Sign in to follow this  
Followers 0
DarkNecromancer

Regular Expression

4 posts in this topic

Hay, I'm trying to parse out the domain name of a URL and I'm having a little trouble doing so. I went with regular expressions because I can't guarentee that the URL will always have an http, or a www, or anything for that matter other then a domain name. Enough of that, heres a test list I came up with just to test the functionality of the expression:

vi.wikipedia.org/wiki/Wikipedia:Phi%C3%AAn_b%E1%BA%A3n_ng%C3%B4n_ng%E1%BB%AF

wiktionary.org/

cu.wikipedia.org/wiki/%D0%93%D0%BB%D0%B0%D0%B2%D1%8C%D0%BD%D0%B0_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0

lij.wikipedia.org/wiki/Pagina_prin%C3%A7ip%C3%A2

http://nrm.wikipedia.org/wiki/Page_d%C3%A9_garde

http://uz.wikipedia.org/wiki/Bosh_Sahifa

Ok I'm working in the latest version of the expression tester and I've come up with the following expression:

(?:(?:http[s]?://)?(?:www.)?)(.*)(?:/.*)\n?

and I'm getting the results:

0 => vi.wikipedia.org/wiki

1 => wiktionary.org

2 => cu.wikipedia.org/wiki

3 => lij.wikipedia.org/wiki

4 => nrm.wikipedia.org/wiki

5 => uz.wikipedia.org/wiki

However, as far as I know, which isn't dependable, isn't it supposed to capture the /wiki on the ones are well?? What am I doing wrong here?

DarkNecromancer

Share this post


Link to post
Share on other sites



Just wonderful, I decided to see if I could find some tools for OSX that I could play with, and after like 3 minutes of being able to do real time string analysis with expressions I realized that the .* was 'greedy' and that was why it wasn't stopping at the first /, I fell stupid now. Just for a note the fixed expression is.

(?:(?:http[s]?://)?(?:www.)?)((?U).*)/.*\n?
. If anyone has any ways to improve the accuracy of the expression please let me know. Else, sorry for wasting your time

DarkNecromancer

Share this post


Link to post
Share on other sites

Hi,

what do you need from this?

http://uz.wikipedia.org/wiki/Bosh_Sahifa

http://uz.wikipedia.org

uz.wikipedia.org

.org

???

So long,

Mega


Scripts & functions Organize Includes Let Scite organize the include files

Yahtzee The game "Yahtzee" (Kniffel, DiceLion)

LoginWrapper Secure scripts by adding a query (authentication)

_RunOnlyOnThis UDF Make sure that a script can only be executed on ... (Windows / HD / ...)

Internet-Café Server/Client Application Open CD, Start Browser, Lock remote client, etc.

MultipleFuncsWithOneHotkey Start different funcs by hitting one hotkey different times

Share this post


Link to post
Share on other sites

Well I'm making a client in one of my classes for a system our teacher is developing for web page crawling. And th client gets provided with a starting domain, and from there we need to crawl the site and be able to gather statistical data. So I've been able to extract out all of the relevent links using regular expresisons but due to the variableness of how address can be handled, I'd like to have a base string that only contains the domain name and the extension (.com/.edu/...) without any extra directory stuff and without any http, https, or www. This will also carry over into some CSS URL code I want to have the client look at. I spoke a little to soon up above, because the provided line can't properly handle a url that doesn't have a / in it. At the moment im going to work around the bug by forcing a / onto the end if there isn't one in the string, but if anyone knows how to get around that let me know :whistle:. the stuff I posted before were just some test lists I came up with after coping some of the stuff the client pulled off wikipedia's main index page. There wasn't anything special in their meaning. Who knows, maybe you guys know of a better way to do this so I'll just post up what its doing.

It get provided a single domain, let say, www.google.com. We then need to take that and crawl till there isn't anything further to crawl, however we aren't allowed to leave the domain we were provided; instead we need to create a list of domains we wanted to go to but would have forced us to leave our domain. The challenge comes when you consider that when your determining if you've look at something you need to do some kind of string comparison, but how do you do that if the strings can be different , but resolve to the same thing. (http://www.google.com, www.google.com, google.com, http://google.com) So I figured that the only similarity in that is the 'google.com' and I wanted to extract it out so i can then just say stringinstr.

Well let me know what you guys think

DarkNecromancer

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0