DarkNecromancer Posted July 18, 2007 Share Posted July 18, 2007 Hay, I'm trying to parse out the domain name of a URL and I'm having a little trouble doing so. I went with regular expressions because I can't guarentee that the URL will always have an http, or a www, or anything for that matter other then a domain name. Enough of that, heres a test list I came up with just to test the functionality of the expression: vi.wikipedia.org/wiki/Wikipedia:Phi%C3%AAn_b%E1%BA%A3n_ng%C3%B4n_ng%E1%BB%AF wiktionary.org/ cu.wikipedia.org/wiki/%D0%93%D0%BB%D0%B0%D0%B2%D1%8C%D0%BD%D0%B0_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0 lij.wikipedia.org/wiki/Pagina_prin%C3%A7ip%C3%A2 http://nrm.wikipedia.org/wiki/Page_d%C3%A9_garde http://uz.wikipedia.org/wiki/Bosh_SahifaOk I'm working in the latest version of the expression tester and I've come up with the following expression: (?:(?:http[s]?://)?(?:www.)?)(.*)(?:/.*)\n? and I'm getting the results: 0 => vi.wikipedia.org/wiki 1 => wiktionary.org 2 => cu.wikipedia.org/wiki 3 => lij.wikipedia.org/wiki 4 => nrm.wikipedia.org/wiki 5 => uz.wikipedia.org/wikiHowever, as far as I know, which isn't dependable, isn't it supposed to capture the /wiki on the ones are well?? What am I doing wrong here? DarkNecromancer Link to comment Share on other sites More sharing options...
DarkNecromancer Posted July 18, 2007 Author Share Posted July 18, 2007 Just wonderful, I decided to see if I could find some tools for OSX that I could play with, and after like 3 minutes of being able to do real time string analysis with expressions I realized that the .* was 'greedy' and that was why it wasn't stopping at the first /, I fell stupid now. Just for a note the fixed expression is. (?:(?:http[s]?://)?(?:www.)?)((?U).*)/.*\n?. If anyone has any ways to improve the accuracy of the expression please let me know. Else, sorry for wasting your timeDarkNecromancer Link to comment Share on other sites More sharing options...
Xenobiologist Posted July 18, 2007 Share Posted July 18, 2007 Hi,what do you need from this?http://uz.wikipedia.org/wiki/Bosh_Sahifahttp://uz.wikipedia.orguz.wikipedia.org.org???So long,Mega Scripts & functions Organize Includes Let Scite organize the include files Yahtzee The game "Yahtzee" (Kniffel, DiceLion) LoginWrapper Secure scripts by adding a query (authentication) _RunOnlyOnThis UDF Make sure that a script can only be executed on ... (Windows / HD / ...) Internet-Café Server/Client Application Open CD, Start Browser, Lock remote client, etc. MultipleFuncsWithOneHotkey Start different funcs by hitting one hotkey different times Link to comment Share on other sites More sharing options...
DarkNecromancer Posted July 18, 2007 Author Share Posted July 18, 2007 Well I'm making a client in one of my classes for a system our teacher is developing for web page crawling. And th client gets provided with a starting domain, and from there we need to crawl the site and be able to gather statistical data. So I've been able to extract out all of the relevent links using regular expresisons but due to the variableness of how address can be handled, I'd like to have a base string that only contains the domain name and the extension (.com/.edu/...) without any extra directory stuff and without any http, https, or www. This will also carry over into some CSS URL code I want to have the client look at. I spoke a little to soon up above, because the provided line can't properly handle a url that doesn't have a / in it. At the moment im going to work around the bug by forcing a / onto the end if there isn't one in the string, but if anyone knows how to get around that let me know . the stuff I posted before were just some test lists I came up with after coping some of the stuff the client pulled off wikipedia's main index page. There wasn't anything special in their meaning. Who knows, maybe you guys know of a better way to do this so I'll just post up what its doing. It get provided a single domain, let say, www.google.com. We then need to take that and crawl till there isn't anything further to crawl, however we aren't allowed to leave the domain we were provided; instead we need to create a list of domains we wanted to go to but would have forced us to leave our domain. The challenge comes when you consider that when your determining if you've look at something you need to do some kind of string comparison, but how do you do that if the strings can be different , but resolve to the same thing. (http://www.google.com, www.google.com, google.com, http://google.com) So I figured that the only similarity in that is the 'google.com' and I wanted to extract it out so i can then just say stringinstr. Well let me know what you guys thinkDarkNecromancer Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now