DrDemonic Posted June 9, 2011 Share Posted June 9, 2011 I am trying to write an html spider that captures a specific kind of domain and copies it to a text file. The spider is working excellently besides two small issues. 1. I have it set up where it crawls one page, copies every url to a queue, and the special domains to a different text file. That part is working. I can't seem to figure out a way to make the spider search the queue file (.txt), to make sure that it isn't repeating strings. This is that specific part of the code: For $oLink In $oLinks $repeat = FileRead ("follow.txt") $repeatcheck = StringInStr ($repeat, $oLink) If @error = 0 Then FileWriteLine("follow.txt", $oLink.href) EndIf That was the best I could come up with, and it doesn't work. The other problem that I have is the special domains. It doesn't output them to the other file. This is what I have: $parse = StringInStr ($oLink, ".edu") $parser = @error If $parser = -1 Then FileWriteLine ("results.txt", $oLink.href) EndIf Any help is greatly appreciated. Link to comment Share on other sites More sharing options...
MrMitchell Posted June 9, 2011 Share Posted June 9, 2011 Maybe you can load all your hyperlinks into an array then use another command to filter all the unique values in the array. Then loop through the array and pick out what you want based on what's in each element by using StringInStr() or StringRegEx(). Link to comment Share on other sites More sharing options...
DrDemonic Posted June 9, 2011 Author Share Posted June 9, 2011 I'll try it. I just thought there was a limit on the amount of strings that could be inserted in an array, and we're talking about thousands and thousands of urls being entered every couple of minutes. So, it seemed more logical to use text files. I guess I could the array. Wouldn't hurt to try it out real quick and see what happens. Thanks for the help bro! I'll post an update on whether or not it worked. Link to comment Share on other sites More sharing options...
MrMitchell Posted June 9, 2011 Share Posted June 9, 2011 I'll try it. I just thought there was a limit on the amount of strings that could be inserted in an array, and we're talking about thousands and thousands of urls being entered every couple of minutes. No prob... and here's a little FYI courtesy the Help FileYou can use up to 64 dimensions in an Array. The total number of entries cannot be greater than 2^24 (16 777 216).Given that, I think you're good! Link to comment Share on other sites More sharing options...
DrDemonic Posted June 9, 2011 Author Share Posted June 9, 2011 Arrays don't seem to work either. Whenever I try to put a url into an array index, it says that it's an invalid array value. I've tried with and without the http://. I really don't understand why nothing is working. I've literally been at this non-stop since about 11 o'clock this morning. It seems like every path I take to accomplish this, just gives out a different error. I decided to skip over the searching for repeats part for now, and just try to get the .edu links to copy to the text file and that won't work either. It either puts every domain type in, or it copies over nothing. This is the code I got currently, that doesn't work: $parse = StringInStr ($oLink, "edu") If @error = -1 Then FileWriteLine ("results.txt", $oLink.href) EndIf I think the problem is that I'm trying to control this with error functions, but the stringinstr command doesn't really give you any options. In the help file it says this: Success: Returns the position of the substring. Failure: Returns 0 if substring not found. @Error 0 - Normal operation 1 - Invalid "start" or "occurence" parameter given. It says it returns zero for a normal operation with the sub string not found, and 1 for invalid starts or occurences, but that's all. So, how can I get my script to recognize it as a success? Link to comment Share on other sites More sharing options...
MrMitchell Posted June 13, 2011 Share Posted June 13, 2011 $parse = StringInStr ($oLink, "edu") If @error = -1 Then FileWriteLine ("results.txt", $oLink.href) EndIf Not sure what you're trying to do, you should just post your whole code. As for the above ($parse =) should be StringInStr($oLink.href, "edu"). Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now