Sign in to follow this  
Followers 0
DrDemonic

Search Text file for strings/repeats

6 posts in this topic

I am trying to write an html spider that captures a specific kind of domain and copies it to a text file. The spider is working excellently besides two small issues.

1. I have it set up where it crawls one page, copies every url to a queue, and the special domains to a different text file. That part is working. I can't seem to figure out a way to make the spider search the queue file (.txt), to make sure that it isn't repeating strings.

This is that specific part of the code:

For $oLink In $oLinks
$repeat = FileRead ("follow.txt")
$repeatcheck = StringInStr ($repeat, $oLink)
If @error = 0 Then
    FileWriteLine("follow.txt", $oLink.href)
    EndIf

That was the best I could come up with, and it doesn't work.

The other problem that I have is the special domains. It doesn't output them to the other file.

This is what I have:

$parse = StringInStr ($oLink, ".edu")
$parser = @error
If $parser = -1 Then
    FileWriteLine ("results.txt", $oLink.href)
    EndIf

Any help is greatly appreciated.

Share this post


Link to post
Share on other sites



Maybe you can load all your hyperlinks into an array then use another command to filter all the unique values in the array. Then loop through the array and pick out what you want based on what's in each element by using StringInStr() or StringRegEx().

Share this post


Link to post
Share on other sites

I'll try it. I just thought there was a limit on the amount of strings that could be inserted in an array, and we're talking about thousands and thousands of urls being entered every couple of minutes. So, it seemed more logical to use text files. I guess I could the array. Wouldn't hurt to try it out real quick and see what happens.

Thanks for the help bro! I'll post an update on whether or not it worked.

Share this post


Link to post
Share on other sites

I'll try it. I just thought there was a limit on the amount of strings that could be inserted in an array, and we're talking about thousands and thousands of urls being entered every couple of minutes.

No prob... and here's a little FYI courtesy the Help File

You can use up to 64 dimensions in an Array. The total number of entries cannot be greater than 2^24 (16 777 216).

Given that, I think you're good! :huh2:

Share this post


Link to post
Share on other sites

Arrays don't seem to work either. Whenever I try to put a url into an array index, it says that it's an invalid array value. I've tried with and without the http://. I really don't understand why nothing is working. I've literally been at this non-stop since about 11 o'clock this morning. It seems like every path I take to accomplish this, just gives out a different error.

I decided to skip over the searching for repeats part for now, and just try to get the .edu links to copy to the text file and that won't work either. It either puts every domain type in, or it copies over nothing. This is the code I got currently, that doesn't work:

$parse = StringInStr ($oLink, "edu")
If @error = -1 Then
    FileWriteLine ("results.txt", $oLink.href)
    EndIf

I think the problem is that I'm trying to control this with error functions, but the stringinstr command doesn't really give you any options. In the help file it says this:

Success: Returns the position of the substring. 
Failure: Returns 0 if substring not found. 
@Error 0 - Normal operation 
 1 - Invalid "start" or "occurence" parameter given.

It says it returns zero for a normal operation with the sub string not found, and 1 for invalid starts or occurences, but that's all. So, how can I get my script to recognize it as a success?

Share this post


Link to post
Share on other sites

$parse = StringInStr ($oLink, "edu")
If @error = -1 Then
    FileWriteLine ("results.txt", $oLink.href)
    EndIf

Not sure what you're trying to do, you should just post your whole code. As for the above ($parse =) should be StringInStr($oLink.href, "edu").

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0