Sign in to follow this  
Followers 0
zackrspv

[I SOLVED haha] Array Help, Loops and Website Scraping

3 posts in this topic

#1 ·  Posted (edited)

Hello,

I'm writing a program (for personal use, to see if i can understand how to parse sources better using other sources, etc), this is not for commercial use, and I will not be using it against the tos of the site that it pulls info from, nor will i be using the information in anyway that violates their tos. I just want to make sure it works that I have a local copy of the dictioanry on MY system if their system goes down.

The problem with the below code:

1. I wrote it, so of course it is very very basic and messy

2. I doubt i did any of the regexp's right lol

3. While the sources look the same for every $line that it grabs, it doesn't always grab the information, and often skips over information.

What in the world am I missing?

#include <INet.au3>
#include <GUIConstants.au3>
filedelete("defs.txt")
filedelete("terms.txt")

$line = ""
$str = ""
$source = ""

GUICreate("Hello World", 600, 500)

;~ AutoItSetOption("GUICoordMode", "0")

GUISetState(@SW_SHOW)

;~ $item = InputBox("Search", "Enter search phrase")
;~ $item = StringReplace($item," ","+")

$item = ""

Func getTerms()
$source = (_INetGetSource("http://www.investopedia.com/terms/"&$item&"/"))
;~ MsgBox(0,"test",$source)


$nOffset = 1
$str = ""
while 1

    $array = StringRegExp($source, '<(?i)a href="(.*?)">', 1, $nOffset)
    
    
    if @error = 0 Then
        $nOffset = @extended
    Else
        ExitLoop
    EndIf
    for $i = 0 to UBound($array) - 1
        if StringLeft($array[$i],9) = "/terms/"&$item&"/" Then
            $testme = StringInStr($array[$i], ".asp")
            if $testme then
                $str = $str & $array[$i] & @CRLF & @CRLF
            Else
            endif
    Else
        EndIf
    Next
WEnd

filewrite("terms.txt", $str)
;~ GUICtrlCreateEdit($str, -1, 0,600,500,BitOR($WS_VSCROLL,$ES_READONLY))

;~ Do
;~   $msg = GUIGetMsg()
;~ Until $msg = $GUI_EVENT_CLOSE
EndFunc

func startTerms()
guictrlcreatelabel("Do: ",0,32,32,32)

guictrlcreatelabel("1",32,32,32,32)
$item = "1"
call("getTerms")

guictrlcreatelabel("a",32,32,32,32)
$item = "a"
call("getTerms")

guictrlcreatelabel("b",32,32,32,32)
$item = "b"
call("getTerms")

guictrlcreatelabel("c",32,32,32,32)
$item = "c"
call("getTerms")

guictrlcreatelabel("d",32,32,32,32)
$item = "d"
call("getTerms")

guictrlcreatelabel("e",32,32,32,32)
$item = "e"
call("getTerms")

guictrlcreatelabel("f",32,32,32,32)
$item = "f"
call("getTerms")

guictrlcreatelabel("g",32,32,32,32)
$item = "g"
call("getTerms")

guictrlcreatelabel("h",32,32,32,32)
$item = "h"
call("getTerms")

guictrlcreatelabel("i",32,32,32,32)
$item = "i"
call("getTerms")

guictrlcreatelabel("j",32,32,32,32)
$item = "j"
call("getTerms")

guictrlcreatelabel("k",32,32,32,32)
$item = "k"
call("getTerms")

guictrlcreatelabel("l",32,32,32,32)
$item = "l"
call("getTerms")

guictrlcreatelabel("m",32,32,32,32)
$item = "m"
call("getTerms")

guictrlcreatelabel("n",32,32,32,32)
$item = "n"
call("getTerms")

guictrlcreatelabel("o",32,32,32,32)
$item = "o"
call("getTerms")

guictrlcreatelabel("p",32,32,32,32)
$item = "p"
call("getTerms")

guictrlcreatelabel("q",32,32,32,32)
$item = "q"
call("getTerms")

guictrlcreatelabel("r",32,32,32,32)
$item = "r"
call("getTerms")

guictrlcreatelabel("s",32,32,32,32)
$item = "s"
call("getTerms")

guictrlcreatelabel("t",32,32,32,32)
$item = "t"
call("getTerms")

guictrlcreatelabel("u",32,32,32,32)
$item = "u"
call("getTerms")

guictrlcreatelabel("v",32,32,32,32)
$item = "v"
call("getTerms")

guictrlcreatelabel("w",32,32,32,32)
$item = "w"
call("getTerms")

guictrlcreatelabel("x",32,32,32,32)
$item = "x"
call("getTerms")

guictrlcreatelabel("y",32,32,32,32)
$item = "y"
call("getTerms")

guictrlcreatelabel("z",32,32,32,32)
$item = "z"
call("getTerms")
EndFunc


Func getDefs()
    $source = ""
    $array = ""
    $str = ""
    $source = (_INetGetSource("http://www.investopedia.com/"&$line))
    
    $nOffset = 1
    
    while 1
                                       
        $array = StringRegExp($source, 'dic_termdefs">(.*?)<', 1, $nOffset)
        
            if @error = 0 Then
                $nOffset = @extended
            Else
                ExitLoop
            EndIf
        for $i = 0 to UBound($array) - 1
;~              msgbox(0,"INFO", "Array info for: $array["&$i&"]"&@LF&$array[$i])
            if $array[$i] = "" Then
                msgbox(0,"error", "Array is blank for: $array["&$i&"]")
            Else
            filewrite("defs.txt", $line & "," & $array[$i] & @CRLF & @CRLF)
            EndIf
        Next
    WEnd
    


EndFunc


call("startTerms")

$url = "http://www.investopedia.com/"

$file = FileOpen("terms.txt", 0)

while 1
    $line = FileReadLine($file)
    if $line = "" then 
    Else
        guictrlcreatelabel("Do: "&$line, 0, 32, 600, 32)
    
    call("getDefs")
    EndIf
WEnd            
FileClose($file)
Edited by zackrspv

-_-------__--_-_-____---_-_--_-__-__-_ ^^€ñ†®øÞÿ ë×阮§ wï†høµ† ƒë@®, wï†høµ† †ïmë, @ñd wï†høµ† @ †ïmïdï†ÿ ƒø® !ïƒë. €×阮 ñø†, bµ† ïñ§†ë@d wï†hïñ, ñ@ÿ, †h®øµghøµ† †hë 맧ëñ§ë øƒ !ïƒë.

Share this post


Link to post
Share on other sites



So, i've been going over this over and over and over, and I still can't seem to figure out why it keeps skpping over some of the links in the terms file. Anyone have any idea?


-_-------__--_-_-____---_-_--_-__-__-_ ^^€ñ†®øÞÿ ë×阮§ wï†høµ† ƒë@®, wï†høµ† †ïmë, @ñd wï†høµ† @ †ïmïdï†ÿ ƒø® !ïƒë. €×阮 ñø†, bµ† ïñ§†ë@d wï†hïñ, ñ@ÿ, †h®øµghøµ† †hë 맧ëñ§ë øƒ !ïƒë.

Share this post


Link to post
Share on other sites

Ha, i got it. It was just not processing properly. I changed the regexp to: $array = StringRegExp($source, '(?i)class="dic_termdefs">(.*?)\n', 1, $nOffset) and boom, it works. I did make some modifications tho, to the underlying script; removed the function and made it in the primary call loop; so it looks like:

$url = "http://www.investopedia.com/"

$file = FileOpen("terms.txt", 0)

while 1
    $line = FileReadLine($file)
    if $line = "" then 
    Else
        guictrlsetdata($info, $line)
        $source = ""
        $array = ""
        $str = ""
        $source = (_INetGetSource("http://www.investopedia.com/"&$line))
        guictrlsetdata($redit, "Grabbing source for:  " & $line)
        
;~      sleep(4000)
        guictrlsetdata($edit, $source)
        
        
        $nOffset = 1
    
        while 1
                                    
            $array = StringRegExp($source, '(?i)class="dic_termdefs">(.*?)\n', 1, $nOffset)
                    $nOffset = @extended
                    if @error = "1" Then 
                        MsgBox(0, "error", "array didn't return result")
                        Exit
                    EndIf
                    guictrlsetdata($redit, "Set array extended for offset for: "  & $line)
;~                  sleep(3000)

            for $i = 0 to UBound($array) - 1
                guictrlsetdata($redit, "Going to write data for: "  & $line)
                $str = $array[$i]
                $str = StringRegExpReplace($str, "&#(.*?);", "" )
                $str = StringRegExpReplace($str, "<(.*?)>", "" )
                $str = StringRegExpReplace($str, "</(.*?)>", "" )
                $str = StringRegExpReplace($str, "&(.*?);", "" )
;~                  sleep(3000)
                filewrite("defs.txt", $line & "," & $str & @CRLF & @CRLF)
                guictrlsetdata($redit, $str)
;~              sleep(4000)
            
            Next
        ExitLoop
        WEnd
    EndIf
WEnd            
FileClose($file)

So, at least it is workin :)


-_-------__--_-_-____---_-_--_-__-__-_ ^^€ñ†®øÞÿ ë×阮§ wï†høµ† ƒë@®, wï†høµ† †ïmë, @ñd wï†høµ† @ †ïmïdï†ÿ ƒø® !ïƒë. €×阮 ñø†, bµ† ïñ§†ë@d wï†hïñ, ñ@ÿ, †h®øµghøµ† †hë 맧ëñ§ë øƒ !ïƒë.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0