Sign in to follow this  
Followers 0
John117

trying to parse webpage

24 posts in this topic

Hi, I am currently running into trouble getting data from a webpage. It seems that the data sometimes loads after the page completes -so its missing the data:

$oIE = _IECreate("http://www.westhoustoninfiniti.com/j/i/30002/UsedInventory.html", 1, 1)
_IELoadWait ($oIE)
$Source = _IEDocReadHTML($oIE)
ConsoleWrite($Source & @LF)

Also: I need the page to show all results, or at least 100, not only 25.

So I need it to be written as 100 before reading the html or any other method to pull all data quickly. Any suggestions?

Share this post


Link to post
Share on other sites



Jon117,

If your interest is the WEB page source try something like this

Local $src = InetGet("http://www.westhoustoninfiniti.com/j/i/30002/UsedInventory.html","c:tmpsrc.txt")
If $src = 0 Then MsgBox(0,'','inetget error = ' & @error)
Run("notepad.exe c:tmpsrc.txt")

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

Getting closer but need help submitting the change to return the results.

#include <IE.au3>
#include <string.au3>
$oIE = _IECreate("http://www.westhoustoninfiniti.com/j/i/30002/UsedInventory.html", 1, 1)
_IELoadWait ($oIE)
$Source = _IEDocReadHTML($oIE)
$Results = StringReplace($Source, "<INPUT id=results-per-page-state value=value;25; type=text name=f7>", "<INPUT id=results-per-page-state value=value;100; type=text name=f7>")
;~ $oIE.submit
$Source = _IEDocReadHTML($oIE)
ConsoleWrite($Source & @LF)

Share this post


Link to post
Share on other sites

John117,

Perhaps it would help if you would tell us what you are trying to do...

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

Jon117,

If your interest is the WEB page source try something like this

Local $src = InetGet("http://www.westhoustoninfiniti.com/j/i/30002/UsedInventory.html","c:tmpsrc.txt")
If $src = 0 Then MsgBox(0,'','inetget error = ' & @error)
Run("notepad.exe c:tmpsrc.txt")

kylomas

Thanks - from what I can tell, the source fetches the data. so grabbing the source, returns little :-)

Share this post


Link to post
Share on other sites

John117,

Perhaps it would help if you would tell us what you are trying to do...

kylomas

If you manually open the website, you will see car data load. If you set it to 100 you will see all car data available. I want to pull all car data. vin, year, make, etc. With the results set to 100

Share this post


Link to post
Share on other sites

So you are not trying to parse a webpage are you really, you are trying to set an element to a particular

value on a webpage.

You might want to reflect that in your thread title.

No, I am trying to parse a webpage. That is the whole point. Setting the element just fetches the rest of the data to parse. as is, 25 of 39 can be parsed without setting the element to 100

the problem is grabbing the data to parse. and it is not part of the source until some loading. . . .

Share this post


Link to post
Share on other sites

Ahhh, now getting complete? picture...


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

No, I am trying to parse a webpage. That is the whole point. Setting the element just fetches the rest of the data to parse. as is, 25 of 39 can be parsed without setting the element to 100

the problem is grabbing the data to parse. and it is not part of the source until some loading. . . .

No, It's not part of the source until you have set that element to 100.

If you do a bit of searching for that then you can parse your webpage .


AutoIt Absolute Beginners    Require a serial    Pause Script    Video Tutorials by Morthawt   ipify 

Monkey's are, like, natures humans.

Share this post


Link to post
Share on other sites

Yes, it is part, it default loads 25. However, it only becomes part of the source after loading.

I only need to set the element after I have the parse worked out.

One has little to do with the other, that is why the element question was an "also" and not part of the main post.

Share this post


Link to post
Share on other sites

The data (car info) loads. But does not always load before:

$Source = _IEBodyReadHTML($oIE)

And since it is not part of the source I must wait for it to load. -rather than just grabbing the source.

I would like to do something to grab, or pull that data without missing it or having to add

Sleep(10000)

before

$Source = _IEBodyReadHTML($oIE)

is there a method to pull the info myself rather than waiting to parse it (after it eventually loads). Or a way to insure it is ready to parse.

Share this post


Link to post
Share on other sites

I think I see what you mean now.

When you view the source of the page from your browser, all the info you want is there...39 found.

But when you try it with _IE functions the raw page source is being returned (not browser parsed source)

Is this correct?


AutoIt Absolute Beginners    Require a serial    Pause Script    Video Tutorials by Morthawt   ipify 

Monkey's are, like, natures humans.

Share this post


Link to post
Share on other sites

Yes, that is correct. After I manually do it, and it completely loads. It is there. I need to to run on its own though. and still return the data. but in a timely manor. not 15 seconds later or what ever I have to wait to fetch it just to insure it has loaded. Is there a part of the fetch that I can mimic without actually loading the rest of the page?

Share this post


Link to post
Share on other sites

#17 ·  Posted (edited)

Hi,

you could just set one of the select boxes for results-per-page to 100, which triggers there javascript to load new results.

After that you can read out the HTML or do some more checks first, here is an example...

#include <IE.au3>


Global $oIE = _IECreate("http://www.westhoustoninfiniti.com/j/i/30002/UsedInventory.html", 1, 0) ; go to the page

; get the page to displays 100 results, or all if there are less then 100
Local $oElems = _IETagNameGetCollection($oIE, "select") ; get all select tags
For $oElem In $oElems
    If StringInStr($oElem.className, 'resultsPerPage') Then ; this is the results-per-page select box
        _IEFormElementOptionSelect($oElem, 100) ; set it to 100 results per page
        ExitLoop ; exit the loop, there are more 'resultsPerPage' selects, no need to set it twice
    EndIf
Next

; get the result count of the page, to check if it worked
Local $oTable = _IETableGetCollection($oIE, 4) ; get the results table (5th form, zero based so it becomes index 4)
Local $oElem = _IETagNameGetCollection($oTable, "td", 0) ; get the first td tag, contains a string with the result count
Local $iResultCount = Int(StringRegExpReplace($oElem.innerText, "D+", "")) ; strip every non digit, now we have the result count
ConsoleWrite("-> Result count A: " & $iResultCount & @CRLF) ; gives me 39 results atm

; you could check if the result count changed...
;~ If $iResultCount <= 25 Then  ; it did not work or page is not done yet
;~ EndIf

Local $oElem = _IETagNameGetCollection($oIE, "div") ; get all div tags
Local $iCount = 0, $aDetail, $oFoo, $sLabel
For $oDiv In $oElem ; loop over all div tags to get the results (detail boxes)

    If StringInStr($oDiv.className, 'details-box') Then ; this is a result

        $iCount += 1
        ConsoleWrite("-- " & StringFormat("%03d", $iCount) & " -------------------------" & @CRLF)

        $oFoo = _IETagNameGetCollection($oDiv, "a", 1) ; get the 2nd link, this is the title of the detail box
        $sLabel = $oFoo.innertext
        ConsoleWrite("Label: " & $sLabel & @CRLF)

        $oFoo = _IETagNameGetCollection($oDiv, "div") ; get all div tags in this result
        For $oDivDetail In $oFoo

            If StringInStr($oDivDetail.className, 'details-text-item') Then ; this is a detail item

                $aDetail = StringRegExp($oDivDetail.innerHTML, "(?i)<strong>(.*?)</strong>s+([^>]+)", 3) ; get the key/value
                If @error Then ContinueLoop

                ConsoleWrite($aDetail[0] & ": " & $aDetail[1] & @CRLF)
;~                 Switch StringStripWS(StringLower($aDetail[0]), 3) ; here you could do something with each item
;~                     Case "retail price"
;~                     Case "price"
;~                     Case "mileage"
;~                     Case "transmission"
;~                     case "stock"
;~                     case "ext. color"
;~                     case "int. color"
;~                     case "engine"
;~                     case "vin"
;~                 EndSwitch

            EndIf

        Next

    EndIf

Next

Edit: added a little debugging code

Edit 2: Just read what you are attempting to do so rewrote the example, this works for me...

Edited by Robjong

Share this post


Link to post
Share on other sites

Robjong,

Thank you...just learned a bunch about navigating a web page programmatically...

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

Hi,

glad you learned something from it :)

Did you see the new example? had some trouble posting so had to clean up my own mess.

Share this post


Link to post
Share on other sites

yes, again tx...


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0