Jump to content
Sign in to follow this  
dvlkn

Parsing html

Recommended Posts

dvlkn

Hi

Just wondering if it's possible to parse html. I am trying to pick up a number from a webpage for further calculations, like for e.g:

<html>You can buy <b>8</b> bla bla bla </html>

I looked in the help file but could not find any function or combine the current ones to do the job.

I know about _IEBodyReadHTML or _IEBodyReadText but how can I get only the required part rather than the whole body ? Any help appreciated

Share this post


Link to post
Share on other sites
oMBRa

U get the whole body then u filter the required part. Look in the help file about String Management

Edited by oMBra

Share this post


Link to post
Share on other sites
dbzfanatic
dvlkn

OK I looked, is that the String Regular Expression that I should use ? looks complicated ;) and dbzfanatic, yes it's always in bold but not the only text that is bolded, would that interfere ?

Share this post


Link to post
Share on other sites
dvlkn

Hmm I am not familiar with Regex and having a difficult time parsing the bold part in: "You can buy *random number here* shares". Can someone help me with the expression pls ?

Share this post


Link to post
Share on other sites
Szhlopp

Hmm I am not familiar with Regex and having a difficult time parsing the bold part in: "You can buy *random number here* shares". Can someone help me with the expression pls ?

RegExRelace tester=)

#include <GUIConstants.au3>
#include <EditConstants.au3>
;


$hMainGUI = GUICreate("StringRegExReplace Tester", 574, 454, 245, 130)
$hMainEdit = GUICtrlCreateEdit("", 0, 0, 573, 182)
GUICtrlCreateGroup("Input", 2, 182, 569, 270)
$hPatternInput = GUICtrlCreateInput("", 70, 206, 150, 21)
$hReplaceInput = GUICtrlCreateInput("", 70, 236, 150, 21)
GUICtrlCreateLabel("Pattern:", 12, 208, 41, 17)
GUICtrlCreateLabel("Replace:", 12, 239, 47, 17)
$hCountInput = GUICtrlCreateInput("0", 70, 266, 150, 21)
GUICtrlCreateLabel("Count:", 12, 269, 35, 17)
$hReturn = GUICtrlCreateEdit("", 5, 326, 560, 121 )
GUICtrlSendMsg($hReturn, $EM_SETREADONLY, -1, 0)
GUICtrlCreateLabel("@Error:", 385, 206, 40, 17)
GUICtrlCreateLabel("@Extended:", 385, 234, 63, 17)
$hErrorInput = GUICtrlCreateInput("", 454, 204, 62, 21)
GUICtrlSetState(-1, $GUI_DISABLE)
$hExtendedInput = GUICtrlCreateInput("", 454, 230, 62, 21)
GUICtrlSetState(-1, $GUI_DISABLE)
$hTestButton = GUICtrlCreateButton("RegExReplace", 231, 298, 108, 26, 0)
GUICtrlCreateGroup("", -99, -99, 1, 1)
GUISetState(@SW_SHOW)


While 1
    $nMsg = GUIGetMsg()
    Switch $nMsg
        Case $GUI_EVENT_CLOSE
            Exit
        Case $hTestButton
            ;Sleep(3000)
            $sRegEx = StringRegExpReplace(GUICtrlRead($hMainEdit), GUICtrlRead($hPatternInput), GUICtrlRead($hReplaceInput) , GUICtrlRead($hCountInput))
            $iError = @error
            $iExt = @extended
            
            If $iError = 2 Then 
                GUICtrlSetState($hPatternInput, $GUI_FOCUS)
                GUICtrlSendMsg($hPatternInput, $EM_SETSEL, $iExt - 2 , $iExt - 1)
                
            EndIf
            
            GUICtrlSetData($hErrorInput, $iError)
            GUICtrlSetData($hExtendedInput, $iExt)
            GUICtrlSetData($hReturn, $sRegEx)
            
            
    EndSwitch
WEndoÝ÷ Ù ^²×«jëh×6#include <GUIConstants.au3>
#include <EditConstants.au3>
#include <WindowsConstants.au3>
#include <ListViewConstants.au3>
#include <GuiListView.au3>

$GUI = GUICreate("RegEx Tester", 622, 450)
$Group1 = GUICtrlCreateGroup("Input Parameters", 16, 16, 569, 185)
    $String = GUICtrlCreateEdit("", 88, 48, 465, 97)
    $Test = GUICtrlCreateButton("Test", 464, 160, 89, 25, 0)
    $Pattern = GUICtrlCreateInput("", 96, 160, 137, 21)
    $Flag = GUICtrlCreateInput("", 320, 160, 73, 21, BitOR($ES_AUTOHSCROLL,$ES_NUMBER))
    GUICtrlCreateLabel("String:", 32, 88, 34, 17)
    GUICtrlCreateLabel("Pattern:", 32, 160, 41, 17)
    GUICtrlCreateLabel("Flag", 272, 162, 24, 17)
GUICtrlCreateGroup("", -99, -99, 1, 1)
$Group2 = GUICtrlCreateGroup("Output", 16, 216, 577, 217)
    $Group3 = GUICtrlCreateGroup("Return Values", 328, 240, 249, 129)
        GUICtrlCreateLabel("Return: ", 344, 264, 42, 17)
        GUICtrlCreateLabel("@Extended", 344, 296, 60, 17)
        GUICtrlCreateLabel("@error", 344, 328, 36, 17)
        $Return = GUICtrlCreateInput("", 408, 264, 137, 21, BitOR($ES_AUTOHSCROLL,$ES_READONLY))
        $Extended = GUICtrlCreateInput("", 408, 296, 137, 21, BitOR($ES_AUTOHSCROLL,$ES_READONLY))
        $Error = GUICtrlCreateInput("", 408, 328, 137, 21, BitOR($ES_AUTOHSCROLL,$ES_READONLY))
    GUICtrlCreateGroup("", -99, -99, 1, 1)
    $ListView = GUICtrlCreateListView("Element|Data", 40, 240, 257, 177, -1, BitOR($WS_EX_CLIENTEDGE,$LVS_EX_GRIDLINES,$LVS_EX_FULLROWSELECT))
GUICtrlCreateGroup("", -99, -99, 1, 1)
GUISetState(@SW_SHOW)


While 1
    $nMsg = GUIGetMsg()
    Switch $nMsg
        Case $GUI_EVENT_CLOSE
            Exit
        Case $Test
            If GuiCtrlRead($String) = "" or GUICtrlRead($Pattern) = "" Then
                MsgBox(0,"Error", "Please fill in all required parameters.")
            Else
                If GUICtrlRead($Flag) = "" then 
                    Process(GUICtrlRead($String), GuiCtrlRead($Pattern))
                Else
                    Process(GUICtrlRead($String), GuiCtrlRead($Pattern), GUICtrlRead($Flag))
                EndIf
            EndIf
    EndSwitch
WEnd

Func Process($String, $Pattern, $Flag=0)
    _GUICtrlListView_DeleteAllItems(GUICtrlGetHandle($ListView))
    $Result = StringRegExp($String,$Pattern,$Flag)
    $Err = @error
    $Ext = @extended
    If IsArray($Result) then
        Dim $Item[UBound($Result)]
        For $x = 0 to UBound($Result)-1
            $Item[$x] = GUICtrlCreateListViewItem("["&$x&"]|"&$Result[$x],$ListView)
        Next
        GUICtrlSetData($Return, "Array")
        GUICtrlSetData($Extended, $Ext)
        GUICtrlSetData($Error, $Err)
    Else
        GUICtrlSetData($Return, $Result)
        GUICtrlSetData($Extended, $Ext)
        GUICtrlSetData($Error, $Err)
    EndIf
EndFunc
Edited by Szhlopp

Share this post


Link to post
Share on other sites
dvlkn

Thanks. Using the program I could only do half the job so I still need some help:

This is the part of the text I need to parse:

<br>You currently have <b>412</b> shares.<br>You can buy a maximum of <b>20</b> shares.<br>

With the pattern, ([0-9]{1,3})(?:</b> shares)', 1), it return as 412</b> shares ;) What I need is in fact the second number, 20 in this e.g

Share this post


Link to post
Share on other sites
Szhlopp

Thanks. Using the program I could only do half the job so I still need some help:

This is the part of the text I need to parse:

With the pattern, ([0-9]{1,3})(?:</b> shares)', 1), it return as 412</b> shares ;) What I need is in fact the second number, 20 in this e.g

Heh here ya go=)

Text used: <br>You currently have <b>412</b> shares.<br>You can buy a maximum of <b>20</b> shares.<br>

Pattern:

<b>(.*?)<

Flag: 3

Array[0] is 412

Array[1] is 20

:D

Edited by Szhlopp

Share this post


Link to post
Share on other sites
dvlkn

Thanks for your help. With a bit of tweaking I managed to filter the required part ;) I used the function _IEBodyReadText instead of html.

But now I am getting another error which says "Subscript used with non-Array variable"

And this is the part that triggers it :

$sResult = StringRegExp($sText, 'You can buy a maximum of (.*?) shares', 3)
$Amount = Int ($sResult[0] / 24) * 24

Strangely, sometimes it works and sometimes it does not :S Can anyone help me clear this please

Edited by dvlkn

Share this post


Link to post
Share on other sites
mc83

afaik, if StringRegExp does not return an array, it means it could not match anything with your pattern

i've written a couple of web crawlers in PHP, and i can tell you one thing: never trust "user" input ;) (well, it's a well known fact, actually)

maybe, in some cases, there is no _space_ before or after the number, so the pattern can't match. Maybe it would be a good idea to capture anything in between maximum of and shares and then trim the result.

you could also fetch a large number of such pages and analyze the text yourself, in order to see whether it really changes every once in a while.

anyway, i think you should use regexes on the actual html code... i believe it's easier to extract bits of text by using the tag structure of a html document

hope it helps

Radu

edit: typo

Edited by mc83

Share this post


Link to post
Share on other sites
dvlkn

Thanks for the advice, I'll run a couple more of tests. I am pretty sure my coding is good though because when I used a msgbox to return a value, it was correct and worked every time.

Share this post


Link to post
Share on other sites
Szhlopp

Thanks for the advice, I'll run a couple more of tests. I am pretty sure my coding is good though because when I used a msgbox to return a value, it was correct and worked every time.

Check @error right after the StringRegEx. Also use UBound to see how big the array is.

Try this too...

Text:

You can buy a maximum 54 shares or you could buy a minimum of 640 =P
You can buy a maximum 59shares and minimum 70.
A  maximum of 100!  shares and minimum70?.

Pattern:

(?:maximum|minimum) (?:of)?\s?([0-9]*)\s?

Flag: 3

Always returns what you want=)

Share this post


Link to post
Share on other sites
dvlkn

It's working now, I just had to add some delay even though there is the _IEloadwait function.

Thanks everyone, this forum and autoIT rocks!

/thread.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×