Jump to content

Parsing html


dvlkn
 Share

Recommended Posts

Hi

Just wondering if it's possible to parse html. I am trying to pick up a number from a webpage for further calculations, like for e.g:

<html>You can buy <b>8</b> bla bla bla </html>

I looked in the help file but could not find any function or combine the current ones to do the job.

I know about _IEBodyReadHTML or _IEBodyReadText but how can I get only the required part rather than the whole body ? Any help appreciated

Link to comment
Share on other sites

Link to comment
Share on other sites

Hmm I am not familiar with Regex and having a difficult time parsing the bold part in: "You can buy *random number here* shares". Can someone help me with the expression pls ?

RegExRelace tester=)

#include <GUIConstants.au3>
#include <EditConstants.au3>
;


$hMainGUI = GUICreate("StringRegExReplace Tester", 574, 454, 245, 130)
$hMainEdit = GUICtrlCreateEdit("", 0, 0, 573, 182)
GUICtrlCreateGroup("Input", 2, 182, 569, 270)
$hPatternInput = GUICtrlCreateInput("", 70, 206, 150, 21)
$hReplaceInput = GUICtrlCreateInput("", 70, 236, 150, 21)
GUICtrlCreateLabel("Pattern:", 12, 208, 41, 17)
GUICtrlCreateLabel("Replace:", 12, 239, 47, 17)
$hCountInput = GUICtrlCreateInput("0", 70, 266, 150, 21)
GUICtrlCreateLabel("Count:", 12, 269, 35, 17)
$hReturn = GUICtrlCreateEdit("", 5, 326, 560, 121 )
GUICtrlSendMsg($hReturn, $EM_SETREADONLY, -1, 0)
GUICtrlCreateLabel("@Error:", 385, 206, 40, 17)
GUICtrlCreateLabel("@Extended:", 385, 234, 63, 17)
$hErrorInput = GUICtrlCreateInput("", 454, 204, 62, 21)
GUICtrlSetState(-1, $GUI_DISABLE)
$hExtendedInput = GUICtrlCreateInput("", 454, 230, 62, 21)
GUICtrlSetState(-1, $GUI_DISABLE)
$hTestButton = GUICtrlCreateButton("RegExReplace", 231, 298, 108, 26, 0)
GUICtrlCreateGroup("", -99, -99, 1, 1)
GUISetState(@SW_SHOW)


While 1
    $nMsg = GUIGetMsg()
    Switch $nMsg
        Case $GUI_EVENT_CLOSE
            Exit
        Case $hTestButton
            ;Sleep(3000)
            $sRegEx = StringRegExpReplace(GUICtrlRead($hMainEdit), GUICtrlRead($hPatternInput), GUICtrlRead($hReplaceInput) , GUICtrlRead($hCountInput))
            $iError = @error
            $iExt = @extended
            
            If $iError = 2 Then 
                GUICtrlSetState($hPatternInput, $GUI_FOCUS)
                GUICtrlSendMsg($hPatternInput, $EM_SETSEL, $iExt - 2 , $iExt - 1)
                
            EndIf
            
            GUICtrlSetData($hErrorInput, $iError)
            GUICtrlSetData($hExtendedInput, $iExt)
            GUICtrlSetData($hReturn, $sRegEx)
            
            
    EndSwitch
WEndoÝ÷ Ù ^²×«jëh×6#include <GUIConstants.au3>
#include <EditConstants.au3>
#include <WindowsConstants.au3>
#include <ListViewConstants.au3>
#include <GuiListView.au3>

$GUI = GUICreate("RegEx Tester", 622, 450)
$Group1 = GUICtrlCreateGroup("Input Parameters", 16, 16, 569, 185)
    $String = GUICtrlCreateEdit("", 88, 48, 465, 97)
    $Test = GUICtrlCreateButton("Test", 464, 160, 89, 25, 0)
    $Pattern = GUICtrlCreateInput("", 96, 160, 137, 21)
    $Flag = GUICtrlCreateInput("", 320, 160, 73, 21, BitOR($ES_AUTOHSCROLL,$ES_NUMBER))
    GUICtrlCreateLabel("String:", 32, 88, 34, 17)
    GUICtrlCreateLabel("Pattern:", 32, 160, 41, 17)
    GUICtrlCreateLabel("Flag", 272, 162, 24, 17)
GUICtrlCreateGroup("", -99, -99, 1, 1)
$Group2 = GUICtrlCreateGroup("Output", 16, 216, 577, 217)
    $Group3 = GUICtrlCreateGroup("Return Values", 328, 240, 249, 129)
        GUICtrlCreateLabel("Return: ", 344, 264, 42, 17)
        GUICtrlCreateLabel("@Extended", 344, 296, 60, 17)
        GUICtrlCreateLabel("@error", 344, 328, 36, 17)
        $Return = GUICtrlCreateInput("", 408, 264, 137, 21, BitOR($ES_AUTOHSCROLL,$ES_READONLY))
        $Extended = GUICtrlCreateInput("", 408, 296, 137, 21, BitOR($ES_AUTOHSCROLL,$ES_READONLY))
        $Error = GUICtrlCreateInput("", 408, 328, 137, 21, BitOR($ES_AUTOHSCROLL,$ES_READONLY))
    GUICtrlCreateGroup("", -99, -99, 1, 1)
    $ListView = GUICtrlCreateListView("Element|Data", 40, 240, 257, 177, -1, BitOR($WS_EX_CLIENTEDGE,$LVS_EX_GRIDLINES,$LVS_EX_FULLROWSELECT))
GUICtrlCreateGroup("", -99, -99, 1, 1)
GUISetState(@SW_SHOW)


While 1
    $nMsg = GUIGetMsg()
    Switch $nMsg
        Case $GUI_EVENT_CLOSE
            Exit
        Case $Test
            If GuiCtrlRead($String) = "" or GUICtrlRead($Pattern) = "" Then
                MsgBox(0,"Error", "Please fill in all required parameters.")
            Else
                If GUICtrlRead($Flag) = "" then 
                    Process(GUICtrlRead($String), GuiCtrlRead($Pattern))
                Else
                    Process(GUICtrlRead($String), GuiCtrlRead($Pattern), GUICtrlRead($Flag))
                EndIf
            EndIf
    EndSwitch
WEnd

Func Process($String, $Pattern, $Flag=0)
    _GUICtrlListView_DeleteAllItems(GUICtrlGetHandle($ListView))
    $Result = StringRegExp($String,$Pattern,$Flag)
    $Err = @error
    $Ext = @extended
    If IsArray($Result) then
        Dim $Item[UBound($Result)]
        For $x = 0 to UBound($Result)-1
            $Item[$x] = GUICtrlCreateListViewItem("["&$x&"]|"&$Result[$x],$ListView)
        Next
        GUICtrlSetData($Return, "Array")
        GUICtrlSetData($Extended, $Ext)
        GUICtrlSetData($Error, $Err)
    Else
        GUICtrlSetData($Return, $Result)
        GUICtrlSetData($Extended, $Ext)
        GUICtrlSetData($Error, $Err)
    EndIf
EndFunc
Edited by Szhlopp
Link to comment
Share on other sites

Thanks. Using the program I could only do half the job so I still need some help:

This is the part of the text I need to parse:

<br>You currently have <b>412</b> shares.<br>You can buy a maximum of <b>20</b> shares.<br>

With the pattern, ([0-9]{1,3})(?:</b> shares)', 1), it return as 412</b> shares ;) What I need is in fact the second number, 20 in this e.g
Link to comment
Share on other sites

Thanks. Using the program I could only do half the job so I still need some help:

This is the part of the text I need to parse:

With the pattern, ([0-9]{1,3})(?:</b> shares)', 1), it return as 412</b> shares ;) What I need is in fact the second number, 20 in this e.g

Heh here ya go=)

Text used: <br>You currently have <b>412</b> shares.<br>You can buy a maximum of <b>20</b> shares.<br>

Pattern:

<b>(.*?)<

Flag: 3

Array[0] is 412

Array[1] is 20

:D

Edited by Szhlopp
Link to comment
Share on other sites

Thanks for your help. With a bit of tweaking I managed to filter the required part ;) I used the function _IEBodyReadText instead of html.

But now I am getting another error which says "Subscript used with non-Array variable"

And this is the part that triggers it :

$sResult = StringRegExp($sText, 'You can buy a maximum of (.*?) shares', 3)
$Amount = Int ($sResult[0] / 24) * 24

Strangely, sometimes it works and sometimes it does not :S Can anyone help me clear this please

Edited by dvlkn
Link to comment
Share on other sites

afaik, if StringRegExp does not return an array, it means it could not match anything with your pattern

i've written a couple of web crawlers in PHP, and i can tell you one thing: never trust "user" input ;) (well, it's a well known fact, actually)

maybe, in some cases, there is no _space_ before or after the number, so the pattern can't match. Maybe it would be a good idea to capture anything in between maximum of and shares and then trim the result.

you could also fetch a large number of such pages and analyze the text yourself, in order to see whether it really changes every once in a while.

anyway, i think you should use regexes on the actual html code... i believe it's easier to extract bits of text by using the tag structure of a html document

hope it helps

Radu

edit: typo

Edited by mc83
Link to comment
Share on other sites

Thanks for the advice, I'll run a couple more of tests. I am pretty sure my coding is good though because when I used a msgbox to return a value, it was correct and worked every time.

Check @error right after the StringRegEx. Also use UBound to see how big the array is.

Try this too...

Text:

You can buy a maximum 54 shares or you could buy a minimum of 640 =P
You can buy a maximum 59shares and minimum 70.
A  maximum of 100!  shares and minimum70?.

Pattern:

(?:maximum|minimum) (?:of)?\s?([0-9]*)\s?

Flag: 3

Always returns what you want=)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...