Jump to content

ChunkV2 - For breaking web pages source code up to get data


Morthawt
 Share

Recommended Posts

For example, you see a top 40 music charts web page and you want to make an AutoIt script to check it now and then and let you know when something new comes up. This UDF helps you to get the raw list of data so that you can make use of it.

Or perhaps your favourite web-store has a "New Items" page and you would like to be the first to see new items. You can look at the source code of the web page and use your find feature of your browser to find areas of the source code that get you closer and closer and use this UDF to get the raw data, perhaps two sets of data, one for the product titles and another for the URL's so you could quickly see new products.

So many uses. I have used this on my phone carrier's forum so that I can help the new guys by detecting every 10 seconds when a new un-replied-to post has been made so that I can help them out instead of randomly manually checking in the hopes that someone made a thread while I am in the mood to help out etc.

#include-once
#include <String.au3>
#include <Inet.au3>
Global $_Cassensitive = 1, $_StripLEADTRAILSpaceFromSrc = 2, $_StripLineFeedsFromSrc = 4, $_StripLEADTRAILTABSFromSrc = 8, $_StripConsecutiveSpacesFromResults = 16, $_StripHTMLComments = 32, $_DebugMode = 8192, $_DebugConsole, $_optionsChosen

; #FUNCTION# ====================================================================================================================
; Name ..........: ChunkV2
; Description ...: A new version of Morthawt's Chunk UDF, to break apart web page source code etc into component parts to get at particular pieces of data, such as a top 40 music charts listing by
;                   using the html formatting to gradually break the source up into a reliable way to get parts of the information.
; Syntax ........: ChunkV2($_Source, $_input, [$_options])
; Parameters ....:  $_Source = Either a BINARY variable containing webpage source code OR a STRING containing a web URL.
;                   $_input = An array containing 2 columns, Before and after, to encapsulate source between <THIS> and <THAT>. 1 row per chunk closer and closer.
;                   $_options = [OPTIONAL] ~ A number in the form of an Integer, which is the sum of all desired options/flags.
;
;                   Options:
;                   1 =     case sensitive chunking information
;                   2 =     Remove trailing and leading space
;                   4 =     Remove all line feeds (CR and LF)
;                   8 =     Remove all leading and trailing tabs
;                   16 =    Remove all duplicate repeated spaces from the end results
;                   32 =    Remove HTML Comments "<!--Message here--!>" "<!--Message here-->" (Single and multi-line)
;                   8192 =  Add verbose debugging code to the console (Can slow down the entire process quite a lot)
;
; Return values .: An array of pieces of strings that your goal is to obtain.
; Author ........: Morthawt
; Modified ......: 2017-07-04 13:47
; Remarks .......:
; Related .......:
; Link ..........:
; Example .......:  as follows based on the web page source code being: "<First><Second>Hello</Second></First><First><Second>Hello again</Second></First>"
;
;~ #include <includes\ChunkV2.au3>

;~ Global $Zones[2][2], $url = "https://www.morthawt.com/chunkv2-example.txt"
;~ $Zones[0][0] = "<First>"
;~ $Zones[0][1] = "</First>"

;~ $Zones[1][0] = "<Second>"
;~ $Zones[1][1] = "</Second>"

;~ $source = InetRead($url, 3) ; $source will contain a BINARY version of the source code.
;~ $Results = ChunkV2($source, $Zones, 1 + 16) ; I could substitute $source for $url and it would still work. Uses options 1 and 16
;~ If @error Then
;~  $er = @error ; Would display the error code which would correspond to the parameter most likely responsible starting with 1 for first parameter.
;~  $ex = @extended ; In the case of extended information, such as a particular Zone row where a failure to match occurred, store this for use in the below message box.
;~  MsgBox(0, $er & "   " & $ex, $Results) ; Since this IF statment happens only during an error, the result from ChunkV2 function will be a string containing English information to help troubleshoot.
;~  Exit
;~ EndIf

;~ For $a In $Results
;~  ConsoleWrite($a & @LF)
;~ Next

;
; ===============================================================================================================================
Func ChunkV2($_Source, $_input, $_options = 0)
    $_optionsChosen = $_options
    $_DebugConsole = ((BitAND($_optionsChosen, $_DebugMode)) ? (1) : (0)) ; Sets $_DebugConsole to 1 if 1 makes up part of the user's options, else sets it to 0.

    If Not IsArray($_input) Then
        SetError(2)
        $_ChunkErrMsg = 'Chunk zones are not in a 2 column array format.'
        If $_DebugConsole Then
            ConsoleWrite('!Error: ' & 2 & @CRLF & $_ChunkErrMsg & @CRLF & @CRLF)
            SetError(2)
        EndIf
        Return $_ChunkErrMsg
    EndIf

    If IsBinary($_Source) Then
        $_Source = BinaryToString($_Source) ; Brings the source code of the web page back into string format.
    ElseIf IsString($_Source) Then
        If $_DebugConsole Then ConsoleWrite('Getting sourcecode for: ' & $_Source & @CRLF & @CRLF)
        $_Source = _INetGetSource($_Source, True)
        If @error Then
            SetError(1)
            $_ChunkErrMsg = 'Problem downloading webpage sourcecode.'
            If $_DebugConsole Then
                ConsoleWrite('!Error: ' & 1 & @CRLF & $_ChunkErrMsg & @CRLF & @CRLF)
                SetError(1)
            EndIf
            Return $_ChunkErrMsg
        EndIf
    Else
        SetError(1)
        $_ChunkErrMsg = 'Webpage sourcecode was not in Binary format.'
        If $_DebugConsole Then
            ConsoleWrite('!Error: ' & 1 & @CRLF & $_ChunkErrMsg & @CRLF & @CRLF)
            SetError(1)
        EndIf
        Return $_ChunkErrMsg
    EndIf

    If BitAND($_optionsChosen, $_StripHTMLComments) Then ; If The right option is a component of the full options chosen by the user, the perform the following additional task(s)
        $_Source = StringRegExpReplace($_Source, '<!--[\s\S]*?--!?>', '') ; This will remove HTML comments from the source code, handy when people leave old code there commented, throwing non-wanted entries in your results.
    EndIf

    If BitAND($_optionsChosen, $_StripLEADTRAILSpaceFromSrc) Then ; If The right option is a component of the full options chosen by the user, the perform the following additional task(s)
        $_Source = StringStripWS($_Source, 3) ; Removes leading and trailing spaces if user wants
        If $_DebugConsole Then ConsoleWrite('Removing leading and trailing spaces.' & @CRLF & @CRLF)
    EndIf

    If BitAND($_optionsChosen, $_StripLEADTRAILTABSFromSrc) Then ; If The right option is a component of the full options chosen by the user, the perform the following additional task(s)
        $_Source = StringRegExpReplace($_Source, '(?m)^\t+', '') ; Removes leading and trailing tabs that are used to visually and logically format web sourcecode to make getting at data easier.
        $_Source = StringRegExpReplace($_Source, '(?m)\t+$', '') ; Removes leading and trailing tabs that are used to visually and logically format web sourcecode to make getting at data easier.
        If $_DebugConsole Then ConsoleWrite('Removing leading and trailing tabs from the webpage sourcecode.' & @CRLF & @CRLF)
    EndIf

    If BitAND($_optionsChosen, $_StripLineFeedsFromSrc) Then ; If The right option is a component of the full options chosen by the user, the perform the following additional task(s)
        $_Source = StringReplace($_Source, @CR, '') ; Strips line feeds and carridge returns if desired by the user's preferences/flags
        $_Source = StringReplace($_Source, @LF, '') ; Strips line feeds and carridge returns if desired by the user's preferences/flags
        If $_DebugConsole Then ConsoleWrite('Removing linefeeds and carridge returns from the webpage sourcecode.' & @CRLF & @CRLF)
    EndIf

    Global $_ChunksNeedingProcessing[1] = [$_Source] ;  Initiates a new array with a single entry that has the entirity of the web page source code, this array will be replaced with all-new rows with previously pared down data needing
    ;                                                   further breaking down to obtain the actual data desired by the user.

    For $_The = 0 To UBound($_input) - 1
        If $_DebugConsole Then ConsoleWrite('Processing chunk-zone: ' & $_The & @CRLF & @CRLF)
        _ProcessBetweens($_input[$_The][0], $_input[$_The][1]) ; This runs as many times as there are before/after zones needing to pare the source code down and down to reach the desired data.
        If @error Then
            SetError(-1, $_The)
            $_ChunkErrMsg = 'Error occurred with Zone array row: ' & $_The & ', this row number is documented in @extended for further use. Webpage source could have changed or there are no further results if you are checking consecutive numbered pages and you reached one that does not exist or lacks desired chunked data.'
            If $_DebugConsole Then
                ConsoleWrite('!Error: ' & - 1 & @CRLF & $_ChunkErrMsg & @CRLF & @CRLF)
                SetError(-1, $_The)
            EndIf
            Return $_ChunkErrMsg
        EndIf
    Next
    If $_DebugConsole Then ConsoleWrite('COMPLETED.' & @CRLF & @CRLF)
    Return $_ChunksNeedingProcessing ; Returns the resulting data desired  by the user, such as a top 40 list of chart music from a web page table containing a list etc.
EndFunc   ;==>ChunkV2

; #FUNCTION# ====================================================================================================================
; Name ..........: _ProcessBetweens
; Description ...: Internal UDF usage to do the heavy lifting of breaking a page's source code down into chunks.
; Syntax ........: _ProcessBetweens($_Start, $_End)
; Parameters ....: $_Start              - The string that preceeds the parts you want.
;                  $_End                - The string that terminates the parts you want.
; Return values .: None
; Author ........: Morthawt
; Modified ......:
; Remarks .......: This is iteratively called to narrow down the code, paring parts away to provide an array listing the data you wanted from the page.
; Related .......:
; Link ..........:
; Example .......: No
; ===============================================================================================================================
Func _ProcessBetweens($_Start, $_End) ; Uses an already established array which contains the current break-down-level of the source code and uses provided before/after zones to further break the source down.
    Local $_tempBuffer[0]

    For $_go In $_ChunksNeedingProcessing
        $_tmp = _StringBetween($_go, $_Start, $_End, Default, ((BitAND($_optionsChosen, $_Cassensitive)) ? (1) : (0))) ; Case sensitive is set to 1 if the option for this was part of the ChunkV2's options. Else it will be set to 0

        If Not @error Then
            For $_go2 In $_tmp
                If BitAND($_optionsChosen, $_StripLEADTRAILSpaceFromSrc) Then ; If The right option is a component of the full options chosen by the user, the perform the following additional task(s)
                    $_go2 = StringStripWS($_go2, 3) ; Newly added to remove lead/trail spaces from results that weren't resolved by removing them from the raw page sourcecode.
                    If $_DebugConsole Then ConsoleWrite('Removing leading and trailing spaces from the final results.' & @CRLF & @CRLF)
                EndIf

                If BitAND($_optionsChosen, $_StripLEADTRAILTABSFromSrc) Then ; If The right option is a component of the full options chosen by the user, the perform the following additional task(s)
                    $_go2 = StringRegExpReplace($_go2, '(?m)^\t+', '') ; Newly added to remove lead/trail tabs from results that weren't resolved by removing them from the raw page sourcecode.
                    $_go2 = StringRegExpReplace($_go2, '(?m)\t+$', '') ; Newly added to remove lead/trail tabs from results that weren't resolved by removing them from the raw page sourcecode.
                    If $_DebugConsole Then ConsoleWrite('Removing leading and trailing tabs from the final results that weren''t solved via removing them from the source.' & @CRLF & @CRLF)
                EndIf ; The above additions for space/tab could be removed if I can process the web sourcecode better MAYBE.

                If BitAND($_optionsChosen, $_StripConsecutiveSpacesFromResults) Then ; If The right option is a component of the full options chosen by the user, the perform the following additional task(s)
                    If $_DebugConsole Then ConsoleWrite('Removing consecutive spaces.' & @CRLF & @CRLF)
                    $_go2 = StringRegExpReplace($_go2, '\s+', ' ')
                EndIf

                ReDim $_tempBuffer[UBound($_tempBuffer) + 1]
                $_tempBuffer[UBound($_tempBuffer) - 1] = $_go2
            Next

        Else ; If there was an error breaking the source code into chunks (such as no desired strings detected or the source was missing valid zones), throw an error up that can be detected in the main ChunkV2 function and sent back to the calling script.
            SetError(1)
            Return
        EndIf
        $_ChunksNeedingProcessing = $_tempBuffer
    Next

EndFunc   ;==>_ProcessBetweens

Example of use: (as follows based on the web page source code being: "<First><Second>Hello</Second></First><First><Second>Hello again</Second></First>")

#include <includes\ChunkV2.au3>

Global $Zones[2][2], $url = "https://www.morthawt.com/chunkv2-example.txt"
$Zones[0][0] = "<First>"
$Zones[0][1] = "</First>"

$Zones[1][0] = "<Second>"
$Zones[1][1] = "</Second>"

$source = InetRead($url, 3) ; $source will contain a BINARY version of the source code.
$Results = ChunkV2($source, $Zones, 1 + 16) ; I could substitute $source for $url and it would still work. Uses options 1 and 16
If @error Then
    $er = @error ; Would display the error code which would correspond to the parameter most likely responsible starting with 1 for first parameter.
    $ex = @extended ; In the case of extended information, such as a particular Zone row where a failure to match occurred, store this for use in the below message box.
    MsgBox(0, $er & "   " & $ex, $Results) ; Since this IF statment happens only during an error, the result from ChunkV2 function will be a string containing English information to help troubleshoot.
    Exit
EndIf

For $a In $Results
    ConsoleWrite($a & @LF)
Next

It may not be the most elegant code but it so far is working. This is an upgraded, written from scratch, version of Chunk UDF that I wrote a long time ago and I think I posted that here somewhere too.

Updated: I added a new flag/option 32 which removes comments from the HTML, to prevent non-relevant "results" from showing up when people just comment out, for example, products no longer sold.

Updated: I have simplified the user's options/flags using BitAND rather than arithmetically processing them to whittle them down.

Updated the format of calling the function, to make it a single line call as you would normally expect to call a function. I did away with needing static variables or arrays. I have added an option for verbose debugging to console.

Updated to include the removal of repeated         spaces anywhere in each resulting array entry, as I was having problems with this on yet another site. Also added some error detection, so that I can for example infinitely parse a site number of successive pages and when an error occurs (because either the page source is different or there are no "results") then I can either end the script or for example if each section of a site is categorised by letter and then pages per letter like 15 pages of A's, then once it gets to page 16 and errors out due to the failure to pull results, a script can have an @error check and exitloop to continue to B and restart from a new 0 count for the page etc.

Updated to process the removal of leading and trailing spaces or tabs from each individual result, because my tests on http://uk.ign.com/games/pc?sortBy=title&sortOrder=asc&startIndex=0 were showing tons of leading and trailing spaces on the results, even though the source had all trailing and leading spaces supposedly removed from it.

 

Edited by Morthawt
Link to comment
Share on other sites

Updated and added some information.

Basics of updates are:
An option modification to remove leading/trailing spaces from the end results as well because I was having issues with such spacing in results.
A new option to remove repeated       spaces from results, because again I ran into this problem.
Added some error detection, details in first post on examples of using it.
 

Link to comment
Share on other sites

Updated: I have simplified the user's options/flags using BitAND rather than arithmetically processing them to whittle them down. I have tested it and it appears to be working well. Maybe it will slightly increase the speed at which it operates but it has certainly dropped the line count.

Link to comment
Share on other sites

I have updated and added a new option 32. I was having problems with some product pages that the site owner would comment out, modularly, products for which are no longer sold or are entirely out of stock. So me trying to make a program to scan for new products was causing all kinds of problems by showing tons of items that appear in the source code, all formatted nicely but in fact aren't actually "there", so if he uncommented those products I would not get notified because ChunkV2 had technically seen the items listed in the page's sourcecode.

I had to google for the regex string and add a !? on the end because the guy's site I am interacting with finished some comments with --!> and sometimes -->

By the way, here is a real-world example of using ChunkV2

#include <includes\ChunkV2.au3>
#include <Array.au3>
AutoItSetOption('TrayAutoPause', 0)

Global $Zones[4][2], $Zones2[2][2], $PreviouslySeen[0], $url = "https://www.heinnie.com/all-products"
$PreviouslySeen = FileReadToArray('productlist.txt')

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

$Zones[0][0] = '<ul id="productlist"'
$Zones[0][1] = '</ul>'

$Zones[1][0] = '<li class="item">'
$Zones[1][1] = '</li>'

$Zones[2][0] = '<h2 class="product-name">'
$Zones[2][1] = '</h2>'

$Zones[3][0] = '" title="'
$Zones[3][1] = '">'

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

$Zones2[0][0] = '<ul id="productlist"'
$Zones2[0][1] = '</ul>'

$Zones2[1][0] = '<li class="item">'
$Zones2[1][1] = '</li>'

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

While 1

    $source = _INetGetSource($url, False)
    $Results = ChunkV2($source, $Zones) ; Returns raw product names
    $Results2 = ChunkV2($source, $Zones2) ; Returns the entire block of code for each product, so the count will be the same as the product names above. (no subscript errors)
    Local $new = False
    For $a = 0 To UBound($Results) - 1
        $product = ((StringInStr($Results2[$a], '<div class="ukfriendlycarry">')) ? ('(UK Legal)' & @TAB) : (@TAB & @TAB & @TAB)) & $Results[$a]
        $test = _ArraySearch($PreviouslySeen, $product)
        If $test < 0 Then
            $new = True
            ConsoleWrite($product & @LF)
            FileWriteLine('productlist.txt', $product)
            If Not IsArray($PreviouslySeen) Then Global $PreviouslySeen[0]
            ReDim $PreviouslySeen[UBound($PreviouslySeen) + 1]
            $PreviouslySeen[UBound($PreviouslySeen) - 1] = $product
        EndIf
    Next
    If $new Then Beep(500, 1000)

    Sleep(300000)
WEnd

 

Edited by Morthawt
Link to comment
Share on other sites

  • 2 weeks later...

You're welcome. Glad there are people out there finding this capability as useful as I have over the years. It has really helped me when API's are not available and pages need breaking down, as well as API's where JSON information needs breaking down so I can get a particular parameter's data from it.

I've used it for getting bitcoin exchange rates, finding new products on websites, seeing a new entry in a top40 charts, seeing when show notes for a new podcast are added to a site, see when the price of a product online goes down and alert me about the lower price, alert me to new threads in a "need help" section of my phone carrier's website. Just endless use I have gotten out of my original Chunk. This new ChunkV2 is so much nicer.

If the site gets updated and your chunk zones no longer work, all you need to do is just make new chunk zones that will re-acquire the chunks you need and the rest of your code should be fine still.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...