Morthawt

ChunkV2 - For breaking web pages source code up to get data

2 posts in this topic

#1 ·  Posted (edited)

For example, you see a top 40 music charts web page and you want to make an AutoIt script to check it now and then and let you know when something new comes up. This UDF helps you to get the raw list of data so that you can make use of it.

Or perhaps your favourite web-store has a "New Items" page and you would like to be the first to see new items. You can look at the source code of the web page and use your find feature of your browser to find areas of the source code that get you closer and closer and use this UDF to get the raw data, perhaps two sets of data, one for the product titles and another for the URL's so you could quickly see new products.

So many uses. I have used this on my phone carrier's forum so that I can help the new guys by detecting every 10 seconds when a new un-replied-to post has been made so that I can help them out instead of randomly manually checking in the hopes that someone made a thread while I am in the mood to help out etc.

#include-once
#include <String.au3>
Global Static $_Cassensitive, $_StripLEADTRAILSpaceFromSrc, $_StripLineFeedsFromSrc, $_StripLEADTRAILTABSFromSrc, $_StripConsecutiveSpacesFromResults

; #FUNCTION# ====================================================================================================================
; Name ..........: ChunkV2
; Description ...: A new version of Morthawt's Chunk UDF, to break apart web page source code etc into component parts to get at particular pieces of data, such as a top 40 music charts listing by
;                   using the html formatting to gradually break the source up into a reliable way to get parts of the information.
; Syntax ........: ChunkV2($_input)
; Parameters ....:  $_input ~ An array input contains the before/after bits of strings to get closer and closer to the desired data.
;                   A binary input is treated as web source code to be procesed.
;                   A number input is seen as options to configure how the function will operate.
;
;                   Prior to feeding it the binary source code, you can issue some options by adding the flags together:
;
;                   1 =  case sensitive chunking information
;                   2 =  Remove trailing and leading space
;                   4 =  Remove all line feeds (CR and LF)
;                   8 =  Remove all leading and trailing tabs
;                   16 = Remove all duplicate repeated spaces from the end results
;
;                   These flags are issued as a single parameter for the function ChunkV2 prior to feeding it the binary source code eg ChunkV2(1 + 4)
;                   This UDF lacks proper error reporting, which if I run into issues, I will look into adding for feedback to the user.
;
; Return values .: An array of pieces of strings that your goal is to obtain.
; Author ........: Morthawt
; Modified ......: 2017-06-21 18:01
; Remarks .......:
; Related .......:
; Link ..........:
; Example .......:  as follows based on the web page source code being: "<First><Second>Hello</Second></First><First><Second>Hello again</Second></First>"
;
;                   Global $Chunky[2][2]
;                   $Chunky[0][0] = "<First>"
;                   $Chunky[0][1] = "</First>"
;
;                   $Chunky[1][0] = "<Second>"
;                   $Chunky[1][1] = "</Second>"
;
;                   ChunkV2($Chunky)
;                   $source = InetRead('https://www.morthawt.com/chunkv2-example.txt', 3) ; $source will contain a BINARY version of the source code, which alerts ChunkV2 to process it correctly since it is known what this data format is for.
;                   $Results = ChunkV2($source)
;                   For $a in $Results
;                       ConsoleWrite($a & @LF)
;                   Next
;
; ===============================================================================================================================
Func ChunkV2($_input)
    Local Static $_Zones[0][2]

    Select
        Case IsArray($_input) ; An array input is treated as the chunks of information used to pull apart the web source code into component parts to get data from.
            $_Zones = $_input ; Creates a static Array for use with the next call(s) of the ChunkV2 function.

        Case IsBinary($_input) ; A binary input is the actual source code of a web page (or other data preseted as binary) to have the previously established Chunk zones (start and end) to process it to obtain the data desired by the user.
            $_input = BinaryToString($_input) ; Brings the source code of the web page back into string format.

            If $_StripLEADTRAILSpaceFromSrc = 1 Then $_input = StringStripWS($_input, 3) ; Removes leading and trailing spaces if user wants

            If $_StripLEADTRAILTABSFromSrc = 1 Then ; Removes leading and trailing tabs that are used to visually and logically format web sourcecode to make getting at data easier.
                $_input = StringRegExpReplace($_input, '(?m)^\t+', '')
                $_input = StringRegExpReplace($_input, '(?m)\t+$', '')
            EndIf

            If $_StripLineFeedsFromSrc = 1 Then ; Strips line feeds and carridge returns if desired by the user's preferences/flags
                $_input = StringReplace($_input, @CR, '')
                $_input = StringReplace($_input, @LF, '')
            EndIf

            Global $_ChunksNeedingProcessing[1] = [$_input] ;   Initiates a new array with a single entry that has the entirity of the web page source code, this array will be replaced with all-new rows with previously pared down data needing
            ;                                                   further breaking down to obtain the actual data desired by the user.

            For $_The = 0 To UBound($_Zones) - 1
                _ProcessBetweens($_Zones[$_The][0], $_Zones[$_The][1]) ; This runs as many times as there are before/after zones needing to pare the source code down and down to reach the desired data.
                If @error Then
                    SetError(1, $_The)
                    Return 'Error occurred with Zone array row: ' & $_The & ', this row number is documented in @extended for further use'
                EndIf
            Next

            ReDim $_Zones[0][3]
            $_Cassensitive = False ; Resets the variable for next use
            $_StripLEADTRAILSpaceFromSrc = 0 ; Resets the variable for next use
            $_StripLineFeedsFromSrc = 0 ; Resets the variable for next use
            $_StripLEADTRAILTABSFromSrc = 0 ; Resets the variable for next use
            $_StripConsecutiveSpacesFromResults = 0 ; Resets the variable for next use
            Return $_ChunksNeedingProcessing ; Returns the resulting data desired  by the user, such as a top 40 list of chart music from a web page table containing a list etc.

        Case IsNumber($_input) ; This is where the user's preferences/flags are detected and processed.
            If $_input / 16 >= 1 Then ; Strip all duplicate, consecutive spaces from the end result chunks of desired data?
                $_StripConsecutiveSpacesFromResults = 1
                $_input -= 16
            EndIf

            If $_input / 8 >= 1 Then ; Strip all LEADING/TRAILING tabs from the source code?
                $_StripLEADTRAILTABSFromSrc = 1
                $_input -= 8
            EndIf

            If $_input / 4 >= 1 Then ; Strip all line feeds of any kind from the source code?
                $_StripLineFeedsFromSrc = 1
                $_input -= 4
            EndIf

            If $_input / 2 >= 1 Then ; Strip all LEADING/TRAILING spaces from the source code?
                $_StripLEADTRAILSpaceFromSrc = 1
                $_input -= 2
            EndIf

            If $_input = 1 Then ; Case sensitive _StringBetween ? (Default 0)
                $_Cassensitive = True
                $_input -= 1
            EndIf

    EndSelect
EndFunc   ;==>ChunkV2

; #FUNCTION# ====================================================================================================================
; Name ..........: _ProcessBetweens
; Description ...: Internal UDF usage to do the heavy lifting of breaking a page's source code down into chunks.
; Syntax ........: _ProcessBetweens($_Start, $_End)
; Parameters ....: $_Start              - The string that preceeds the parts you want.
;                  $_End                - The string that terminates the parts you want.
; Return values .: None
; Author ........: Your Name
; Modified ......:
; Remarks .......: This is iteratively called to narrow down the code, paring parts away to provide an array listing the data you wanted from the page.
; Related .......:
; Link ..........:
; Example .......: No
; ===============================================================================================================================
Func _ProcessBetweens($_Start, $_End) ; Uses an already established array which contains the current break-down-level of the source code and uses provided before/after zones to further break the source down.
    Local $_tempBuffer[0]

    For $_go In $_ChunksNeedingProcessing
        $_tmp = _StringBetween($_go, $_Start, $_End, Default, $_Cassensitive)
        If Not @error Then
            For $_go2 In $_tmp
                If $_StripLEADTRAILSpaceFromSrc = 1 Then $_go2 = StringStripWS($_go2, 3) ; Newly added to remove lead/trail spaces from results that weren't resolved by removing them from the raw page sourcecode.
                If $_StripLEADTRAILTABSFromSrc = 1 Then ; Newly added to remove lead/trail tabs from results that weren't resolved by removing them from the raw page sourcecode.
                    $_go2 = StringRegExpReplace($_go2, '(?m)^\t+', '')
                    $_go2 = StringRegExpReplace($_go2, '(?m)\t+$', '')
                EndIf ; The above additions for space/tab could be removed if I can process the web sourcecode better MAYBE.

                If $_StripConsecutiveSpacesFromResults = 1 Then
                    $_go2 = StringRegExpReplace($_go2, '\s+', ' ')
                EndIf

                ReDim $_tempBuffer[UBound($_tempBuffer) + 1]
                $_tempBuffer[UBound($_tempBuffer) - 1] = $_go2
            Next

        Else ; If there was an error breaking the source code into chunks (such as no desired strings detected or the source was missing valid zones), throw an error up that can be detected in the main ChunkV2 function and sent back to the calling script.
            SetError(1)
            Return
        EndIf
        $_ChunksNeedingProcessing = $_tempBuffer
    Next

EndFunc   ;==>_ProcessBetweens

It may not be the most elegant code but it so far is working. This is an upgraded, written from scratch, version of Chunk UDF that I wrote a long time ago and I think I posted that here somewhere too.

Updated to include the removal of repeated         spaces anywhere in each resulting array entry, as I was having problems with this on yet another site. Also added some error detection, so that I can for example infinitely parse a site number of successive pages and when an error occurs (because either the page source is different or there are no "results") then I can either end the script or for example if each section of a site is categorised by letter and then pages per letter like 15 pages of A's, then once it gets to page 16 and errors out due to the failure to pull results, a script can have an @error check and exitloop to continue to B and restart from a new 0 count for the page etc.

Updated to process the removal of leading and trailing spaces or tabs from each individual result, because my tests on http://uk.ign.com/games/pc?sortBy=title&sortOrder=asc&startIndex=0 were showing tons of leading and trailing spaces on the results, even though the source had all trailing and leading spaces supposedly removed from it.

 

Edited by Morthawt

Share this post


Link to post
Share on other sites



Updated and added some information.

Basics of updates are:
An option modification to remove leading/trailing spaces from the end results as well because I was having issues with such spacing in results.
A new option to remove repeated       spaces from results, because again I ran into this problem.
Added some error detection, details in first post on examples of using it.
 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now