WKHtmkToX 1.0.0

   (0 reviews)

About This File

Class using AutoItObject to generate PDF or image (JPG, BMP, GIF, PNG...) files from webpages or HTML files through wkhtmltopdf and wkhtmltoimage.





User Feedback

You may only provide a review once you have downloaded the file.

There are no reviews to display.

  • Similar Content

    • nitekram
      _INetGetSource coming back with strange characters
      By nitekram
      This function does not appear to be returning the correct characters for a site, but the code of the page of the site shows the right characters? Can someone show me how to correct this, or is this a bug in the function? Actually I corrected it in my code, but just replacing these characters, but should they even be part of the string?
      Here is the two strange characters:
      “
      .â€.
       
      #include <Inet.au3> #include <String.au3> $linc = "http://www.dictionary.com/browse/diablo?s=t" $str = _INetGetSource($linc, 1) ; has this in it ;“devil.”. $str = BinaryToString($str) ConsoleWrite($str) $str = _StringBetween($str, '<div class="def-set">', '</div>', 1) MsgBox('', 'raw data', $str[0])  
    • nbg15
      tesseract doesnt detect the easiest image *image to text*
      By nbg15
      Hello everybody..
       
      i have this picture here *attached* and this script here: 
       
      $ImageToReadPath = @MyDocumentsDir & "\GDIPlus_Image2.jpg" $ResultTextPath = @MyDocumentsDir & "\Result" $OutPutPath = $ResultTextPath & ".txt" $TesseractExePath = @MyDocumentsDir & "\Tesseract.exe" ShellExecuteWait($TesseractExePath, '"' & $ImageToReadPath & '" "' & $ResultTextPath & '"', "", "", @SW_HIDE) If @error Then Exit MsgBox(0, "Error", @error) EndIf MsgBox(0, "Result", FileRead($OutPutPath)) FileDelete($OutPutPath)  
      but tesseract doesnt recognized the correct word... and gives me trash back...

      this is the image >> 
      and the result was >> "samm" 

      the image was an normal jpg and generated with this code here:
       
      _ScreenCapture_Capture(@MyDocumentsDir & "\GDIPlus_Image2.jpg", 712,268,853,284)
      Could anybody give me a hint what i can do better to get this easy image to text?
       
      thank u very much!!!
       
       
      Edit: i also tried to capture the screen as bmp with a higher resolution... nothing changed... 
       
       
      _ScreenCapture_SetBMPFormat(4) _ScreenCapture_Capture(@MyDocumentsDir & "\GDIPlus_Image.bmp", 712,279,853,295)  
    • RyukShini
      Image quality is bad compared to paint
      By RyukShini
      #Region ;**** Directives created by AutoIt3Wrapper_GUI **** #AutoIt3Wrapper_Icon=car.ico #EndRegion ;**** Directives created by AutoIt3Wrapper_GUI **** #include <GDIPlus.au3> #include <File.au3> #include <Array.au3> #include <ColorConstants.au3> #include <GUIConstantsEx.au3> #include <WindowsConstants.au3> #include <ProgressConstants.au3> ; Declare array Dim $Images[1] ; Gets all JPG files in the current directory (@ScriptDir). Local $search = FileFindFirstFile("*.jpg") ; Check if the search was successful If $search = -1 Then MsgBox(0, "Error", "No JPG files could be found.") Exit EndIf ; Resize array While 1 If IsArray($Images) Then Local $Bound = UBound($Images) ReDim $Images[$Bound+1] EndIf $Images[$Bound] = FileFindNextFile($search) If @error Then ExitLoop WEnd ; Close the search handle FileClose($search) ; Create directory "resized" if not there yet $nymappe = InputBox("Mappe / Bil Navn", "Mappe / Bil Navn") If NOT FileExists(@ScriptDir & "\" & $nymappe & "\") Then DirCreate(@ScriptDir & "\" & $nymappe & "\") EndIf ; Loop for JPGs - gets dimension of JPG and calls resize function to resize to 50% width and 50% height For $i = 1 to Ubound($Images)-1 If $Images[$i] <> "" AND FileExists(@ScriptDir & "\" & $Images[$i]) Then Local $ImagePath = @ScriptDir & "\" & $Images[$i] _GDIPlus_Startup() Local $hImage = _GDIPlus_ImageLoadFromFile($ImagePath) Local $ImageWidth = _GDIPlus_ImageGetWidth($hImage) Local $ImageHeight = _GDIPlus_ImageGetHeight($hImage) _GDIPlus_ImageDispose($hImage) _GDIPlus_Shutdown() ;MsgBox(0,"DEBUG", $ImageWidth & " x " & $ImageHeight) Local $NewImageWidth = ($ImageWidth / 100) * 15 Local $NewImageHeight = ($ImageHeight / 100) * 15 ;MsgBox(0,"DEBUG: " & $i,$Images[$i]) _ImageResize(@ScriptDir & "\" & $Images[$i], @ScriptDir & "\" & $nymappe & "\" & $Images[$i], $NewImageWidth, $NewImageHeight) EndIf Next ; Resize function Func _ImageResize($sInImage, $sOutImage, $iW, $iH) Local $hWnd, $hDC, $hBMP, $hImage1, $hImage2, $hGraphic, $CLSID, $i = 0 ;OutFile path, to use later on. Local $sOP = StringLeft($sOutImage, StringInStr($sOutImage, "\", 0, -1)) ;OutFile name, to use later on. Local $sOF = StringMid($sOutImage, StringInStr($sOutImage, "\", 0, -1) + 1) ;OutFile extension , to use for the encoder later on. Local $Ext = StringUpper(StringMid($sOutImage, StringInStr($sOutImage, ".", 0, -1) + 1)) ; Win api to create blank bitmap at the width and height to put your resized image on. $hWnd = _WinAPI_GetDesktopWindow() $hDC = _WinAPI_GetDC($hWnd) $hBMP = _WinAPI_CreateCompatibleBitmap($hDC, $iW, $iH) _WinAPI_ReleaseDC($hWnd, $hDC) ;Start GDIPlus _GDIPlus_Startup() ;Get the handle of blank bitmap you created above as an image $hImage1 = _GDIPlus_BitmapCreateFromHBITMAP ($hBMP) ;Load the image you want to resize. $hImage2 = _GDIPlus_ImageLoadFromFile($sInImage) ;Get the graphic context of the blank bitmap $hGraphic = _GDIPlus_ImageGetGraphicsContext ($hImage1) ;Draw the loaded image onto the blank bitmap at the size you want _GDIPLus_GraphicsDrawImageRect($hGraphic, $hImage2, 0, 0, $iW, $iH) ;Get the encoder of to save the resized image in the format you want. $CLSID = _GDIPlus_EncodersGetCLSID($Ext) ;Generate a number for out file that doesn't already exist, so you don't overwrite an existing image. Do $i += 1 Until (Not FileExists($sOP & $i & "_" & $sOF)) ;Prefix the number to the begining of the output filename $sOutImage = $sOP & $i & "_" & $sOF ;Save the new resized image. _GDIPlus_ImageSaveToFileEx($hImage1, $sOutImage, $CLSID) ;Clean up and shutdown GDIPlus. _GDIPlus_ImageDispose($hImage1) _GDIPlus_ImageDispose($hImage2) _GDIPlus_GraphicsDispose ($hGraphic) _WinAPI_DeleteObject($hBMP) _GDIPlus_Shutdown() EndFunc Quality gets quite bad compared to using Paint / Photoshop when resizing with GDIPlus
      Any idea how to make the quality better?
      Thanks in advance
    • UEZ
      _GDIPlus_BitmapApplyFilter UDF beta
      By UEZ
      A collection of image filter effects usable with AutoIt!
       
      IMPORTANT: You are not allowed to sell this code or just parts of it in a commercial project or modify it and distribute it with a different name!
      Distributing copies of this UDF incl. _GDIPlus_BitmapApplyFilter.dll in compiled format (exe) must be free of any fee!
       
      More information can be found in the forum thread!
       
    • Chimp
      Read data from html Tables from raw HTML source
      By Chimp
      This is for extraction of data from HTML tables to an array.
      It uses an raw html source file as input, and does not relies on any browser.
      You can get the source of the html using commands like InetGet(), InetRead(), _INetGetSource(), _IEDocReadHTML() for example, or load an html file from disc as well.
      It also takes care of the data position in the table due to rowspan and colspan trying to keep the same layout in the generated array.
      It has the option to fill the cells in the array corresponding with the "span" zones all with the same value of the first "span" cell of the corresponding area.
      ; save this as _HtmlTable2Array.au3 #include-once #include <array.au3> ; ; #FUNCTION# ==================================================================================================================== ; Name ..........: _HtmlTableGetList ; Description ...: Finds and enumerates all the html tables contained in an html listing (even if nested). ; if the optional parameter $i_index is passed, then only that table is returned ; Syntax ........: _HtmlTableGetList($sHtml[, $i_index = -1]) ; Parameters ....: $sHtml - A string value containing an html page listing ; $i_index - [optional] An integer value indicating the number of the table to be returned (1 based) ; with the default value of -1 an array with all found tables is returned ; Return values .: Success; Returns an 1D 1 based array containing all or single html table found in the html. ; element [0] (and @extended as well) contains the number of tables found (or 0 if no tables are returned) ; if an error occurs then an ampty string is returned and the following @error code is setted ; @error: 1 - no tables are present in the passed HTML ; 2 - error while parsing tables, (opening and closing tags are not balanced) ; 3 - error while parsing tables, (open/close mismatch error) ; 4 - invalid table index request (requested table nr. is out of boundaries) ; =============================================================================================================================== Func _HtmlTableGetList($sHtml, $i_index = -1) Local $aTables = _ParseTags($sHtml, "<table", "</table>") If @error Then Return SetError(@error, 0, "") ElseIf $i_index = -1 Then Return SetError(0, $aTables[0], $aTables) Else If $i_index > 0 And $i_index <= $aTables[0] Then Local $aTemp[2] = [1, $aTables[$i_index]] Return SetError(0, 1, $aTemp) Else Return SetError(4, 0, "") ; bad index EndIf EndIf EndFunc ;==>_HtmlTableGetList ; #FUNCTION# ==================================================================================================================== ; Name ..........: _HtmlTableWriteToArray ; Description ...: It writes values from an html table to a 2D array. It tries to take care of the rowspan and colspan formats ; Syntax ........: _HtmlTableWriteToArray($sHtmlTable[, $bFillSpan = False[, $iFilter = 0]]) ; Parameters ....: $sHtmlTable - A string value containing the html code of the table to be parsed ; $bFillSpan - [optional] Default is False. If span areas have to be filled by repeating the data ; contained in the first cell of the span area ; $iFilter - [optional] Default is 0 (no filters) data extracted from cells is returned unchanged. ; - 0 = no filter ; - 1 = removes non ascii characters ; - 2 = removes all double whitespaces ; - 4 = removes all double linefeeds ; - 8 = removes all html-tags ; - 16 = simple html-tag / entities convertor ; Return values .: Success: 2D array containing data from the html table ; Faillure: An empty strimg and sets @error as following: ; @error: 1 - no table content is present in the passed HTML ; 2 - error while parsing rows and/or columns, (opening and closing tags are not balanced) ; 3 - error while parsing rows and/or columns, (open/close mismatch error) ; =============================================================================================================================== Func _HtmlTableWriteToArray($sHtmlTable, $bFillSpan = False, $iFilter = 0) $sHtmlTable = StringReplace(StringReplace($sHtmlTable, "<th", "<td"), "</th>", "</td>") ; th becomes td ; rows of the wanted table Local $iError, $aTempEmptyRow[2] = [1, ""] Local $aRows = _ParseTags($sHtmlTable, "<tr", "</tr>") ; $aRows[0] = nr. of rows If @error Then Return SetError(@error, 0, "") Local $aCols[$aRows[0] + 1], $aTemp For $i = 1 To $aRows[0] $aTemp = _ParseTags($aRows[$i], "<td", "</td>") $iError = @error If $iError = 1 Then ; check if it's an empty row $aTemp = $aTempEmptyRow ; Empty Row Else If $iError Then Return SetError($iError, 0, "") EndIf If $aCols[0] < $aTemp[0] Then $aCols[0] = $aTemp[0] ; $aTemp[0] = max nr. of columns in table $aCols[$i] = $aTemp Next Local $aResult[$aRows[0]][$aCols[0]], $iStart, $iEnd, $aRowspan, $aColspan, $iSpanY, $iSpanX, $iSpanRow, $iSpanCol, $iMarkerCode, $sCellContent Local $aMirror = $aResult For $i = 1 To $aRows[0] ; scan all rows in this table $aTemp = $aCols[$i] ; <td ..> xx </td> ..... For $ii = 1 To $aTemp[0] ; scan all cells in this row $iSpanY = 0 $iSpanX = 0 $iY = $i - 1 ; zero base index for vertical ref $iX = $ii - 1 ; zero based indexes for horizontal ref ; following RegExp kindly provided by SadBunny in this post: ; http://www.autoitscript.com/forum/topic/167174-how-to-get-a-number-located-after-a-name-from-within-a-string/?p=1222781 $aRowspan = StringRegExp($aTemp[$ii], "(?i)rowspan\s*=\s*[""']?\s*(\d+)", 1) ; check presence of rowspan If IsArray($aRowspan) Then $iSpanY = $aRowspan[0] - 1 If $iSpanY + $iY > $aRows[0] Then $iSpanY -= $iSpanY + $iY - $aRows[0] + 1 EndIf EndIf ; $aColspan = StringRegExp($aTemp[$ii], "(?i)colspan\s*=\s*[""']?\s*(\d+)", 1) ; check presence of colspan If IsArray($aColspan) Then $iSpanX = $aColspan[0] - 1 ; $iMarkerCode += 1 ; code to mark this span area or single cell If $iSpanY Or $iSpanX Then $iX1 = $iX For $iSpY = 0 To $iSpanY For $iSpX = 0 To $iSpanX $iSpanRow = $iY + $iSpY If $iSpanRow > UBound($aMirror, 1) - 1 Then $iSpanRow = UBound($aMirror, 1) - 1 EndIf $iSpanCol = $iX1 + $iSpX If $iSpanCol > UBound($aMirror, 2) - 1 Then ReDim $aResult[$aRows[0]][UBound($aResult, 2) + 1] ReDim $aMirror[$aRows[0]][UBound($aMirror, 2) + 1] EndIf ; While $aMirror[$iSpanRow][$iX1 + $iSpX] ; search first free column $iX1 += 1 ; $iSpanCol += 1 If $iX1 + $iSpX > UBound($aMirror, 2) - 1 Then ReDim $aResult[$aRows[0]][UBound($aResult, 2) + 1] ReDim $aMirror[$aRows[0]][UBound($aMirror, 2) + 1] EndIf WEnd Next Next EndIf ; $iX1 = $iX ; following RegExp kindly provided by mikell in this post: ; http://www.autoitscript.com/forum/topic/167309-how-to-remove-from-a-string-all-between-and-pairs/?p=1224207 $sCellContent = StringRegExpReplace($aTemp[$ii], '<[^>]+>', "") If $iFilter Then $sCellContent = _HTML_Filter($sCellContent, $iFilter) For $iSpX = 0 To $iSpanX For $iSpY = 0 To $iSpanY $iSpanRow = $iY + $iSpY If $iSpanRow > UBound($aMirror, 1) - 1 Then $iSpanRow = UBound($aMirror, 1) - 1 EndIf While $aMirror[$iSpanRow][$iX1 + $iSpX] $iX1 += 1 If $iX1 + $iSpX > UBound($aMirror, 2) - 1 Then ReDim $aResult[$aRows[0]][$iX1 + $iSpX + 1] ReDim $aMirror[$aRows[0]][$iX1 + $iSpX + 1] EndIf WEnd $aMirror[$iSpanRow][$iX1 + $iSpX] = $iMarkerCode ; 1 If $bFillSpan Then $aResult[$iSpanRow][$iX1 + $iSpX] = $sCellContent Next $aResult[$iY][$iX1] = $sCellContent Next Next Next ; _ArrayDisplay($aMirror, "Debug") Return SetError(0, $aResult[0][0], $aResult) EndFunc ;==>_HtmlTableWriteToArray ; ; #FUNCTION# ==================================================================================================================== ; Name ..........: _HtmlTableGetWriteToArray ; Description ...: extract the html code of the required table from the html listing and copy the data of the table to a 2D array ; Syntax ........: _HtmlTableGetWriteToArray($sHtml[, $iWantedTable = 1[, $bFillSpan = False[, $iFilter = 0]]]) ; Parameters ....: $sHtml - A string value containing the html listing ; $iWantedTable - [optional] An integer value. The nr. of the table to be parsed (default is first table) ; $bFillSpan - [optional] Default is False. If all span areas have to be filled by repeating the data ; contained in the first cell of the span area ; $iFilter - [optional] Default is 0 (no filters) data extracted from cells is returned unchanged. ; - 0 = no filter ; - 1 = removes non ascii characters ; - 2 = removes all double whitespaces ; - 4 = removes all double linefeeds ; - 8 = removes all html-tags ; - 16 = simple html-tag / entities convertor ; Return values .: success: 2D array containing data from the wanted html table. ; faillure: An empty string and sets @error as following: ; @error: 1 - no tables are present in the passed HTML ; 2 - error while parsing tables, (opening and closing tags are not balanced) ; 3 - error while parsing tables, (open/close mismatch error) ; 4 - invalid table index request (requested table nr. is out of boundaries) ; =============================================================================================================================== Func _HtmlTableGetWriteToArray($sHtml, $iWantedTable = 1, $bFillSpan = False, $iFilter = 0) Local $aSingleTable = _HtmlTableGetList($sHtml, $iWantedTable) If @error Then Return SetError(@error, 0, "") Local $aTableData = _HtmlTableWriteToArray($aSingleTable[1], $bFillSpan, $iFilter) If @error Then Return SetError(@error, 0, "") Return SetError(0, $aTableData[0][0], $aTableData) EndFunc ;==>_HtmlTableGetWriteToArray ; #FUNCTION# ==================================================================================================================== ; Name ..........: _ParseTags ; Description ...: searches and extract all portions of html code within opening and closing tags inclusive. ; Returns an array containing a collection of <tag ...... </tag> lines. one in each element (even if are nested) ; Syntax ........: _ParseTags($sHtml, $sOpening, $sClosing) ; Parameters ....: $sHtml - A string value containing the html listing ; $sOpening - A string value indicating the opening tag ; $sClosing - A string value indicating the closing tag ; Return values .: success: an 1D 1 based array containing all the portions of html code representing the element ; element [0] af the array (and @extended as well) contains the counter of found elements ; faillure: An empty string and sets @error as following: ; @error: 1 - no tables are present in the passed HTML ; 2 - error while parsing tables, (opening and closing tags are not balanced) ; 3 - error while parsing tables, (open/close mismatch error) ; 4 - invalid table index request (requested table nr. is out of boundaries) ; =============================================================================================================================== Func _ParseTags($sHtml, $sOpening, $sClosing) ; example: $sOpening = '<table', $sClosing = '</table>' ; it finds how many of such tags are on the HTML page StringReplace($sHtml, $sOpening, $sOpening) ; in @xtended nr. of occurences Local $iNrOfThisTag = @extended ; I assume that opening <tag and closing </tag> tags are balanced (as should be) ; (so NO check is made to see if they are actually balanced) If $iNrOfThisTag Then ; if there is at least one of this tag ; $aThisTagsPositions array will contain the positions of the ; starting <tag and ending </tag> tags within the HTML Local $aThisTagsPositions[$iNrOfThisTag * 2 + 1][3] ; 1 based (make room for all open and close tags) ; 2) find in the HTML the positions of the $sOpening <tag and $sClosing </tag> tags For $i = 1 To $iNrOfThisTag $aThisTagsPositions[$i][0] = StringInStr($sHtml, $sOpening, 0, $i) ; start position of $i occurrence of <tag opening tag $aThisTagsPositions[$i][1] = $sOpening ; it marks which kind of tag is this $aThisTagsPositions[$i][2] = $i ; nr of this tag $aThisTagsPositions[$iNrOfThisTag + $i][0] = StringInStr($sHtml, $sClosing, 0, $i) + StringLen($sClosing) - 1 ; end position of $i^ occurrence of </tag> closing tag $aThisTagsPositions[$iNrOfThisTag + $i][1] = $sClosing ; it marks which kind of tag is this Next _ArraySort($aThisTagsPositions, 0, 1) ; now all opening and closing tags are in the same sequence as them appears in the HTML Local $aStack[UBound($aThisTagsPositions)][2] Local $aTags[Ceiling(UBound($aThisTagsPositions) / 2)] ; will contains the collection of <tag ..... </tag> from the html For $i = 1 To UBound($aThisTagsPositions) - 1 If $aThisTagsPositions[$i][1] = $sOpening Then ; opening <tag $aStack[0][0] += 1 ; nr of tags in html $aStack[$aStack[0][0]][0] = $sOpening $aStack[$aStack[0][0]][1] = $i ElseIf $aThisTagsPositions[$i][1] = $sClosing Then ; a closing </tag> was found If Not $aStack[0][0] Or Not ($aStack[$aStack[0][0]][0] = $sOpening And $aThisTagsPositions[$i][1] = $sClosing) Then Return SetError(3, 0, "") ; Open/Close mismatch error Else ; pair detected (the reciprocal tag) ; now get coordinates of the 2 tags ; 1) extract this tag <tag ..... </tag> from the html to the array $aTags[$aThisTagsPositions[$aStack[$aStack[0][0]][1]][2]] = StringMid($sHtml, $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0], 1 + $aThisTagsPositions[$i][0] - $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0]) ; 2) remove that tag <tag ..... </tag> from the html $sHtml = StringLeft($sHtml, $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0] - 1) & StringMid($sHtml, $aThisTagsPositions[$i][0] + 1) ; 3) adjust the references to the new positions of remaining tags For $ii = $i To UBound($aThisTagsPositions) - 1 $aThisTagsPositions[$ii][0] -= StringLen($aTags[$aThisTagsPositions[$aStack[$aStack[0][0]][1]][2]]) Next $aStack[0][0] -= 1 ; nr of tags still in html EndIf EndIf Next If Not $aStack[0][0] Then ; all tags where parsed correctly $aTags[0] = $iNrOfThisTag Return SetError(0, $iNrOfThisTag, $aTags) ; OK Else Return SetError(2, 0, "") ; opening and closing tags are not balanced EndIf Else Return SetError(1, 0, "") ; there are no of such tags on this HTML page EndIf EndFunc ;==>_ParseTags ; #============================================================================= ; Name ..........: _HTML_Filter ; Description ...: Filter for strings ; AutoIt Version : V3.3.0.0 ; Syntax ........: _HTML_Filter(ByRef $sString[, $iMode = 0]) ; Parameter(s): .: $sString - String to filter ; $iMode - Optional: (Default = 0) : removes nothing ; - 0 = no filter ; - 1 = removes non ascii characters ; - 2 = removes all double whitespaces ; - 4 = removes all double linefeeds ; - 8 = removes all html-tags ; - 16 = simple html-tag / entities convertor ; Return Value ..: Success - Filterd String ; Failure - Input String ; Author(s) .....: Thorsten Willert, Stephen Podhajecki {gehossafats at netmdc. com} _ConvertEntities ; Date ..........: Wed Jan 27 20:49:59 CET 2010 ; modified ......: by Chimp Removed a double "&nbsp;" entities declaration, ; replace it with char(160) instead of chr(32), ; declaration of the $aEntities array as Static instead of just Local ; ============================================================================== Func _HTML_Filter(ByRef $sString, $iMode = 0) If $iMode = 0 Then Return $sString ;16 simple HTML tag / entities converter If $iMode >= 16 And $iMode < 32 Then Static Local $aEntities[95][2] = [["&quot;", 34],["&amp;", 38],["&lt;", 60],["&gt;", 62],["&nbsp;", 160] _ ,["&iexcl;", 161],["&cent;", 162],["&pound;", 163],["&curren;", 164],["&yen;", 165],["&brvbar;", 166] _ ,["&sect;", 167],["&uml;", 168],["&copy;", 169],["&ordf;", 170],["&not;", 172],["&shy;", 173] _ ,["&reg;", 174],["&macr;", 175],["&deg;", 176],["&plusmn;", 177],["&sup2;", 178],["&sup3;", 179] _ ,["&acute;", 180],["&micro;", 181],["&para;", 182],["&middot;", 183],["&cedil;", 184],["&sup1;", 185] _ ,["&ordm;", 186],["&raquo;", 187],["&frac14;", 188],["&frac12;", 189],["&frac34;", 190],["&iquest;", 191] _ ,["&Agrave;", 192],["&Aacute;", 193],["&Atilde;", 195],["&Auml;", 196],["&Aring;", 197],["&AElig;", 198] _ ,["&Ccedil;", 199],["&Egrave;", 200],["&Eacute;", 201],["&Ecirc;", 202],["&Igrave;", 204],["&Iacute;", 205] _ ,["&Icirc;", 206],["&Iuml;", 207],["&ETH;", 208],["&Ntilde;", 209],["&Ograve;", 210],["&Oacute;", 211] _ ,["&Ocirc;", 212],["&Otilde;", 213],["&Ouml;", 214],["&times;", 215],["&Oslash;", 216],["&Ugrave;", 217] _ ,["&Uacute;", 218],["&Ucirc;", 219],["&Uuml;", 220],["&Yacute;", 221],["&THORN;", 222],["&szlig;", 223] _ ,["&agrave;", 224],["&aacute;", 225],["&acirc;", 226],["&atilde;", 227],["&auml;", 228],["&aring;", 229] _ ,["&aelig;", 230],["&ccedil;", 231],["&egrave;", 232],["&eacute;", 233],["&ecirc;", 234],["&euml;", 235] _ ,["&igrave;", 236],["&iacute;", 237],["&icirc;", 238],["&iuml;", 239],["&eth;", 240],["&ntilde;", 241] _ ,["&ograve;", 242],["&oacute;", 243],["&ocirc;", 244],["&otilde;", 245],["&ouml;", 246],["&divide;", 247] _ ,["&oslash;", 248],["&ugrave;", 249],["&uacute;", 250],["&ucirc;", 251],["&uuml;", 252],["&thorn;", 254]] $sString = StringRegExpReplace($sString, '(?i)<p.*?>', @CRLF & @CRLF) $sString = StringRegExpReplace($sString, '(?i)<br>', @CRLF) Local $iE = UBound($aEntities) - 1 For $x = 0 To $iE $sString = StringReplace($sString, $aEntities[$x][0], Chr($aEntities[$x][1]), 0, 2) Next For $x = 32 To 255 $sString = StringReplace($sString, "&#" & $x & ";", Chr($x)) Next $iMode -= 16 EndIf ;8 Tag filter If $iMode >= 8 And $iMode < 16 Then ;$sString = StringRegExpReplace($sString, '<script.*?>.*?</script>', "") $sString = StringRegExpReplace($sString, "<[^>]*>", "") $iMode -= 8 EndIf ; 4 remove all double cr, lf If $iMode >= 4 And $iMode < 8 Then $sString = StringRegExpReplace($sString, "([ \t]*[\n\r]+[ \t]*)", @CRLF) $sString = StringRegExpReplace($sString, "[\n\r]+", @CRLF) $iMode -= 4 EndIf ; 2 remove all double withespaces If $iMode = 2 Or $iMode = 3 Then $sString = StringRegExpReplace($sString, "[[:blank:]]+", " ") $sString = StringRegExpReplace($sString, "\n[[:blank:]]+", @CRLF) $sString = StringRegExpReplace($sString, "[[:blank:]]+\n", "") $iMode -= 2 EndIf ; 1 remove all non ASCII (remove all chars with ascii code > 127) If $iMode = 1 Then $sString = StringRegExpReplace($sString, "[^\x00-\x7F]", " ") EndIf Return $sString EndFunc ;==>_HTML_Filter This simple demo allow to test those functions, showing what it can extract from the html tables in a web page of your choice or loading the html file from the disc.
      ; #include <_HtmlTable2Array.au3> ; <--- udf already included (hard coded) at bottom of this demo #include <GUIConstantsEx.au3> #include <EditConstants.au3> #include <WindowsConstants.au3> #include <File.au3> ; needed for _FileWriteFromArray() #include <array.au3> #include <IE.au3> Local $oIE1 = _IECreateEmbedded(), $oIE2 = _IECreateEmbedded(), $iFilter = 0 Local $sHtml_File, $iIndex, $aTable, $aMyArray, $sFilePath GUICreate("Html tables to array demo", 1000, 450, (@DesktopWidth - 1000) / 2, (@DesktopHeight - 450) / 2 _ , $WS_OVERLAPPEDWINDOW + $WS_CLIPSIBLINGS + $WS_CLIPCHILDREN) GUICtrlCreateObj($oIE1, 010, 10, 480, 360) ; left browser GUICtrlCreateTab(500, 10, 480, 360) GUICtrlCreateTabItem("view table") GUICtrlCreateObj($oIE2, 502, 33, 474, 335) ; right browser GUICtrlCreateTabItem("view html") Local $idLabel_HtmlTable = GUICtrlCreateInput("", 502, 33, 474, 335, $ES_MULTILINE + $ES_AUTOVSCROLL) GUICtrlSetFont(-1, 10, 0, 0, "Courier new") GUICtrlCreateTabItem("") Local $idInputUrl = GUICtrlCreateInput("", 10, 380, 440, 20) Local $idButton_Go = GUICtrlCreateButton("Go", 455, 380, 25, 20) Local $idButton_Load = GUICtrlCreateButton("Load html from disk", 10, 410, 480, 30) Local $idButton_Prev = GUICtrlCreateButton("Prev <-", 510, 375, 50, 30) Local $idLabel_NunTable = GUICtrlCreateLabel("00 / 00", 570, 375, 40, 30) GUICtrlSetFont(-1, 9, 700) Local $idButton_Next = GUICtrlCreateButton("Next ->", 620, 375, 50, 30) GUICtrlCreateGroup("Fill Span", 680, 370, 80, 40) Local $iFillSpan = GUICtrlCreateCheckbox("", 715, 388, 15, 15) GUICtrlCreateGroup("", -99, -99, 1, 1) ;close group Local $idButton_Array0 = GUICtrlCreateButton("Preview array", 770, 375, 100, 30) Local $idButton_Array1 = GUICtrlCreateButton("Write array to file", 880, 375, 100, 30) ; options for filtering GUICtrlCreateGroup("Filters", 510, 410, 470, 35) Local $iFilter01 = GUICtrlCreateCheckbox("non ascii", 520, 425, 85, 15) Local $iFilter02 = GUICtrlCreateCheckbox("double spaces", 610, 425, 85, 15) Local $iFilter04 = GUICtrlCreateCheckbox("double @LF", 700, 425, 85, 15) Local $iFilter08 = GUICtrlCreateCheckbox("html-tags", 790, 425, 85, 15) Local $iFilter16 = GUICtrlCreateCheckbox("tags to entities", 880, 425, 85, 15) GUICtrlCreateGroup("", -99, -99, 1, 1) ;close group GUISetState(@SW_SHOW) ;Show GUI ; _IEDocWriteHTML($oIE2, "<HTML></HTML>") GUICtrlSetData($idInputUrl, "http://www.danshort.com/HTMLentities/") ; GUICtrlSetData($idInputUrl, "http://www.mojotoad.com/sisk/projects/HTML-TableExtract/tables.html") ; example page ControlClick("", "", $idButton_Go) ; _IEAction($oIE1, "stop") Do; Waiting for user to close the window $iMsg = GUIGetMsg() Select Case $iMsg = $idButton_Go _IENavigate($oIE1, GUICtrlRead($idInputUrl)) ; _IEAction($oIE1, "stop") $aTables = _HtmlTableGetList(_IEBodyReadHTML($oIE1)) If Not @error Then ; _ArrayDisplay($aTables, "Tables contained in this html") $iIndex = 1 _IEBodyWriteHTML($oIE2, "<html>" & $aTables[$iIndex] & "</html>") ControlClick("", "", $idButton_Prev) _IEAction($oIE2, "stop") Else MsgBox(0, 0, "@error " & @error) EndIf Case $iMsg = $idButton_Load ConsoleWrite("$idButton_Load" & @CRLF) $sHtml_File = FileOpenDialog("Choose an html file", @ScriptDir & "\", "html page (*.htm;*.html)") If Not @error Then GUICtrlSetData($idInputUrl, $sHtml_File) ControlClick("", "", $idButton_Go) EndIf Case $iMsg = $idButton_Next If IsArray($aTables) Then $iIndex += $iIndex < $aTables[0] GUICtrlSetData($idLabel_NunTable, "Table" & @CRLF & $iIndex & " / " & $aTables[0]) GUICtrlSetData($idLabel_HtmlTable, $aTables[$iIndex]) _IEBodyWriteHTML($oIE2, "<html>" & $aTables[$iIndex] & "</html>") _IEAction($oIE2, "stop") EndIf Case $iMsg = $idButton_Prev If IsArray($aTables) Then $iIndex -= $iIndex > 1 GUICtrlSetData($idLabel_NunTable, "Table" & @CRLF & $iIndex & " / " & $aTables[0]) GUICtrlSetData($idLabel_HtmlTable, $aTables[$iIndex]) _IEBodyWriteHTML($oIE2, "<html>" & $aTables[$iIndex] & "</html>") _IEAction($oIE2, "stop") EndIf Case $iMsg = $idButton_Array0 ; Preview Array If IsArray($aTables) Then $iFilter = 1 * _IsChecked($iFilter01) + 2 * _IsChecked($iFilter02) + 4 * _IsChecked($iFilter04) + 8 * _IsChecked($iFilter08) + 16 * _IsChecked($iFilter16) $aMyArray = _HtmlTableWriteToArray($aTables[$iIndex], _IsChecked($iFillSpan), $iFilter) If Not @error Then _ArrayDisplay($aMyArray) EndIf Case $iMsg = $idButton_Array1 ; Saves the array in a csv file of your choice If IsArray($aTables) Then $iFilter = 1 * _IsChecked($iFilter01) + 2 * _IsChecked($iFilter02) + 4 * _IsChecked($iFilter04) + 8 * _IsChecked($iFilter08) + 16 * _IsChecked($iFilter16) $aMyArray = _HtmlTableWriteToArray($aTables[$iIndex], _IsChecked($iFillSpan), $iFilter) If Not @error Then $sFilePath = FileSaveDialog("Choose a file to save to", @ScriptDir, "(*.csv)") If $sFilePath <> "" Then If Not _FileWriteFromArray($sFilePath, $aMyArray, 0, Default, ",") Then MsgBox(0, "Error on file write", "Error code is " & @error & @CRLF & @CRLF & "@error meaning:" & @CRLF & _ "1 - Error opening specified file" & @CRLF & _ "2 - $aArray is not an array" & @CRLF & _ "3 - Error writing to file" & @CRLF & _ "4 - $aArray is not a 1D or 2D array" & @CRLF & _ "5 - Start index is greater than the $iUbound parameter") EndIf EndIf EndIf EndIf EndSelect Until $iMsg = $GUI_EVENT_CLOSE GUIDelete() ; returns 1 if CheckBox is checked Func _IsChecked($idControlID) ; $GUI_CHECKED = 1 Return GUICtrlRead($idControlID) = $GUI_CHECKED EndFunc ;==>_IsChecked ; ------------------------------------------------------------------------ ; Following code should be included by the #include <_HtmlTable2Array.au3> ; hard coded here for easy load an run to try the example ; ------------------------------------------------------------------------ #include-once #include <array.au3> ; ; #FUNCTION# ==================================================================================================================== ; Name ..........: _HtmlTableGetList ; Description ...: Finds and enumerates all the html tables contained in an html listing (even if nested). ; if the optional parameter $i_index is passed, then only that table is returned ; Syntax ........: _HtmlTableGetList($sHtml[, $i_index = -1]) ; Parameters ....: $sHtml - A string value containing an html page listing ; $i_index - [optional] An integer value indicating the number of the table to be returned (1 based) ; with the default value of -1 an array with all found tables is returned ; Return values .: Success; Returns an 1D 1 based array containing all or single html table found in the html. ; element [0] (and @extended as well) contains the number of tables found (or 0 if no tables are returned) ; if an error occurs then an ampty string is returned and the following @error code is setted ; @error: 1 - no tables are present in the passed HTML ; 2 - error while parsing tables, (opening and closing tags are not balanced) ; 3 - error while parsing tables, (open/close mismatch error) ; 4 - invalid table index request (requested table nr. is out of boundaries) ; =============================================================================================================================== Func _HtmlTableGetList($sHtml, $i_index = -1) Local $aTables = _ParseTags($sHtml, "<table", "</table>") If @error Then Return SetError(@error, 0, "") ElseIf $i_index = -1 Then Return SetError(0, $aTables[0], $aTables) Else If $i_index > 0 And $i_index <= $aTables[0] Then Local $aTemp[2] = [1, $aTables[$i_index]] Return SetError(0, 1, $aTemp) Else Return SetError(4, 0, "") ; bad index EndIf EndIf EndFunc ;==>_HtmlTableGetList ; #FUNCTION# ==================================================================================================================== ; Name ..........: _HtmlTableWriteToArray ; Description ...: It writes values from an html table to a 2D array. It tries to take care of the rowspan and colspan formats ; Syntax ........: _HtmlTableWriteToArray($sHtmlTable[, $bFillSpan = False[, $iFilter = 0]]) ; Parameters ....: $sHtmlTable - A string value containing the html code of the table to be parsed ; $bFillSpan - [optional] Default is False. If span areas have to be filled by repeating the data ; contained in the first cell of the span area ; $iFilter - [optional] Default is 0 (no filters) data extracted from cells is returned unchanged. ; - 0 = no filter ; - 1 = removes non ascii characters ; - 2 = removes all double whitespaces ; - 4 = removes all double linefeeds ; - 8 = removes all html-tags ; - 16 = simple html-tag / entities convertor ; Return values .: Success: 2D array containing data from the html table ; Faillure: An empty strimg and sets @error as following: ; @error: 1 - no table content is present in the passed HTML ; 2 - error while parsing rows and/or columns, (opening and closing tags are not balanced) ; 3 - error while parsing rows and/or columns, (open/close mismatch error) ; =============================================================================================================================== Func _HtmlTableWriteToArray($sHtmlTable, $bFillSpan = False, $iFilter = 0) $sHtmlTable = StringReplace(StringReplace($sHtmlTable, "<th", "<td"), "</th>", "</td>") ; th becomes td ; rows of the wanted table Local $iError, $aTempEmptyRow[2] = [1, ""] Local $aRows = _ParseTags($sHtmlTable, "<tr", "</tr>") ; $aRows[0] = nr. of rows If @error Then Return SetError(@error, 0, "") Local $aCols[$aRows[0] + 1], $aTemp For $i = 1 To $aRows[0] $aTemp = _ParseTags($aRows[$i], "<td", "</td>") $iError = @error If $iError = 1 Then ; check if it's an empty row $aTemp = $aTempEmptyRow ; Empty Row Else If $iError Then Return SetError($iError, 0, "") EndIf If $aCols[0] < $aTemp[0] Then $aCols[0] = $aTemp[0] ; $aTemp[0] = max nr. of columns in table $aCols[$i] = $aTemp Next Local $aResult[$aRows[0]][$aCols[0]], $iStart, $iEnd, $aRowspan, $aColspan, $iSpanY, $iSpanX, $iSpanRow, $iSpanCol, $iMarkerCode, $sCellContent Local $aMirror = $aResult For $i = 1 To $aRows[0] ; scan all rows in this table $aTemp = $aCols[$i] ; <td ..> xx </td> ..... For $ii = 1 To $aTemp[0] ; scan all cells in this row $iSpanY = 0 $iSpanX = 0 $iY = $i - 1 ; zero base index for vertical ref $iX = $ii - 1 ; zero based indexes for horizontal ref ; following RegExp kindly provided by SadBunny in this post: ; http://www.autoitscript.com/forum/topic/167174-how-to-get-a-number-located-after-a-name-from-within-a-string/?p=1222781 $aRowspan = StringRegExp($aTemp[$ii], "(?i)rowspan\s*=\s*[""']?\s*(\d+)", 1) ; check presence of rowspan If IsArray($aRowspan) Then $iSpanY = $aRowspan[0] - 1 If $iSpanY + $iY > $aRows[0] Then $iSpanY -= $iSpanY + $iY - $aRows[0] + 1 EndIf EndIf ; $aColspan = StringRegExp($aTemp[$ii], "(?i)colspan\s*=\s*[""']?\s*(\d+)", 1) ; check presence of colspan If IsArray($aColspan) Then $iSpanX = $aColspan[0] - 1 ; $iMarkerCode += 1 ; code to mark this span area or single cell If $iSpanY Or $iSpanX Then $iX1 = $iX For $iSpY = 0 To $iSpanY For $iSpX = 0 To $iSpanX $iSpanRow = $iY + $iSpY If $iSpanRow > UBound($aMirror, 1) - 1 Then $iSpanRow = UBound($aMirror, 1) - 1 EndIf $iSpanCol = $iX1 + $iSpX If $iSpanCol > UBound($aMirror, 2) - 1 Then ReDim $aResult[$aRows[0]][UBound($aResult, 2) + 1] ReDim $aMirror[$aRows[0]][UBound($aMirror, 2) + 1] EndIf ; While $aMirror[$iSpanRow][$iX1 + $iSpX] ; search first free column $iX1 += 1 ; $iSpanCol += 1 If $iX1 + $iSpX > UBound($aMirror, 2) - 1 Then ReDim $aResult[$aRows[0]][UBound($aResult, 2) + 1] ReDim $aMirror[$aRows[0]][UBound($aMirror, 2) + 1] EndIf WEnd Next Next EndIf ; $iX1 = $iX ; following RegExp kindly provided by mikell in this post: ; http://www.autoitscript.com/forum/topic/167309-how-to-remove-from-a-string-all-between-and-pairs/?p=1224207 $sCellContent = StringRegExpReplace($aTemp[$ii], '<[^>]+>', "") If $iFilter Then $sCellContent = _HTML_Filter($sCellContent, $iFilter) For $iSpX = 0 To $iSpanX For $iSpY = 0 To $iSpanY $iSpanRow = $iY + $iSpY If $iSpanRow > UBound($aMirror, 1) - 1 Then $iSpanRow = UBound($aMirror, 1) - 1 EndIf While $aMirror[$iSpanRow][$iX1 + $iSpX] $iX1 += 1 If $iX1 + $iSpX > UBound($aMirror, 2) - 1 Then ReDim $aResult[$aRows[0]][$iX1 + $iSpX + 1] ReDim $aMirror[$aRows[0]][$iX1 + $iSpX + 1] EndIf WEnd $aMirror[$iSpanRow][$iX1 + $iSpX] = $iMarkerCode ; 1 If $bFillSpan Then $aResult[$iSpanRow][$iX1 + $iSpX] = $sCellContent Next $aResult[$iY][$iX1] = $sCellContent Next Next Next ; _ArrayDisplay($aMirror, "Debug") Return SetError(0, $aResult[0][0], $aResult) EndFunc ;==>_HtmlTableWriteToArray ; ; #FUNCTION# ==================================================================================================================== ; Name ..........: _HtmlTableGetWriteToArray ; Description ...: extract the html code of the required table from the html listing and copy the data of the table to a 2D array ; Syntax ........: _HtmlTableGetWriteToArray($sHtml[, $iWantedTable = 1[, $bFillSpan = False[, $iFilter = 0]]]) ; Parameters ....: $sHtml - A string value containing the html listing ; $iWantedTable - [optional] An integer value. The nr. of the table to be parsed (default is first table) ; $bFillSpan - [optional] Default is False. If all span areas have to be filled by repeating the data ; contained in the first cell of the span area ; $iFilter - [optional] Default is 0 (no filters) data extracted from cells is returned unchanged. ; - 0 = no filter ; - 1 = removes non ascii characters ; - 2 = removes all double whitespaces ; - 4 = removes all double linefeeds ; - 8 = removes all html-tags ; - 16 = simple html-tag / entities convertor ; Return values .: success: 2D array containing data from the wanted html table. ; faillure: An empty string and sets @error as following: ; @error: 1 - no tables are present in the passed HTML ; 2 - error while parsing tables, (opening and closing tags are not balanced) ; 3 - error while parsing tables, (open/close mismatch error) ; 4 - invalid table index request (requested table nr. is out of boundaries) ; =============================================================================================================================== Func _HtmlTableGetWriteToArray($sHtml, $iWantedTable = 1, $bFillSpan = False, $iFilter = 0) Local $aSingleTable = _HtmlTableGetList($sHtml, $iWantedTable) If @error Then Return SetError(@error, 0, "") Local $aTableData = _HtmlTableWriteToArray($aSingleTable[1], $bFillSpan, $iFilter) If @error Then Return SetError(@error, 0, "") Return SetError(0, $aTableData[0][0], $aTableData) EndFunc ;==>_HtmlTableGetWriteToArray ; #FUNCTION# ==================================================================================================================== ; Name ..........: _ParseTags ; Description ...: searches and extract all portions of html code within opening and closing tags inclusive. ; Returns an array containing a collection of <tag ...... </tag> lines. one in each element (even if are nested) ; Syntax ........: _ParseTags($sHtml, $sOpening, $sClosing) ; Parameters ....: $sHtml - A string value containing the html listing ; $sOpening - A string value indicating the opening tag ; $sClosing - A string value indicating the closing tag ; Return values .: success: an 1D 1 based array containing all the portions of html code representing the element ; element [0] af the array (and @extended as well) contains the counter of found elements ; faillure: An empty string and sets @error as following: ; @error: 1 - no tables are present in the passed HTML ; 2 - error while parsing tables, (opening and closing tags are not balanced) ; 3 - error while parsing tables, (open/close mismatch error) ; 4 - invalid table index request (requested table nr. is out of boundaries) ; =============================================================================================================================== Func _ParseTags($sHtml, $sOpening, $sClosing) ; example: $sOpening = '<table', $sClosing = '</table>' ; it finds how many of such tags are on the HTML page StringReplace($sHtml, $sOpening, $sOpening) ; in @xtended nr. of occurences Local $iNrOfThisTag = @extended ; I assume that opening <tag and closing </tag> tags are balanced (as should be) ; (so NO check is made to see if they are actually balanced) If $iNrOfThisTag Then ; if there is at least one of this tag ; $aThisTagsPositions array will contain the positions of the ; starting <tag and ending </tag> tags within the HTML Local $aThisTagsPositions[$iNrOfThisTag * 2 + 1][3] ; 1 based (make room for all open and close tags) ; 2) find in the HTML the positions of the $sOpening <tag and $sClosing </tag> tags For $i = 1 To $iNrOfThisTag $aThisTagsPositions[$i][0] = StringInStr($sHtml, $sOpening, 0, $i) ; start position of $i occurrence of <tag opening tag $aThisTagsPositions[$i][1] = $sOpening ; it marks which kind of tag is this $aThisTagsPositions[$i][2] = $i ; nr of this tag $aThisTagsPositions[$iNrOfThisTag + $i][0] = StringInStr($sHtml, $sClosing, 0, $i) + StringLen($sClosing) - 1 ; end position of $i^ occurrence of </tag> closing tag $aThisTagsPositions[$iNrOfThisTag + $i][1] = $sClosing ; it marks which kind of tag is this Next _ArraySort($aThisTagsPositions, 0, 1) ; now all opening and closing tags are in the same sequence as them appears in the HTML Local $aStack[UBound($aThisTagsPositions)][2] Local $aTags[Ceiling(UBound($aThisTagsPositions) / 2)] ; will contains the collection of <tag ..... </tag> from the html For $i = 1 To UBound($aThisTagsPositions) - 1 If $aThisTagsPositions[$i][1] = $sOpening Then ; opening <tag $aStack[0][0] += 1 ; nr of tags in html $aStack[$aStack[0][0]][0] = $sOpening $aStack[$aStack[0][0]][1] = $i ElseIf $aThisTagsPositions[$i][1] = $sClosing Then ; a closing </tag> was found If Not $aStack[0][0] Or Not ($aStack[$aStack[0][0]][0] = $sOpening And $aThisTagsPositions[$i][1] = $sClosing) Then Return SetError(3, 0, "") ; Open/Close mismatch error Else ; pair detected (the reciprocal tag) ; now get coordinates of the 2 tags ; 1) extract this tag <tag ..... </tag> from the html to the array $aTags[$aThisTagsPositions[$aStack[$aStack[0][0]][1]][2]] = StringMid($sHtml, $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0], 1 + $aThisTagsPositions[$i][0] - $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0]) ; 2) remove that tag <tag ..... </tag> from the html $sHtml = StringLeft($sHtml, $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0] - 1) & StringMid($sHtml, $aThisTagsPositions[$i][0] + 1) ; 3) adjust the references to the new positions of remaining tags For $ii = $i To UBound($aThisTagsPositions) - 1 $aThisTagsPositions[$ii][0] -= StringLen($aTags[$aThisTagsPositions[$aStack[$aStack[0][0]][1]][2]]) Next $aStack[0][0] -= 1 ; nr of tags still in html EndIf EndIf Next If Not $aStack[0][0] Then ; all tags where parsed correctly $aTags[0] = $iNrOfThisTag Return SetError(0, $iNrOfThisTag, $aTags) ; OK Else Return SetError(2, 0, "") ; opening and closing tags are not balanced EndIf Else Return SetError(1, 0, "") ; there are no of such tags on this HTML page EndIf EndFunc ;==>_ParseTags ; #============================================================================= ; Name ..........: _HTML_Filter ; Description ...: Filter for strings ; AutoIt Version : V3.3.0.0 ; Syntax ........: _HTML_Filter(ByRef $sString[, $iMode = 0]) ; Parameter(s): .: $sString - String to filter ; $iMode - Optional: (Default = 0) : removes nothing ; - 0 = no filter ; - 1 = removes non ascii characters ; - 2 = removes all double whitespaces ; - 4 = removes all double linefeeds ; - 8 = removes all html-tags ; - 16 = simple html-tag / entities convertor ; Return Value ..: Success - Filterd String ; Failure - Input String ; Author(s) .....: Thorsten Willert, Stephen Podhajecki {gehossafats at netmdc. com} _ConvertEntities ; Date ..........: Wed Jan 27 20:49:59 CET 2010 ; modified ......: by Chimp Removed a double "&nbsp;" entities declaration, ; replace it with char(160) instead of chr(32), ; declaration of the $aEntities array as Static instead of just Local ; ============================================================================== Func _HTML_Filter(ByRef $sString, $iMode = 0) If $iMode = 0 Then Return $sString ;16 simple HTML tag / entities converter If $iMode >= 16 And $iMode < 32 Then Static Local $aEntities[95][2] = [["&quot;", 34],["&amp;", 38],["&lt;", 60],["&gt;", 62],["&nbsp;", 160] _ ,["&iexcl;", 161],["&cent;", 162],["&pound;", 163],["&curren;", 164],["&yen;", 165],["&brvbar;", 166] _ ,["&sect;", 167],["&uml;", 168],["&copy;", 169],["&ordf;", 170],["&not;", 172],["&shy;", 173] _ ,["&reg;", 174],["&macr;", 175],["&deg;", 176],["&plusmn;", 177],["&sup2;", 178],["&sup3;", 179] _ ,["&acute;", 180],["&micro;", 181],["&para;", 182],["&middot;", 183],["&cedil;", 184],["&sup1;", 185] _ ,["&ordm;", 186],["&raquo;", 187],["&frac14;", 188],["&frac12;", 189],["&frac34;", 190],["&iquest;", 191] _ ,["&Agrave;", 192],["&Aacute;", 193],["&Atilde;", 195],["&Auml;", 196],["&Aring;", 197],["&AElig;", 198] _ ,["&Ccedil;", 199],["&Egrave;", 200],["&Eacute;", 201],["&Ecirc;", 202],["&Igrave;", 204],["&Iacute;", 205] _ ,["&Icirc;", 206],["&Iuml;", 207],["&ETH;", 208],["&Ntilde;", 209],["&Ograve;", 210],["&Oacute;", 211] _ ,["&Ocirc;", 212],["&Otilde;", 213],["&Ouml;", 214],["&times;", 215],["&Oslash;", 216],["&Ugrave;", 217] _ ,["&Uacute;", 218],["&Ucirc;", 219],["&Uuml;", 220],["&Yacute;", 221],["&THORN;", 222],["&szlig;", 223] _ ,["&agrave;", 224],["&aacute;", 225],["&acirc;", 226],["&atilde;", 227],["&auml;", 228],["&aring;", 229] _ ,["&aelig;", 230],["&ccedil;", 231],["&egrave;", 232],["&eacute;", 233],["&ecirc;", 234],["&euml;", 235] _ ,["&igrave;", 236],["&iacute;", 237],["&icirc;", 238],["&iuml;", 239],["&eth;", 240],["&ntilde;", 241] _ ,["&ograve;", 242],["&oacute;", 243],["&ocirc;", 244],["&otilde;", 245],["&ouml;", 246],["&divide;", 247] _ ,["&oslash;", 248],["&ugrave;", 249],["&uacute;", 250],["&ucirc;", 251],["&uuml;", 252],["&thorn;", 254]] $sString = StringRegExpReplace($sString, '(?i)<p.*?>', @CRLF & @CRLF) $sString = StringRegExpReplace($sString, '(?i)<br>', @CRLF) Local $iE = UBound($aEntities) - 1 For $x = 0 To $iE $sString = StringReplace($sString, $aEntities[$x][0], Chr($aEntities[$x][1]), 0, 2) Next For $x = 32 To 255 $sString = StringReplace($sString, "&#" & $x & ";", Chr($x)) Next $iMode -= 16 EndIf ;8 Tag filter If $iMode >= 8 And $iMode < 16 Then ;$sString = StringRegExpReplace($sString, '<script.*?>.*?</script>', "") $sString = StringRegExpReplace($sString, "<[^>]*>", "") $iMode -= 8 EndIf ; 4 remove all double cr, lf If $iMode >= 4 And $iMode < 8 Then $sString = StringRegExpReplace($sString, "([ \t]*[\n\r]+[ \t]*)", @CRLF) $sString = StringRegExpReplace($sString, "[\n\r]+", @CRLF) $iMode -= 4 EndIf ; 2 remove all double withespaces If $iMode = 2 Or $iMode = 3 Then $sString = StringRegExpReplace($sString, "[[:blank:]]+", " ") $sString = StringRegExpReplace($sString, "\n[[:blank:]]+", @CRLF) $sString = StringRegExpReplace($sString, "[[:blank:]]+\n", "") $iMode -= 2 EndIf ; 1 remove all non ASCII (remove all chars with ascii code > 127) If $iMode = 1 Then $sString = StringRegExpReplace($sString, "[^\x00-\x7F]", " ") EndIf Return $sString EndFunc ;==>_HTML_Filter Any error reports or suggestions for enhancements are welcome