Decipher

HTML Parser UDF

8 posts in this topic

Hi,

I'm inviting all autoit forum members to contribute to a HTML parser udf. I going to attempt to replicate a python module called BeautifulSoup. It would be greatly appreciated if some senior Autoit programmers took interest in this topic. There is no template other than the module written in python located here and the documentation here.

I can't wait to see what this develops into. :robot:


Spoiler

censored.jpg

 

Share this post


Link to post
Share on other sites



I don't see the need, there is the _IE UDF, and with my link below, you can focus on any node(s) through an XPATH.


IEbyXPATH-Grab IE DOM objects by XPATH IEscriptRecord-Makings of an IE script recorder ExcelFromXML-Create Excel docs without excel installed GetAllWindowControls-Output all control data on a given window.

Share this post


Link to post
Share on other sites

I see the need for simplicity. Would you care to give an example of how to use the functions mentioned above for data extraction.


Spoiler

censored.jpg

 

Share this post


Link to post
Share on other sites

I stand corrected. There are templates for parsing HTML... and in UDF format. Thank You @ Jdelaney & KaFu.

If anyone else has UDF's or examples they'd like to share for anyone following path here that would be great. :)


Spoiler

censored.jpg

 

Share this post


Link to post
Share on other sites

#6 ·  Posted (edited)

Finally found a way to "mute" IE -- making it unable to load external resources except the already cached ones -- makes it load page faster. Other IE instances won't be affected, only the one we used as html parser.

This simple wrapper functions:

_HtmlParser_Startup

_HtmlParser_GetDocument ; the IHtmlDocument ref

_HtmlParser_LoadHtml ; load plain text, optionally remove all script tags

_HtmlParser_LoadUrl

_HtmlParser_LoadScript ; to use jquery, xpath, etc.

_HtmlParser_ClearScript

_HtmlParser_Exec ; execute js and get return value, see sample

Save as "HtmlParser.au3" and run, the sample code is included. Clear IE cache first for the best result. Hope it helps :)

#include-once
#include <WinAPI.au3>
#include <WindowsConstants.au3>
#include <GuiConstantsEx.au3>
#include <IE.au3>

; Opt("MustDeclareVars", 1)

Global $_HtmlParser_Debug = False
Global $_HtmlParser_Script = ""

Global Const $_HtmlParser_ScriptName = "HtmlParser.au3"
Global Const $tagINTERNET_PROXY_INFO = "dword dwAccessType; ptr lpszProxy; ptr lpszProxyBypass";
Global Const $tagCOPYDATA = _
  "ULONG_PTR;" & _  ; dwData, The data to be passed to the receiving application
  "DWORD;" & _    ; cbData, The size, in bytes, of the data pointed to by the lpData member
  "PTR"          ; lpData, The data to be passed to the receiving application. This member can be NULL.

Func _HtmlParser_Startup ($iPort=843)
    If IsDeclared("_HtmlParser_IE") AND IsObj($_HtmlParser_IE) Then Return 1

    Global $_HtmlParser_Port = $iPort
    __HtmlParser_ParseCmdLine()
    
    Global $_HtmlParser_HWND = WinWait($_HtmlParser_GUID, "", 5)
    If NOT $_HtmlParser_HWND Then Return SetError(1, 0, 0) ; Daemon failed to start
    WinSetTitle($_HtmlParser_HWND, "", "")
    
    ; Attach IE
    Global $_HtmlParser_IE = _IEAttach(WinGetHandle($_HtmlParser_GUID), "embedded")
    If NOT IsObj(_HtmlParser_GetDocument()) Then Return SetError(2, 0, 0)
    
    OnAutoItExitRegister("__HtmlParser_Shutdown")
    Return 1
EndFunc

Func _HtmlParser_GetDocument ()
    If IsObj($_HtmlParser_IE) Then Return _IEDocGetObj($_HtmlParser_IE)
    Return SetError(1, 0, 0)
EndFunc

Func _HtmlParser_LoadHtml ( $sHtml, $fRemoveScriptTags=0 )
    _IENavigate($_HtmlParser_IE, "about:blank")
    Local $doc = _HtmlParser_GetDocument()
    If $fRemoveScriptTags Then $sHtml = StringRegExpReplace($sHtml, "<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>", "")
    $doc.Write($sHtml & @CRLF & '<script language="javascript">' & @CRLF & $_HtmlParser_Script & @CRLF & 'Array.prototype.set=function(i,v){this[i]=v};Array.prototype.get=function(i){return this[i]};document.scripts[document.scripts.length-1].removeNode(false)</script>')
    Return $doc
EndFunc

Func _HtmlParser_LoadUrl ( $sUrl, $fRemoveScriptTags=0 )
    Local $http = ObjCreate("winhttp.winhttprequest.5.1")
    $http.Open("GET", $sUrl)
    $http.Send()
    Return _HtmlParser_LoadHtml($http.Responsetext, $fRemoveScriptTags)
EndFunc

Func _HtmlParser_LoadScript ( $sFilename )
    Local $script = ""
    If FileExists ($sFilename) Then
        $script = FileRead($sFilename)
    Else
        If StringInStr($sFilename, "http://") OR StringInStr($sFilename, "https://") OR StringInStr($sFilename, "ftp://") Then
            Local $http = ObjCreate("winhttp.winhttprequest.5.1")
            $http.Open("GET", $sFilename)
            $http.Send()
            $script = $http.Responsetext
        EndIf
    EndIf
    If $script Then
        $_HtmlParser_Script &= @CRLF & @CRLF & $script
        Return 1
    EndIf
    Return 0
EndFunc

Func _HtmlParser_ClearScript ( $sFilename )
    $_HtmlParser_Script = ""
EndFunc

Func _HtmlParser_Exec ( $sScript )
    Local $doc = _HtmlParser_GetDocument()
    If IsObj($doc) AND IsObj($doc.parentWindow) Then
        Local $window = $doc.parentWindow
        $window.execScript('window._HtmlParser_Result=(function(){' & $sScript & '})();', 'Javascript')
        If $window._HtmlParser_Result OR IsObj($window._HtmlParser_Result) Then
            Return $window._HtmlParser_Result
        Else
            Return SetError(2, 0, 0)
        EndIf
    EndIf
    Return SetError(1, 0, 0)
EndFunc

#region >> Internals

Func __HtmlParser_Shutdown ()
    Global $_HtmlParser_IE = 0
    If IsDeclared("_HtmlParser_HWND") Then WinClose($_HtmlParser_HWND)
    OnAutoItExitUnregister("__HtmlParser_Shutdown")
EndFunc

Func __HtmlParser_ParseCmdLine ()
    If @compiled Then
      Global Const $_HtmlParser_Exec = '"' & @ScriptFullPath & '"'
    Else
      Global Const $_HtmlParser_Exec = '"' & @AutoItExe & '" "' & @ScriptFullPath & '"'
    EndIf

    Local $cmd = StringInStr($CmdLineRaw, "/_HtmlParser_", 1), $val
    If $cmd > 0 Then
        #NoTrayIcon
        $cmd = StringRegExp(StringMid($CmdLineRaw, $cmd), '/(_HtmlParser_[a-zA-Z]+)\:"([^"]*)"', 3)
        For $i = 0 To UBound($cmd)-1 Step 2
            $val = $cmd[$i+1]
            If $val = ("" & Number($val)) Then $val = Number($val)
            Assign($cmd[$i], $val, 2)
        Next
        __HtmlParser_DaemonRun()
        Exit
    Else
        Global $_HtmlParser_GUID = __WinAPI_CreateGUID()
        __HtmlParser_Daemonstart()
    EndIf
EndFunc

Func __HtmlParser_Daemonstart ()
    Run ( $_HtmlParser_Exec & _
        ' /_HtmlParser_GUID:"' & $_HtmlParser_GUID & '"' & _
        ' /_HtmlParser_Debug:"' & (0 + $_HtmlParser_Debug) & '"' & _
        ' /_HtmlParser_Port:"' & $_HtmlParser_Port & '"' )
EndFunc

Func __HtmlParser_DaemonRun ()
    ; Initialize GUI
    Local $hGUI = GUICreate("", 500, 400, 10, 10, $WS_SIZEBOX, $WS_EX_TOOLWINDOW)
    Global $_HtmlParser_IE = _IECreateEmbedded()
    Local $hIE = GUICtrlCreateObj($_HtmlParser_IE, 0, 0, _WinAPI_GetClientWidth($hGUI), _WinAPI_GetClientHeight($hGUI))
    GUICtrlSetResizing($hIE, $GUI_DOCKAUTO)
    GUIRegisterMsg($WM_SYSCOMMAND, "__HtmlParser_SysCommand")
    If $_HtmlParser_Debug Then GUISetState()
    
    ; Initialize IE
    _IE_SetSessionProxy("127.0.0.1:" & $_HtmlParser_Port)
    _IENavigate($_HtmlParser_IE, "about:blank")
    _IEPropertySet($_HtmlParser_IE, "silent", True)
    WinSetTitle($hGUI, "", $_HtmlParser_GUID)
    
    Local $timeout = TimerInit()
    Do
        If TimerDiff($timeout) > 5000 Then Exit
    Until WinGetTitle($hGUI) <> $_HtmlParser_GUID
    If $_HtmlParser_Debug Then WinSetTitle($hGUI, "", "HtmlParser Debug Window")
    
    While 1
        Sleep(100)
    WEnd
EndFunc

Func __HtmlParser_SysCommand ($hWnd, $Msg, $wParam, $lParam)
    #forceref $Msg, $wParam, $lParam
    If BitAND($wParam, 0xFFF0) = 0xF060 Then Exit
    Return $GUI_RUNDEFMSG
EndFunc

#endregion << Internals

#region >> WinAPIEx

Func __WinAPI_CreateGUID()

    Local $tGUID, $Ret

    $tGUID = DllStructCreate($tagGUID)
    $Ret = DllCall('ole32.dll', 'uint', 'CoCreateGuid', 'ptr', DllStructGetPtr($tGUID))
    If @error Then
        Return SetError(1, 0, '')
    Else
        If $Ret[0] Then
            Return SetError(1, $Ret[0], 0)
        EndIf
    EndIf
    $Ret = DllCall('ole32.dll', 'int', 'StringFromGUID2', 'ptr', DllStructGetPtr($tGUID), 'wstr', '', 'int', 39)
    If (@error) Or (Not $Ret[0]) Then
        Return SetError(1, 0, '')
    EndIf
    Return $Ret[2]
EndFunc   ;==>__WinAPI_CreateGUID

Func _IE_SetSessionProxy ($sProxyAddress, $sBypassList="")
    Local $tProxyAddress = DllStructCreate("char[" & StringLen($sProxyAddress) + 1 & "]"), _
          $tBypassList = DllStructCreate("char[" & StringLen($sBypassList) + 1 & "]"), _
          $tINTERNET_PROXY_INFO = DllStructCreate($tagINTERNET_PROXY_INFO)
          
    DllStructSetData($tINTERNET_PROXY_INFO, "dwAccessType", 0x3)
    DllStructSetData($tProxyAddress, 1, $sProxyAddress)
    DllStructSetData($tINTERNET_PROXY_INFO, "lpszProxy", DllStructGetPtr($tProxyAddress))
    DllStructSetData($tBypassList, 1, $sBypassList)
    DllStructSetData($tINTERNET_PROXY_INFO, "lpszProxyBypass", DllStructGetPtr($tBypassList))
    
    Local $aRet = DllCall("urlmon.dll", "INT", "UrlMkSetSessionOption", _
                            "uint", 0x26, _
                            "ptr", DllStructGetPtr($tINTERNET_PROXY_INFO), _
                            "int", DllStructGetSize($tINTERNET_PROXY_INFO), _
                            "int", 0 )
    If @error OR $aRet[0] Then Return SetError(1, @error, 0)
    
    Return 1
EndFunc

#endregion << WinAPIEx

#region >> Debugging

Func _Alert ($msg, $fDialog=1)
  If $fDialog Then
    MsgBox(0, @ScriptName, $msg)
  Else
    TrayTip(@ScriptName, $msg, 5000)
  EndIf
EndFunc

Func _Critical ($ret, $rel=0, $msg="Fatal Error", $err=@error, $ext=@extended, $ln = @ScriptLineNumber)
  If $err Then
    $ln += $rel
    Local $LastError = _WinAPI_GetLastError(), _
          $LastErrorMsg = _WinAPI_GetLastErrorMessage(), _
          $LastErrorHex = Hex($LastError)
    $LastErrorHex = "0x" & StringMid($LastErrorHex, StringInStr($LastErrorHex, "0", 1, -1)+1)
    $msg &= @CRLF & "at line " & $ln & @CRLF & @CRLF & "AutoIt Error: " & $err & " (0x" & Hex($err)  & ") Extended: " & $ext
    If $LastError Then $msg &= @CRLF & "WinAPI Error: " & $LastError & " (" & $LastErrorHex & ")" & @CRLF & $LastErrorMsg
    ClipPut($msg)
    MsgBox(270352, "Fatal Error - " & @ScriptName, $msg)
    Exit
  EndIf
  Return $ret
EndFunc

#endregion << Debugging

; ==============================================================================

Func _HtmlParser_Test ()
    $_HtmlParser_Debug = True
    
    _Critical( _HtmlParser_Startup() )
    
    ; warming up
    Local $doc = _HtmlParser_GetDocument()
    $doc.write("Hello AutoIt World")
    _Alert("now for real")
    
    _HtmlParser_LoadScript("https://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js")
    ; Or:
    ; _HtmlParser_LoadScript(@ScriptDir & "\jquery.min.js")
    Local $doc = _HtmlParser_LoadUrl("http://www.autoitscript.com", True)
    Local $pages = _Critical( _HtmlParser_Exec('var p=[];$("p").each(function(index) { p.push("[" + (index+1) + "] " + $(this).text());});return p') )
    Local $s = ""
    For $i = 0 To $pages.length-1
        $s &= $pages.get($i) & @CRLF
    Next
    _Alert($s)
    
    Local $saved = $doc.body.parentElement.outerHTML
    Local $divs = _HtmlParser_Exec('return $(".featitem.clearfix")')
    Local $div, $features = ""
    For $i = 0 To $divs.length-1
        $div = $divs.get($i)
        $features &= $div.outerHTML
    Next
    $doc.body.innerHTML = $features
    _Alert("A lot faster parsing in javascript though")
    
    _HtmlParser_LoadHtml($saved, True)
    Local $features = _HtmlParser_Exec('var s=""; $(".featitem.clearfix").each(function(){ s += this.outerHTML }); document.body.innerHTML=s; return s')  
    _Alert($features)
    
EndFunc
If @ScriptName = $_HtmlParser_ScriptName Then _HtmlParser_Test()

Edit: added $fRemoveScriptTags option for _HtmlParser_LoadUrl and _HtmlParser_LoadHtml

Edited by eimhym

Share this post


Link to post
Share on other sites

#7 ·  Posted (edited)

Heres another, IE-less, example. Needs libtidy (attached) to clean pages and pass the well-formed html into MSXML ActiveX control (see MSDN documentation here).

pros: 1) lightweight, lighting fast. 2) XPath available by default. 3) Script tags won't executed, css or images won't loaded. 4) No problem in deleting SCRIPT tag.

cons: 1) does fail with some pages, always check @error after calling _HXmlParser_LoadUrl or _HXmlParser_LoadHtml. 2) libtidy crash on HTML5 pages, you have to reload the dll. 3) Doesn't handle html tags within textarea correctly, suggestion for workaround expected. 4) Can't use JS framework.

The sample code do the same as HtmlParser, for comparison.

#include-once
#include "libtidy.au3"

; Opt("MustDeclareVars", 1)

Global Const $_HXmlParser_ScriptName = "HXmlParser.au3"
Global $_HXmlParser_DOM = 0

Func _HXmlParser_Startup ( $sConfFilename="tidy-xml-settings.cfg" )
    _LibTidy_Startup()
    If @error Then Return SetError(@error, @extended, 0)
    _LibTidy_LoadConfig($sConfFilename)
    If @error Then
        _LibTidy_Shutdown()
        Return SetError(@error, @extended, 0)
    EndIf
    $_HXmlParser_DOM = ObjCreate("MSXML2.DOMDocument")
    OnAutoItExitRegister("__HXmlParser_Shutdown")
    $_HXmlParser_DOM.validateOnParse = False;
    $_HXmlParser_DOM.resolveExternals = False;
    Return 1
EndFunc

Func __HXmlParser_Shutdown ()
    $_HXmlParser_DOM = 0
    OnAutoItExitUnRegister("__HXmlParser_Shutdown")
EndFunc

Func _HXmlParser_GetErrorString ()
    If IsObj($_HXmlParser_DOM.parseError) AND $_HXmlParser_DOM.parseError.errorCode Then
        Return "Error loading page " & _
                " (" & Hex($_HXmlParser_DOM.parseError.errorCode) & _
                ") at line: " & $_HXmlParser_DOM.parseError.line & _
                ", position: " & $_HXmlParser_DOM.parseError.linepos & _
                ", reason: " & $_HXmlParser_DOM.parseError.reason
    EndIf
    Return 0
EndFunc

Func _HXmlParser_LoadHtml ( $sHtml )
    If NOT $sHtml Then Return SetError(4, 0, 0)
    _LibTidy_LoadString( $sHtml )
    If @error Then Return SetError(@error, @extended, 0)
    $sHtml = _LibTidy_CleanAndRepair()
    If @error OR NOT $sHtml Then Return SetError(5, @error, 0)
    
    $sHtml = StringRegExp(StringMid($sHtml, StringInStr($sHtml, "<html")), "(?s)^<html[^>]*>[^<]*(<.*)", 1)
    If @error Then Return SetError(6, @error, 0)

    $sHtml = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd" [' _
        & '<!ENTITY nbsp " "><!ENTITY iexcl "¡"><!ENTITY cent "¢"><!ENTITY pound "£"><!ENTITY curren "¤"><!ENTITY yen "¥"><!ENTITY brvbar "¦"><!ENTITY sect "§"><!ENTITY uml "¨"><!ENTITY copy "©"><!ENTITY ordf "ª"><!ENTITY laquo "«"><!ENTITY not "¬"><!ENTITY shy "­"><!ENTITY reg "®"><!ENTITY macr "¯"><!ENTITY deg "°"><!ENTITY plusmn "±"><!ENTITY sup2 "²"><!ENTITY sup3 "³"><!ENTITY acute "´"><!ENTITY micro "µ"><!ENTITY para "¶"><!ENTITY middot "·"><!ENTITY cedil "¸"><!ENTITY sup1 "¹"><!ENTITY ordm "º"><!ENTITY raquo "»"><!ENTITY frac14 "¼"><!ENTITY frac12 "½"><!ENTITY frac34 "¾"><!ENTITY iquest "¿"><!ENTITY times "×"><!ENTITY divide "÷">' _
        & '<!ENTITY Agrave "À"><!ENTITY Aacute "Á"><!ENTITY Acirc "Â"><!ENTITY Atilde "Ã"><!ENTITY Auml "Ä"><!ENTITY Aring "Å"><!ENTITY AElig "Æ"><!ENTITY Ccedil "Ç"><!ENTITY Egrave "È"><!ENTITY Eacute "É"><!ENTITY Ecirc "Ê"><!ENTITY Euml "Ë"><!ENTITY Igrave "Ì"><!ENTITY Iacute "Í"><!ENTITY Icirc "Î"><!ENTITY Iuml "Ï"><!ENTITY ETH "Ð"><!ENTITY Ntilde "Ñ"><!ENTITY Ograve "Ò"><!ENTITY Oacute "Ó"><!ENTITY Ocirc "Ô"><!ENTITY Otilde "Õ"><!ENTITY Ouml "Ö"><!ENTITY Oslash "Ø"><!ENTITY Ugrave "Ù"><!ENTITY Uacute "Ú"><!ENTITY Ucirc "Û"><!ENTITY Uuml "Ü"><!ENTITY Yacute "Ý"><!ENTITY THORN "Þ"><!ENTITY szlig "ß"><!ENTITY agrave "à"><!ENTITY aacute "á"><!ENTITY acirc "â"><!ENTITY atilde "ã"><!ENTITY auml "ä"><!ENTITY aring "å"><!ENTITY aelig "æ"><!ENTITY ccedil "ç"><!ENTITY egrave "è"><!ENTITY eacute "é"><!ENTITY ecirc "ê"><!ENTITY euml "ë"><!ENTITY igrave "ì"><!ENTITY iacute "í"><!ENTITY icirc "î"><!ENTITY iuml "ï"><!ENTITY eth "ð"><!ENTITY ntilde "ñ"><!ENTITY ograve "ò"><!ENTITY oacute "ó"><!ENTITY ocirc "ô"><!ENTITY otilde "õ"><!ENTITY ouml "ö"><!ENTITY oslash "ø"><!ENTITY ugrave "ù"><!ENTITY uacute "ú"><!ENTITY ucirc "û"><!ENTITY uuml "ü"><!ENTITY yacute "ý"><!ENTITY thorn "þ"><!ENTITY yuml "ÿ">' _
        & ']>' & @CRLF _
        & '<html>' & @CRLF _
        & $sHtml[0]

    $_HXmlParser_DOM.loadXML($sHtml);
    If IsObj($_HXmlParser_DOM.parseError) AND $_HXmlParser_DOM.parseError.errorCode Then
        SetError(7, $_HXmlParser_DOM.parseError.errorCode, 0)
    EndIf
    
    $_HXmlParser_DOM.setProperty("SelectionLanguage", "XPath");
    
    Return $_HXmlParser_DOM
EndFunc

Func _HXmlParser_LoadUrl ( $sUrl )
    Local $http = ObjCreate("winhttp.winhttprequest.5.1")
    $http.Open("GET", $sUrl)
    $http.Send()
    Local $ret = _HXmlParser_LoadHtml($http.Responsetext)
    If @error Then Return SetError(@error, @extended, $ret)
    Return $ret
EndFunc


#region >> Debugging

Func _Alert ($msg, $fDialog=1, $err=@error, $ext=@extended, $ln = @ScriptLineNumber)
  If $fDialog Then
    MsgBox(0, @ScriptName, $msg)
  Else
    TrayTip(@ScriptName, $msg, 5000)
  EndIf
  If $err Then Return SetError($err, $ext, $ln)
  Return 0
EndFunc

Func _Critical ($ret, $rel=0, $msg="Fatal Error", $err=@error, $ext=@extended, $ln = @ScriptLineNumber)
  If $err Then
    $ln += $rel
    Local $LastError = _WinAPI_GetLastError(), _
          $LastErrorMsg = _WinAPI_GetLastErrorMessage(), _
          $LastErrorHex = Hex($LastError)
    $LastErrorHex = "0x" & StringMid($LastErrorHex, StringInStr($LastErrorHex, "0", 1, -1)+1)
    $msg &= @CRLF & "at line " & $ln & @CRLF & @CRLF & "AutoIt Error: " & $err & " (0x" & Hex($err)  & ") Extended: " & $ext
    If $LastError Then $msg &= @CRLF & "WinAPI Error: " & $LastError & " (" & $LastErrorHex & ")" & @CRLF & $LastErrorMsg
    $msg &= @CRLF & @CRLF & _HXmlParser_GetErrorString()
    ClipPut($msg)
    MsgBox(270352, "Fatal Error - " & @ScriptName, $msg)
    Exit
  EndIf
  Return $ret
EndFunc

#endregion << Debugging

Func _HXmlParser_Test ()
    
    _Critical( _HXmlParser_Startup() )
    Local $dom = _Critical( _HXmlParser_LoadUrl("http://www.AutoItScript.com") )
   
    _Alert("Removing all SCRIPT tags")
    Local $begin = TimerInit()
    Local $nodes = $dom.selectNodes("//script")
    If $nodes.length Then
        For $i = 0 To $nodes.length - 1
            Local $node = $nodes.item($i)
            $node.parentNode.removeChild($node)
        Next
    EndIf
    _Alert("Done in " & TimerDiff($begin) & " ms")
    
    _Alert("Collecting all P tags")
    Local $count = 1, $s = ""
    $begin = TimerInit()
    $nodes = $dom.selectNodes("//p")
    If $nodes.length Then
        For $i = 0 To $nodes.length - 1
            Local $node = $nodes.item($i)
            $s &= "[" & $count & "] " & $node.text & @CRLF
            $count += 1
        Next
    EndIf
    _Alert("Done in " & TimerDiff($begin) & " ms")
    _Alert($s)

    $s = ""
    _Alert("Collecting all feature DIVs")
    $begin = TimerInit()
    $nodes = $dom.selectNodes("//div[contains(@class, 'featitem')]")
    If $nodes.length Then
        For $i = 0 To $nodes.length - 1
            Local $node = $nodes.item($i)
            $s &= $node.xml & @CRLF
        Next
    EndIf
    _Alert("Done in " & TimerDiff($begin) & " ms")
    _Alert($s)
    
    ClipPut($dom.xml)
    _Alert("HTML content is in clipboard")
    
EndFunc
If @ScriptName = $_HXmlParser_ScriptName Then _HXmlParser_Test()

libtidy.7z

Edited by eimhym

Share this post


Link to post
Share on other sites

#8 ·  Posted (edited)

I get an "libtidy.au3(106,68) : ERROR: $tidyLoadConfig: undeclared global variable."

What am I doing wrong here ?

Edited by level20peon

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now

  • Similar Content

    • islandspapand
      By islandspapand
      Hi all
      i am currently trying to click on an element in a HTML Table, but just can get it to work.
      i am able to click the top of the table so it changes to sort  but just can't click on the element in the table.
      an i need to click on element to continue in the site.
      i have attached the code so far and pictures of the table  element want to click plus the source of the table.
      i am able to get data in the table with $oTable = _IETableGetCollection($oIE, 2) but not able to click on them.
       
      Help is very much appreciated
       
      #cs ---------------------------------------------------------------------------- AutoIt Version: 3.3.14.2 Author: myName Script Function: Template AutoIt script. #ce ---------------------------------------------------------------------------- ; Script Start - Add your code below here #include <IE.au3> #include "DOM.au3" #include <Array.au3> #include <MsgBoxConstants.au3> Global $oIE = _IECreate("*") _IELoadWait($oIE) Sleep(2000) _PageLogin($oIE) _PageLoadWait() _PageNewReq($oIE) _PageLoadWait() _InputModelInf($oIE) _PageLoadWait() Sleep(1000) $aTableLink = BGe_IEGetDOMObjByXPathWithAttributes($oIE, "//table/tbody/tr/td[.='Name Of user']", 2000) ;~ $aTableLink = BGe_IEGetDOMObjByXPathWithAttributes($oIE, "//table/tbody/tr", 2000) ;~ _ArrayDisplay($aTableLink,"$aTableLink") If IsArray($aTableLink) Then ConsoleWrite("Able to BGe_IEGetDOMObjByXPathWithAttributes($oIE, //table/tbody/tr/td[.='Name Of user'])" & @CRLF) For $i = 0 To UBound($aTableLink)-1 ConsoleWrite(" OuterHTML : " & $aTableLink[$i].outerHTML & @CRLF) ConsoleWrite(" Parentnode : " & $aTableLink[$i].parentnode & @CRLF) ConsoleWrite(" Parentnode.click : " & $aTableLink[$i].parentnode.fireEvent("onclick","click") & @CRLF) $objClick = $aTableLink[$i].parentnode ;~ _IEAction($aTableLink[$i] , "focus") _IEAction($objClick , "focus") ;~ If _IEAction($aTableLink[$i], "click") Then If _IEAction($objClick, "click") Then ConsoleWrite("Able to _IEAction($aForumLink[0], 'click')" & @CRLF) _IELoadWait($oIE) Else ConsoleWrite("UNable to _IEAction($aForumLink[0], 'click')" & @CRLF) Exit 3 EndIf Next Else ConsoleWrite("Unable to BGe_IEGetDOMObjByXPathWithAttributes($oIE, //table/tbody/tr/td[.='Name Of user'])" & @CRLF) Exit 2 EndIf _PageLoadWait() Func _InputModelInf($oTmpIE) ; Add Var for Model & Serial in Func $oModelInput = _IEGetObjById($oTmpIE,"model") _IEAction($oModelInput,"focus") _IEDocInsertText($oModelInput, "*") $oSerialInput = _IEGetObjById($oTmpIE,"serial") _IEAction($oModelInput,"focus") _IEDocInsertText($oSerialInput, "*") $links = $oTmpIE.document.getElementsByClassName("btn btn-primary ng-scope") For $link In $links If $link.innertext = "Søg" Or $link.innertext = "Search" Then $link.click() ExitLoop EndIf Next Return True EndFunc Func _PageNewReq($oTmpIE) $links = $oTmpIE.document.getElementsByClassName("ng-scope k-link") For $link In $links If $link.innertext = "Send ny fejlmelding" Or $link.innertext = "Submit a New Service Request" Then $link.click() ExitLoop EndIf Next Return True EndFunc Func _PageLogin($oTmpIE) $oUserInput = _IEGetObjById($oTmpIE,"loginid") _IEDocInsertText($oUserInput, "*") $oPasswordInput = _IEGetObjById($oTmpIE,"password") _IEDocInsertText($oPasswordInput, "*") $links = $oTmpIE.document.getElementsByClassName("btn btn-primary login ng-scope") For $link In $links If $link.innertext = "Sign in" Then $link.click() ExitLoop EndIf Next Return True EndFunc Func _PageLoadWait() Local $PageLoadWait = False ;~ nav navbar-nav navbar-right ng-hide ;~ nav navbar-nav navbar-right $tags = $oIE.document.GetElementsByTagName("ul") For $tag in $tags $class_value = $tag.GetAttribute("class") If $class_value = "nav navbar-nav navbar-right" Then ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : Webpage loading :) ' & @CRLF) ;### Debug Console $PageLoadWait = True ExitLoop EndIf Next Do sleep(250) For $tag in $tags $class_value = $tag.GetAttribute("class") If $class_value = "nav navbar-nav navbar-right ng-hide" Then ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : Webpage load finished :)'& @CRLF) ;### Debug Console $PageLoadWait = False ExitLoop EndIf Next Until $PageLoadWait = False EndFunc  
      Thanks in advance
       
       


    • imitto
      By imitto
      Hello all!
      I use Autoit for a while, already made some automation for a TV station's master control room with it. I made a UDF to easily work with PAL timecode and time with milliseconds, convert, add or subtract them. Feel free to use it if you want something like this
      #Region ;**** Directives created by AutoIt3Wrapper_GUI **** #AutoIt3Wrapper_Res_Description=PAL Timecode Calculator UDF #AutoIt3Wrapper_Res_LegalCopyright=horvath.imre@gmail.com #EndRegion ;**** Directives created by AutoIt3Wrapper_GUI **** ; ; #FUNCTION# ; Name...........: _tcAdd ; Description....: Returns addition of two timecodes ; Syntax.........: _tcAdd($fTc1, fTc2 [, $fFormat = "P"]) ; ; Parameters.....: $fTc1 - First timecode in hh:mm:ss.ff format ; $fTc2 - Second timecode in hh:mm:ss.ff format ; $fFormat - Time base - "P" (default): PAL (25 fps) ; "M" : millisecond ; ; Return value...: Sum of the two timecode in the selected format Func _tcAdd($fTc1, $fTc2, $fFormat = "P", $fHourFormat = 1) Local $fMs1 = _tcToMs($fTc1) Local $fMs2 = _tcToMs($fTc2) Local $fSumMs = $fMs1 + $fMs2 Return _msToTc($fSumMs, $fFormat, $fHourFormat) EndFunc ; #FUNCTION# ; Name...........: _tcsSub ; Description....: Returns addition of two timecodes ; Syntax.........: _tcSub($fTc1, fTc2 [, $fFormat = "P"]) ; ; Parameters.....: $fTc1 - First timecode in hh:mm:ss.ff format ; $fTc2 - Second timecode in hh:mm:ss.ff format ; $fFormat - Time base - "P" (default): PAL (25 fps) ; "M" : millisecond ; ; Return value...: Subtract $fTc2 from $fTc1 in the source format Func _tcSub($fTc1, $fTc2, $fFormat = "P") Local $fMs1 = _tcToMs($fTc1) Local $fMs2 = _tcToMs($fTc2) Local $fSumMs = $fMs1 - $fMs2 If $fSumMs < 0 Then $fSumMs = _tcToMs("24:00:00.00") - ($fSumMs * -1) EndIf Return _msToTc($fSumMs, $fFormat) EndFunc ; #FUNCTION# ; Name...........: _tcToMs ; Description....: Returns timecode converted to total milliseconds ; Syntax.........: _tcToMs($fTc) ; ; Parameters.....: $fTc - Timecode in hh:mm:ss.ff or hh:mm:ss:xxx format, where xxx are milliseconds ; ; Return value...: Milliseconds as an integer value Func _tcToMs($fTc) Local $fTemp = StringSplit($fTc, ":.") Local $fChr = StringLen($fTemp[4]) Switch $fChr Case 2 Return ($fTemp[4] * 40) + ($fTemp[3] * 1000) + ($fTemp[2] * 60000) + ($fTemp[1] * 3600000) Case 3 Return ($fTemp[4]) + ($fTemp[3] * 1000) + ($fTemp[2] * 60000) + ($fTemp[1] * 3600000) EndSwitch EndFunc ; #FUNCTION# ; Name...........: _msToTc ; Description....: Converts total milliseconds to timecode ; Syntax.........: _msToTc($fIn, $fFormat = "P", $fHourFormat = 1) ; ; Parameters.....: $fIn - Time in milliseconds ; $fFormat - Output format "P": PAL TC (default) ; "M": hh:mm:ss.xxx where xxx are milliseconds ; $fHourFormat - Hour format "1": max. value is 23, then starts from 0 (default) ; "0": hours can be more then 23 ; ; Return value...: Timecode as string in the selected format Func _msToTc($fIn, $fFormat = "P", $fHourFormat = 1) Switch $fFormat Case "P" Local $fFr = StringFormat("%02i", (StringRight($fIn, 3) - Mod(StringRight($fIn, 3), 40)) / 40) Case "M" Local $fFr = StringFormat("%03i", StringRight($fIn, 3)) EndSwitch $fIn = StringTrimRight($fIn, 3) Local $fSec = StringFormat("%02i", Mod($fIn, 60)) $fIn -= $fSec Local $fMinTot = $fIn / 60 Local $fMin = StringFormat("%02i", Mod($fMinTot, 60)) $fIn -= $fMin*60 Local $fHourTot = $fIn / 60 / 60 Switch $fHourFormat Case 1 $fHour = StringFormat("%02i", Mod($fHourTot, 24)) Case 0 $fHour = StringFormat("%02i", $fHourTot) EndSwitch Return($fHour & ":" & $fMin & ":" & $fSec & "." & $fFr) EndFunc ; #FUNCTION# ; Name...........: _tcFormatChange ; Description....: Toggle TC format ; Syntax.........: _tcFormatChange($fTc) ; ; Parameters.....: $fTc - Timecode in hh:mm:ss.ff or hh:mm:ss:xxx format, where xxx are milliseconds ; ; Return value...: PAL timecode or time with milliseconds as string, depends on input Func _tcFormatChange($fTc) Local $fTemp = StringSplit($fTc, ":.") Local $fChr = StringLen($fTemp[4]) Switch $fChr Case 2 Return $fTemp[1]&":"&$fTemp[2]&":"&$fTemp[3]&"."&StringFormat("%03i", $fTemp[4]*40) Case 3 Return $fTemp[1]&":"&$fTemp[2]&":"&$fTemp[3]&"."&StringFormat("%02i", ($fTemp[4]-Mod($fTemp[4], 40))/40) EndSwitch EndFunc And the example script:
      #include<_PAL_TC_Calc.au3> $palTC1 = "00:01:12.20" $palTC2 = "23:59:50.02" $msTC1 = "00:01:12.800" $msTC2 = "23:59:50.120" MsgBox(0, "1", _tcAdd($palTC1, $palTC2)); Adds $palTC1 to $palTC2, turns hour back to 0 after 23, returns PAL TC format MsgBox(0, "2", _tcAdd($palTC1, $palTC2, "M")); Adds $palTC1 to $palTC2, turns hour back to 0 after 23, returns time with milliseconds format MsgBox(0, "3", _tcAdd($palTC1, $palTC2, "M", 0)); Adds $palTC1 to $palTC2, hours can be infinite, returns time with milliseconds format MsgBox(0, "4", _tcAdd($msTC1, $msTC2)); Adds $palTC1 to $palTC2, turns hour back to 0 after 23, returns PAL TC format MsgBox(0, "5", _tcAdd($msTC1, $msTC2, "M")); Adds $palTC1 to $palTC2, turns hour back to 0 after 23, returns time with milliseconds format MsgBox(0, "6", _tcAdd($msTC1, $msTC2, "M", 0)); Adds $palTC1 to $palTC2, hours can be infinite, returns time with milliseconds format MsgBox(0, "7", _tcSub($palTC2, $palTC1)); Subtract $palTC1 from $palTC2, returns PAL TC format MsgBox(0, "8", _tcSub($palTC2, $palTC1, "M")); Subtract $palTC1 from $palTC2, time with milliseconds format MsgBox(0, "9", _tcSub($msTC1, $msTC2)); Subtract $palTC1 from $palTC2, returns PAL TC format - when hits zero, counts back from 24:00:00.00 MsgBox(0, "10", _tcSub($msTC1, $msTC2, "M")); Subtract $palTC1 from $palTC2, time with milliseconds format - when hits zero, counts back from 24:00:00.000 MsgBox(0, "11", _tcFormatChange($palTC2)); Convert PAL TC to time with milliseconds and back MsgBox(0, "12", _tcFormatChange($msTC2)); Convert PAL TC to time with milliseconds and back  
      TC_CALC_example.au3
      _PAL_TC_Calc.au3
    • TRAGENALPHA
      By TRAGENALPHA
      A small UDF to Modify the Console Interface.
      #include-once ;#AutoIt3Wrapper_Au3Check_Parameters=-q -d -w 1 -w 2 -w 3 -w- 4 -w 5 -w 6 -w- 7 ; =============================================================================================================================== ; Name...........: Console Modify ; Description ...: A small UDF to manipulate the Console Interface for scripts that are compiled as a console application. ; Syntax.........: _ConsoleClear() -- Clears the Console ; _ConsoleTitle("VALUE") - Sets the console title. ; _ConsoleWindow([WIDTH BUFFER SIZE], [HEIGHT BUFFER SIZE]) - Sets the Width and Height Buffer size of the console. ; Parameters ....: $bh6e5v_ctval = Title of the Console Window. ; $bh6e5v_cwwidth = Window width buffer size. ; $bh6e5v_cwheight = Window height buffer size. ; Return values .: True = Console window buffer size has been changed ; False = Failed to change console window buffer size. ; Author ........: TRAGENALPHA <3 ; Example .......: _ConsoleTitle("This is the new title") // _ConsoleWindow(200, 60) ; =============================================================================================================================== ; -- This is here because writing RunDos and including a whole UDF is too much. But this is basically just _RunDos() ;Func cmd($bh6e5v_ldvar) ; RunWait(@ComSpec & " /c " & $bh6e5v_ldval) ;EndFunc Func _ConsoleClear() RunWait(@ComSpec & " /c cls") EndFunc Func _ConsoleTitle($bh6e5v_ctval) RunWait(@ComSpec & " /c title " & $bh6e5v_ctval) EndFunc Func _ConsoleWindow($bh6e5v_cwwidth, $bh6e5v_cwheight) If IsNumber($bh6e5v_cwwidth) And IsNumber($bh6e5v_cwheight) And ($bh6e5v_cwwidth > 0) And ($bh6e5v_cwheight > 0) Then RunWait(@ComSpec & " /c mode con: cols=" & $bh6e5v_cwwidth & " lines=" & $bh6e5v_cwheight) Return True Else Return False EndIf EndFunc  
      ConsoleModify.au3
    • Spask
      By Spask
      Hi, I'm trying to find a text value inside of a html.
      This is what the line looks like normally:
      <p id="line1" class> <span class="bot">TEXT HERE</span> </p> The text then changes to a non breaking space:
      <p id="line1" class> <span class="bot">&nbsp;</span> </p> And then it changes back to normal text but it's different every time.
      Can I code this so that it grabs the text every time it changes and has a variable that represents it?
      I currently have this inside of my loop:
      $span = .document.getElementsByTagName("span") For $text In $span If $text.value = "&nbsp;" Then Sleep(50) MsgBox(0,0,0) ;messagebox to test if it can be found, but I don't know how to grab the text EndIf Next The problem is that there are many other lines in the html that have the same span but are called "line3", "line5", etc and the one I need is from "line1".
      I will appreciate if anyone can help with this!
    • FrancescoDiMuro
      By FrancescoDiMuro
      Good evening everyone
      Before all, I want to say that I'm doing this script to see how _IE* functions work, and see if my studs can hack a quiz I'm working on.
      I want to clarify that I'm not automating any game, bypassing any CAPTCHAs, or anything that could damage anyone.
      I was trying to autofill a form, based on which question is displayed.
      The question is always stored in here:
      <header> <h1><span class="questionid">1. </span>Here goes the question</h1> </header> And answers are stored in here:
      <ul class="answers"> <li><label><span><input id="answer_0" name="answer[]" type="radio" value="0">Answer 1</span></label></li> <li><label><span><input id="answer_1" name="answer[]" type="radio" value="1">Answer 2</span></label></li> <li><label><span><input id="answer_2" name="answer[]" type="radio" value="2">Anwser 3</span></label></li> <li><label><span><input id="answer_3" name="answer[]" type="radio" value="3">Answer 4</span></label></li> </ul></fieldset></form></div> And, there are 15 questions like this.
      How can automatically fill my form?
      Thanks in advance
      Francesco