Sign in to follow this  
Followers 0
Mikeman27294

[Problem]Cutting Divs from Source code String read from saved HTML page.

20 posts in this topic

#1 ·  Posted (edited)

Hey everyone,

I am working on a script for which I have a webpage saved. At the start of the program, the program loads the source code into a variable, removes all script and noscript, and then it must cut out a div called product (<div id="product">). The problem is that each time I try to write this function, I have problems detecting how many child divs it has nested in it. I have tried also writing a function that determines the amount of characters in the div, for trimming to the right of the div, but this has not worked. I dont wish to use an embedded browser for my program, I just want to work with the strings. Could anybody give me any pointers?

Thanks.

EDIT

I have found the following forum post:

The problem is though, that (as far as I am aware) it requires you to use the embedded browser or just use internet explorer, which will not help me as I just want to read the source code. I am aware that I can load the source into the browser and then hide the browser, but I would rather just "cut" the div out of the source.

Thanks.

SOLUTION

Thanks to thanlankon. Please note that this also returns the div tags (opening and closing) for the specified div. It also removes the " character from either side of the ID name.

_HTMLGetDIVCode('html code here', 'id of div here')
Func _HTMLGetDIVCode($html_code, $div_id)
    Local $o_htmlfile = ObjCreate('HTMLFILE')
    If Not IsObj($o_htmlfile) Then Return SetError(-1, 0, '')
    $o_htmlfile.open()
    $o_htmlfile.write($html_code)
    $o_htmlfile.close()
    Local $div = $o_htmlfile.getElementByID($div_id)
    If Not IsObj($div) Then Return SetError(0, 0, '')
    Return $div.outerHTML
EndFunc

Edited by Mikeman27294

Share this post


Link to post
Share on other sites



Please show the text before and after the cutting. That way it should be easy to have a look.


Scripts & functions Organize Includes Let Scite organize the include files

Yahtzee The game "Yahtzee" (Kniffel, DiceLion)

LoginWrapper Secure scripts by adding a query (authentication)

_RunOnlyOnThis UDF Make sure that a script can only be executed on ... (Windows / HD / ...)

Internet-Café Server/Client Application Open CD, Start Browser, Lock remote client, etc.

MultipleFuncsWithOneHotkey Start different funcs by hitting one hotkey different times

Share this post


Link to post
Share on other sites

Ok, I will attach 2 HTML documents.

Before

After

If anybody is in query about whether these files contain viruses or not, then I can email them to you (They are just 2 HTML files called File and Result, respectively). It would also be appreciated if anybody who downloads them could verify that they are virus free (Just to clear the doubts), and thanks to anybody who does.

This is what I was hoping to achieve from the program. I wish to remove the div named product, and write that to a file. I already have my program removing script and noscript, which is also demonstrated in the result file.

If you want to view the original page, it was saved from here:

http://www.oo.com.au/Prima_72_Bottle_Wine_Cooler_-__P4022.cfm

Thanks in advanced for any help.

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

Download doesn't work.

_________________________________________________________________________

This file is currently set to private.

When a file is set to private by its owner only the owner of the file can access it. If you are the owner of the file please log into your account to access this file.

If you believe you have reached this page in error, please contact support.

Click here to view our help resources

Edited by Xenobiologist

Scripts & functions Organize Includes Let Scite organize the include files

Yahtzee The game "Yahtzee" (Kniffel, DiceLion)

LoginWrapper Secure scripts by adding a query (authentication)

_RunOnlyOnThis UDF Make sure that a script can only be executed on ... (Windows / HD / ...)

Internet-Café Server/Client Application Open CD, Start Browser, Lock remote client, etc.

MultipleFuncsWithOneHotkey Start different funcs by hitting one hotkey different times

Share this post


Link to post
Share on other sites

#5 ·  Posted (edited)

Sorry, I thought you would still be able to download them. I will fix that.

EDIT

Ok, I fixed that. Thankyou for letting me know.

Edited by Mikeman27294

Share this post


Link to post
Share on other sites

How do you determine it manually? All I can see is starting at id="product and end at

</div>

</div>

</div>

</div>

Everything between ist what you extracted.


Scripts & functions Organize Includes Let Scite organize the include files

Yahtzee The game "Yahtzee" (Kniffel, DiceLion)

LoginWrapper Secure scripts by adding a query (authentication)

_RunOnlyOnThis UDF Make sure that a script can only be executed on ... (Windows / HD / ...)

Internet-Café Server/Client Application Open CD, Start Browser, Lock remote client, etc.

MultipleFuncsWithOneHotkey Start different funcs by hitting one hotkey different times

Share this post


Link to post
Share on other sites

I found an object called HTMLFile, like Shell.Explorer.2 (use in _IECreateEmbedded), but the object HTMLFile just work with source code, not display anything like Shell.Explorer.2

How about the code below, Mikeman:

_HTMLGetDIVCode('html code here', 'id of div here')

Func _HTMLGetDIVCode($html_code, $div_id)
    Local $o_htmlfile = ObjCreate('HTMLFILE')

    If Not IsObj($o_htmlfile) Then Return SetError(-1, 0, '')

    $o_htmlfile.open()
    $o_htmlfile.write($html_code)
    $o_htmlfile.close()

    Local $div = $o_htmlfile.getElementByID($div_id)

    If Not IsObj($div) Then Return SetError(0, 0, '')

    Return $div.outerHTML
EndFunc

Share this post


Link to post
Share on other sites

thanlankon,

Did you find any MS doc for "HTMLfile"?

I had a VBS routine using the IE parsing engine and this object to extract just what the OP is looking for, however, I have not been able to successfully translate it to AutoIT...finally gave up and used SRE for parsing.

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

#9 ·  Posted (edited)

How do you determine it manually? All I can see is starting at id="product and end at

</div>

</div>

</div>

</div>

Everything between ist what you extracted.

What I was doing is that I had a do while loop, and a variable which incremented each time a div tag was detected, and the opposite when the close div tag. This would return the amount of lines (the starting line is known), and I could then read each line and write it to the file, but this just returned 0 :S

I found an object called HTMLFile, like Shell.Explorer.2 (use in _IECreateEmbedded), but the object HTMLFile just work with source code, not display anything like Shell.Explorer.2

How about the code below, Mikeman:

_HTMLGetDIVCode('html code here', 'id of div here')

Func _HTMLGetDIVCode($html_code, $div_id)
    Local $o_htmlfile = ObjCreate('HTMLFILE')

    If Not IsObj($o_htmlfile) Then Return SetError(-1, 0, '')

    $o_htmlfile.open()
    $o_htmlfile.write($html_code)
    $o_htmlfile.close()

    Local $div = $o_htmlfile.getElementByID($div_id)

    If Not IsObj($div) Then Return SetError(0, 0, '')

    Return $div.outerHTML
EndFunc

That looks good. I will place it in my code and see how well it works, thanks :D Edited by Mikeman27294

Share this post


Link to post
Share on other sites

Yep, that code works. I will paste it in a spoiler in my first post for future reference, thankyou, thanlankon.

Share this post


Link to post
Share on other sites

thanlankon,

Did you find any MS doc for "HTMLfile"?

I had a VBS routine using the IE parsing engine and this object to extract just what the OP is looking for, however, I have not been able to successfully translate it to AutoIT...finally gave up and used SRE for parsing.

kylomas

actually no, kylomas. the methods and properties of HTMLFILE like object "document" in javascript, I think so.(of course they have some differences but I'm not sure about them.) I don't know much about VBS but can you post your VBS code here? maybe I can help you to translate them.

Yep, that code works. I will paste it in a spoiler in my first post for future reference, thankyou, thanlankon.

not at all, Mikeman

Share this post


Link to post
Share on other sites

Ok, so I was using that code, and it was working quite well, until I found that it was stripping the " characters from div IDs. Are there any work-arounds for that? I dont understand what the problem is with it but I know that before it enters that function, the " characters are there and they arent when the string leaves it :S

Share this post


Link to post
Share on other sites

I'm not entirely sure what data you are after. I suggest mapping the opening and closing tag positions using StringInStr with the occurence parameter. Then you can grab whatever characters you wish between any two tags. Things like this can also be done using regular expressions.

Share this post


Link to post
Share on other sites

Basically, I want to go onto an online webstore, and get data about the product being sold.

I have tried using string in str but that didnt work properly (Not a clue why, I use that function all the time :S). I also dont know how to do RegEx really, so for the time being, I will just stay away from it till I have a bit of free time to learn it.

Share this post


Link to post
Share on other sites

Here's a simple way to get starting positions of the opening divs. The code is just an example of how you might go about solving this problem. To make it work you will have to make some modifications. I hope it gives you some ideas.

Local $html = "<div><div><div>Hello World</div></div></div>"

Local $iDivCount = 1, $sDelimStr = ""
While StringInStr ( $html, "<div" , 0 , $iDivCount)
    $sDelimStr &= StringInStr( $html, "<div" , 0 , $iDivCount) & ","
    $iDivCount += 1
WEnd
MsgBox(0, "Staring positions", StringTrimRight($sDelimStr, 1))

Share this post


Link to post
Share on other sites

Earlier, the one I tried to have a go at basically read each line individually, and if '<div' was in it, it would increment a variable. then if it found '</div>', it would do the opposite.

What I would really need is the beginning, and the ends of each one.

I might have another crack at it soon.

Share this post


Link to post
Share on other sites

I don't see any other way than adding and subtracting increments. But my suggestion is to simplify the process by first just looking for the positions of the tags. Once you know this you can sort them in order and select between them more easily.

Share this post


Link to post
Share on other sites

Something like this might work if you're after the specific info.

Add your own divs afterwards if you need them.

Local $url = 'http://www.oo.com.au/Prima_72_Bottle_Wine_Cooler_-__P4022.cfm'
Local $htm = BinaryToString(InetRead($url, 1))
;
Local $s = ''
$s &= GetPageInfo($htm, 'inner">' & @CRLF & '<h1>', '</h1>') & @CRLF
$s &= 'Price: ' & GetPageInfo($htm, 'Price: <strong>', '</strong>') & @CRLF
$s &= GetPageInfo($htm, 'rrp"><strong>', '</strong>') & @CRLF
MsgBox(0, '', $s)
;
Exit
;
Func GetPageInfo($htm, $tag1, $tag2)
    Local $a = StringRegExp($htm, '(?i)(?s)' & $tag1 & '(.*?)' & $tag2, 3)
    If IsArray($a) Then Return $a[0]
EndFunc
;

"The mediocre teacher tells. The Good teacher explains. The superior teacher demonstrates. The great teacher inspires." -William Arthur Ward

Share this post


Link to post
Share on other sites

Ok, so I was using that code, and it was working quite well, until I found that it was stripping the " characters from div IDs. Are there any work-arounds for that? I dont understand what the problem is with it but I know that before it enters that function, the " characters are there and they arent when the string leaves it :S

Because my code use HTMLFILE, the object HTMLFILE automatically remove the " character that wraps value of some properties of the tag like ID. but I wonder what the problem with it is. Although the " character is removed but you can parse the code normally. I think it does not matter much.

Share this post


Link to post
Share on other sites

Because my code use HTMLFILE, the object HTMLFILE automatically remove the " character that wraps value of some properties of the tag like ID. but I wonder what the problem with it is. Although the " character is removed but you can parse the code normally. I think it does not matter much.

Ok, thanks.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0

  • Similar Content

    • islandspapand
      By islandspapand
      Hi all
      i am currently trying to click on an element in a HTML Table, but just can get it to work.
      i am able to click the top of the table so it changes to sort  but just can't click on the element in the table.
      an i need to click on element to continue in the site.
      i have attached the code so far and pictures of the table  element want to click plus the source of the table.
      i am able to get data in the table with $oTable = _IETableGetCollection($oIE, 2) but not able to click on them.
       
      Help is very much appreciated
       
      #cs ---------------------------------------------------------------------------- AutoIt Version: 3.3.14.2 Author: myName Script Function: Template AutoIt script. #ce ---------------------------------------------------------------------------- ; Script Start - Add your code below here #include <IE.au3> #include "DOM.au3" #include <Array.au3> #include <MsgBoxConstants.au3> Global $oIE = _IECreate("*") _IELoadWait($oIE) Sleep(2000) _PageLogin($oIE) _PageLoadWait() _PageNewReq($oIE) _PageLoadWait() _InputModelInf($oIE) _PageLoadWait() Sleep(1000) $aTableLink = BGe_IEGetDOMObjByXPathWithAttributes($oIE, "//table/tbody/tr/td[.='Name Of user']", 2000) ;~ $aTableLink = BGe_IEGetDOMObjByXPathWithAttributes($oIE, "//table/tbody/tr", 2000) ;~ _ArrayDisplay($aTableLink,"$aTableLink") If IsArray($aTableLink) Then ConsoleWrite("Able to BGe_IEGetDOMObjByXPathWithAttributes($oIE, //table/tbody/tr/td[.='Name Of user'])" & @CRLF) For $i = 0 To UBound($aTableLink)-1 ConsoleWrite(" OuterHTML : " & $aTableLink[$i].outerHTML & @CRLF) ConsoleWrite(" Parentnode : " & $aTableLink[$i].parentnode & @CRLF) ConsoleWrite(" Parentnode.click : " & $aTableLink[$i].parentnode.fireEvent("onclick","click") & @CRLF) $objClick = $aTableLink[$i].parentnode ;~ _IEAction($aTableLink[$i] , "focus") _IEAction($objClick , "focus") ;~ If _IEAction($aTableLink[$i], "click") Then If _IEAction($objClick, "click") Then ConsoleWrite("Able to _IEAction($aForumLink[0], 'click')" & @CRLF) _IELoadWait($oIE) Else ConsoleWrite("UNable to _IEAction($aForumLink[0], 'click')" & @CRLF) Exit 3 EndIf Next Else ConsoleWrite("Unable to BGe_IEGetDOMObjByXPathWithAttributes($oIE, //table/tbody/tr/td[.='Name Of user'])" & @CRLF) Exit 2 EndIf _PageLoadWait() Func _InputModelInf($oTmpIE) ; Add Var for Model & Serial in Func $oModelInput = _IEGetObjById($oTmpIE,"model") _IEAction($oModelInput,"focus") _IEDocInsertText($oModelInput, "*") $oSerialInput = _IEGetObjById($oTmpIE,"serial") _IEAction($oModelInput,"focus") _IEDocInsertText($oSerialInput, "*") $links = $oTmpIE.document.getElementsByClassName("btn btn-primary ng-scope") For $link In $links If $link.innertext = "Søg" Or $link.innertext = "Search" Then $link.click() ExitLoop EndIf Next Return True EndFunc Func _PageNewReq($oTmpIE) $links = $oTmpIE.document.getElementsByClassName("ng-scope k-link") For $link In $links If $link.innertext = "Send ny fejlmelding" Or $link.innertext = "Submit a New Service Request" Then $link.click() ExitLoop EndIf Next Return True EndFunc Func _PageLogin($oTmpIE) $oUserInput = _IEGetObjById($oTmpIE,"loginid") _IEDocInsertText($oUserInput, "*") $oPasswordInput = _IEGetObjById($oTmpIE,"password") _IEDocInsertText($oPasswordInput, "*") $links = $oTmpIE.document.getElementsByClassName("btn btn-primary login ng-scope") For $link In $links If $link.innertext = "Sign in" Then $link.click() ExitLoop EndIf Next Return True EndFunc Func _PageLoadWait() Local $PageLoadWait = False ;~ nav navbar-nav navbar-right ng-hide ;~ nav navbar-nav navbar-right $tags = $oIE.document.GetElementsByTagName("ul") For $tag in $tags $class_value = $tag.GetAttribute("class") If $class_value = "nav navbar-nav navbar-right" Then ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : Webpage loading :) ' & @CRLF) ;### Debug Console $PageLoadWait = True ExitLoop EndIf Next Do sleep(250) For $tag in $tags $class_value = $tag.GetAttribute("class") If $class_value = "nav navbar-nav navbar-right ng-hide" Then ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : Webpage load finished :)'& @CRLF) ;### Debug Console $PageLoadWait = False ExitLoop EndIf Next Until $PageLoadWait = False EndFunc  
      Thanks in advance
       
       


    • rudi
      By rudi
      Hello.
      I'm too stupid to see my mistake:
      To investigate the internal "dictionary" of TIFF files I'd like to read in the files in binary mode and to check, if there are more than one pages "in" this TIFF.
      Notepad++, "View as Hex" is presenting the first bytes as "49 49 2a 20 08 20 20 20 12" for the TIF attached to this posting
      The "TIFF Header Format" is easy:
      Offset 00h, 2 Byte = Byte Order, "II"=intel, "MM"=motorola. (I = 0x49)
      --> II
      Offset 02h, 2 Byte = Version Nr.
      Offset 04h, 4 Byte = pointer to first IFD entry
      Description of TIFF header: https://www.awaresystems.be/imaging/tiff/faq.html#q3
       

      Howto read and analyse the binary content correctly? This is my messy, not operational code:
       
      $sampleTiff="H:\daten\tif\11\11\111111.TIF" $h=FileOpen($sampleTiff,16) $content=FileRead($h) ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : $content = ' & $content & @CRLF & '>Error code: ' & @error & @CRLF) ;### Debug Console FileClose($h) $type=VarGetType($content) ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : $type = ' & $type & @CRLF & '>Error code: ' & @error & @CRLF) ;### Debug Console $ToString=BinaryToString($content) ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : $ToString = ' & $ToString & @CRLF & '>Error code: ' & @error & @CRLF) ;### Debug Console ConsoleWrite(@CRLF & @CRLF) $content=StringTrimLeft($content,2) ; cut off the leading "0x" ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : $content = ' & $content & @CRLF & '>Error code: ' & @error & @CRLF) ;### Debug Console for $i = 1 to 8 step 8 $next=StringMid($content,$i,2) ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : $next = ' & $next & @CRLF & '>Error code: ' & @error & @CRLF) ;### Debug Console $Chr=BinaryToString($next) ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : $Chr = ' & $Chr & @CRLF & '>Error code: ' & @error & @CRLF) ;### Debug Console ConsoleWrite(@CRLF & "---" & @CRLF) Next Regards, Rudi.
      111111.TIF
    • Spask
      By Spask
      Hi, I'm trying to find a text value inside of a html.
      This is what the line looks like normally:
      <p id="line1" class> <span class="bot">TEXT HERE</span> </p> The text then changes to a non breaking space:
      <p id="line1" class> <span class="bot">&nbsp;</span> </p> And then it changes back to normal text but it's different every time.
      Can I code this so that it grabs the text every time it changes and has a variable that represents it?
      I currently have this inside of my loop:
      $span = .document.getElementsByTagName("span") For $text In $span If $text.value = "&nbsp;" Then Sleep(50) MsgBox(0,0,0) ;messagebox to test if it can be found, but I don't know how to grab the text EndIf Next The problem is that there are many other lines in the html that have the same span but are called "line3", "line5", etc and the one I need is from "line1".
      I will appreciate if anyone can help with this!
    • FrancescoDiMuro
      By FrancescoDiMuro
      Good evening everyone
      Before all, I want to say that I'm doing this script to see how _IE* functions work, and see if my studs can hack a quiz I'm working on.
      I want to clarify that I'm not automating any game, bypassing any CAPTCHAs, or anything that could damage anyone.
      I was trying to autofill a form, based on which question is displayed.
      The question is always stored in here:
      <header> <h1><span class="questionid">1. </span>Here goes the question</h1> </header> And answers are stored in here:
      <ul class="answers"> <li><label><span><input id="answer_0" name="answer[]" type="radio" value="0">Answer 1</span></label></li> <li><label><span><input id="answer_1" name="answer[]" type="radio" value="1">Answer 2</span></label></li> <li><label><span><input id="answer_2" name="answer[]" type="radio" value="2">Anwser 3</span></label></li> <li><label><span><input id="answer_3" name="answer[]" type="radio" value="3">Answer 4</span></label></li> </ul></fieldset></form></div> And, there are 15 questions like this.
      How can automatically fill my form?
      Thanks in advance
      Francesco
    • houser747
      By houser747
      I have previously used _IEFormElementGetObjByName and _IEFormElementSetValue to enter text into a search box on a form and then submit the form.
      I am now trying to enter text into a search box which is not part of a form. 
      Here is the HTML from the website that i'm trying to enter the data on and then submit the search.
      <div class="row">
          <div class="form-group col-xs-12">
              <span id="FullWidthWithSubmenuContent_FullWidthContent_MainContent_AircraftRegistry_lblSearchText" for="input-search">Registreringsbeteckning</span>
              <div class="input-group col-xs-12">
                  <span id="FullWidthWithSubmenuContent_FullWidthContent_MainContent_AircraftRegistry_preSearchText" class="input-group-addon">SE -</span>
                  <input name="ctl00$FullWidthWithSubmenuContent$FullWidthContent$MainContent$AircraftRegistry$txtSearchText" type="text" value="DTH" id="FullWidthWithSubmenuContent_FullWidthContent_MainContent_AircraftRegistry_txtSearchText" class="form-control" />
              </div>
          </div>
      </div>
      <div class="row">
          <div class="form-group col-xs-12">
              <label class="sr-only" for="">Sök</label>
              <input type="submit" name="ctl00$FullWidthWithSubmenuContent$FullWidthContent$MainContent$AircraftRegistry$btnSearch" value="Sök" id="FullWidthWithSubmenuContent_FullWidthContent_MainContent_AircraftRegistry_btnSearch" class="btn btn-primary ladda-button" data-style="expand-right" />
          </div>
      </div>
      Many thanks in advance
      cheers
      Roger