Jump to content
Sign in to follow this  
lilx

[Question] Cutting Div content from html page

Recommended Posts

lilx

Hello,

i have the following question i am trying to extract the content from a specific div. i know that there is a command "_IEGetObjById($oIE, "divID")"

but i end up with to much garbage because the div containing a id; have a couple of child div which only have a class. so i wondered is there a possibility to look for class name in stat of a ID?

Share this post


Link to post
Share on other sites
jdelaney

provide a snippet of the HTML, and I can show you how to use my signature...or you can use DOM methods to loop through the children of your parent returned by:_IEGetObjById($oIE, "divID")

http://www.w3schools.com/htmldom/dom_methods.asp


IEbyXPATH-Grab IE DOM objects by XPATH IEscriptRecord-Makings of an IE script recorder ExcelFromXML-Create Excel docs without excel installed GetAllWindowControls-Output all control data on a given window.

Share this post


Link to post
Share on other sites
DaleHohm

You can use _IETagnameGetCollection, DIV starting with the DIV you have and use the 0-based index to specify the nested DIV you want. Or, you can loop through the DIV collection returned without the index and look at attributes like classname for a match.

Dale

Edited by DaleHohm

Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Share this post


Link to post
Share on other sites
lilx

provide a snippet of the HTML, and I can show you how to use my signature...or you can use DOM methods to loop through the children of your parent returned by:_IEGetObjById($oIE, "divID")

http://www.w3schools.com/htmldom/dom_methods.asp

hi here is the part of the html page i am intrested in.

<div id="rt-main" class="mb8-sa4">
<div class="rt-mainsection">
     <div class="rt-mainrow">
                                             <div class="rt-grid-8 rt-alpha">
                                                                     <div class="rt-block component-block">
             <div class="component-content">
                 <div class="item-page">
<h2>
<a href="/nokia/mock-up-windows-rt-tablet-van-nokia-duikt-op">
[b]Mock-up Windows RT tablet van Nokia duikt op[/b]</a>
</h2>

<dl class="article-info">
<dt class="article-info-term">Gegevens</dt>
[b]<dd class="category-name">
Categorie: <a href="/nokia">Nokia</a> </dd>
<dd class="published">
Gepubliceerd op woensdag, 02 januari 2013 </dd>[/b]
</dl>

[b]<p><img src="/images/Nokia/Nokia-Windows-RT.jpg" border="0" alt="Nokia Windows RT Mockup" style="border: 0px; display: block; margin-left: auto; margin-right: auto;" /></p>………………..[/b]

this site provide the latest news on smartphones, I want to able to extract the data of the post to provide it as a news feed on a project im working on, for the record the source will be added.

The information that the program will need to produce lies in the bold marked code. Do you have any suggestions how to accomplish this?

Share this post


Link to post
Share on other sites
lilx

bump..

Share this post


Link to post
Share on other sites
kylomas

lilx,

Kludgy, non-IE solution (FWIW)

#include <array.au3>

local $str = '<div id="rt-main" class="mb8-sa4"> ' & _
'<div class="rt-mainsection">' & _
     '<div class="rt-mainrow">' & _
                                             '<div class="rt-grid-8 rt-alpha">' & _
                                                                     '<div class="rt-block component-block">' & _
             '<div class="component-content">' & _
                 '<div class="item-page">' & _
'<h2>' & _
'<a href="/nokia/mock-up-windows-rt-tablet-van-nokia-duikt-op">' & _
'[b]Mock-up Windows RT tablet van Nokia duikt op[/b]</a>' & _
'</h2>' & _
'<dl class="article-info">' & _
'<dt class="article-info-term">Gegevens</dt>' & _
'[b]<dd class="category-name">' & _
'Categorie: <a href="/nokia">Nokia</a> </dd>' & _
'<dd class="published">' & _
'Gepubliceerd op woensdag, 02 januari 2013 </dd>[/b]' & _
'</dl>' & _
'[b]<p><img src="/images/Nokia/Nokia-Windows-RT.jpg" border="0" alt="Nokia Windows RT Mockup" style="border: 0px; display: block; margin-left: auto; margin-right: auto;" /></p>………………..[/b]'

$str = stringregexpreplace($str,'\[b\]','')
$str = stringregexpreplace($str,'\[/b\]','')
$ret = stringregexp($str,'>(.*?)<',3)

for $1 = ubound($ret) - 1 to 0 step -1
    if stringlen(stringstripws($ret[$1],3)) = 0 then    _arraydelete($ret,$1)
next

 _arraydisplay($ret)

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites
kylomas

lilx,

This might give you some ideas for an IE solution.

#include <array.au3>
#include <ie.au3>

local $str = '<div id="rt-main" class="mb8-sa4"> ' & _
'<div class="rt-mainsection">' & _
     '<div class="rt-mainrow">' & _
                                             '<div class="rt-grid-8 rt-alpha">' & _
                                                                     '<div class="rt-block component-block">' & _
             '<div class="component-content">' & _
                 '<div class="item-page">' & _
'<h2>' & _
'<a href="/nokia/mock-up-windows-rt-tablet-van-nokia-duikt-op">' & _
'[b]Mock-up Windows RT tablet van Nokia duikt op[/b]</a>' & _
'</h2>' & _
'<dl class="article-info">' & _
'<dt class="article-info-term">Gegevens</dt>' & _
'[b]<dd class="category-name">' & _
'Categorie: <a href="/nokia">Nokia</a> </dd>' & _
'<dd class="published">' & _
'Gepubliceerd op woensdag, 02 januari 2013 </dd>[/b]' & _
'</dl>' & _
'[b]<p><img src="/images/Nokia/Nokia-Windows-RT.jpg" border="0" alt="Nokia Windows RT Mockup" style="border: 0px; display: block; margin-left: auto; margin-right: auto;" /></p>………………..[/b]'

Local $ohtml = ObjCreate('HTMLFILE')

If Not IsObj($ohtml) Then SetError(-1)

$ohtml.open()
$ohtml.write($str)
$ohtml.close()

Local $odivs = _IETagnameGetCollection($ohtml, 'div'), $o_str

if not isobj($odivs) then seterror(-2)

for $odiv in $odivs
    ConsoleWrite('!  Classname = ' & $odiv.classname & '   title = ' & $odiv.title & '   id = ' & $odiv.id & @LF)
    consolewrite('>' & @tab & @tab & $odiv.innertext & @lf)
Next

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites
DaleHohm

Your answer in in the reply I left for you. If you don't understand it, do some reading and ask questions, but I wouldn't suggest you just ignore it, bump and hope someone writes it for you.

Dale


Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Share this post


Link to post
Share on other sites
lilx

hi guys,

sorry for the late response but indeed i found my answer with _IETagNameGetCollection.

I bumped the topic because i couldn't get it to work the first time with _IETagNameGetCollection and other commands, but now i gto it working thank you for tip.

Here is what i got so far:

$divs = _IETagNameGetCollection($oIE, "div")

For $div In $divs
Local $Title, $ImgSource, $Publised
If $div.className == "item-page" Then
$htmlcontent = String($div.innerHTML)

$Title = _StringBetween($htmlcontent, '">', '')

$ImgSource = _StringBetween($htmlcontent, 'src="', '"')

$rawPublised = _StringBetween($htmlcontent, '<dd class="published">', '</dd>')
    _formatdate($rawPublised)
    EndIf
    Next
    
    
    _IEQuit($oIE)
    
    Func _formatdate($sString) ; " Gepubliceerd op woensdag, 02 januari 2013 "
    $day = StringRegExp ( $sString, '([0-9]{2})', 1)
    $year = StringRegExp ( $sString, '([0-9]{4})', 1)
    $month = StringRegExp ( $sString, '([a-z]{3-9})(:? [0-9]{4})', 1)
    ConsoleWrite ($day[0] & ' ' & $month[0] & ' ' & $year[0])
    EndFunc

i now have a question about StringRegEXP hope you can help me with this.

as you can see $sString contains this value "Gepubliceerd op woensdag, 02 januari 2013" and i was able the extract the 02 and 2013 value.

But now i am tryin to extract the month but my criteria isent working, my idea was to look at the string as followed "02 " extract " 2013" any idea's?

Share this post


Link to post
Share on other sites
lilx

nvm found what i was looking for, StringRegExp ( $sString, '([a-z]{3,9})(?: ([0-9]{4}))', 1).

i think that i can finish it from here off.

thank you guys for the tips

Topic can be closed

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

  • Similar Content

    • Pricehacker
      By Pricehacker
      Hello again!
      I have tried searching for quite some time now but couldn´t find a single working way to change the class of your GUI that doesn´t include changing it manually after its compilation.
      First or "AutoIt v3 GUI" is a pretty boring name tbh and i would like to spice it up a bit, also, and more importantly, I would like to identify my windows by something other than its title as it will change depending on some factors and im using two different programs independently as multi threading isn't supported.
      Any help is appreciated! Thank you!
    • ur
      By ur
      Is there any UDF to remove all anchor tags <a> with a particular class (and also its sub elements completely) in a html document.
      Here the classes are browse and breadcrumbs
      Like in the below image.


       
      I am not able to find that option in IE.au3
       
      Please suggest.
    • milkmoron
      By milkmoron
      I am trying to automate something in a web browser but i need some help with finding the html code to a web applet. How do I access the code.
    • kelso
      By kelso
      Hello Guru's,
       
      I'm trying to write an autoit script to select from the dropdown list as you see in the attached picture. 
      I read the help page for _IEFormElementOptionSelect, but I cannot grasp how to correlate that with the source code that I'm seeing. any suggestions?

    • Seminko
      By Seminko
      Is there a way to grab non-hardcoded but rather javascript generated data from a webpage?
      Tried a get request as well as _IEBodyReadHTML but both seem to grab the code without the javascript generated data.
      $oHTTP = ObjCreate("winhttp.winhttprequest.5.1") $oHTTP.Open("GET", "link", False) $oHTTP.Send() $oReceived = $oHTTP.ResponseText $oStatusCode = $oHTTP.Status Global $DataArray[10][5] If $oStatusCode <> 200 Then Exit MsgBox(1, "Error", "Status Code <> 200") EndIf FileWrite(@ScriptDir & "\output.txt", $oReceived) ; //////// #include <IE.au3> Local $FullLink = "link" Local $oIE = _IECreate($FullLink, 0, 0) _IELoadWait($oIE) Local $sText = _IEBodyReadHTML($oIE) FileWrite(@ScriptDir & "\output.txt", $sText)  
×