Jump to content
Sign in to follow this  
lilx

[Question] Cutting Div content from html page

Recommended Posts

Hello,

i have the following question i am trying to extract the content from a specific div. i know that there is a command "_IEGetObjById($oIE, "divID")"

but i end up with to much garbage because the div containing a id; have a couple of child div which only have a class. so i wondered is there a possibility to look for class name in stat of a ID?

Share this post


Link to post
Share on other sites

provide a snippet of the HTML, and I can show you how to use my signature...or you can use DOM methods to loop through the children of your parent returned by:_IEGetObjById($oIE, "divID")

http://www.w3schools.com/htmldom/dom_methods.asp


IEbyXPATH-Grab IE DOM objects by XPATH IEscriptRecord-Makings of an IE script recorder ExcelFromXML-Create Excel docs without excel installed GetAllWindowControls-Output all control data on a given window.

Share this post


Link to post
Share on other sites

You can use _IETagnameGetCollection, DIV starting with the DIV you have and use the 0-based index to specify the nested DIV you want. Or, you can loop through the DIV collection returned without the index and look at attributes like classname for a match.

Dale

Edited by DaleHohm

Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Share this post


Link to post
Share on other sites

provide a snippet of the HTML, and I can show you how to use my signature...or you can use DOM methods to loop through the children of your parent returned by:_IEGetObjById($oIE, "divID")

http://www.w3schools.com/htmldom/dom_methods.asp

hi here is the part of the html page i am intrested in.

<div id="rt-main" class="mb8-sa4">
<div class="rt-mainsection">
     <div class="rt-mainrow">
                                             <div class="rt-grid-8 rt-alpha">
                                                                     <div class="rt-block component-block">
             <div class="component-content">
                 <div class="item-page">
<h2>
<a href="/nokia/mock-up-windows-rt-tablet-van-nokia-duikt-op">
[b]Mock-up Windows RT tablet van Nokia duikt op[/b]</a>
</h2>

<dl class="article-info">
<dt class="article-info-term">Gegevens</dt>
[b]<dd class="category-name">
Categorie: <a href="/nokia">Nokia</a> </dd>
<dd class="published">
Gepubliceerd op woensdag, 02 januari 2013 </dd>[/b]
</dl>

[b]<p><img src="/images/Nokia/Nokia-Windows-RT.jpg" border="0" alt="Nokia Windows RT Mockup" style="border: 0px; display: block; margin-left: auto; margin-right: auto;" /></p>………………..[/b]

this site provide the latest news on smartphones, I want to able to extract the data of the post to provide it as a news feed on a project im working on, for the record the source will be added.

The information that the program will need to produce lies in the bold marked code. Do you have any suggestions how to accomplish this?

Share this post


Link to post
Share on other sites

lilx,

Kludgy, non-IE solution (FWIW)

#include <array.au3>

local $str = '<div id="rt-main" class="mb8-sa4"> ' & _
'<div class="rt-mainsection">' & _
     '<div class="rt-mainrow">' & _
                                             '<div class="rt-grid-8 rt-alpha">' & _
                                                                     '<div class="rt-block component-block">' & _
             '<div class="component-content">' & _
                 '<div class="item-page">' & _
'<h2>' & _
'<a href="/nokia/mock-up-windows-rt-tablet-van-nokia-duikt-op">' & _
'[b]Mock-up Windows RT tablet van Nokia duikt op[/b]</a>' & _
'</h2>' & _
'<dl class="article-info">' & _
'<dt class="article-info-term">Gegevens</dt>' & _
'[b]<dd class="category-name">' & _
'Categorie: <a href="/nokia">Nokia</a> </dd>' & _
'<dd class="published">' & _
'Gepubliceerd op woensdag, 02 januari 2013 </dd>[/b]' & _
'</dl>' & _
'[b]<p><img src="/images/Nokia/Nokia-Windows-RT.jpg" border="0" alt="Nokia Windows RT Mockup" style="border: 0px; display: block; margin-left: auto; margin-right: auto;" /></p>………………..[/b]'

$str = stringregexpreplace($str,'\[b\]','')
$str = stringregexpreplace($str,'\[/b\]','')
$ret = stringregexp($str,'>(.*?)<',3)

for $1 = ubound($ret) - 1 to 0 step -1
    if stringlen(stringstripws($ret[$1],3)) = 0 then    _arraydelete($ret,$1)
next

 _arraydisplay($ret)

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

lilx,

This might give you some ideas for an IE solution.

#include <array.au3>
#include <ie.au3>

local $str = '<div id="rt-main" class="mb8-sa4"> ' & _
'<div class="rt-mainsection">' & _
     '<div class="rt-mainrow">' & _
                                             '<div class="rt-grid-8 rt-alpha">' & _
                                                                     '<div class="rt-block component-block">' & _
             '<div class="component-content">' & _
                 '<div class="item-page">' & _
'<h2>' & _
'<a href="/nokia/mock-up-windows-rt-tablet-van-nokia-duikt-op">' & _
'[b]Mock-up Windows RT tablet van Nokia duikt op[/b]</a>' & _
'</h2>' & _
'<dl class="article-info">' & _
'<dt class="article-info-term">Gegevens</dt>' & _
'[b]<dd class="category-name">' & _
'Categorie: <a href="/nokia">Nokia</a> </dd>' & _
'<dd class="published">' & _
'Gepubliceerd op woensdag, 02 januari 2013 </dd>[/b]' & _
'</dl>' & _
'[b]<p><img src="/images/Nokia/Nokia-Windows-RT.jpg" border="0" alt="Nokia Windows RT Mockup" style="border: 0px; display: block; margin-left: auto; margin-right: auto;" /></p>………………..[/b]'

Local $ohtml = ObjCreate('HTMLFILE')

If Not IsObj($ohtml) Then SetError(-1)

$ohtml.open()
$ohtml.write($str)
$ohtml.close()

Local $odivs = _IETagnameGetCollection($ohtml, 'div'), $o_str

if not isobj($odivs) then seterror(-2)

for $odiv in $odivs
    ConsoleWrite('!  Classname = ' & $odiv.classname & '   title = ' & $odiv.title & '   id = ' & $odiv.id & @LF)
    consolewrite('>' & @tab & @tab & $odiv.innertext & @lf)
Next

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

Your answer in in the reply I left for you. If you don't understand it, do some reading and ask questions, but I wouldn't suggest you just ignore it, bump and hope someone writes it for you.

Dale


Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Share this post


Link to post
Share on other sites

hi guys,

sorry for the late response but indeed i found my answer with _IETagNameGetCollection.

I bumped the topic because i couldn't get it to work the first time with _IETagNameGetCollection and other commands, but now i gto it working thank you for tip.

Here is what i got so far:

$divs = _IETagNameGetCollection($oIE, "div")

For $div In $divs
Local $Title, $ImgSource, $Publised
If $div.className == "item-page" Then
$htmlcontent = String($div.innerHTML)

$Title = _StringBetween($htmlcontent, '">', '')

$ImgSource = _StringBetween($htmlcontent, 'src="', '"')

$rawPublised = _StringBetween($htmlcontent, '<dd class="published">', '</dd>')
    _formatdate($rawPublised)
    EndIf
    Next
    
    
    _IEQuit($oIE)
    
    Func _formatdate($sString) ; " Gepubliceerd op woensdag, 02 januari 2013 "
    $day = StringRegExp ( $sString, '([0-9]{2})', 1)
    $year = StringRegExp ( $sString, '([0-9]{4})', 1)
    $month = StringRegExp ( $sString, '([a-z]{3-9})(:? [0-9]{4})', 1)
    ConsoleWrite ($day[0] & ' ' & $month[0] & ' ' & $year[0])
    EndFunc

i now have a question about StringRegEXP hope you can help me with this.

as you can see $sString contains this value "Gepubliceerd op woensdag, 02 januari 2013" and i was able the extract the 02 and 2013 value.

But now i am tryin to extract the month but my criteria isent working, my idea was to look at the string as followed "02 " extract " 2013" any idea's?

Share this post


Link to post
Share on other sites

nvm found what i was looking for, StringRegExp ( $sString, '([a-z]{3,9})(?: ([0-9]{4}))', 1).

i think that i can finish it from here off.

thank you guys for the tips

Topic can be closed

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By DannyJ
      I use _ClipPutHTML UDF function 
      My problem is that I am not able to write characters with accets.
      When I paste this code to an Mail program the accent characters will be Chinese characters or '???' characters.
      Here is a snippet of my code:
      #include <_ClipPutHTML.au3> $sHTMLStr='<html><head>'&@CRLF & " <title>Page Title</title>"&@CRLF & _ ' <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">'&@CRLF & _ "</head>"&@CRLF & "<body>"&@CRLF & "<h1>Headline Text</h1>"&@CRLF & _ "<p>" & "ófiéááéllááéáéá:" & Chr(225) & BinaryToString("á",4) &@CRLF & _ '<a href="http://www.autoitscript.com/forum/index.php?showtopic=96556">_ClipPutHTML() functions</a>.'&@CRLF& _ " The regular modifiders, such as <strong>bold</strong>, <i>italics</i>, and <u>underlines</u> work as usual,"&@CRLF& _ " just like all other HTML formatting.</p>"&@CRLF & "<p>&nbsp;</p>"&@CRLF & _ "<p><strong>Here's an example list:</strong></p>"&@CRLF & "<ul>"&@CRLF & _ " <li>List <i>itemü</i> #1.</li>"&@CRLF & _ " <li>List <i>itemá</i> #2.</li>"&@CRLF & _ ' <li>List <i>itemé</i> #3 with a <a href="http://www.google.com">Hyperlink</a></li>'&@CRLF & _ "</ul>"&@CRLF & "</body>"&@CRLF & "</html>" $sPlainTextStr="Headline Text"&@CRLF&@CRLF& _ "ófigyeljáéáéá" & Chr(225) & "_ClipPutHTML() functions."& _ "The regular modifiders, such as bold, italics, and underlines work as usual, just like all other HTML formatting."&@CRLF&@CRLF& _ "Here's an example list:"&@CRLF& _ " * List itemü #1."&@CRLF& _ " * List itemá #2."&@CRLF& _ " * List itemé #3 with a Hyperlink"&@CRLF ;I have tired this way, but it does not work. ;$UTF8HTML = BinaryToString($sHTMLStr,4) ;ConsoleWrite($UTF8HTML) ;$sUTF8String=BinaryToString($sPlainTextStr,4) ConsoleWrite($sUTF8String) _ClipPutHTML($UTF8HTML,$sUTF8String) ; Special Unicode text call ;_ClipPutHyperlink("http://www.google.co.jp/",ChrW(0x30B0)& ChrW(0x30FC)& ChrW(0x30B0)& ChrW(0x30EB)& " (Japanese Google)") ; Regular text ;_ClipPutHyperlink("http://www.google.com","itt")  
    • By Fenzik
      Hello!
      i wrote this function as alternative to using the Com Object or Commandline version of this project, discussed also earlyer on this forum.
      Project site - http://ebstudio.info/home/xdoc2txt.html
      Advantage of this implementation is that you do not need to register Com dll, using regsvr32.
      But you still need the project Dll (xd2txlib.dll).
      Enjoy!
      ; #FUNCTION# ==================================================================================================================== ; Name ..........: _ExtractText ; Description ...: Extracts text from advanced documment formats (Doc, Docx, ODT, XLS, ...) ; Syntax ........: _ExtractText($sFilename[, $bProperties = False[, $hDll = 0]]) ; Parameters ....: $sFilename - a string value. ; $bProperties - [optional] a boolean value. Default is False. If True, documment properties will be returned instead of the text. ; $hDll - [optional] a handle value. Default is 0. Optional handle to previously opened xd2txlib.dll. By default the xd2txlib.dll (Expected in @scriptdir) will be opened and closed during the function call. ; Return value .: String, containing the text or documment properties or empty string and Error as follows: ;1 - The file does not exists. ;2 - Error during opening xd2txlib.dll. ;3 - No text returned. ; Author ........: Fenzik ; Modified ......: ; Remarks .......: Project site - http://ebstudio.info/home/xdoc2txt.html ; Related .......: ; Link ..........: ; Example .......: No ; =============================================================================================================================== Func _ExtractText($sFilename, $bProperties = False, $hDll = 0) If Not FileExists($sFilename) Then Return SetError(1, "", "") Local $bLoaded = False If $hDll = 0 Then $hDll = DllOpen(@scriptdir&"\xd2txlib.dll") If $hDll = -1 Then Return SetError(2, "", "") $bLoaded = True Endif $aResult = DllCall($hDll, "int:cdecl", "ExtractText", "WSTR", $sFilename, "BOOL", $bProperties, "WSTR*", "") If $aResult[0] = 0 Then Return SetError(3, "", "") If $bLoaded = True Then DllClose($hDll) Return $aResult[3] EndFunc  
       
      xd2txlib-example.zip
    • By wysocki
      I have a smartphone and I use it to access my email. However, when composing an email on it I have a problem. My list of phone contacts on the phone is very different from my list of email contacts in my Thunderbird desktop app.  I use my Gmail address book to store primarily phone contacts, and I use Thunderbird for my list of email contacts. I wanted a way to get my Thunderbird contact list onto my smartphone to be able to compose emails to addresses in that list. Here's my solution.
      I wrote a script to export my Thunderbird Personal Address Book to a csv file. It then reads that file and re-writes it with html wrappers around the data to make it into a nicely formatted web page. It then uploads the htm file to my website. On my smartphone, I created a shortcut to the file's URL and whenever I click it, I get the list displayed. Each contact shows name and email address along with a COPY button that will put the address into the clipboard. Then in my email client, I can easily paste that address into it. Alternatively, clicking on the actual email link will open a new message dialog in your email client with that address already entered.
      To use the app, all you need to do is use Thunderbird and have a webserver available. You'll need to download the FTPEX.AU3 file from this website and make a few changes to some constants around line 17 for FTP login info, etc.
       
      pab2ftp.au3
    • By SkysLastChance
      What would be the best way to grab the last digits of this <span>? One of the problems I know I am going to have is sometimes it will be 1 digit other times it might be 3. 

      I am trying to get the list of spans and I get this error.

       
      $oInputs = _IETagNameGetCollection($oIE, "span") $sTxt = "" For $oInput In $oInputs     $sTxt &= $oInput.Innertext & @CRLF Next MsgBox($MB_SYSTEMMODAL, "Form Input Type", "Form: " & $oInput.form.name & @CRLF & @CRLF & "         Types :" & @CRLF & $sTxt)  
    • By nacerbaaziz
      hello sirs, please i created a tool witch get the focused control in a window and play a audio file linked with this controls
      e.g buttons, checkBoxes, radios, comboboxes, and others
      i know that their is a function that give us the control focus but it return the classNN
      i want to get the class name to use it with a switch and
      because their are more than class e.g button tbutton timagebutton tnewButton...
      please can any one help me to get the class name not the classnn
      thanks in advance
×
×
  • Create New...