Warsaw Posted February 7, 2012 Share Posted February 7, 2012 (edited) This is my first post here but I have been using AutoIT for a while at work. This problem is personal, however. I am trying to pull posts that have been posted from an account. I have logged into the account and am trying to use a script to extract the list of posts and plan to then pull up each post and extract the post's contents. I have gotten a start trying to adapt the script from Here is what I have tried working with: #include <IE.au3> #include <INet.au3> #include <Array.au3> Global $oIE, $asURL_Listings[1][2], $asURL_Parse, $id = 0, $sIndex, $sURL_Source $oIE = _IECreate("https://accounts.craigslist.org/login?filter_active=active&filter_cat=0&show_tab=postings") $sURL_Source = _IEDocReadHTML($oIE) $asURL_Parse = StringRegExp($sURL_Source, '<a href="https://post.craigslist.org/manage/(.*?)</a>', 3) If @error Then SetError(1) MsgBox(0, "Error", "Error: " & @error & @CRLF & "Extended: " & @extended) EndIf For $i = 0 To UBound($asURL_Parse) - 1 ReDim $asURL_Listings[UBound($asURL_Listings) + 1][2] $asSplit = StringSplit($asURL_Parse[$i], '">', 3) $asURL_Listings[$i + 1][0] = $asSplit[0] $asURL_Listings[$i + 1][1] = $asSplit[1] Next $asURL_Listings[0][0] = (UBound($asURL_Listings) - 1) _ArrayDisplay($asURL_Listings)It seems that the StringRegExp is erroring out but I'm not sure why. Anyone have any ideas? Edited February 7, 2012 by Warsaw Link to comment Share on other sites More sharing options...
Guest Posted February 7, 2012 Share Posted February 7, 2012 (edited) The problem is when StringRegExp searches the html source it can't find the tag that you were looking for because it does not exists in the html source, therefor StringRegExp returns an error.I tested out using the following code after i saw your html source:#include <IE.au3> #include <INet.au3> #include <Array.au3> Global $oIE, $asURL_Listings[1][2], $asURL_Parse, $id = 0, $sIndex, $sURL_Source $oIE = _IECreate("https://accounts.craigslist.org/login?filter_active=active&filter_cat=0&show_tab=postings") $sURL_Source = _IEDocReadHTML($oIE) ConsoleWrite($sURL_Source) $asURL_Parse = StringRegExp('<a href="https://post.craigslist.org/manage/">4444</a>', '<a href="https://post.craigslist.org/manage/(.*?)</a>', 3) If @error Then SetError(1) MsgBox(0, "Error", "Error: " & @error & @CRLF & "Extended: " & @extended) EndIf For $i = 0 To UBound($asURL_Parse) - 1 ReDim $asURL_Listings[UBound($asURL_Listings) + 1][2] $asSplit = StringSplit($asURL_Parse[$i], '">', 3) $asURL_Listings[$i + 1][0] = $asSplit[0] $asURL_Listings[$i + 1][1] = $asSplit[1] Next $asURL_Listings[0][0] = (UBound($asURL_Listings) - 1) _ArrayDisplay($asURL_Listings)My advice to you is, just make sure the tag exists. Edited February 7, 2012 by Guest Link to comment Share on other sites More sharing options...
Warsaw Posted February 8, 2012 Author Share Posted February 8, 2012 (edited) OK, I figured out my original problem. I didn't think about the search being case sensitive. That part is now working for me but now I am on to the second part and the new RegEx is erroring out again. Here's my code: #include <IE.au3> #include <INet.au3> #include <Array.au3> Global $oIE, $asURL_Listings[1][2], $asURL_Parse, $id = 0, $sIndex, $sURL_Source, $sPOST_Source, $asPOST_Parse, $PostListing = "" $oIE = _IECreate("https://accounts.craigslist.org/login?filter_active=active&filter_cat=0&show_tab=postings") $sURL_Source = _IEDocReadHTML($oIE) $asURL_Parse = StringRegExp($sURL_Source, '<A href="https://post.craigslist.org/manage/(.*?)</A>', 3) If @error Then SetError(1) MsgBox(0, "Error", "Error: " & @error & @CRLF & "Extended: " & @extended) EndIf For $i = 0 To UBound($asURL_Parse) - 1 ReDim $asURL_Listings[UBound($asURL_Listings) + 1][2] $asSplit = StringSplit($asURL_Parse[$i], '">', 3) $asURL_Listings[$i + 1][0] = $asSplit[0] $asURL_Listings[$i + 1][1] = $asSplit[1] _IENavigate($oIE, "https://post.craigslist.org/manage/" & $asSplit[0] ) $sPOST_Source = _IEDocReadHTML($oIE) $asPOST_Parse = StringRegExp($sPOST_Source, '</div><h2>(.*?)<ul class="blurbs">', 3) If @error Then SetError(1) MsgBox(0, "Error", "Error: " & @error & @CRLF & "Extended: " & @extended) EndIf $PostListing &= "<h2>" & $asPOST_Parse & "<br><br>" Next $asURL_Listings[0][0] = (UBound($asURL_Listings) - 1) FileWrite(@ScriptDir & "Test.html", $PostListing) Here is an example of source to search: <div class="posting"> <div class="bchead"> louisville craigslist > for sale / wanted > motorcycles/scooters - by owner </div><h2>This is the Title</h2> <hr> Date: 2012-02-05, 9:34PM EST<br> Reply to: see below <hr> <br> <div id="userbody"> The Text Goes Here.345-6789<!-- START CLTAGS --> <br><br><ul class="blurbs"> <li>it's NOT ok to contact this poster with services or other commercial interests</li></ul> <!-- END CLTAGS --> Do I have to escape any of the characters in my RegEx search? Is it the multiple lines causing problems? Thanks for any help. Edited February 8, 2012 by Warsaw Link to comment Share on other sites More sharing options...
Guest Posted February 8, 2012 Share Posted February 8, 2012 (edited) Cause of problems:RegEx search Enabled for singal lines (Fixed: Now Multiple lines)RegEx search Case sensitive was On (Fixed: Now Off)You were using an array as a variable $asPOST_Parse (Fixed: Now $asPOST_Parse[1])______________________________________________Try this:#include <IE.au3> #include <INet.au3> #include <Array.au3> Global $oIE, $asURL_Listings[1][2], $asURL_Parse, $id = 0, $sIndex, $sURL_Source, $sPOST_Source, $asPOST_Parse, $PostListing = "" $oIE = _IECreate("https://accounts.craigslist.org/login?filter_active=active&filter_cat=0&show_tab=postings") $sURL_Source = _IEDocReadHTML($oIE) $asURL_Parse = StringRegExp($sURL_Source, '(?i)<A href="https://post.craigslist.org/manage/(.*?)</A>', 3) If @error Then SetError(1) MsgBox(0, "Error", "Error: " & @error & @CRLF & "Extended: " & @extended) EndIf For $i = 0 To UBound($asURL_Parse) - 1 ReDim $asURL_Listings[UBound($asURL_Listings) + 1][2] $asSplit = StringSplit($asURL_Parse[$i], '">', 3) $asURL_Listings[$i + 1][0] = $asSplit[0] $asURL_Listings[$i + 1][1] = $asSplit[1] _IENavigate($oIE, "https://post.craigslist.org/manage/" & $asSplit[0] ) $sPOST_Source = _IEDocReadHTML($oIE) $asPOST_Parse = StringRegExp($sPOST_Source), '(?i)(?s)(?m)</div><h2>(.*?)(.*?)<ul class="blurbs">', 3) MsgBox(0, "Text Found :)", "" & $asPOST_Parse[1]) If @error Then SetError(1) MsgBox(0, "Error", "Error: " & @error & @CRLF & "Extended3: " & @extended) EndIf $PostListing &= "<h2>" & $asPOST_Parse[1] & "<br><br>" Next $asURL_Listings[0][0] = (UBound($asURL_Listings) - 1) FileWrite(@ScriptDir & "Test.html", $PostListing)If you run into any more problems you know where to ask. Edited February 8, 2012 by Guest Link to comment Share on other sites More sharing options...
Warsaw Posted February 8, 2012 Author Share Posted February 8, 2012 I've built on this and now it does just what I want. Thanks so much. I've always had trouble with RegEx. Sometimes it just looks like gibberish. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now