Sign in to follow this  
Followers 0
Warsaw

Capturing Craigslist

5 posts in this topic

#1 ·  Posted (edited)

This is my first post here but I have been using AutoIT for a while at work. This problem is personal, however. I am trying to pull posts that have been posted from an account. I have logged into the account and am trying to use a script to extract the list of posts and plan to then pull up each post and extract the post's contents. I have gotten a start trying to adapt the script from

Here is what I have tried working with:

#include <IE.au3>
#include <INet.au3>
#include <Array.au3>
Global $oIE, $asURL_Listings[1][2], $asURL_Parse, $id = 0, $sIndex, $sURL_Source
$oIE = _IECreate("https://accounts.craigslist.org/login?filter_active=active&filter_cat=0&show_tab=postings")
$sURL_Source = _IEDocReadHTML($oIE)
$asURL_Parse = StringRegExp($sURL_Source, '<a href="https://post.craigslist.org/manage/(.*?)</a>', 3)
    If @error Then
        SetError(1)
  MsgBox(0, "Error", "Error: " & @error & @CRLF & "Extended: " & @extended)
    EndIf
    For $i = 0 To UBound($asURL_Parse) - 1
        ReDim $asURL_Listings[UBound($asURL_Listings) + 1][2]
        $asSplit = StringSplit($asURL_Parse[$i], '">', 3)
        $asURL_Listings[$i + 1][0] = $asSplit[0]
        $asURL_Listings[$i + 1][1] = $asSplit[1]
    Next
    $asURL_Listings[0][0] = (UBound($asURL_Listings) - 1)
_ArrayDisplay($asURL_Listings)

It seems that the StringRegExp is erroring out but I'm not sure why. Anyone have any ideas?

Edited by Warsaw

Share this post


Link to post
Share on other sites



#2 ·  Posted (edited)

The problem is when StringRegExp searches the html source it can't find the tag that you were looking for because it does not exists in the html source, therefor StringRegExp returns an error.

I tested out using the following code after i saw your html source:

#include <IE.au3>
#include <INet.au3>
#include <Array.au3>
Global $oIE, $asURL_Listings[1][2], $asURL_Parse, $id = 0, $sIndex, $sURL_Source
$oIE = _IECreate("https://accounts.craigslist.org/login?filter_active=active&filter_cat=0&show_tab=postings")
$sURL_Source = _IEDocReadHTML($oIE)
ConsoleWrite($sURL_Source)
$asURL_Parse = StringRegExp('<a href="https://post.craigslist.org/manage/">4444</a>', '<a href="https://post.craigslist.org/manage/(.*?)</a>', 3)
    If @error Then
        SetError(1)
  MsgBox(0, "Error", "Error: " & @error & @CRLF & "Extended: " & @extended)
    EndIf
    For $i = 0 To UBound($asURL_Parse) - 1
        ReDim $asURL_Listings[UBound($asURL_Listings) + 1][2]
        $asSplit = StringSplit($asURL_Parse[$i], '">', 3)
        $asURL_Listings[$i + 1][0] = $asSplit[0]
        $asURL_Listings[$i + 1][1] = $asSplit[1]
    Next
    $asURL_Listings[0][0] = (UBound($asURL_Listings) - 1)
_ArrayDisplay($asURL_Listings)

My advice to you is, just make sure the tag exists.

Edited by Aipion

Share this post


Link to post
Share on other sites

#3 ·  Posted (edited)

OK, I figured out my original problem. I didn't think about the search being case sensitive. That part is now working for me but now I am on to the second part and the new RegEx is erroring out again. Here's my code:

#include <IE.au3>
#include <INet.au3>
#include <Array.au3>
Global $oIE, $asURL_Listings[1][2], $asURL_Parse, $id = 0, $sIndex, $sURL_Source, $sPOST_Source, $asPOST_Parse, $PostListing = ""
$oIE = _IECreate("https://accounts.craigslist.org/login?filter_active=active&filter_cat=0&show_tab=postings")
$sURL_Source = _IEDocReadHTML($oIE)

$asURL_Parse = StringRegExp($sURL_Source, '<A href="https://post.craigslist.org/manage/(.*?)</A>', 3)
    If @error Then
        SetError(1)
  MsgBox(0, "Error", "Error: " & @error & @CRLF & "Extended: " & @extended)
    EndIf
    For $i = 0 To UBound($asURL_Parse) - 1
        ReDim $asURL_Listings[UBound($asURL_Listings) + 1][2]
        $asSplit = StringSplit($asURL_Parse[$i], '">', 3)
        $asURL_Listings[$i + 1][0] = $asSplit[0]
        $asURL_Listings[$i + 1][1] = $asSplit[1]
  _IENavigate($oIE, "https://post.craigslist.org/manage/" & $asSplit[0] )
  $sPOST_Source = _IEDocReadHTML($oIE)
  $asPOST_Parse = StringRegExp($sPOST_Source, '</div><h2>(.*?)<ul class="blurbs">', 3)
   If @error Then
    SetError(1)
    MsgBox(0, "Error", "Error: " & @error & @CRLF & "Extended: " & @extended)
   EndIf
  $PostListing &= "<h2>" & $asPOST_Parse & "<br><br>"
    Next
    $asURL_Listings[0][0] = (UBound($asURL_Listings) - 1)

FileWrite(@ScriptDir & "Test.html", $PostListing)

Here is an example of source to search:

<div class="posting">
<div class="bchead">
louisville craigslist &gt;  for sale / wanted &gt; motorcycles/scooters - by owner
</div><h2>This is the Title</h2>
<hr>
Date: 2012-02-05,  9:34PM EST<br>
Reply to: see below
<hr>
<br>
<div id="userbody">
The Text Goes Here.345-6789<!-- START CLTAGS -->

<br><br><ul class="blurbs">
<li>it's NOT ok to contact this poster with services or other commercial interests</li></ul>
<!-- END CLTAGS -->

Do I have to escape any of the characters in my RegEx search? Is it the multiple lines causing problems?

Thanks for any help.

Edited by Warsaw

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

Cause of problems:

  • RegEx search Enabled for singal lines (Fixed: Now Multiple lines)
  • RegEx search Case sensitive was On (Fixed: Now Off)
  • You were using an array as a variable $asPOST_Parse (Fixed: Now $asPOST_Parse[1])
______________________________________________

Try this:

#include <IE.au3>
#include <INet.au3>
#include <Array.au3>

Global $oIE, $asURL_Listings[1][2], $asURL_Parse, $id = 0, $sIndex, $sURL_Source, $sPOST_Source, $asPOST_Parse, $PostListing = ""
$oIE = _IECreate("https://accounts.craigslist.org/login?filter_active=active&filter_cat=0&show_tab=postings")
$sURL_Source = _IEDocReadHTML($oIE)
$asURL_Parse = StringRegExp($sURL_Source, '(?i)<A href="https://post.craigslist.org/manage/(.*?)</A>', 3)
    If @error Then
        SetError(1)
  MsgBox(0, "Error", "Error: " & @error & @CRLF & "Extended: " & @extended)
    EndIf
    For $i = 0 To UBound($asURL_Parse) - 1
        ReDim $asURL_Listings[UBound($asURL_Listings) + 1][2]
        $asSplit = StringSplit($asURL_Parse[$i], '">', 3)
        $asURL_Listings[$i + 1][0] = $asSplit[0]
        $asURL_Listings[$i + 1][1] = $asSplit[1]
  _IENavigate($oIE, "https://post.craigslist.org/manage/" & $asSplit[0] )
  $sPOST_Source = _IEDocReadHTML($oIE)
  $asPOST_Parse = StringRegExp($sPOST_Source), '(?i)(?s)(?m)</div><h2>(.*?)(.*?)<ul class="blurbs">', 3)
  MsgBox(0, "Text Found :)", "" & $asPOST_Parse[1])
   If @error Then
    SetError(1)
    MsgBox(0, "Error", "Error: " & @error & @CRLF & "Extended3: " & @extended)
   EndIf
  $PostListing &= "<h2>" & $asPOST_Parse[1] & "<br><br>"
    Next
    $asURL_Listings[0][0] = (UBound($asURL_Listings) - 1)
FileWrite(@ScriptDir & "Test.html", $PostListing)

If you run into any more problems you know where to ask.

Edited by Aipion

Share this post


Link to post
Share on other sites

I've built on this and now it does just what I want. Thanks so much.

I've always had trouble with RegEx. Sometimes it just looks like gibberish.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0