Jump to content

StringRegExp Craigslist parser.


Recommended Posts

I am making a script that will take a craigslist posting and parse the cars brands, models, year etc and then do an auto lookup at edmunds.com's true market value. This way I can instantly get an idea how good of a deal it is. I am having problem with parsing out the years because some people put 88 or 94 instead of 1988 and 1994. Right now I am just working on making it check the start of the listing for /d/d. However, because of compatibility problems with older scripts I use I have to resort on a 3.1.1.132 version. The way it does start of a match is this i think:

\< Match beginning of word.

\> Match end of word.

Here is my script so far:

#include <INet.au3>

For $index = 1 To 1
    $Source = _INetGetSource('http://detroit.craigslist.org/car/index' & $index & '00.html')
    $Listings = StringRegExp($Source, '<p>(.*?)</p>', 3)

    Dim $CarDatabase[100][2]
    $CarDatabase[0][1] = "ford"
    $CarDatabase[0][0] = "mustang"

    Dim $Cars[UBound($Listings) ][9]

    For $i = 0 To UBound($Listings) - 1
        $Link = ""
        $City = ""
        $Year = ""
        $Brand = ""
        $Model = ""
        $Price = ""
        $TMV_Low = ""
        $TMV_High = ""
        $Bargin = ""


        $LinkSearch = StringRegExp($Listings[$i], '<a href="(.*?)">', 3)
        If @extended Then $Link = $LinkSearch[0]

        $CitySearch = StringRegExp($Listings[$i], '<font size="-1"> (.*?)</font>', 3)
        If @extended Then $City = $CitySearch[0]

        $YearSearch = StringRegExp($Listings[$i], '(19\d\d)|(200\d)\D', 3)
        If @extended Then
            $Year = $YearSearch[0]
        Else
            $YearSearch = StringRegExp($Listings[$i], '(\d\d)\D', 3)
            If @extended Then
                If StringRegExp($YearSearch[0], '(0\d)', 0) Then
                    $Year = '20' & $YearSearch[0]
                Else
                    $Year = '19' & $YearSearch[0]
                EndIf
            EndIf
        EndIf
        
        For $ii = 0 To UBound($CarDatabase, 1) - 1
            If StringInStr($Listings[$i], $CarDatabase[$ii][0]) Then
                $Brand = $CarDatabase[$ii][1]
                $Model = $CarDatabase[$ii][0]
                $Source = _INetGetSource('http://www.edmunds.com/used/' & $Year & '/' & $Brand & '/' & $Model & '/')
                $TMV_LowSearch = StringRegExp($Source, 'Dealer Retail:<b>&nbsp;$(.*?) - ', 3)
                If @extended Then $TMV_Low = $TMV_LowSearch[0]
                
                
                $TMV_HighSearch = StringRegExp($Source, ' - $(.*?)</b></font>', 3)
                If @extended Then $TMV_High = $TMV_HighSearch[0]

                ExitLoop
            EndIf
        Next



        $PriceSearch = StringRegExp($Listings[$i], ' - $(\d*)', 3)
        If @extended Then $Price = $PriceSearch[0]
        
        $Cars[$i][0] = $Link
        $Cars[$i][1] = $City
        $Cars[$i][2] = $Year
        $Cars[$i][3] = $Brand
        $Cars[$i][4] = $Model
        $Cars[$i][5] = $Price
        $Cars[$i][6] = $TMV_Low
        $Cars[$i][7] = $TMV_High
        $Cars[$i][8] = $Bargin
    Next
Next


$file = FileOpen("Output.html", 2)
FileWriteLine($file, '<table>')
FileWriteLine($file, '<tr><td>Listing</td><td>Year</td><td>Car Brand</td><td>Car Model</td><td>Price</td><td>TMV Low</td><td>TMV High</td></tr>')

For $i = 0 To UBound($Listings) - 1
    FileWriteLine($file, '<tr><td><a href="' & $Cars[$i][0] & '">' & $Listings[$i] & '</a></td><td>' & $Cars[$i][2] & '</td><td>' & $Cars[$i][3] & '</td><td>' & $Cars[$i][4] & '</td><td>' & $Cars[$i][5] & '</td><td>' & $Cars[$i][6] & '</td><td>' & $Cars[$i][7] & '</td></tr>')
Next

FileWriteLine($file, '</table>')
FileClose($file)

Its a very early stage as you can see. The problem lies in here:

$YearSearch = StringRegExp($Listings[$i], '(19\d\d)|(200\d)\D', 3)
        If @extended Then
            $Year = $YearSearch[0]
        Else
            $YearSearch = StringRegExp($Listings[$i], '(\d\d)\D', 3)
            If @extended Then
                If StringRegExp($YearSearch[0], '(0\d)', 0) Then
                    $Year = '20' & $YearSearch[0]
                Else
                    $Year = '19' & $YearSearch[0]
                EndIf
            EndIf
        EndIf

Basically it searches for a 19xx or 200x and if it doesnt find that it will try to look for just a \d\d\D or digit digit non-digit obviously this messes up because the price is in the listing as well. So I need it search the very beginning of the string.

Now I just had an idea that I could just have the price removed from the search string first and then I should be left with just years for numbers. None the less I still would like to know how to search the beginning of a string for 3.1.1.132 AutoIT. I know newer versions use /a and /z i think.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...