Jump to content

IMDb Top 250 extracter


stefionesco
 Share

Recommended Posts

I'm bored so here is basic example:

#include <Array.au3>
#include <IE.au3>

Local $oIE = _IECreate("https://www.imdb.com/india/top-rated-indian-movies", 1)
Sleep(4000)
Local $oMovies = _IETableGetCollection($oIE)
For $oMovie In $oMovies
    If $oMovie.ClassName = "chart full-width" Then
        $aMovies = _IETableWriteToArray($oMovie, True)
        ExitLoop
    EndIf
Next

Local $sMovies = @ScriptDir & "\" & @YEAR & "-" & @MON & "-" & @MDAY & "_Top-Rated-Indian-Movies.csv"
Local $hMovies = FileOpen($sMovies, 10) ;~ Create directory + overwrite contents
Filewrite($hMovies, '"' & _ArrayToString($aMovies, '","', -1, -1, '"' & @CRLF & '"', 1, 2) & '"' & @CRLF)
FileClose($hMovies)
_ArrayDisplay($aMovies)

 

Link to comment
Share on other sites

17 minutes ago, FrancescoDiMuro said:

I think that a little bit of effort from your side should be showed.

No-one is here to code for you, as it is stated in the Forum etiquette.

I am trying, man, I don't want to be lazy, but I didn't get at any point. That's why I try to get some help from you. I do not want to be rude... Sorry if you understand this.

 

Link to comment
Share on other sites

Not 100% sure what the IMDb id is, I believe it's the code after the "title" so here is how I'd get it.

#include <Array.au3>
#include <IE.au3>

Local $oIE = _IECreate("https://www.imdb.com/india/top-rated-indian-movies", 1)
Sleep(4000)
Local $aMovies[0][4]
Local $oMovies = _IETableGetCollection($oIE)
If IsObj($oMovies) Then
    For $oMovie In $oMovies
        If $oMovie.ClassName = "chart full-width" Then
            $oRows = _IETagNameGetCollection($oMovie, "tr")
            If IsObj($oRows) Then
                For $oRow In $oRows
                    ReDim $aMovies[UBound($aMovies) + 1][4]
                    $iMovies = UBound($aMovies) - 1
                    $oCells = _IETagNameGetCollection($oRow, "td")
                    For $oCell In $oCells
                        If $oCell.ClassName = "titleColumn" Then
                            $aMovies[$iMovies][0] = $oCell.InnerText
                            $oLinks = _IETagNameGetCollection($oCell, "a")
                            If IsObj($oLinks) Then
                                For $olink In $oLinks
                                    $aMovies[$iMovies][3] = StringLeft($olink.href, StringInStr($olink.href, "/", 0, 5))
                                    $aMovies[$iMovies][2] = StringTrimRight(StringReplace($aMovies[$iMovies][3], "https://www.imdb.com/title/", ""), 1)
                                Next
                            EndIf
                        EndIf
                        If $oCell.ClassName = "ratingColumn imdbRating" Then $aMovies[$iMovies][1] = $oCell.InnerText
                    Next
                Next
            EndIf
            ExitLoop
        EndIf
    Next
EndIf

$aMovies[0][0] = "Title"
$aMovies[0][1] = "IMDb Rating"
$aMovies[0][2] = "IMDb Id"
$aMovies[0][3] = "IMDb Url"

Local $sMovies = @ScriptDir & "\" & @YEAR & "-" & @MON & "-" & @MDAY & "_Top-Rated-Indian-Movies.csv"
Local $hMovies = FileOpen($sMovies, 10) ;~ Create directory + overwrite contents
Filewrite($hMovies, '"' & _ArrayToString($aMovies, '","', -1, -1, '"' & @CRLF & '"') & '"' & @CRLF)
FileClose($hMovies)
_ArrayDisplay($aMovies)

 

Link to comment
Share on other sites

10 minutes ago, Subz said:

Not 100% sure what the IMDb id is, I believe it's the code after the "title" so here is how I'd get it.

 

Thanks. Seem to what I'm looking for. Now I can try to go ahead with my coding.

PS. And yes, the IMDb ID is that number after the title (ttxxxxxxx)

Thanks again

PS2. After I finish what I have in mind I will post it here.

Link to comment
Share on other sites

You're scraping web data so you need some basic html knowledge, I normally use Chrome and inspect each element that you need to capture, you need to identify unique information, for example <div id="xyz"> is better than <div class="xyz"> since id should only be used once per page (if coded correctly).  Class names are normally used throughout the document, however in most instances, people will use class names like in the example I posted above so that all titles have a class name of "titleColumn", making it easy identify.  If you look at the link you posted and inspect the elements of the page you'll notice it doesn't use tables, but is using divs.  Each title has a class name named "lister-item-content", you'll note the heading "h3" is the title and holds the url.  So start with:

$oDivs = _IETagNameGetCollection($oIE, "div")

Loop and look for $oDiv.ClassName = "lister-item-content"

_IETagNameGetCollection($oDiv, "h3")

$oH3.InnerText will be your title

Use the code I posted above to get the links.

If you encounter any issues post your code and we can assist.

Link to comment
Share on other sites

1. First, I want to say I'm no coder, so if Somerset say I'm a lazy coder, I take it as a compliment. I just learn some basic GUI functions, here, on this forum, on some YouTube tutorials and that's it. I have always tried to adapt on my needs the codes I found here. This time I didn't get the result, that's why I ask for help. I'm not lazy. I just do not have the knowledge to understand and build something that I have in mind. Sorry if I offend somebody.

2. For those who still want to help me, especially for Subz who tried to explain me how to get the div class... I tried. I find the class, but it didn't work. I do something wrong for sure. To make it simple, here is the code, your code, that I tried to modified to fit my needs:

#include <Array.au3>
#include <IE.au3>

Local $oIE = _IECreate("https://www.imdb.com/india/top-rated-indian-movies/", 1)

                SplashTextOn("Working", "Please wait...", 600, 50)

Sleep(4000)
Local $aMovies[0][4]
Local $oMovies = _IETableGetCollection($oIE)
If IsObj($oMovies) Then
    For $oMovie In $oMovies
        If $oMovie.ClassName = "chart full-width" Then
            $oRows = _IETagNameGetCollection($oMovie, "tr")
            If IsObj($oRows) Then
                For $oRow In $oRows
                    ReDim $aMovies[UBound($aMovies) + 1][4]
                    $iMovies = UBound($aMovies) - 1
                    $oCells = _IETagNameGetCollection($oRow, "td")
                    For $oCell In $oCells
                        If $oCell.ClassName = "titleColumn" Then
                            $aMovies[$iMovies][0] = $oCell.InnerText
                            $oLinks = _IETagNameGetCollection($oCell, "a")
                            If IsObj($oLinks) Then
                                For $olink In $oLinks
                                    $aMovies[$iMovies][3] = StringLeft($olink.href, StringInStr($olink.href, "/", 0, 5))
                                    $aMovies[$iMovies][1] = StringTrimRight(StringReplace($aMovies[$iMovies][3], "https://www.imdb.com/title/tt", ""), 1)
                                Next
                            EndIf
                        EndIf
                        If $oCell.ClassName = "ratingColumn seen-widget rated inline rating " Then $aMovies[$iMovies][2] = $oCell.InnerText
                    Next
                Next
            EndIf
            ExitLoop
        EndIf
    Next
EndIf


                SplashOff()

$aMovies[0][0] = "Title"
$aMovies[0][1] = "IMDb ID"
$aMovies[0][2] = "My Rating"

Local $sMovies = @ScriptDir & "\" & @YEAR & "-" & @MON & "-" & @MDAY & "_Top-Rated-Indian-Movies.ini"
Local $hMovies = FileOpen($sMovies, 10) ;~ Create directory + overwrite contents
Filewrite($hMovies, '' & _ArrayToString($aMovies, '= ', 1, -1, '' & @CRLF))
FileClose($hMovies)
_ArrayDisplay($aMovies)

Problems:

1. I change CSV file into INI file. Later I prefer to have it in INI format. This seems to be no problem, still I think an INI file need to have [sections].

2. (This is tough) I'm not interested about IMDb rating, instead I need to have my ratings. The id class for "my rating"... I found it but it didn't work for me. The result is an empty column.  Anyway, in the aMovies file I need to exclude titles I already rated. Something like... If my rating is null Then write on file Else (if there is a rating already) ignore the line. I know I can do it after, in Excel with the CSV file but it will be more easier to have the INI file without that movies i've seen.. 

3. The final INI file needs to have only 2 columns of the array (Title = IMDB Id). In the code above (that have 4 columns) I can't realize where I can change that. I mean I know where, in FileWrite but I can't find the right expression.

 

Thank you.

PS. Even if nobody will help me, thanks anyway for all the things I've learn on this forum.

 

Edited by stefionesco
Link to comment
Share on other sites

@stefionesco

Let @Somerset go! He was joking as it does with a lot of people around here, so, don't mind him! :)

For your requests, a Database seems to be more "appropriate", since you could query it and do almost everything, instead of doing in your script (for example, you could think to extract only films of a particular genere, or which have a rating more than a value, and si on...).

By the way, if you still want to use INI files, then take a look at Ini* functions in thr Help filr, instead of using File* functions to write to your file :)

 

Edited by FrancescoDiMuro

Click here to see my signature:

Spoiler

ALWAYS GOOD TO READ:

 

Link to comment
Share on other sites

Here is an example of how to get the page list method (your second url) and also add it to an Ini file, in your code above the classname you should be looking for is "ratingColumn" the code you posted was for a div not the cell i.e. "td"

#include <Array.au3>
#include <IE.au3>

Local $oIE = _IECreate("https://www.imdb.com/list/ls045397191", 1)
Sleep(4000)
Local $aMovies[0][4]
Local $oDivs = _IETagNameGetCollection($oIE, "div")
If IsObj($oDivs) Then
    For $oDiv In $oDivs
        If $oDiv.ClassName = "lister-item-content" Then
            ReDim $aMovies[UBound($aMovies) + 1][4]
            $iMovies = UBound($aMovies) - 1
            $oHeading3s = _IETagNameGetCollection($oDiv, "h3")
            If IsObj($oHeading3s) Then
                For $oHeading3 In $oHeading3s
                    $aMovies[$iMovies][0] = $oHeading3.InnerText
                    SplashTextOn("IMDb Extractor", $aMovies[$iMovies][0], 400, 50)
                    $oLinks = _IETagNameGetCollection($oHeading3, "a")
                    If IsObj($oLinks) Then
                        For $olink In $oLinks
                            $aMovies[$iMovies][3] = StringLeft($olink.href, StringInStr($olink.href, "/", 0, 5))
                            $aMovies[$iMovies][2] = StringTrimRight(StringReplace($aMovies[$iMovies][3], "https://www.imdb.com/title/", ""), 1)
                        Next
                    EndIf
                Next
            EndIf
            $oLabels = _IETagNameGetCollection($oDiv, "label")
            If IsObj($oLabels) Then
                For $oLabel In $oLabels
                    If $oLabel.ClassName = "ipl-rating-interactive__star-container" Then
                        $aMovies[$iMovies][1] = StringStripWS($oLabel.InnerText, 3)
                    EndIf
                Next
            EndIf
        EndIf
    Next
EndIf
SplashOff()
_ArrayInsert($aMovies, 0, "Title|IMDb Rating|IMDb Id|IMDb Url")

Local $sCsvMovies = @ScriptDir & "\" & @YEAR & "-" & @MON & "-" & @MDAY & "_Top-Rated-Indian-Movies.csv"
Local $sIniMovies = @ScriptDir & "\" & @YEAR & "-" & @MON & "-" & @MDAY & "_Top-Rated-Indian-Movies.ini"
Local $hMovies = FileOpen($sCsvMovies, 10) ;~ Create directory + overwrite contents
Filewrite($hMovies, '"' & _ArrayToString($aMovies, '","', -1, -1, '"' & @CRLF & '"') & '"' & @CRLF)
FileClose($hMovies)

For $i = 1 To UBound($aMovies) - 1
    ;~ Check to see if the movie has already been rated, if not continue.
    If IniRead($sIniMovies, $aMovies[$i][2], "My Rating", "") = "" Then
        IniWrite($sIniMovies, $aMovies[$i][2], "Title", $aMovies[$i][0])
        IniWrite($sIniMovies, $aMovies[$i][2], "My Rating", $aMovies[$i][1])
    EndIf
Next
_ArrayDisplay($aMovies)

 

Link to comment
Share on other sites

  • Moderators

@stefionesco While your project seems fairly innocuous, it has been pointed out that IMDB's Conditions of Use page states very clearly:

Quote

Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.

As I am guessing that you do not possess this in writing, I am locking this thread based on our forum rules. Please read these and familiarize yourself before posting again.

"Profanity is the last vestige of the feeble mind. For the man who cannot express himself forcibly through intellect must do so through shock and awe" - Spencer W. Kimball

How to get your question answered on this forum!

Link to comment
Share on other sites

Guest
This topic is now closed to further replies.
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...