Jump to content

How would i go about collecting certain links on a webpage?


Recommended Posts

So i'd like to 'collect' all of the video links on this page

http://www.youtube.com/videos?s=mp&t=t&cr=CA&p=1

#include <IE.au3>


$oIE = _IECreate ("http://www.youtube.com/videos?s=mp&t=t&cr=CA&p=1")
$oLinks = _IELinkGetCollection ($oIE)
$iNumLinks = @extended
MsgBox(0, "Link Info", $iNumLinks & " links found")
For $oLink  In $oLinks
    MsgBox(0, "Link Info", $oLink)
Next

Thats the only script that i could find, but i mean it has like 150 links most of which are useless, i only want links to the actual 20 or so videos, and i don't want duplicates either.

Can anyone tell me how i can approach this issue? I want the web links to be saved to a .txt

Edited by phatzilla
Link to comment
Share on other sites

Okay after Racking my Brain, i got to this

#include <IE.au3>
$oIE = _IECreate("http://www.youtube.com/videos?s=mp&t=t&cr=CA&p=1")
$nOffset = 1
$oLinks = _IELinkGetCollection($oIE)
For $oLink In   $oLinks
Sleep(10)

    $findlink = StringInStr($oLink.href, "http://www.youtube.com/watch?v=")
    $done=StringReplace ( $oLink.href, "http://www.youtube.com/watch?v=" , "")

    If $findlink = 0 Then
        Sleep(10)
    Else

;~      MsgBox(0,$oLink.href,$done)

        $file = FileOpen("links.txt", 1)
        FileWrite($file,$done & @CRLF)
        FileClose($file)
    EndIf
Next

Now it extracts all the video links, however i have quadruples of like every link, how do i make sure i get no duplicates added to my .txt?

Here's the output

I'd also like to remove the lines with "hd" and "cc"

uelHwf8o7_U
uelHwf8o7_U
uelHwf8o7_U
uelHwf8o7_U&hd=1
toBLte0n8z8
toBLte0n8z8
toBLte0n8z8
m3lvxTo4Oq8
m3lvxTo4Oq8
m3lvxTo4Oq8
m3lvxTo4Oq8&hd=1
68XP29hivmY
68XP29hivmY
68XP29hivmY
68XP29hivmY&hd=1
1qghtxXiBMU
1qghtxXiBMU
1qghtxXiBMU
1qghtxXiBMU&hd=1
homJ9pE-OCc
homJ9pE-OCc
homJ9pE-OCc
ch1UBta1sg4
ch1UBta1sg4
ch1UBta1sg4
ch1UBta1sg4&hd=1
nQEEQEBySTk
nQEEQEBySTk
nQEEQEBySTk
-sDL98j3ii0
-sDL98j3ii0
-sDL98j3ii0
-sDL98j3ii0&hd=1
-sDL98j3ii0&cc=1
4cqFbEDsZqE
4cqFbEDsZqE
4cqFbEDsZqE
dDnvpGSskgc
dDnvpGSskgc
dDnvpGSskgc
Lzf1v_Km2mc
Lzf1v_Km2mc
Lzf1v_Km2mc
Lzf1v_Km2mc&hd=1
mcdKB17PDlg
mcdKB17PDlg
mcdKB17PDlg
c9R-lIs35fM
c9R-lIs35fM
c9R-lIs35fM
c9R-lIs35fM&hd=1
qqTMfmBEcPA
qqTMfmBEcPA
qqTMfmBEcPA
9lL0Wj_IXOk
9lL0Wj_IXOk
9lL0Wj_IXOk
9lL0Wj_IXOk&hd=1
Q_j6F66K-hE
Q_j6F66K-hE
Q_j6F66K-hE
Q_j6F66K-hE&hd=1
OjNydJV4Iuw
OjNydJV4Iuw
OjNydJV4Iuw
0DWHb6ZIPBs
0DWHb6ZIPBs
0DWHb6ZIPBs
0DWHb6ZIPBs&hd=1
L53gjP-TtGE
L53gjP-TtGE
L53gjP-TtGE
L53gjP-TtGE&hd=1
xIBCZfxUg6Q
xIBCZfxUg6Q
xIBCZfxUg6Q
mS2L-dxeMOg
mS2L-dxeMOg
mS2L-dxeMOg
hfhrqwe495g
hfhrqwe495g
hfhrqwe495g
Edited by phatzilla
Link to comment
Share on other sites

Examining $oLink in that example is pretty useless... it is an object variable and isn't useful on its own.

Replace

MsgBox(0, "Link Info", $oLink)

with

ConsoleWrite("Link: " & $oLink.innerText & " href: " & $oLink.href)

examine the results and you may be on your way.

Dale

Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y

Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Link to comment
Share on other sites

Dale, i think you missed my second post, im sort of past that, thanks tho

I got rid of the extra characters like "hd" and "cc" by using

$done1 = StringLeft ($done, 11)

So i only take the first 11 characters of each link name.

Now, how would i go about removing duplicate lines? or maybe not even parsing them in the first place?

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...