Jump to content

extract from a list of links


Recommended Posts

  • Moderators

vinnyMS1,

You have already been told not to bump your own threads within 24 hrs - please do not do it again.

M23

Public_Domain.png.2d871819fcb9957cf44f4514551a2935.png Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind

Open spoiler to see my UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Link to comment
Share on other sites

Does InetRead work on any website ?
I'm not really sure of that.

If it doesn't work on the required website, then a way to do it is to write this kind of automated script :

1) Get the 1st link from the text file and copy it to the Clipboard.
https://www.example.org/list/?page=1

2) Activate the Browser's window and paste the clipboard into the Browser's url box (the browser should be opened before the script is run) then Send Enter to display the web page.

3) When the webpage is fully displayed, Send Ctrl+U to get the html source opened in a new tab of the browser (Ctrl+U works with Chrome, FireFox, Opera...)

4) Ctrl+A to select the whole source of the webpage, then Ctrl+C to copy the source to the Clipboard

5) Use the RegEx part to retrieve the links desired from the Clipboard (i.e. from the source)

6) Write those links in your output text file

7) Close the tabs from the Browser

8) Loop on 1) for the next link
https://www.example.org/list/?page=2

It's not as simple as InetRead but it works, if accurate Sleep() are added in the script, giving time for the pages to load etc...

But you really need to check precisely every part of the process, adding tests and error checking to make sure everything is going fine and nothing hangs.

As you got 374 links to check, then it's worth the time to spend writing the script, which will loop 374 times and repeat automatically all parts from 1) to 8) . Also the script may help you later when you'll reuse it... and it's a good/fun way to learn AutoIt !

Good luck
 

Link to comment
Share on other sites

Hmmmyes. The main problem is the way you can get the source of the page. InetRead doesn't always work indeed, so you can use the way pixelsearch mentioned, or use curl, etc
After that the regex must obviously be adapted to fit the search of  the required data
For example, using a txt file containing the source of the page1 link you provided, this code works for me

;https://www.lomcn.org/forum/members/list/?page=1

#Include <Array.au3>

$txt = FileRead("site_page1.txt")  ; source
$list = ""
$items = StringRegExp($txt, 'forum/members/(\w+\.\d+)', 3)
    $items = _ArrayUnique($items)
    For $k = 0 to UBound($items)-1
        $list &= $items[$k] & @crlf
    Next
Msgbox(0,"", $list)
;FileWrite("results.txt", $list)


Edit
@pixelsearch  please try this  :D

;https://www.lomcn.org/forum/members/list/?page=1

#Include <Array.au3>

$out = _XP_Read("https://www.lomcn.org/forum/members/list/?page=1")
;ConsoleWrite($out & @crlf)

$list = ""
$items = StringRegExp($out, 'forum/members/(\w+\.\d+)', 3)
    $items = _ArrayUnique($items)
    For $k = 0 to UBound($items)-1
        $list &= $items[$k] & @crlf
    Next
Msgbox(0,"", $list)
;FileWrite("results.txt", $list)


Func _XP_Read($url)
   Local $cmd = "curl -L -s -k -A 'Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1)' " & $url
   Local $iPID = Run($cmd, "", @SW_HIDE, 2)  ;$STDOUT_CHILD
   ProcessWaitClose($iPID)
   Local $output = StdoutRead($iPID)
   Return $output
EndFunc

I love curl  :P

Edited by mikell
Link to comment
Share on other sites

@mikell I've spent the last couple of hours dealing with curl (i never used it before and discovered it in your last post, before you edited it). So I went on curl website, trying to find the right version to download for windows.

Finally I got it working, thanks to your syntax described in this post and it worked on the test link provided yesterday by OP (as InetRead didn't make it for me) :

https://www.lomcn.org/forum/members/list/?page=1

The html source code filled AutoIt console, no error !

Thanks also to this post, where "VIP (I'm trong)" had a problem with InetGet and Error 13, which was nicely solved by using HttpSetUserAgent, this could help users sometimes.

Only now I see your edited post above (which will definitely give a correct result as it already worked for me, using the syntax in your link I mentioned above)

Guess I'm starting to like curl too, it's never too late to learn something new !

Link to comment
Share on other sites

very good, how do i get multiple pages? with the latest version there's only 1 page address page 1

 

i have this

 

;https://www.lomcn.org/forum/members/list/?page=1

#Include <Array.au3>
$base_url = "https://www.lomcn.org/forum/members/list/"
For $i = 1 to 374
$out = _XP_Read($base_url & "?page=" & $i)
;ConsoleWrite($out & @crlf)
$list = ""
$items = StringRegExp($out, '\Q' & $base_url & '\E(\w+\.\d+)', 3)
    $items = _ArrayUnique($items)
    For $k = 0 to UBound($items)-1
        $list &= $items[$k] & @crlf
    Next
FileWrite("results.txt", $list)
;FileWrite("results.txt", $list)
Next

Func _XP_Read($url)
   Local $cmd = "curl -L -s -k -A 'Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1)' " & $url
   Local $iPID = Run($cmd, "", @SW_HIDE, 2)  ;$STDOUT_CHILD
   ProcessWaitClose($iPID)
   Local $output = StdoutRead($iPID)
   Return $output
EndFunc

 

Edited by vinnyMS1
Link to comment
Share on other sites

@vinnyMS1
Hey, you may think a little. If you don't try to understand what you read you will never learn anything
As written in my "roadmap" code and as pixelsearch said, you have to loop through the numbered pages
This works for me to treat the 3 first pages

#Include <Array.au3>

$base_url = "https://www.lomcn.org/"
$sub_url = "forum/members/"
$list = ""

For $i = 1 to 3
    $out = _XP_Read($base_url & $sub_url & "list/?page=" & $i)
    ;ConsoleWrite($out & @crlf)
    $items = StringRegExp($out, $sub_url & '(\w+\.\d+)', 3)
        $items = _ArrayUnique($items, 0, 0, 0, 0)
        For $k = 0 to UBound($items)-1
            $list &= $items[$k] & @crlf
        Next
    Next
Msgbox(0,"", $list)
;FileWrite("results.txt", $list)


Func _XP_Read($url)
   Local $cmd = "curl -L -s -k -A 'Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1)' " & $url
   Local $iPID = Run($cmd, "", @SW_HIDE, 2)  ;$STDOUT_CHILD
   ProcessWaitClose($iPID)
   Local $output = StdoutRead($iPID)
   Return $output
EndFunc

 

Link to comment
Share on other sites

@mikell just to let you know the minimal switches I just tried for a successful file download, using curl

Local $url = "https://www.autoitscript.com/autoit3/scite/download/Au3Stripper.zip"
Local $cmd = "C:\curl\curl.exe -O " & $url
Local $iPID = Run($cmd, "", @SW_HIDE, 2) ; 2 = $STDOUT_CHILD
ProcessWaitClose($iPID)
Local $output = StdoutRead($iPID)

The script downloads Au3Stripper.zip and saves it with the same name in the script folder because of the -O switch :

-O, --remote-name 
Write output to a local file named like the remote file we get [...]

But if we had to download this file :

Local $url = "https://www.autoitscript.com/cgi-bin/getfile.pl?autoit3/autoit-v3.zip"

Then it would download only a 1KB zip file (instead of 17Mb !) because of the missing switch -L . In this case, the proper syntax should be :

Local $cmd = "C:\curl\curl.exe -O -L " & $url

So it seems always good to include the -L switch (as you indicated), no matter the url :

-L, --location
(HTTP) If the server reports that the requested page has moved to a different location [...] this option will make curl redo the request on the new place.

I avoided the -k (--insecure) switch after I read this in the help file :

WARNING: using this option makes the transfer insecure.

Without the -s/--silent switch, we can see nice progress lines in AutoIt console, great :)

Finally, the -A/--user-agent <agent string> will certainly be useful depending on the site, but I just wanted to test the minimal mandatory switches needed for a successful download.

Both versions of Curl did it : the old "curl_7_46_0_openssl_nghttp2_x86" I told you about, and the recent "curl-7.83.1_7-win32-mingw.zip" (may 2022)

Also, as you guessed, I found the following info written on the official curl website :

Microsoft ships curl too :
curl is also shipped by Microsoft as part of Windows 10 and 11.

You definitely knew all this but it's very new for me !
Have a great sunday and thanks for making us discover curl :bye:

Link to comment
Share on other sites

9 hours ago, pixelsearch said:

I just wanted to test the minimal mandatory switches needed for a successful download.

I got some failures when using curl without the -k and/or -A switches...
For convenience I use a little UDF with curl 'read' and 'get' functions inside, and I need them to work in as many cases as possible :P
Reason why I also use -o instead of -O (matter of versatility)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...