extract from a list of links

vinnyMS1 · July 6, 2022

help plz

Melba23 · July 6, 2022

vinnyMS1,

You have already been told not to bump your own threads within 24 hrs - please do not do it again.

M23

pixelsearch · July 6, 2022

Does InetRead work on any website ?
I'm not really sure of that.

If it doesn't work on the required website, then a way to do it is to write this kind of automated script :

1) Get the 1st link from the text file and copy it to the Clipboard.
https://www.example.org/list/?page=1

2) Activate the Browser's window and paste the clipboard into the Browser's url box (the browser should be opened before the script is run) then Send Enter to display the web page.

3) When the webpage is fully displayed, Send Ctrl+U to get the html source opened in a new tab of the browser (Ctrl+U works with Chrome, FireFox, Opera...)

4) Ctrl+A to select the whole source of the webpage, then Ctrl+C to copy the source to the Clipboard

5) Use the RegEx part to retrieve the links desired from the Clipboard (i.e. from the source)

6) Write those links in your output text file

7) Close the tabs from the Browser

8) Loop on 1) for the next link
https://www.example.org/list/?page=2

It's not as simple as InetRead but it works, if accurate Sleep() are added in the script, giving time for the pages to load etc...

But you really need to check precisely every part of the process, adding tests and error checking to make sure everything is going fine and nothing hangs.

As you got 374 links to check, then it's worth the time to spend writing the script, which will loop 374 times and repeat automatically all parts from 1) to 8) . Also the script may help you later when you'll reuse it... and it's a good/fun way to learn AutoIt !

Good luck

mikell · July 6, 2022

Hmmmyes. The main problem is the way you can get the source of the page. InetRead doesn't always work indeed, so you can use the way pixelsearch mentioned, or use curl, etc
After that the regex must obviously be adapted to fit the search of the required data
For example, using a txt file containing the source of the page1 link you provided, this code works for me

;https://www.lomcn.org/forum/members/list/?page=1

#Include <Array.au3>

$txt = FileRead("site_page1.txt")  ; source
$list = ""
$items = StringRegExp($txt, 'forum/members/(\w+\.\d+)', 3)
    $items = _ArrayUnique($items)
    For $k = 0 to UBound($items)-1
        $list &= $items[$k] & @crlf
    Next
Msgbox(0,"", $list)
;FileWrite("results.txt", $list)

Edit
@pixelsearch please try this

;https://www.lomcn.org/forum/members/list/?page=1

#Include <Array.au3>

$out = _XP_Read("https://www.lomcn.org/forum/members/list/?page=1")
;ConsoleWrite($out & @crlf)

$list = ""
$items = StringRegExp($out, 'forum/members/(\w+\.\d+)', 3)
    $items = _ArrayUnique($items)
    For $k = 0 to UBound($items)-1
        $list &= $items[$k] & @crlf
    Next
Msgbox(0,"", $list)
;FileWrite("results.txt", $list)


Func _XP_Read($url)
   Local $cmd = "curl -L -s -k -A 'Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1)' " & $url
   Local $iPID = Run($cmd, "", @SW_HIDE, 2)  ;$STDOUT_CHILD
   ProcessWaitClose($iPID)
   Local $output = StdoutRead($iPID)
   Return $output
EndFunc

I love curl

Edited July 6, 2022 by mikell

pixelsearch · July 6, 2022

@mikell I've spent the last couple of hours dealing with curl (i never used it before and discovered it in your last post, before you edited it). So I went on curl website, trying to find the right version to download for windows.

Finally I got it working, thanks to your syntax described in this post and it worked on the test link provided yesterday by OP (as InetRead didn't make it for me) :

https://www.lomcn.org/forum/members/list/?page=1

The html source code filled AutoIt console, no error !

Thanks also to this post, where "VIP (I'm trong)" had a problem with InetGet and Error 13, which was nicely solved by using HttpSetUserAgent, this could help users sometimes.

Only now I see your edited post above (which will definitely give a correct result as it already worked for me, using the syntax in your link I mentioned above)

Guess I'm starting to like curl too, it's never too late to learn something new !

pixelsearch · July 6, 2022

@mikell great job, this is the result I got :

317845937_mikellscurlway.png.a2bc12c82199f3d4857a85a1cafa6a57.png

vinnyMS1 · July 6, 2022

very good, how do i get multiple pages? with the latest version there's only 1 page address page 1

i have this

;https://www.lomcn.org/forum/members/list/?page=1

#Include <Array.au3>
$base_url = "https://www.lomcn.org/forum/members/list/"
For $i = 1 to 374
$out = _XP_Read($base_url & "?page=" & $i)
;ConsoleWrite($out & @crlf)
$list = ""
$items = StringRegExp($out, '\Q' & $base_url & '\E(\w+\.\d+)', 3)
    $items = _ArrayUnique($items)
    For $k = 0 to UBound($items)-1
        $list &= $items[$k] & @crlf
    Next
FileWrite("results.txt", $list)
;FileWrite("results.txt", $list)
Next

Func _XP_Read($url)
   Local $cmd = "curl -L -s -k -A 'Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1)' " & $url
   Local $iPID = Run($cmd, "", @SW_HIDE, 2)  ;$STDOUT_CHILD
   ProcessWaitClose($iPID)
   Local $output = StdoutRead($iPID)
   Return $output
EndFunc

Edited July 6, 2022 by vinnyMS1

mikell · July 7, 2022

@vinnyMS1
Hey, you may think a little. If you don't try to understand what you read you will never learn anything
As written in my "roadmap" code and as pixelsearch said, you have to loop through the numbered pages
This works for me to treat the 3 first pages

#Include <Array.au3>

$base_url = "https://www.lomcn.org/"
$sub_url = "forum/members/"
$list = ""

For $i = 1 to 3
    $out = _XP_Read($base_url & $sub_url & "list/?page=" & $i)
    ;ConsoleWrite($out & @crlf)
    $items = StringRegExp($out, $sub_url & '(\w+\.\d+)', 3)
        $items = _ArrayUnique($items, 0, 0, 0, 0)
        For $k = 0 to UBound($items)-1
            $list &= $items[$k] & @crlf
        Next
    Next
Msgbox(0,"", $list)
;FileWrite("results.txt", $list)


Func _XP_Read($url)
   Local $cmd = "curl -L -s -k -A 'Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1)' " & $url
   Local $iPID = Run($cmd, "", @SW_HIDE, 2)  ;$STDOUT_CHILD
   ProcessWaitClose($iPID)
   Local $output = StdoutRead($iPID)
   Return $output
EndFunc

pixelsearch · July 10, 2022

@mikell just to let you know the minimal switches I just tried for a successful file download, using curl

Local $url = "https://www.autoitscript.com/autoit3/scite/download/Au3Stripper.zip"
Local $cmd = "C:\curl\curl.exe -O " & $url
Local $iPID = Run($cmd, "", @SW_HIDE, 2) ; 2 = $STDOUT_CHILD
ProcessWaitClose($iPID)
Local $output = StdoutRead($iPID)

The script downloads Au3Stripper.zip and saves it with the same name in the script folder because of the -O switch :

-O, --remote-name 
Write output to a local file named like the remote file we get [...]

But if we had to download this file :

Local $url = "https://www.autoitscript.com/cgi-bin/getfile.pl?autoit3/autoit-v3.zip"

Then it would download only a 1KB zip file (instead of 17Mb !) because of the missing switch -L . In this case, the proper syntax should be :

Local $cmd = "C:\curl\curl.exe -O -L " & $url

So it seems always good to include the -L switch (as you indicated), no matter the url :

-L, --location
(HTTP) If the server reports that the requested page has moved to a different location [...] this option will make curl redo the request on the new place.

I avoided the -k (--insecure) switch after I read this in the help file :

WARNING: using this option makes the transfer insecure.

Without the -s/--silent switch, we can see nice progress lines in AutoIt console, great

Finally, the -A/--user-agent <agent string> will certainly be useful depending on the site, but I just wanted to test the minimal mandatory switches needed for a successful download.

Both versions of Curl did it : the old "curl_7_46_0_openssl_nghttp2_x86" I told you about, and the recent "curl-7.83.1_7-win32-mingw.zip" (may 2022)

Also, as you guessed, I found the following info written on the official curl website :

Microsoft ships curl too :
curl is also shipped by Microsoft as part of Windows 10 and 11.

You definitely knew all this but it's very new for me !
Have a great sunday and thanks for making us discover curl :bye:

mikell · July 10, 2022

9 hours ago, pixelsearch said:

I just wanted to test the minimal mandatory switches needed for a successful download.

I got some failures when using curl without the -k and/or -A switches...
For convenience I use a little UDF with curl 'read' and 'get' functions inside, and I need them to work in as many cases as possible
Reason why I also use -o instead of -O (matter of versatility)

extract from a list of links

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members