Jump to content

extract from a list of links


Recommended Posts

i have a list of links in a text file

https://www.example.org/list/?page=2
https://www.example.org/list/?page=3
https://www.example.org/list/?page=4
https://www.example.org/list/?page=5
etc


there's 20 links for every page and they are clickable

their link is what i'm after

https://www.example.org/list/item.9308/

i want from the page links to extract the last part of the addresses "item.9308" and save it as

item:9308

in a text file

i have a list of 1000 page links and there's 20000 item links on them

it can work with regex

it will read the txt file i attached and save the items and their numbers to a text file as a list.

like this screenshot

links sources.txt

Link to comment
Share on other sites

  • Developers

In a rush or something?

Would be polite to wait at least 24hour before bumping your question and in the meantime maybe share what you have already or are you expecting the code being served on a silver plate?😉

SciTE4AutoIt3 Full installer Download page   - Beta files       Read before posting     How to post scriptsource   Forum etiquette  Forum Rules 
 
Live for the present,
Dream of the future,
Learn from the past.
  :)

Link to comment
Share on other sites

  • Developers
20 minutes ago, vinnyMS1 said:

i can describe programs but i don't program, i cant

Well, you are in a AutoIt3 support forum, not a "make the code for free for me" forum. ;) 

So come back when you have an actual Autoit3 question.

SciTE4AutoIt3 Full installer Download page   - Beta files       Read before posting     How to post scriptsource   Forum etiquette  Forum Rules 
 
Live for the present,
Dream of the future,
Learn from the past.
  :)

Link to comment
Share on other sites

how do i read a file for text lines then determine they're links of webpages then make the script visit the webpage links and extract with regex the links that have a word and a period then a number in the end of the link

in conclusion, the autoit questions

  • how do i read text lines in a text file
  • how to visit each link detected and find 20 links in each link based on how the links end like https://www.example.org/list/item.9308/
  • how to extract with regex the links that have a string a period and a number in the end
  • how to write in a text file the last part of the links as string column and number

https://www.example.org/list/item.9308/

i want from the page links to extract the last part of the addresses "item.9308" and save it as

item:9308

etc

 

Link to comment
Share on other sites

You could just use String functions example:

Global $g_iLinks, $g_aLinks[] = ["https://www.example.org/list/item.9308/","https://www.example.org/list/item.9309","https://www.example.org/list/item.9310","https://www.example.org/list/item.9311","https://www.example.org/list/item.9312"]
For $i = 0 To UBound($g_aLinks) - 1
    $g_iLinks = StringRight($g_aLinks[$i], 1) = "/" ? -2 : -1
    ConsoleWrite(StringTrimLeft($g_aLinks[$i], StringInStr($g_aLinks[$i], "/", 0, $g_iLinks)) & @CRLF)
Next

 

Link to comment
Share on other sites

i don't have the links i'm trying to extract (item.9308) i only have the links (https://www.example.org/list/?page=1) to the extractable links (item.9308)

1 link 20 extractions

i don't know how the links end

 

it could be any random string with a period and any number

Link to comment
Share on other sites

i have this complete code that needs only a regex and a way to browse all the links from "links sources.txt"

 

#include <File.au3>
#include <Array.au3>
#include <MsgBoxConstants.au3>
#include <IE.au3>

_Example()
Func _Example()
    ; Error monitoring. This will trap all COM errors while alive.
    ; This particular object is declared as local, meaning after the function returns it will not exist.
    Local $oErrorHandler = ObjEvent("AutoIt.Error", "_ErrFunc")
   Local $oIE = _IE_Example("basic")
   Local $aWordst = _IEBodyReadText($oIE)
    Local $oDictionary = ObjCreate("Scripting.Dictionary")
    Local $mypath = @ScriptDir
    Local $aFiles = _FileListToArrayRec($mypath, "links sources.txt", 1, 1)


    If @error Then
        MsgBox($MB_SYSTEMMODAL, "Error", "No files found")
        Exit
    Else
        MsgBox($MB_SYSTEMMODAL, "Found", $aFiles[0] & " files")
    EndIf

    Local $aWords
    For $i = 1 To $aFiles[0]
        $aWords = StringRegExp(FileRead($aFiles[$i]), "(?mi)^\s*(@.*)$", 3) ; change pattern to fit your definition of "word
        Local $iError = @error
        If $iError = 0 Then
            For $Word In $aWords
                $oDictionary.add($Word, $Word)
            Next
        Else
            ;;MsgBox($MB_SYSTEMMODAL, "Error", $aFiles[$i] & " - " & $i & @CRLF & "error: " & $iError)
        EndIf
    Next

    $aWords = $oDictionary.Items
    FileWrite("saved result2.txt", _ArrayToString($aWords, @CRLF))

EndFunc   ;==>_Example


; User's COM error function. Will be called if COM error occurs
Func _ErrFunc($oError)
    ; Do anything here.
    ConsoleWrite(@ScriptName & " (" & $oError.scriptline & ") : ==> COM Error intercepted !" & @CRLF & _
            @TAB & "err.number is: " & @TAB & @TAB & "0x" & Hex($oError.number) & @CRLF & _
            @TAB & "err.windescription:" & @TAB & $oError.windescription & @CRLF & _
            @TAB & "err.description is: " & @TAB & $oError.description & @CRLF & _
            @TAB & "err.source is: " & @TAB & @TAB & $oError.source & @CRLF & _
            @TAB & "err.helpfile is: " & @TAB & $oError.helpfile & @CRLF & _
            @TAB & "err.helpcontext is: " & @TAB & $oError.helpcontext & @CRLF & _
            @TAB & "err.lastdllerror is: " & @TAB & $oError.lastdllerror & @CRLF & _
            @TAB & "err.scriptline is: " & @TAB & $oError.scriptline & @CRLF & _
            @TAB & "err.retcode is: " & @TAB & "0x" & Hex($oError.retcode) & @CRLF & @CRLF)
EndFunc   ;==>_ErrFunc

 

Link to comment
Share on other sites

So simple...
Use InetRead to get the source code of the numbered pages. Don't need a txt file, just use a For/Next loop
Then use a regex to extract from these texts the data you want
Could be something like this - obviously untested, and raw (no error checking etc)

$list = ""
$base_url = "https://www.example.org/list/"
For $i = 1 to 374
    $txt = InetRead($base_url & "?page=" & $i)
    $items = StringRegExp($txt, '\Q' & $base_url & '\E(\w+\.\d+)', 3)
    For $k = 0 to UBound($items)-1
        $list &= $items[$k] & @crlf
    Next
Next
FileWrite(".\results.txt", $list)

 

Link to comment
Share on other sites

2 hours ago, mikell said:

So simple...
Use InetRead to get the source code of the numbered pages. Don't need a txt file, just use a For/Next loop
Then use a regex to extract from these texts the data you want
Could be something like this - obviously untested, and raw (no error checking etc)

$list = ""
$base_url = "https://www.example.org/list/"
For $i = 1 to 374
    $txt = InetRead($base_url & "?page=" & $i)
    $items = StringRegExp($txt, '\Q' & $base_url & '\E(\w+\.\d+)', 3)
    For $k = 0 to UBound($items)-1
        $list &= $items[$k] & @crlf
    Next
Next
FileWrite(".\results.txt", $list)

 

what do i add on $list = ""

Link to comment
Share on other sites

As I don't know the site you are dealing with, the code I provided is nothing but a roadmap. You have to understand what the various instructions mean, so have a look at the documentation
To test try first on page1 only, using this :

$list = ""
$txt = InetRead("https://www.example.org/list/?page=1")
$items = StringRegExp($txt, '\Q' & $base_url & '\E(\w+\.\d+)', 3)
    For $k = 0 to UBound($items)-1
        $list &= $items[$k] & @crlf
    Next
FileWrite("results.txt", $list)

If it works as intended then try the next step - on several pages
If it doesn't then the helpfile is your best friend :)

Link to comment
Share on other sites

  • Moderators

vinnyMS1,

When you reply in future, please use the "Reply to this topic" button at the top of the thread or the "Reply to this topic" editor at the bottom rather than the "Quote" button - responders know what they wrote and it just pads the thread unnecessarily. Thanks in advance for your cooperation.

M23

Public_Domain.png.2d871819fcb9957cf44f4514551a2935.png Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind

Open spoiler to see my UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...