SimonRu Posted March 2, 2020 Posted March 2, 2020 (edited) Hi all, I'm having a bad day (so much so I accidentally submitted this post without all the content). I'm trying to get all the URL's from <a> tags in a local html file and put them in an array. I've tried a few things, this what I've got left and I feel like I'm having an off-day. Global $cached = FileOpen(@ScriptDir & "\temp\index.html", 0) Global $href1 = _StringBetween($cached, '<a href="', '">') Global $href2 = _StringBetween($cached, '<a href="', '" >') _ArrayDisplay($href1) _ArrayDisplay($href2) Any help and point in the right direction would be greatly appreciated. Edited March 2, 2020 by SimonRu
FrancescoDiMuro Posted March 2, 2020 Posted March 2, 2020 @SimonRu Attaching a sample html or at least a string you want to extract the text from would be really helpful to us in order to help you 😅 Click here to see my signature: Spoiler ALWAYS GOOD TO READ: Forum Rules Forum Etiquette
Danp2 Posted March 2, 2020 Posted March 2, 2020 @SimonRu FileOpen returns a file handle, not the contents of the file. Maybe you should switch to using FileRead. Latest Webdriver UDF Release Webdriver Wiki FAQs
SimonRu Posted March 2, 2020 Author Posted March 2, 2020 5 minutes ago, FrancescoDiMuro said: @SimonRu Attaching a sample html or at least a string you want to extract the text from would be really helpful to us in order to help you 😅 I understand. The HTML is a mess, but there at links all over the place in <div> and <p> tags. Here is a modified example of the footer: expandcollapse popup<div class="footer-menu"> <div class="footer-section"> <div class="footer-title"> <a href="https://www.example.com/en_gb/motor.html">Moto</a> </div> <ul class="footer-options"> <li> <a href="https://www.example.com/en_gb/moto/moto-helmets.html"> Moto Helmets </a> </li> <li> <a href="https://www.example.com/en_gb/moto/moto-helmet-visors.html"> Moto Visors </a> </li> <li> <a href="https://www.example.com/en_gb/moto/moto-helmet-bluetooth-audio.html"> Bluetooth Audio </a> </li> <li> <a href="https://www.example.com/en_gb/moto/gear.html"> Gear </a> </li> </ul> </div> <div class="footer-section"> <div class="footer-title"> <a href="https://www.example.com/en_gb/snow-sports.html"> Sports</a> </div> <ul class="footer-options"> <li> <a href="https://www.example.com/en_gb/snow-sports/snowboard-helmets.html"> Helmets </a> </li> <li> <a href="https://www.example.com/en_gb/snow-sports/masks.html"> Masks </a> </li> <li> <a href="https://www.example.com/en_gb/snow-sports/optics.html"> Optics </a> </li> <li> <a href="https://www.example.com/en_gb/snow-sports/visors-mounts.html"> Visors & Mounts </a> </li> <li> <a href="https://www.example.com/en_gb/snow-sports/helmet-gaskets.html"> Gaskets </a> </li> <li> <a href="https://www.example.com/en_gb/snow-sports/helmet-bluetooth.html"> Bluetooth Audio </a> </li> </ul> </div> <div class="footer-section"> <div class="footer-title"> <a href="#">Pages</a> </div> <ul class="footer-options"> <li> <a href="https://www.example.com/en_gb/contact/"> Contact </a> </li> <li> <a href="https://www.example.com/en_gb/sales/guest/form/"> Guest order tracker </a> </li> <li> <a href="https://www.example.com/en_gb/sizing/"> Sizing </a> </li> <li> <a href="https://www.example.com/en_gb/returns/"> Returns </a> </li> <li> <a href="https://www.example.com/en_gb/payment/"> About payment </a> </li> </ul> </div> This is going to be part of a project to craw links. Some links are like this: <a href="https://www.example.com" >text</a> and others are like this <a href="https://www.example.com">text</a> Notice the space before the >
SimonRu Posted March 2, 2020 Author Posted March 2, 2020 8 minutes ago, Danp2 said: @SimonRu FileOpen returns a file handle, not the contents of the file. Maybe you should switch to using FileRead. I understand. Thank you. I'm making all sorts of noob slip-ups today.
Nine Posted March 2, 2020 Posted March 2, 2020 Try this instead of using String*, will save you a lot of effort : #include <Constants.au3> #include <IE.au3> Opt ("MustDeclareVars", 1) Local $oIE = _IECreate() Local $sHTML = FileRead ("YourFile.html") _IEDocWriteHTML ($oIE, $sHTML) _IELoadWait ($oIE) Local $cTags = _IETagNameGetCollection ($oIE, "a") Local $sList = "", $sAtt For $oTag in $cTags $sAtt = $oTag.getAttribute("href") If $sAtt = "" Then ContinueLoop $sList &= $sAtt & @CRLF Next MsgBox ($MB_SYSTEMMODAL,"",$sList) Gianni and Aether 2 “They did not know it was impossible, so they did it” ― Mark Twain Spoiler Block all input without UAC Save/Retrieve Images to/from Text Monitor Management (VCP commands) Tool to search in text (au3) files Date Range Picker Virtual Desktop Manager Sudoku Game 2020 Overlapped Named Pipe IPC HotString 2.0 - Hot keys with string x64 Bitwise Operations Multi-keyboards HotKeySet Recursive Array Display Fast and simple WCD IPC Multiple Folders Selector Printer Manager GIF Animation (cached) Debug Messages Monitor UDF Screen Scraping Round Corner GUI UDF Multi-Threading Made Easy Interface Object based on Tag
mikell Posted March 2, 2020 Posted March 2, 2020 2 hours ago, Nine said: Try this instead of using String*, will save you a lot of effort You're joking I presume This is a typical example on "how to use regular expressions" #Include <Array.au3> $txt = FileRead("test.html") $res = StringRegExp($txt, 'a href="(h[^"]+)', 3) _ArrayDisplay($res) But the try using _StringBetween in post #1 was not so bad ... argumentum and FrancescoDiMuro 2
Nine Posted March 2, 2020 Posted March 2, 2020 41 minutes ago, mikell said: You're joking I presume I am ??? Well if you say so, I must be then “They did not know it was impossible, so they did it” ― Mark Twain Spoiler Block all input without UAC Save/Retrieve Images to/from Text Monitor Management (VCP commands) Tool to search in text (au3) files Date Range Picker Virtual Desktop Manager Sudoku Game 2020 Overlapped Named Pipe IPC HotString 2.0 - Hot keys with string x64 Bitwise Operations Multi-keyboards HotKeySet Recursive Array Display Fast and simple WCD IPC Multiple Folders Selector Printer Manager GIF Animation (cached) Debug Messages Monitor UDF Screen Scraping Round Corner GUI UDF Multi-Threading Made Easy Interface Object based on Tag
SimonRu Posted March 3, 2020 Author Posted March 3, 2020 13 hours ago, mikell said: You're joking I presume This is a typical example on "how to use regular expressions" #Include <Array.au3> $txt = FileRead("test.html") $res = StringRegExp($txt, 'a href="(h[^"]+)', 3) _ArrayDisplay($res) But the try using _StringBetween in post #1 was not so bad ... Oh my goodness, that worked a charm! Thank you so much. It had been playing on my mind all night that I couldn't figure it out. Thank you again.
Aether Posted March 3, 2020 Posted March 3, 2020 I like the approach of Nine. It gets somewhat more control over the content of the html file.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now