mdwerne Posted May 12, 2020 Posted May 12, 2020 Hello, Before Broadcom took over Symantec, I was able to use the following code as the base to scrape daily definition downloads from the web. Since the pages were moved over to the Broadcom web servers, I get a page full of what I believe may be JavaScript instead of the fully rendered page that lies behind it. Does anyone have any suggestions as to how I can read the full rendered webpage using AutoIt? #include <IE.au3> $oIE_DEFS = _IECreate("https://www.broadcom.com/support/security-center/definitions/download/detail?gid=sep14", 0, 0, 1, 1) $sString_DEFS = _IEBodyReadHTML($oIE_DEFS) MsgBox(0, "https://www.broadcom.com/support/security-center/definitions/download/detail?gid=sep14", $sString_DEFS) The above code will show the JavaScript, but if you go to the URL in a browser that has JavaScript enabled, you will see the fully rendered page that I would like to access. I hope my question makes sense, and I would appreciate any suggestions to get this working again. All the best, -Mike
faustf Posted May 12, 2020 Posted May 12, 2020 if you find a link use stringregrexp and match all href
mdwerne Posted May 12, 2020 Author Posted May 12, 2020 Thank you for the suggestion, but unfortunately, the links to the files don't appear to be within the JavaScript...they only appear after the page is fully rendered.
faustf Posted May 12, 2020 Posted May 12, 2020 scroll the page and after find a link use stringregrexp and match all href or i remember the big company have also FTP anonymous for download tools or other , try to look if symantec have , google is our friend , bye
TheXman Posted May 12, 2020 Posted May 12, 2020 (edited) <SNIP> When I ran my example, I thought I saw the rendered HTML but I appear to have been mistaken. So I removed it. Edited May 12, 2020 by TheXman Removed example since it was incorrect. CryptoNG UDF: Cryptography API: Next Gen jq UDF: Powerful and Flexible JSON Processor | jqPlayground: An Interactive JSON Processor Xml2Json UDF: Transform XML to JSON | HttpApi UDF: HTTP Server API | Roku Remote: Example Script About Me How To Ask Good Questions On Technical And Scientific Forums (Detailed) | How to Ask Good Technical Questions (Brief) "Any fool can know. The point is to understand." -Albert Einstein "If you think you're a big fish, it's probably because you only swim in small ponds." ~TheXman
MattHiggs Posted May 14, 2020 Posted May 14, 2020 On 5/12/2020 at 3:29 PM, mdwerne said: Hello, Before Broadcom took over Symantec, I was able to use the following code as the base to scrape daily definition downloads from the web. Since the pages were moved over to the Broadcom web servers, I get a page full of what I believe may be JavaScript instead of the fully rendered page that lies behind it. Does anyone have any suggestions as to how I can read the full rendered webpage using AutoIt? #include <IE.au3> $oIE_DEFS = _IECreate("https://www.broadcom.com/support/security-center/definitions/download/detail?gid=sep14", 0, 0, 1, 1) $sString_DEFS = _IEBodyReadHTML($oIE_DEFS) MsgBox(0, "https://www.broadcom.com/support/security-center/definitions/download/detail?gid=sep14", $sString_DEFS) The above code will show the JavaScript, but if you go to the URL in a browser that has JavaScript enabled, you will see the fully rendered page that I would like to access. I hope my question makes sense, and I would appreciate any suggestions to get this working again. All the best, -Mike When I am writing a script to scrape a website, there are several ways that I go about it: 1) If the html is fully available, then that is the easiest way to go about it 2) I download the file manually in the browser, and then I copy the download link from the browser itself and see if the actual download links have a commonality that would allow me to download the files I need without actually knowing pulling the download link from the web page: 3) In the even that the file download link doesn't directly reference the file, like in the case of the file at this url: https://www.sordum.org/files/downloads.php?easy-context-menu then I use the wget tool and have it handle the redirections and force the file download into the file format that I know it is supposed to be in. So, using the above example, I know the file is supposed to be a zip file, so I would run the wget command wget https://www.sordum.org/files/downloads.php?easy-context-menu -O"Path\to\save\file\file.zip" Oh, also, wget is native to linux operating systems, but there is a very decent windows port of wget available here
mdwerne Posted May 19, 2020 Author Posted May 19, 2020 (edited) Thanks for the reply, @MattHiggs. This is something new to try as I have been unable to get at the files any other way. I haven't played with WGET much, but I'll give it a shot. Manually downloading a handful of different definition sets, every night, from this site: https://www.broadcom.com/support/security-center/definitions/download/detail?gid=sep14 is getting kinda old. Again, thanks for your reply, I was at an impasse. -Mike P.S. Part of the issue is also that the download files names change a few times a day...which is why I need to scrape the fully rendered Javascript (HTML) page each time. Edited May 19, 2020 by mdwerne Forgot some info...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now