Jump to content

Grab downloads from webpage...now broken.


Recommended Posts

Hello,

Before Broadcom took over Symantec, I was able to use the following code as the base to scrape daily definition downloads from the web. Since the pages were moved over to the Broadcom web servers, I get a page full of what I believe may be JavaScript instead of the fully rendered page that lies behind it. Does anyone have any suggestions as to how I can read the full rendered webpage using AutoIt?

#include <IE.au3>
$oIE_DEFS = _IECreate("https://www.broadcom.com/support/security-center/definitions/download/detail?gid=sep14", 0, 0, 1, 1)
$sString_DEFS = _IEBodyReadHTML($oIE_DEFS)
MsgBox(0, "https://www.broadcom.com/support/security-center/definitions/download/detail?gid=sep14", $sString_DEFS)

The above code will show the JavaScript, but if you go to the URL in a browser that has JavaScript enabled, you will see the fully rendered page that I would like to access. I hope my question makes sense, and I would appreciate any suggestions to get this working again.

All the best,
-Mike

Link to comment
Share on other sites

scroll the page

and after find a link use stringregrexp and  match all href

or  i remember  the  big company have also FTP anonymous  for download tools or other , try to look if symantec  have  , google is our friend , bye

Link to comment
Share on other sites

<SNIP>

When I ran my example, I thought I saw the rendered HTML but I appear to have been mistaken.  So I removed it.

 

Edited by TheXman
Removed example since it was incorrect.
Link to comment
Share on other sites

On 5/12/2020 at 3:29 PM, mdwerne said:

Hello,

Before Broadcom took over Symantec, I was able to use the following code as the base to scrape daily definition downloads from the web. Since the pages were moved over to the Broadcom web servers, I get a page full of what I believe may be JavaScript instead of the fully rendered page that lies behind it. Does anyone have any suggestions as to how I can read the full rendered webpage using AutoIt?

#include <IE.au3>
$oIE_DEFS = _IECreate("https://www.broadcom.com/support/security-center/definitions/download/detail?gid=sep14", 0, 0, 1, 1)
$sString_DEFS = _IEBodyReadHTML($oIE_DEFS)
MsgBox(0, "https://www.broadcom.com/support/security-center/definitions/download/detail?gid=sep14", $sString_DEFS)

The above code will show the JavaScript, but if you go to the URL in a browser that has JavaScript enabled, you will see the fully rendered page that I would like to access. I hope my question makes sense, and I would appreciate any suggestions to get this working again.

All the best,
-Mike

When I am writing a script to scrape a website, there are several ways that I go about it:

1) If the html is fully available, then that is the easiest way to go about it

2) I download the file manually in the browser, and then I copy the download link from the browser itself and see if the actual download links have a commonality that would allow me to download the files I need without actually knowing pulling the download link from the web page:

dfmm1fl.png

Je5p4Xq.png

3) In the even that the file download link doesn't directly reference the file, like in the case of the file at this url: https://www.sordum.org/files/downloads.php?easy-context-menu

then I use the wget tool and have it handle the redirections and force the file download into the file format that I know it is supposed to be in.  So, using the above example, I know the file is supposed to be a zip file, so I would run the wget command 
wget https://www.sordum.org/files/downloads.php?easy-context-menu -O"Path\to\save\file\file.zip"

Oh, also, wget is native to linux operating systems, but there is a very decent windows port of wget available here

Link to comment
Share on other sites

Thanks for the reply, @MattHiggs. This is something new to try as I have been unable to get at the files any other way. I haven't played with WGET much, but I'll give it a shot. Manually downloading a handful of different definition sets, every night, from this site: https://www.broadcom.com/support/security-center/definitions/download/detail?gid=sep14
is getting kinda old.

Again, thanks for your reply, I was at an impasse.

-Mike

P.S. Part of the issue is also that the download files names change a few times a day...which is why I need to scrape the fully rendered Javascript (HTML) page each time.

Edited by mdwerne
Forgot some info...
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...