pushkar_nagle

How can I get URLs from HTML page?

11 posts in this topic

Hi,

I would like to extract URLs from below HTML code:

</div>
        <div id="ctl00_middleContentPlaceHolder_ctrlAssets_pnlDetailAssetViewLists">
            
            
            
 
                <div id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_pnlList">
                
                    <input type="hidden" name="ctl00$middleContentPlaceHolder$ctrlAssets$ctrlWebsites$websiteLoginView$cpeURLs_ClientState" id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_cpeURLs_ClientState" />
 
                    <div id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_pnlurlsTitle" class="vorbsgBoxHeading">
                    
                        <img id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_btnExpandWebsiteURLs" src="" style="border-width:0px;" />
                        <span id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_lblURLTitle"></span>
                    
                </div>
 
                    <div id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_pnlUrls">
                    
                        <div>
                        <table class="vorbsgTable vorbsgTableNoMargin" cellspacing="0" rules="all" border="1" id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs" style="border-collapse:collapse;">
                            <tr>
                                <th class="wordbreak" scope="col">Link to Site(s)</th><th scope="col">Parent</th><th scope="col">Original <br/> Website ID</th><th scope="col">URL Brand</th><th scope="col">URL Region</th><th scope="col">DDOS<br/>Protection<br/>in Place</th><th scope="col">Status</th><th scope="col">&nbsp;</th><th scope="col">&nbsp;</th>
                            </tr><tr>
                                <td class="wordbreak">
                                        <a href='intermediaries-ie.staging.ubserver.avbdev.com' title='intermediaries-ie.staging.ubserver.avbdev.com'
                                            target="_blank">
                                            <span id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl02_lblURL">intermediaries-ie.staging.ubserver. ...</span></a>
                                    </td><td>
                                        <span disabled="disabled" title="Shows which website will be displayed as the parent when searching. Only one item can be ticked at a time."><input id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl02_chkParentToDisplay" type="checkbox" name="ctl00$middleContentPlaceHolder$ctrlAssets$ctrlWebsites$websiteLoginView$grdURLs$ctl02$chkParentToDisplay" disabled="disabled" /></span>
                                    </td><td>
                                        <span id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl02_lblOriginalWebsiteID"></span>
                                    </td><td>
                                        <span id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl02_lblBrand"></span>
                                    </td><td>
                                        <span id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl02_lblRegion"></span>
                                    </td><td>
                                        <span disabled="disabled" title="Checked if DDOS protection is in place"><input id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl02_chkDDOS_Protection_in_Place" type="checkbox" name="ctl00$middleContentPlaceHolder$ctrlAssets$ctrlWebsites$websiteLoginView$grdURLs$ctl02$chkDDOS_Protection_in_Place" disabled="disabled" /></span>
                                    </td><td>Active</td><td>
                                        <a id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl02_lnkEditURL" AccessibleHeaderText="Edit user" href="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;ctl00$middleContentPlaceHolder$ctrlAssets$ctrlWebsites$websiteLoginView$grdURLs$ctl02$lnkEditURL&quot;, &quot;&quot;, true, &quot;&quot;, &quot;&quot;, false, true))">View</a>
                                    </td><td>
                                        <input type="image" name="ctl00$middleContentPlaceHolder$ctrlAssets$ctrlWebsites$websiteLoginView$grdURLs$ctl02$btnDeleteURL" id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl02_btnDeleteURL" title="Delete (non-display items only)" src="../images/delete_trash_can_small.jpg" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;ctl00$middleContentPlaceHolder$ctrlAssets$ctrlWebsites$websiteLoginView$grdURLs$ctl02$btnDeleteURL&quot;, &quot;&quot;, true, &quot;&quot;, &quot;&quot;, false, false))" style="border-width:0px;" />
                                    </td>
                            </tr><tr class="vorbsgTableAltRow">
                                <td class="wordbreak">
                                        <a href='http://www.ulsterbankintermediaries.ie/' title='http://www.ulsterbankintermediaries.ie/'
                                            target="_blank">
                                            <span id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl03_lblURL">http://www.ulsterbankintermediaries ...</span></a>
                                    </td><td>
                                        <span disabled="disabled" title="Shows which website will be displayed as the parent when searching. Only one item can be ticked at a time."><input id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl03_chkParentToDisplay" type="checkbox" name="ctl00$middleContentPlaceHolder$ctrlAssets$ctrlWebsites$websiteLoginView$grdURLs$ctl03$chkParentToDisplay" disabled="disabled" /></span>
                                    </td><td>
                                        <span id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl03_lblOriginalWebsiteID"></span>
                                    </td><td>
                                        <span id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl03_lblBrand">Ulster Bank</span>
                                    </td><td>
                                        <span id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl03_lblRegion">UK & Ireland</span>
                                    </td><td>
                                        <span disabled="disabled" title="Checked if DDOS protection is in place"><input id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl03_chkDDOS_Protection_in_Place" type="checkbox" name="ctl00$middleContentPlaceHolder$ctrlAssets$ctrlWebsites$websiteLoginView$grdURLs$ctl03$chkDDOS_Protection_in_Place" disabled="disabled" /></span>
                                    </td><td>Active</td><td>
                                        <a id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl03_lnkEditURL" AccessibleHeaderText="Edit user" href="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;ctl00$middleContentPlaceHolder$ctrlAssets$ctrlWebsites$websiteLoginView$grdURLs$ctl03$lnkEditURL&quot;, &quot;&quot;, true, &quot;&quot;, &quot;&quot;, false, true))">View</a>
                                    </td><td>
                                        <input type="image" name="ctl00$middleContentPlaceHolder$ctrlAssets$ctrlWebsites$websiteLoginView$grdURLs$ctl03$btnDeleteURL" id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl03_btnDeleteURL" title="Delete (non-display items only)" src="../images/delete_trash_can_small.jpg" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;ctl00$middleContentPlaceHolder$ctrlAssets$ctrlWebsites$websiteLoginView$grdURLs$ctl03$btnDeleteURL&quot;, &quot;&quot;, true, &quot;&quot;, &quot;&quot;, false, false))" style="border-width:0px;" />
                                    </td>
                            </tr><tr>
                                <td class="wordbreak">
                                        <a href='http://intermediariesuat.ulsterbank.ie/' title='http://intermediariesuat.ulsterbank.ie/'
                                            target="_blank">
                                            <span id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl04_lblURL">http://intermediariesuat.ulsterbank ...</span></a>
                                    </td><td disabled="disabled">
                                        <span disabled="disabled" title="Shows which website will be displayed as the parent when searching. Only one item can be ticked at a time."><input id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl04_chkParentToDisplay" type="checkbox" name="ctl00$middleContentPlaceHolder$ctrlAssets$ctrlWebsites$websiteLoginView$grdURLs$ctl04$chkParentToDisplay" checked="checked" disabled="disabled" /></span>
                                    </td><td>
                                        <span id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl04_lblOriginalWebsiteID"></span>
                                    </td><td>
                                        <span id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl04_lblBrand">Ulster Bank</span>
                                    </td><td>
                                        <span id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl04_lblRegion">UK & Ireland</span>
                                    </td><td>
                                        <span disabled="disabled" title="Checked if DDOS protection is in place"><input id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl04_chkDDOS_Protection_in_Place" type="checkbox" name="ctl00$middleContentPlaceHolder$ctrlAssets$ctrlWebsites$websiteLoginView$grdURLs$ctl04$chkDDOS_Protection_in_Place" disabled="disabled" /></span>
                                    </td><td>Active</td><td>
                                        <a id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl04_lnkEditURL" AccessibleHeaderText="Edit user" href="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;ctl00$middleContentPlaceHolder$ctrlAssets$ctrlWebsites$websiteLoginView$grdURLs$ctl04$lnkEditURL&quot;, &quot;&quot;, true, &quot;&quot;, &quot;&quot;, false, true))">View</a>
                                    </td><td disabled="disabled">
                                        <input type="image" name="ctl00$middleContentPlaceHolder$ctrlAssets$ctrlWebsites$websiteLoginView$grdURLs$ctl04$btnDeleteURL" disabled="disabled" id="ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs_ctl04_btnDeleteURL" title="Can not delete since parent" src="../images/delete_trash_can_small.jpg" style="border-width:0px;" />
                                    </td>
                            </tr>
                        </table>
                    </div>

I am writing a script which automatically sends an email with URLs from the above code.

Code I have written so far is:

Local $oForm = _IEFormGetObjByName($oIE, "aspnetForm")
Local $oTag = _IEFormElementGetObjByName($oForm, "ctl00$middleContentPlaceHolder$ctrlAssets$ctrlWebsites$websiteLoginView$cpeURLs_ClientState")
Local $URL = _IEPropertyGet($oTag, "innertext")

 

Share this post


Link to post
Share on other sites



You mean you want the URL of the current page displayed by your browser ? (in this case, Internet Explorer)

If so, just use the built-in function  _IEPropertyGet like this :

_IEPropertyGet($oIE, "locationurl")

where $oIE is your Internet Explorer object.

That will return a String containing the URL of the page IE is currently displaying.

Share this post


Link to post
Share on other sites

No, I want URLs mentioned in href of above HTML code.

Share this post


Link to post
Share on other sites

Assuming all your href attributes belong to <a> tags, you can indeed use AutoBert's solution.

There is also this solution that should work :

; Get a collection of all the <a> tags on the document
Local $aTagsCollection = _IETagNameGetCollection($oIE, "a")

; Loop through those tags
For $aTag In aTagsCollection
    
    ; You get the 'href' attribute
    $href = $aTag.href
    
    ; Then do whatever you want with the 'href' attribute
    ; ...
    
Next

 

Share this post


Link to post
Share on other sites

using the IE.au3 i suggest _IELinkGetCollection  will be easiest solution.

1 person likes this

Share this post


Link to post
Share on other sites

Above codes are providing all the links in the document.

I want Links from only the HTML code I have pasted above.

Share this post


Link to post
Share on other sites
16 minutes ago, pushkar_nagle said:

Above codes are providing all the links in the document.

I want Links from only the HTML code I have pasted above.

It provide all links you have in your html-snippet, but you have to tell him this:

; Open blank browser and insert a htmlsnipet, get link collection
; loop through items and display the associated link URL references

#include <IE.au3>
#include <MsgBoxConstants.au3>

$sSnipet=FileRead('HTML.txt')

$oIE=_IECreate()
_IEBodyWriteHTML($oIE,$sSnipet)
ConsoleWrite('_IEBodyWriteHTML error: '&@error&' extended: '&@extended&@CRLF)
 $oLinks = _IELinkGetCollection($oIE)
 $iNumLinks = @extended

 $sTxt = $iNumLinks & " links found" & @CRLF & @CRLF
For $oLink In $oLinks
    $sTxt &= $oLink.href & @CRLF
Next
MsgBox($MB_SYSTEMMODAL, "Link Info", $sTxt)
_IEQuit($oIE)

the file html.txt contains your HTML code I have pasted above.

Share this post


Link to post
Share on other sites

Apologies, if I am not clear enough.

HTML code I pasted above is a part of a document which contains many links.

But I want links that exist under table ID 'ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs'.

I have written a code below to get all table tags and then selecting above particular tag. Then it will get 'a' tags from this table and will display links.

But its not giving any output. Is this the correct way?

#include <IE.au3>

Local $oIE = _IECreate("http://lonms07231.fm.rbsgrp.net/WebSecurity/Assets.aspx?AssetID=900961")
; Get a collection of all the <table> tags on the document
Local $tableTagsCollection = _IETagNameGetCollection($oIE, "table")
; Get a collection of all the <a> tags on the document
Local $aTagsCollection = _IETagNameGetCollection($oIE, "a")

For $tableTag In $tableTagsCollection
;Select a particular table tag
   If $tableTag = ("ctl00_middleContentPlaceHolder_ctrlAssets_ctrlWebsites_websiteLoginView_grdURLs") Then
      For $aTag In $aTagsCollection
         ; Get the 'href' attribute
         Local $href = $aTag.href
         ; Display links
         ConsoleWrite($href & @CRLF)
      Next
   EndIf
Next

 

Share this post


Link to post
Share on other sites

as there is only a placeholder for future urls, you have to trigger the event filling the placeholder with real urls. Therefore the URL or at least the complete source of the site to automate is needed.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now