Sign in to follow this  
Followers 0
tzeus

Regular Expression and array parsing

13 posts in this topic

#1 ·  Posted (edited)

Hello,

I have a web page with many links on it.

I am trying to get those links into an array, and then using a loop to go and visit each one of them getting another link and populating another array.

So to be exaplain.

I am visiting page A, where i have a list of contact links, they have the following format:

<a href=javascript:showContact('12345678')>Doe, John</a>

So far i noticed the contact number to be 7-8 figures, but i think they might be even 6-9.

Now i want to read every link of this format and create an array with all the links, and then

getting trough every one of them, navigating to the link, getting another link there:

<a href="mailto:address@goesHere.com"></a>

And get all the email addresses that i will save into another array and in a file.

How would a regular expression for the first and second link look like?

Can anyone help me with this piece of code? I have no idea how to do it. I know i have to use some loops

but i couldn't figure them out so far.

Thank you

Edited by tzeus

Share this post


Link to post
Share on other sites



#include <String.au3>
#include <array.au3>
$string = '<a href="mailto:address@goesHere.com"></a>'
$address = _StringBetween($string, '<a href="mailto:', '"></a>')
_ArrayDisplay($address)


#include <ByteMe.au3>

Share this post


Link to post
Share on other sites

For the first one it would be along the lines of

$sStr = BinaryToString(INetRead($sURL))
$aLinks1 = StringRegExp($sStr, "(?i)showcontact.+\x27(\d+)\x27.+?>(.+)</a>", 4)
For $i = 0 To Ubound($aLinks1)
    MsgBox(0, "Result", "Number: " & $aLinks1[$i][1] & @CRLF & "Name: " & $aLinks1[$i][2]);; you can use _ArrayDisplay here instead
Next

The second is a bit simpler

Here $sUrl would be pointed to the second page

$sStr = BinaryToString(INetRead($sURL))
$aLinks2 = StringRegExp($sStr, "(?i)mailto:(.+?)\x22.*>", 3)
Remember that the $aLinks2 array will be 0 based

The _StringBetween method works but that's just a wrapper for a RegEx anyway so you might as well use the correct RegEx.


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

I don't understand why, but it's not working.

I also tried to do it this way:

$oIE = _IEAttach ("PAge NAme")
;$sURL = "linkHere"
;Pe prima pagina, cea cu contactele
;$sStr = BinaryToString(INetRead($sURL))
$sStr = _IEDocReadHTML($oIE)
$aLinks1 = StringRegExp($sStr, "(?i)showcontact.+\x27(\d+)\x27.+?>(.+)</a>", 4)
For $i = 0 To Ubound($aLinks1)
    ;MsgBox(0, "Result", "Number: " & $aLinks1[$i][1] & @CRLF & "Name: " & $aLinks1[$i][2]);; you can use _ArrayDisplay here instead
Next
_ArrayDisplay($aLinks1)
MsgBox(0, "Titlu", UBound($aLinks1))

The array has 100 elements(0-99) but they are all empty. Maybe is something wrong with the reg ex ?

Also, the method George provided may not work because, i can only visit the $sUrl when i am logged in, i don't know how iNet works

But in both cases the array has 100 "" elements.

Any insight on that?

Share this post


Link to post
Share on other sites

Has to be something on the site. Maybe logon related.

Add a line

ClipPut($sStr)

after the

$sStr = BinaryToString(INetRead($sURL))

and see what you are getting from the page.

I re-tested that expression based on the example you gave and it's fine.


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

If i am not logged in in Iexplore, i get the login page and the array stays empty

However if i am logged in Iexplore, $sStr becomes all the HTML that the $sUrl has, the array is created with 100 elements and they are all empty.

The <tr> for the contact looks like this:

<tr class='alt1'>
<td class='tblrowLeft'>
<input type='checkbox' class='checkbox' onclick='checkboxClicked()' name='ids' value='31669365' id='A_31669365' /> </td>
<td class='tblrow' style='text-align:center;'> </td>
<td class='tblrow nowrap' title='Chief Executive Officer'>Chief Executive Officer</td>
<td class='tblrow' title='Edell, David'><span class='nowrap'/>
<a href=javascript:showContact('31669365')>Edell, David</a></span>
</td>
<td class='tblrow' title='CCA Industries'><a class='nowrap' href='/id31765/cca_industries_company.xhtml?uid=12375093&tok=1319641896034-3215411191676879867'>CCA Industries</a></td>
<td class='tblrow' title='East Rutherford'><span class='nowrap'/>East Rutherford</span></td>
<td class='tblrow' title='NJ'><span class='nowrap'/>NJ</span></td>
<td class='tblrow' nowrap>02/01/11</td>
</tr>

Share this post


Link to post
Share on other sites

#8 ·  Posted (edited)

Hi,

The second pattern GEOSoft is missing a question mark which is why it now grabs all the HTML until the last >.

It should be like this:

"(?i)mailto:(.+?)\x22.*?>"

or

"(?i)mailto:(.+?)\x22[^>]*>"
Edited by Robjong

Share this post


Link to post
Share on other sites

I didn't get to the second part yet. The first pattern is not working(not necessarily the pattern itself). Without the first i will not advance to the second part yet.

So the part that is not working is the one with the

showContact

Share this post


Link to post
Share on other sites

#10 ·  Posted (edited)

That is because when the flag for StringRegExp is 4 it returnes an array filled with arrays, not a 2 dimensional array.

The pattern works but could also use a tweak.

$aLinks1 = StringRegExp($sStr, "(?i)showContact\('(\d+)'\)>(.+?)</a>", 4)
If IsArray($aLinks1) Then
    For $i = 0 To Ubound($aLinks1) - 1 ; last element is not the same as UBound because the array is zero based
        $array = $aLinks1[$i] ; get the array for this entry
        ConsoleWrite("Number: " & $array[1] & "  Name: " & $array[2] & @CRLF);; you can use _ArrayDisplay here instead
    Next
EndIf

Edited by Robjong

Share this post


Link to post
Share on other sites

#11 ·  Posted (edited)

I keep being wrong somewhere.

The thing is the code above works, i tested and it works.

However i am unable to extract the required numbers into an loop-external array. Where do i go wrong?

For $i=0 to UBound($aLinks1) - 1
$numbers = $aLinks1[$i][1]
Next

This always gives me:

==> Array variable has incorrect number of subscripts or subscript dimension range exceeded.:

$numbers = $aLinks1[$i][1]

$numbers = ^ ERROR

I don't get it.

$aLinks is like this:

$aLinks[0][0] = showContact('34474726')>Bakhach, Nadim</a>

$aLinks[0][1] = 18404175

$aLinks[0][2] = Bakhach, Nadim

How do i extract only the number? This looks like a simple problem but i don't get it.

EDIT:

I managed to do it. But i still don't know why i get that error, could someone clear that for me?

$numbers[$i] = $aLinks1[$i][1]   == > Why is this not working ???
<- Am i not allowed to add one item to one array if it comes from a 2D array ?

The way i did it is:

Global $numbers[100]
 
For $i = 0 To Ubound($aLinks1) - 1
$array = $aLinks1[$i]
 
 
$numbers[$i] = $array[1]
Next
EndIf
Edited by tzeus

Share this post


Link to post
Share on other sites

#12 ·  Posted (edited)

I managed to do it. But i still don't know why i get that error, could someone clear that for me?

$numbers[$i] = $aLinks1[$i][1]   == > Why is this not working ???
<- Am i not allowed to add one item to one array if it comes from a 2D array ?

I already explained that.

That is because when the flag for StringRegExp is 4 it returnes an array filled with arrays, not a 2 dimensional array.

How do i extract only the number? This looks like a simple problem but i don't get it.

You should take the time to read the documentation for StringRegExp, and check out the samples provided there.

Take a good look at the lines behind Flag 3 which you will need now, and 4 which was used in the snippets above.

Here is the pattern you will need:

"(?i)showContact\('(\d+)'\)>"

Good luck

Edited by Robjong

Share this post


Link to post
Share on other sites

I did it, the "program" works well, and could use some tweaking but it's ok.

One issue i still have is sending and email using Gmail. Because gmail is all about javascript(i think), i have no way of keeping track when loading ends, so the _IEWait does not work. However i've put a sleep(3200) and so far it's pretty good. If there is a solution to this point me to it please.

I would like to thank everyone for having the patience and helping me.

Have a nice day!

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0