Jump to content
Sign in to follow this  
littleclown

Exract e-mail addressed from HTML

Recommended Posts

littleclown

Hello.

We have internal website with some communication. We have a lot of posts some of them with e-mail addresses, but without any specific rule.

I need to filter just e-mails from this site to send everybody invite to register to the new user system. All I need is script to extract e-mails only.

Something to find strings with e-mail pattern XXXX@XXXX.XXX Actually this is a standart e-mail spider, but I don't want to use some strange SPAM oriented shareware-s in office local network.

I am not sure how to do this.

Thank you in advanced

Share this post


Link to post
Share on other sites
littleclown

I found this:

#Include <String.au3>

$Text = FileRead("email.txt")

$EmailFound = StringRegExp($Text, "([A-Za-z0-9._%-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})", 3)

if @extended = 1 Then

for $i = 0 to UBound($EmailFound) - 1

MsgBox(0, "E-Mail", $EmailFound[$i])

Next

Else

MsgBox(0, "E-Mail", "No E-Mail addressess found in the supplied text")

EndIf

But this don't work. I miss something, but I can get where is my mistake.

Edited by littleclown

Share this post


Link to post
Share on other sites
Melba23

littleclown,

I do not know where that script came from, but the If test after the StringRegExp is wrong. If the SRE is successful @extended = 0, which will give you the "Not found" message. :( You need to check for @error instead.

Change the If structure to read:

If @error = 1 Then
    MsgBox(0, "E-Mail", "No E-Mail addressess found in the supplied text")
Else
    For $i = 0 To UBound($EmailFound) - 1
        MsgBox(0, "E-Mail", $EmailFound[$i])
    Next
EndIf

I can extract email addresses with no problems using that. :mellow:

M23


Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind._______My UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites
littleclown

Thank you very much!

It works now.

Share this post


Link to post
Share on other sites
littleclown

And can somebody modify this to make the same but for the URL-s?

Share this post


Link to post
Share on other sites
Melba23

littleclown,

Please post some examples of your data including the URLs you want to extract together with their surrounding characters so we can try and develop a SRE pattern for you. As URLs can vary in format quite a bit, trying to get a sense of how they are located in the data is vital. :mellow:

M23


Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind._______My UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites
littleclown

Yes I know, and I can give you the most simple example <a href="http://someaddress/somepage.html">, but this pattern is not absolute. I mean sometimes the URL-s are different.

Let me specify what I need if we forget this is an URL, but just a string. I need to extract all strings that begins with "http://" or before them there is "href=" or "href="" OR "href='" and the next character after this string is " " or """ or "'" or ">" or " " or some other symbols (i can add it after that - actually all symbols that can't be in correct URL).

I think this is what I need.

Thanks for your help!

Share this post


Link to post
Share on other sites
Melba23

littleclown,

Try this:

$sText = 'rubbish_text<a href="http://someaddress/somepage.html">rubbish_text'

$sURL = StringRegExpReplace($sText, '(?i).*http:(.+)">.*', 'http:$1')

MsgBox(0, "", $sURL)

Explanation -

Pattern:

(?i) = Case insensitive

.* = any number of characters

http: = literal text

(.+) = capturing group of at least one character (capturing group means we can use it later, as you will see)

"> = literal string

.* = any number of characters

Replacement:

http: = literal string

$1 = first capturing group (what we had in brackets in the pattern)

Now as long as you have the URL starting with "http:" and the tag ending in ">" - which I hope is all the time - you should be fine! :mellow:

M23

Edit: Added the explanation.

Edited by Melba23

Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind._______My UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites
Fulano

It looks like you've indicated the majority of the regular expression:

(?i)[href=["']|http://]([^'>"]+)
Breaking it down:

[href=['"]|http://] = looks for either href= followed by an ' or " or http://
([^'>"]+) = store as many characters as you can that are not: ' > "

So: <a href="http://someaddress/somepage.html">

Becomes: someaddress/somepage.html

You'll want to prefix them with http://, but that is fairly trivial.

Hope this helps

Almost forgot: (?i) makes it case insensitive


#fgpkerw4kcmnq2mns1ax7ilndopen (Q, $0); while ($l = <Q>){if ($l =~ m/^#.*/){$l =~ tr/a-z1-9#/Huh, Junketeer's Alternate Pro Ace /; print $l;}}close (Q);[code] tag ninja!

Share this post


Link to post
Share on other sites
littleclown

Thank you all for your replies, and for explanations about regular expressions, because I am new in this and will be great if next time I can do my own regular expression without ask you for this :mellow:

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.