Jump to content

Exract e-mail addressed from HTML


Recommended Posts

Hello.

We have internal website with some communication. We have a lot of posts some of them with e-mail addresses, but without any specific rule.

I need to filter just e-mails from this site to send everybody invite to register to the new user system. All I need is script to extract e-mails only.

Something to find strings with e-mail pattern XXXX@XXXX.XXX Actually this is a standart e-mail spider, but I don't want to use some strange SPAM oriented shareware-s in office local network.

I am not sure how to do this.

Thank you in advanced

Link to comment
Share on other sites

I found this:

#Include <String.au3>

$Text = FileRead("email.txt")

$EmailFound = StringRegExp($Text, "([A-Za-z0-9._%-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})", 3)

if @extended = 1 Then

for $i = 0 to UBound($EmailFound) - 1

MsgBox(0, "E-Mail", $EmailFound[$i])

Next

Else

MsgBox(0, "E-Mail", "No E-Mail addressess found in the supplied text")

EndIf

But this don't work. I miss something, but I can get where is my mistake.

Edited by littleclown
Link to comment
Share on other sites

  • Moderators

littleclown,

I do not know where that script came from, but the If test after the StringRegExp is wrong. If the SRE is successful @extended = 0, which will give you the "Not found" message. :( You need to check for @error instead.

Change the If structure to read:

If @error = 1 Then
    MsgBox(0, "E-Mail", "No E-Mail addressess found in the supplied text")
Else
    For $i = 0 To UBound($EmailFound) - 1
        MsgBox(0, "E-Mail", $EmailFound[$i])
    Next
EndIf

I can extract email addresses with no problems using that. :mellow:

M23

Public_Domain.png.2d871819fcb9957cf44f4514551a2935.png Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind

Open spoiler to see my UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Link to comment
Share on other sites

  • Moderators

littleclown,

Please post some examples of your data including the URLs you want to extract together with their surrounding characters so we can try and develop a SRE pattern for you. As URLs can vary in format quite a bit, trying to get a sense of how they are located in the data is vital. :mellow:

M23

Public_Domain.png.2d871819fcb9957cf44f4514551a2935.png Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind

Open spoiler to see my UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Link to comment
Share on other sites

Yes I know, and I can give you the most simple example <a href="http://someaddress/somepage.html">, but this pattern is not absolute. I mean sometimes the URL-s are different.

Let me specify what I need if we forget this is an URL, but just a string. I need to extract all strings that begins with "http://" or before them there is "href=" or "href="" OR "href='" and the next character after this string is " " or """ or "'" or ">" or " " or some other symbols (i can add it after that - actually all symbols that can't be in correct URL).

I think this is what I need.

Thanks for your help!

Link to comment
Share on other sites

  • Moderators

littleclown,

Try this:

$sText = 'rubbish_text<a href="http://someaddress/somepage.html">rubbish_text'

$sURL = StringRegExpReplace($sText, '(?i).*http:(.+)">.*', 'http:$1')

MsgBox(0, "", $sURL)

Explanation -

Pattern:

(?i) = Case insensitive

.* = any number of characters

http: = literal text

(.+) = capturing group of at least one character (capturing group means we can use it later, as you will see)

"> = literal string

.* = any number of characters

Replacement:

http: = literal string

$1 = first capturing group (what we had in brackets in the pattern)

Now as long as you have the URL starting with "http:" and the tag ending in ">" - which I hope is all the time - you should be fine! :mellow:

M23

Edit: Added the explanation.

Edited by Melba23

Public_Domain.png.2d871819fcb9957cf44f4514551a2935.png Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind

Open spoiler to see my UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Link to comment
Share on other sites

It looks like you've indicated the majority of the regular expression:

(?i)[href=["']|http://]([^'>"]+)
Breaking it down:

[href=['"]|http://] = looks for either href= followed by an ' or " or http://
([^'>"]+) = store as many characters as you can that are not: ' > "

So: <a href="http://someaddress/somepage.html">

Becomes: someaddress/somepage.html

You'll want to prefix them with http://, but that is fairly trivial.

Hope this helps

Almost forgot: (?i) makes it case insensitive

#fgpkerw4kcmnq2mns1ax7ilndopen (Q, $0); while ($l = <Q>){if ($l =~ m/^#.*/){$l =~ tr/a-z1-9#/Huh, Junketeer's Alternate Pro Ace /; print $l;}}close (Q);[code] tag ninja!

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...