Jump to content

Ran into a RegEx brick wall...


Recommended Posts

I'm writing a small script to read e-mails from the reading pane in Outlook 2003. So far I've got it to read the standard rtf messages, asciii text messages, and capture the text from html messages using _IEBodyReadText. Now comes the tricky part...

I'm trying to use RegEx to read only the current reply and not include the previous mail history. So far my expression looks like this:

((Look for all characters and new lines)Repeat. Look for _ mail separator)

((.*/n)*_)

In most cases it will skip the previous reply and include the mail previous to it. When I've modified ti to search for > separators in ascii text mails, it includes the entire mail. As for html mesaages - after it has been imported the _ separators are stripped out. The only separator i can see is 3 new lines then From:

Am in on the right track here or am i barking up the wrong tree?

Link to comment
Share on other sites

I wrote an Email stripper a few years back that did this and also allowed you to, optionally, save the file as a .txt file. It was well before the days of RegExp in AutoIt but it was fairly simple to do. I'll look for that file in my archives and post it.

BTW: There are more characters that can be used than just >, so you also have to allow for that.

I think the RegExpReplace would be something like (untested)

$sText = StringRegExpReplace($sText, "(?i)(\W*)(\w*\r|\n)", "$2")

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

Thanks for the reply.

Gave it a try - it strips out the extra lines and spaces, but the body of the previous messages remain.

Curious to see your previous e-mail utility - hopefully it can do the job!

I need to create a RegEx which will read the text down to the separator ( _ , > , [3 blank lines] From:) and ignore/strip the rest.

Not sure where i'm going wrong on what i have done so far... SmOke_N anywhere to be found?

Link to comment
Share on other sites

Thanks for the reply.

Gave it a try - it strips out the extra lines and spaces, but the body of the previous messages remain.

Curious to see your previous e-mail utility - hopefully it can do the job!

I need to create a RegEx which will read the text down to the separator ( _ , > , [3 blank lines] From:) and ignore/strip the rest.

Not sure where i'm going wrong on what i have done so far... SmOke_N anywhere to be found?

Maybe I misunderstood the previous question. Can you post an exact example? the way I understand this post you want to strip everything AFTER (From:)

SmOke_N is usually too easily found. :)

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

Maybe I misunderstood the previous question. Can you post an exact example? the way I understand this post you want to strip everything AFTER (From:)

SmOke_N is usually too easily found. :)

Thats correct. Just need to extract the last reply just down the mail separator, and ignore the rest of the mail.

CODE

Text of the last reply

--------------------------------------------------------------------------------

From:whoever

This expression ((.*\n)*)_) will work most of the time

I can adapt the expression using | (or) to detect different mail separators, it's just getting the initial text in it's own that it proving difficult...

Adapting it to ((.*\n)*)>) so that it is supposed to capture only to the separator > didn't work - it included the entire mail.

Neither did ((.*\n)*\n\n\n) working, as it isn't stopping when there is a 3 blank line separator.

Link to comment
Share on other sites

Hey,

how about this...

#include <Array.au3>

$sText = FileRead("mail.txt") ; 
$sText1 = StringRegExp($sText, "(?i)(?s)\A(.*?)[\r\n]{3}", 3)
If Not IsArray($sText1) Then Exit -1
ConsoleWrite($sText1[0] & @CRLF)
_ArrayDisplay($sText1)

mail.txt

Text of the last reply



--------------------------------------------------------------------------------
From:whoever
Edited by Robjong
Link to comment
Share on other sites

Is this what you want?

(?i)(?s)(.*\n)[[:punct:]]

EDIT:

Actually this may be more in line with what you want.

(?i)(?s)(.*)(?:\n[[:punct:]{5,}])

Edited by GEOSoft

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

  • Moderators

Here's a silly question. Why aren't just using the COM objects to enum through Outlooks emails rather than having to have it open and go through the _IE* functions to get the text?

Using the COM objects you could use $o_email.Subject/$o_email.From/$o_email.Body etc... would make your life much easier I think.

Anyway, it would probably be best to show the script you're using to extract the emails and give a step by step on that, so that we can see the different types of returns you're getting.

One last thing... does reading the body to html help better than reading it to text to have better regex anchors?

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

Here's a silly question. Why aren't just using the COM objects to enum through Outlooks emails rather than having to have it open and go through the _IE* functions to get the text?

Using the COM objects you could use $o_email.Subject/$o_email.From/$o_email.Body etc... would make your life much easier I think.

Anyway, it would probably be best to show the script you're using to extract the emails and give a step by step on that, so that we can see the different types of returns you're getting.

One last thing... does reading the body to html help better than reading it to text to have better regex anchors?

Ok.. took some relentless tweaking and testing but I think i got it. Here's a copy of the code with a test preview window.

(This is what i came up with before I had a chance to read your replies :) )

#include <GUIConstants.au3>

#Region ### START Koda GUI section ### Form=
$Form1 = GUICreate("Form1", 1349, 996, 193, 115)
$Label1 = GUICtrlCreateLabel("Label1", 10, 10, 1341, 1002)
GUISetState(@SW_SHOW)
#EndRegion ### END Koda GUI section ###

While 1
    $nMsg = GUIGetMsg()
    Switch $nMsg
        Case $GUI_EVENT_CLOSE
            Exit
    EndSwitch
    

    $Text = ControlGetText("Outlook","","RichEdit20W7")     ;Get content of Preview Pane when viewing Text or RTF
    $TextCheck = StringRegExp($Text,'\(\d\d/\d\d/\d\d')     ;Look for old text mail separator (Sender name (Date))
    If $TextCheck = 1 Then
        $Type = "Text"
        $TestSearch = StringRegExp($Text,'((.*\n){1,25}).*\(\d\d/\d\d/\d\d',1)
        If $TestSearch <> 0 Then 
            $Type = "Text RegEx"
            $Text = $TestSearch[0]
        EndIf
        
    ;Display message as is, or RegEx to find RTF mail separator (_)
    ElseIf $Text <> "" Then
        $Type = "RTF/Text"
        $TestSearch = StringRegExp($Text,'((.*\n){1,25})_',1)
        If $TestSearch <> 0 Then 
            $Text = $TestSearch[0]
            $Type = "RTF RegEx"
        EndIf
    EndIf
    
    ;Get content of Preview Pane when viewing HTML message
    $HTMLControlHandle  = ControlGetHandle("Outlook","","Internet Explorer_Server1")
    If $HTMLControlHandle <> "" Then
        $HTML = __IEControlGetObjFromHWND($HTMLControlHandle)
        $Text = _IEBodyReadText($HTML)
        $Type = "HTML"
    ;Display message as is, or RegEx to find HTML mail separator (From:)
        $TestSearch = StringRegExp($Text,'((.*\n){1,25})From:',1)
        If $TestSearch <> 0 Then 
            $Type = "HTML RegEx"
            $Text = $TestSearch[0]
        EndIf
    EndIf
    
    ;Update Test window with message content
    GUICtrlSetData($Label1, $Type &@CRLF&@CRLF&$Text)
    Sleep(1000)
WEnd

The Outlook Preview Pane changes the name of the control when viewing RTF/Text or HTML dependant on the content. Once the HTML is viewable from _IEBodyReadText, the normal mail separator (_) doesn't appear through so I had to find another (From:)

The expression was working, ((.*\n){1,25}) but I realised I needed to limit how far into the message it should go, between 1-25 lines is typical of a reply. afterwards to was a case of tweaking the detection of the mail separators.

@GEOSoft, Tried the code you supplied - works a charm except on Text mails. Changed it to use \(\d\d/\d\d/\d\d and it's now working.. Thanks!

Reading it as i see it, you use (?i)(?s) as modifiers to ignore case and include all characters before using (.*) to capture all characters

(?:[\r\n] tells the expression to look for but not include return\new line, [[:punct:]{5,}]) means not to include any letters or digits, which can means a range of separators are covered.

Don't know why > is not one of them :lmao:

@SmOke_N, I'm doing my best to get my head around RegEx but it's a long way off before i'm fluent.. don't know my way around COM objects in any respect yet, so if there's a more elegant solution i'm interested in learning how it's done :think:

Edited by DarkGUNMAN
Link to comment
Share on other sites

The [[:punct:]{5,}]) is used to detect the separator and will only come into play if the separator is found 5 or more times. Example: in a couple of your posts you were testing for underscores (_) but in the test text you supplied the separator you used was the hyphen(-) so instead of asking the obvious I just used the :Punct: class which covers them all.

EDIT:

When testing a line to see if it is actually part of a reply (usually starts with >) and you want to use a regexp to test for the :Punct: class as the first character them make sure you use something like

If StringLeft($sStr, 1) <> Chr(34) Then

Put a regExp in here

EndIf

That will avoid catching lines that may be starting with a quote.

Edited by GEOSoft

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...