Regular expression, exclude word if near other word

jchd · May 22, 2009

Those appear to be a match if I understood what the OP is asking for. It seems like "up shut up" should be matched. If the OP can post a few lines or sentences that the regexp should and shouldn't match it'll be much more comfortable than guessing what the regexp should or otherwise not should match.

It all depend on how strictly you interpret the OP own wording, which I already quoted:

In this case I just want a positive or negative result -- want to find the word 'up' any time it is not preceded on the same line by the word 'shut'.

[Double emphasis still mine]

I took it as if it was legaleese. As a consequence, in "up shut up" I see the blue word 'up' preceded by the red word 'shut' on the same line. Hence I considered it had to be a match failure despite the presence of a green 'up' beforehand.

But look, there's little point in a "mine is bigger than yours" discussion, since even the OP seems to have lost interest in his own question and we might never know more formaly what he intended/needed. Even the notions of 'word' and 'word separator' should be strictly defined.

We both know that regexes are a very powerful tool, cleverly enhanced by the new extensions which your post incites me to consider more closely.

jchd · May 22, 2009

#include <String.au3>
#include<array.au3>

Dim $aStr[11] = _
    ['shut the hell up', 'up hell the shut up', 'shutup', 'please, shut the window', _
    'shut the soup', "I'm feeling up", "I'm feeling up so shut up", "I'm feeling shut so up shut", _
    'shut word word shut up', 'shuton on upload', 'shut the upload']

$str=_ArrayToString($aStr," ")

$ss=StringSplit($str," ")
for $c=1 to $ss[0]
    if $ss[$c]="up" and $ss[$c-1]="shut" then MsgBox(0,"","Match "&$ss[$c]&" "&$ss[$c-1],0)

Next

Sorry Aceguy but you miss the point on this one.

Your code merges all test strings together. They should instead be tested individually. Then if one follows the spirit of your loop, it only handles the case where the two words ('shut' and 'up') follow in direct sequence. You have to raise a flag or something to denote the occurence of 'shut' before 'up'. Also the OP was asking for a solution using a single regex giving a binary answer on a direct test, just because he thought it would have been relatively easy to do so. It turns out it wasn't _that_ easy if you insist on a correct answer for every possible input string.

There are also plenty of simple yet elegant solutions involving for instance _two_ simple StringRegExp() in a row linked by a And or Or condition. Even if this is not what the OP was looking for, the overall complexity of such solution should be more favorable on average input than any loop in AutoIt instructions.

To sum it all, more straightforward code is possible (and certainly recommended from both readability and maintenance points of view) but this was not that much the whole point of the thread, at least as I understand it.

Kiai · May 24, 2009

I was the original poster on here and am sort of amazed at the solution from jchd. I never considered reversing the entire string which, once I saw it, was sort of a 'duh!', of course time can be variable Einstein sort of solution. Hats off to you, jchd!

The app I am working on can't, as-is, utilize this solution without some modifications. It is a medical text matching program where search terms (contained outside of the program in an xml file) can either just be words or regular expressions (but not chunks of code). The actual real world problem I was trying to solve was, tell me yes or no to ? does a 2 page document contain the word(s) cancer|neoplasm|dysplasia|carcinoma NOT preceded by not|no|n't|without. The challenge is that, within a lengthy biopsy report of, for example, someone's pancreas, this may be the text:

Specimen 1: No evidence of carcinoma or dysplasia.

Specimen 2: Grade 3 adeno-carcinoma extending beyond surgical margins

Specimen 3: No evidence of carcinoma or dysplasia.

I want to match that document as POSITIVE for cancer, despite finding the phrases 1 and 3

I obviously can't just match on 'carcinoma' because it would pick up the negative sentences.

At this point, I think I might add to the app a 'reverse search' string option -- use it only for these specific situations. ... As I think about it, if I reverse the entire string I am searching (the biopsy report), then this should work:

Reverse the entire string:

.aisalpsyd ro amonicrac fo ecnedive oN :3 nemicepS

snigram lacigrus dnoyeb gnidnetxe amonicrac-oneda 3 edarG :2 nemicepS

.aisalpsyd ro amonicrac fo ecnedive oN :1 nemicepS

search for

(?i)(aisalpsyd| amonicrac)(\w*\W*){0,4}(?!( on| ton))

It is looking for dysplasia or carcinoma (reversed) followed by 0-4 word but Not followed by 'no ' or 'not ' (reversed). Only problem is, after spending about 2 hours trying to make this work, it still matches when it shouldn't.

Suggestions on what I'm doing wrong for the problem of, match word1 NOT followed by word2 (within 0-6 words)?

Sorry Aceguy but you miss the point on this one.
Your code merges all test strings together. They should instead be tested individually. Then if one follows the spirit of your loop, it only handles the case where the two words ('shut' and 'up') follow in direct sequence. You have to raise a flag or something to denote the occurence of 'shut' before 'up'. Also the OP was asking for a solution using a single regex giving a binary answer on a direct test, just because he thought it would have been relatively easy to do so. It turns out it wasn't _that_ easy if you insist on a correct answer for every possible input string.
There are also plenty of simple yet elegant solutions involving for instance _two_ simple StringRegExp() in a row linked by a And or Or condition. Even if this is not what the OP was looking for, the overall complexity of such solution should be more favorable on average input than any loop in AutoIt instructions.
To sum it all, more straightforward code is possible (and certainly recommended from both readability and maintenance points of view) but this was not that much the whole point of the thread, at least as I understand it.

jchd · May 24, 2009

The app I am working on can't, as-is, utilize this solution without some modifications. It is a medical text matching program where search terms (contained outside of the program in an xml file) can either just be words or regular expressions (but not chunks of code). The actual real world problem I was trying to solve was, tell me yes or no to ? does a 2 page document contain the word(s) cancer|neoplasm|dysplasia|carcinoma NOT preceded by not|no|n't|without. The challenge is that, within a lengthy biopsy report of, for example, someone's pancreas, this may be the text:
Specimen 1: No evidence of carcinoma or dysplasia.
Specimen 2: Grade 3 adeno-carcinoma extending beyond surgical margins
Specimen 3: No evidence of carcinoma or dysplasia.
I want to match that document as POSITIVE for cancer, despite finding the phrases 1 and 3
I obviously can't just match on 'carcinoma' because it would pick up the negative sentences.

Nice to see you back on this thread. I found the problem intriguing, as you state it in the first post. The constraint of using a single regex was more a challenge than a requirement. I must admit I enjoyed it anyway.

Now your actual problem embeds in a much more difficult context. The mere nature of human language (medical jargon apart) makes it very difficult to capture semantics correctly each time, in almost any context. Having spent much time modelling systems specifications using formal methods made me extremely cautious when it comes to rigor expressed by human language.

In your particular context, a "one bit" error can have dramatic consequences, so let me insist that the following is not expert advice and has to be validated by other more rigorous means. I'm also directly concerned myself about the correctness of cancer reports, so I tend to insist boldly. This is NOT an exercise! as they say in the movies.

In your case I would favor reinforcing the outcome of the function by two means:

o) using more simple technics (complex/unusual/convoluted code always gives more opportunities for bugs to creep in)

o) using a redundant determination employing distincts technics

You can do so because I guess that superior efficiency is not the problem in your case. Your code can most probably be allowed to spend dozens of seconds or even a few minutes before replying (a HUGE potential for careful computations).

Using regex on a whole report would be useless IMHO. This is too much to chew in one chunk. Rather handle text on a phrase basis, then look at each paragraph if it still makes sense. If your documents are XML, then they probably have a usable structure to help focussing on the actual pieces of interest.

With your sample extract (I know it can be more verbose/complex than that), you can still run a dual regex.

Lets call <nasties> the whole lot of words you look for, and <negations> the whole lot of negative terms. Both sets inside <> can be direct alternations or more flexible root-based word detection. You can achieve a partial result by using a couple of regexes like:

$CancerPresenceDoubtRaised = StringRegExp($phrase, "<nasties>", 0) And Not StringRegExp($phrase, "<negations>.*?<nasties>", 0)

or

$CancerAbsenceDoubtRaised = Not StringRegExp($phrase, "<nasties>", 0) Or StringRegExp($phrase, "<negations>.*?<nasties>", 0)

But then I would only consider this a low-level component. More scrutiny is needed for examination of the doubt-raising phrases. English is not my mothertongue so I can miss obvious things with this approach. Even if medical reports tend to be more factual than random litterature, both false positives and false negatives are to be avoided at any rate. Any positive or negative doubt raised should be ascertained further by other means. I bet there are components or libraries available to help you here, employing completely different ways.

Look, there certainly exist gotchas (of course I'm making this up entirely but you get the idea):

Specimen 1: No evidence of carcinoma or dysplasia.

Specimen 2: Evidence of successful surgery of an old adeno-carcinoma.

Specimen 3: No evidence of carcinoma or dysplasia.

Specimen 4: Evidence of a benign form of friendly-carcinoma whose presence is compatible with the patient's age.

I whish you the best luck for a failsafe implementation.

Authenticity · May 24, 2009

#include <Array.au3>

Global $aStr[3] = ['Specimen 1: No evidence of carcinoma or dysplasia.', _
                  'Specimen 2: Grade 3 adeno-carcinoma extending beyond surgical margins', _
                  'Specimen 3: No evidence of carcinoma or dysplasia.']
                  
Global $sPatt = '^((?i)(?>.*?\b(?:(dysplasia|carcinoma)|(not|no))\B)(?(3)(*FAIL)|.*))'
Global $aMatch

For $i = 0 To 2
    $aMatch = StringRegExp($aStr[$i], $sPatt, 1)
    If IsArray($aMatch) Then _ArrayDisplay($aMatch, StringFormat('String: %d', $i+1))
Next

Kiai · May 24, 2009

I appreciate your concern about accuracy here. I am fortunately not going send people to surgery or the morgue based on the output of the regex -- I'm a doctor (and hack programmer) trying to develop a system to flag text reports for users so they don't accidently overlook the one positive mixed in with negative 'nasties'. As you point out, there will still be accidental positives, as in your example below #2 and #4.

At this point, I agree with you that the primary problem is with trying to do this with an entire 2 page report at once, as opposed to taking it phrase by phrase. I think the best approach appears to be to search for 'nasties' in the text and, when found, run a string search backwards looking for a negation term, then go on to the next found 'nasty'. Currently my app allows for 'inclusion' and 'exclusion' terms in the search so I think I'll add third category of 'not negated' search terms.

Thanks for your help with this!

btw, for someone who didn't grow up speaking English, your English is fantastic.

Nice to see you back on this thread. I found the problem intriguing, as you state it in the first post. The constraint of using a single regex was more a challenge than a requirement. I must admit I enjoyed it anyway.
Now your actual problem embeds in a much more difficult context. The mere nature of human language (medical jargon apart) makes it very difficult to capture semantics correctly each time, in almost any context. Having spent much time modelling systems specifications using formal methods made me extremely cautious when it comes to rigor expressed by human language.
In your particular context, a "one bit" error can have dramatic consequences, so let me insist that the following is not expert advice and has to be validated by other more rigorous means. I'm also directly concerned myself about the correctness of cancer reports, so I tend to insist boldly. This is NOT an exercise! as they say in the movies.
In your case I would favor reinforcing the outcome of the function by two means:
o) using more simple technics (complex/unusual/convoluted code always gives more opportunities for bugs to creep in)
o) using a redundant determination employing distincts technics
You can do so because I guess that superior efficiency is not the problem in your case. Your code can most probably be allowed to spend dozens of seconds or even a few minutes before replying (a HUGE potential for careful computations).
Using regex on a whole report would be useless IMHO. This is too much to chew in one chunk. Rather handle text on a phrase basis, then look at each paragraph if it still makes sense. If your documents are XML, then they probably have a usable structure to help focussing on the actual pieces of interest.
With your sample extract (I know it can be more verbose/complex than that), you can still run a dual regex.
Lets call <nasties> the whole lot of words you look for, and <negations> the whole lot of negative terms. Both sets inside <> can be direct alternations or more flexible root-based word detection. You can achieve a partial result by using a couple of regexes like:
$CancerPresenceDoubtRaised = StringRegExp($phrase, "<nasties>", 0) And Not StringRegExp($phrase, "<negations>.*?<nasties>", 0)
or
$CancerAbsenceDoubtRaised = Not StringRegExp($phrase, "<nasties>", 0) Or StringRegExp($phrase, "<negations>.*?<nasties>", 0)
But then I would only consider this a low-level component. More scrutiny is needed for examination of the doubt-raising phrases. English is not my mothertongue so I can miss obvious things with this approach. Even if medical reports tend to be more factual than random litterature, both false positives and false negatives are to be avoided at any rate. Any positive or negative doubt raised should be ascertained further by other means. I bet there are components or libraries available to help you here, employing completely different ways.
Look, there certainly exist gotchas (of course I'm making this up entirely but you get the idea):
Specimen 1: No evidence of carcinoma or dysplasia.
Specimen 2: Evidence of successful surgery of an old adeno-carcinoma.
Specimen 3: No evidence of carcinoma or dysplasia.
Specimen 4: Evidence of a benign form of friendly-carcinoma whose presence is compatible with the patient's age.
I whish you the best luck for a failsafe implementation.

jchd · May 24, 2009

I appreciate your concern about accuracy here. I am fortunately not going send people to surgery or the morgue based on the output of the regex -- I'm a doctor (and hack programmer) trying to develop a system to flag text reports for users so they don't accidently overlook the one positive mixed in with negative 'nasties'. As you point out, there will still be accidental positives, as in your example below #2 and #4.

I feel some relief reading this. Hopefully you're not a rogue programmer paid low wages to decide life or death based on the outcome of a single regex

At this point, I agree with you that the primary problem is with trying to do this with an entire 2 page report at once, as opposed to taking it phrase by phrase. I think the best approach appears to be to search for 'nasties' in the text and, when found, run a string search backwards looking for a negation term, then go on to the next found 'nasty'. Currently my app allows for 'inclusion' and 'exclusion' terms in the search so I think I'll add third category of 'not negated' search terms.

This is certainly a wise decision. It's now up to you to decide the balance between several constraints, being aware of the fact that you will never come up with the 100% failproof "algorithm of God".

W.r.t. my style english or french, I do my best to express myself in the clearest possible way with reader's respect and background in mind. I'm not always successfull, but I keep trying ... and learning.

All the best; take care of you and of your files :party:

Edited May 24, 2009 by jchd

jchd · May 24, 2009

@ Authenticity

I think you hit this nail with the right hammer in this last code. (Beware, the in "\" turned into an emoticon)!

Sign In

Regular expression, exclude word if near other word

Recommended Posts

jchd

Link to comment

Share on other sites

jchd

Link to comment

Share on other sites

Kiai

Link to comment

Share on other sites

jchd

Link to comment

Share on other sites

Authenticity

Link to comment

Share on other sites

Kiai

Link to comment

Share on other sites

jchd

Link to comment

Share on other sites

jchd

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

AutoIt Resources

Release

Beta