Regex question

JRowe · November 15, 2008

I need to split paragraphs into sentences. This was relatively easy to do, as it's a pattern very familiar to me. And well, to anyone that reads english, too.

I've tried to incoroporate anything and everything that could be encountered in the English language. The only problem I'm having is in handling website addresses... the .com's and www.'s are throwing me for a loop.

$Value = "In computing, regular expressions provide a concise and flexible means for identifying strings of te2xt of interest, such as particular characters, words, or patterns of characters. Regular expressions (abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen) are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification. This String is a test paragraph & a foil? The sentences are delimited by various punctuation! Wow, aren't we excited. Bill Gates email is billgates_110002@hotmail.com."

StringRegExp($Value, "(['""\w\d\(\),\;\:\-\@\&\s]++[\.\!\?])", 3)

$Value holds examples of everything except for colons, semicolons, and quotes. Obviously, I'm not concerned with grammar at this point, I'm just trying to do a basic paragraph parser.

There should be 6 sentences returned, but it parses the last as 2 separate sentences and returns 7... the billgates_110002@hotmail. is read as terminating the sentence, so "com." is read as it's very own sentence.

Thanks :mellow:

Edited November 15, 2008 by jrowe

JRowe · November 15, 2008

(['""\w\d\(\),\;\:\-\@\&\s]+(.com)?+[\.\!\?])

This works, sort of. It still counts .com as it's own sentence. How do I tell it to not count (.com) by itself as a complete sentence?

dbzfanatic · November 15, 2008

Make a minimum length? It won't work for everything but you could do something like

psuedocode:

If $sentance <> "" and $sentance <> "No" and $sentance <> "Yes" Then

Yes and No are the only words I can think of that are 3 chars or less and could be a full sentence.

JRowe · November 15, 2008

There are possible single character answers within the domain of conversational english; giving the answer to a multiple choice question, for example, so I need to focus on punctuation. I almost have it, I think; I'm looking at noncapturing group to qualify and eliminate (\w\,\w) groups.

dbzfanatic · November 15, 2008

I was thinking standard conversation not oral test-taking lol.

JRowe · November 15, 2008

Woot.

(['""\w\d\(\),\;\:\-\@\&\s]+(?:.\w+)?+[\.\!\?])

Had to put a non greedy quantifier with a may/may not appear qualifier after the word identifier. :mellow:

I'm doing this as part of a bigger project, I want to be able to parse arbitrary english text. Paragraphs are easier, I just need to specify sentence groups, with newline delimiters. Pages, volumes, threads, Chapters and other larger divisions can wait, I think.

Also, thanks to Szhlopp for the regex tester. It's been a huge help!

edit: fixed (?:.com) to read (?:.\w+) , for dot net, org, biz, etc. There's other exceptions to account for now, like Mr. and Mrs., but those are easily accounted for, by reversing the .com test. I'll update when I have more.

Edited November 15, 2008 by jrowe

GEOSoft · November 15, 2008

Good one. I just made one slight change that you might want to consider

\b(['""\w\d\(\),\;\:\-\@\&\s]+(?:.\w+)?+[\.\!\?])

\b causes the regex to start at a word boundary instead of picking up any white space preceeding it. You could also use \h* to do the same thing.

JRowe · November 15, 2008

I parse out white space during the word parse step, but, that's a good one to know. :mellow:

I figured serial parsing made more sense for arbitrary english, so I have 3 stages so far; Parsing paragraphs from the corpus, sentences from paragraphs, and words from sentences.

My plan is to parse Corpus - > Paragraphs - > Sentences - > Words , then to rebuild Sentences sentence diagrams, then diagrams as microtheories, that represent contextual groupings of word types, which in turn can be associated with data acquired from other sources, such as previous paragraphs, other texts read, and so on.

Ultimately, I'm aiming at a paragraph/story generator... give it a microtheory, such as "AutoIt web server catches fire, community is distraught, nobody knows what's going on" and have it generate a grammatically and syntactically correct paragraph conveying the thought. The challenge is in parsing enough sentence structures and identifying patterns that can be used algorithmically for the microtheories. Microtheories are a concept from the Cyc/OpenCyc project, meaning a bundle of assertions, or a concept dependent on previous data. Most inference engines have the equivalent.

To begin studying how I might solve that problem, I decided to parse large amounts of text. I'm going to parse several books, a few thousand news articles, maybe some wikipedia articles, IRC logs, and see where I get. I'll probably have to add several thousand new words by hand, and I'll learn a lot about regular expressions during the process. Eventually, I'm hoping I'll have a large database of paragraph patterns representing various concepts (not dependent on the original inputs, but usable to generate new concepts and grammatical structures for those concepts.)

JRowe · November 15, 2008

Actually, that improvement was great, GEOSoft. That'll teach me to question an MVP :mellow:

\b(['""\w\d\(\),\;\:\-\@\&\s']+(?:.\w+)?+[\.\!\?]'?)

#include <array.au3>
$data = "Testing, 123, 456;this is a test. There are two sentences so far. Mister Jones said 'And now there are three.' Oh, and six sentences total in this paragraph. My email is blah@blahblah.net. I didn't lie!"
Local $a 
$a = StringRegExp($data, "\b(['""\w\d\(\),\;\:\-\@\&\s']+(?:.\w+)?+[\.\!\?]'?)", 3)
_ArrayDisplay($a, "Extracted lines")

It allowed me to parse the quotes within a sentence correctly.

Mr., Mrs., and things like ellipses(...) and other grammatical and punctuational oddities will be parsed out and replaced with @<ellipses>@ or some such symbols, and filled back in after.

Sign In

Regex question

Recommended Posts

JRowe

JRowe

dbzfanatic

JRowe

dbzfanatic

JRowe

GEOSoft

JRowe

JRowe

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta