Jump to content

Regex question


JRowe
 Share

Recommended Posts

I need to split paragraphs into sentences. This was relatively easy to do, as it's a pattern very familiar to me. And well, to anyone that reads english, too. :(

I've tried to incoroporate anything and everything that could be encountered in the English language. The only problem I'm having is in handling website addresses... the .com's and www.'s are throwing me for a loop.

$Value = "In computing, regular expressions provide a concise and flexible means for identifying strings of te2xt of interest, such as particular characters, words, or patterns of characters. Regular expressions (abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen) are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification. This String is a test paragraph & a foil? The sentences are delimited by various punctuation! Wow, aren't we excited. Bill Gates email is billgates_110002@hotmail.com."

StringRegExp($Value, "(['""\w\d\(\),\;\:\-\@\&\s]++[\.\!\?])", 3)

$Value holds examples of everything except for colons, semicolons, and quotes. Obviously, I'm not concerned with grammar at this point, I'm just trying to do a basic paragraph parser.

There should be 6 sentences returned, but it parses the last as 2 separate sentences and returns 7... the billgates_110002@hotmail. is read as terminating the sentence, so "com." is read as it's very own sentence.

Thanks :mellow:

Edited by jrowe
Link to comment
Share on other sites

(['""\w\d\(\),\;\:\-\@\&\s]+(.com)?+[\.\!\?])

This works, sort of. It still counts .com as it's own sentence. How do I tell it to not count (.com) by itself as a complete sentence?

Link to comment
Share on other sites

Make a minimum length? It won't work for everything but you could do something like

psuedocode:

If $sentance <> "" and $sentance <> "No" and $sentance <> "Yes" Then

Yes and No are the only words I can think of that are 3 chars or less and could be a full sentence.

Link to comment
Share on other sites

There are possible single character answers within the domain of conversational english; giving the answer to a multiple choice question, for example, so I need to focus on punctuation. I almost have it, I think; I'm looking at noncapturing group to qualify and eliminate (\w\,\w) groups.

Link to comment
Share on other sites

Woot.

(['""\w\d\(\),\;\:\-\@\&\s]+(?:.\w+)?+[\.\!\?])

Had to put a non greedy quantifier with a may/may not appear qualifier after the word identifier. :mellow:

I'm doing this as part of a bigger project, I want to be able to parse arbitrary english text. Paragraphs are easier, I just need to specify sentence groups, with newline delimiters. Pages, volumes, threads, Chapters and other larger divisions can wait, I think.

Also, thanks to Szhlopp for the regex tester. It's been a huge help!

edit: fixed (?:.com) to read (?:.\w+) , for dot net, org, biz, etc. There's other exceptions to account for now, like Mr. and Mrs., but those are easily accounted for, by reversing the .com test. I'll update when I have more.

Edited by jrowe
Link to comment
Share on other sites

Good one. I just made one slight change that you might want to consider

\b(['""\w\d\(\),\;\:\-\@\&\s]+(?:.\w+)?+[\.\!\?])

\b causes the regex to start at a word boundary instead of picking up any white space preceeding it. You could also use \h* to do the same thing.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

I parse out white space during the word parse step, but, that's a good one to know. :mellow:

I figured serial parsing made more sense for arbitrary english, so I have 3 stages so far; Parsing paragraphs from the corpus, sentences from paragraphs, and words from sentences.

My plan is to parse Corpus - > Paragraphs - > Sentences - > Words , then to rebuild Sentences sentence diagrams, then diagrams as microtheories, that represent contextual groupings of word types, which in turn can be associated with data acquired from other sources, such as previous paragraphs, other texts read, and so on.

Ultimately, I'm aiming at a paragraph/story generator... give it a microtheory, such as "AutoIt web server catches fire, community is distraught, nobody knows what's going on" and have it generate a grammatically and syntactically correct paragraph conveying the thought. The challenge is in parsing enough sentence structures and identifying patterns that can be used algorithmically for the microtheories. Microtheories are a concept from the Cyc/OpenCyc project, meaning a bundle of assertions, or a concept dependent on previous data. Most inference engines have the equivalent.

To begin studying how I might solve that problem, I decided to parse large amounts of text. I'm going to parse several books, a few thousand news articles, maybe some wikipedia articles, IRC logs, and see where I get. I'll probably have to add several thousand new words by hand, and I'll learn a lot about regular expressions during the process. Eventually, I'm hoping I'll have a large database of paragraph patterns representing various concepts (not dependent on the original inputs, but usable to generate new concepts and grammatical structures for those concepts.)

Link to comment
Share on other sites

Actually, that improvement was great, GEOSoft. That'll teach me to question an MVP :mellow:

\b(['""\w\d\(\),\;\:\-\@\&\s']+(?:.\w+)?+[\.\!\?]'?)

#include <array.au3>
$data = "Testing, 123, 456;this is a test. There are two sentences so far. Mister Jones said 'And now there are three.' Oh, and six sentences total in this paragraph. My email is blah@blahblah.net. I didn't lie!"
Local $a 
$a = StringRegExp($data, "\b(['""\w\d\(\),\;\:\-\@\&\s']+(?:.\w+)?+[\.\!\?]'?)", 3)
_ArrayDisplay($a, "Extracted lines")

It allowed me to parse the quotes within a sentence correctly.

Mr., Mrs., and things like ellipses(...) and other grammatical and punctuational oddities will be parsed out and replaced with @<ellipses>@ or some such symbols, and filled back in after.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...