JRowe Posted July 18, 2008 Share Posted July 18, 2008 I'm trying to format text exported from pdf files. I'm running into a few problems, and I was wondering if someone could show me a regexp pattern which could solve it?They both deal with line breaks appearing in inappropriate places. The first is sentences that are split by linebreaks. Sentences are words followed by a period, exclamation mark, or question mark. Some sentences have parentheses in them, others have parentheses around them.The second is groups of 3 or 4 linebreaks (page breaks in the pdf doc itself.)I'm trying to format several books in order to parse and analyze the grammatical structures (more on this later.) After spending about an hour doing it manually, I realized that a simple set of regexp search and replace functions would handle the whole thing automagically.Thanks for any help!samples:"blahblahblah," character said, "blah blah blahblahblah blah."Sentence textmore, sentence text sentence sentence wordword. "Someone saying something something blah somethingmore from the same someone saying something."Getting rid of all non-enclosed linebreaks would probably do the trick. At any rate, thanks for any help! [center]However, like ninjas, cyber warriors operate in silence.AutoIt Chat Engine (+Chatbot) , Link Grammar for AutoIt , Simple Speech RecognitionArtificial Neural Networks UDF , Bayesian Networks UDF , Pattern Matching UDFTransparent PNG GUI Elements , Au3Irrlicht 2Advanced Mouse Events MonitorGrammar Database GeneratorTransitions & Tweening UDFPoker Hand Evaluator[/center] Link to comment Share on other sites More sharing options...
JRowe Posted July 18, 2008 Author Share Posted July 18, 2008 #include <Array.au3> $data = FileOpen("test.txt", 0) $chars = FileRead($data) Local $a $a = StringRegExp($chars, "[\w\s]+?\.", 3) _ArrayDisplay($a) That's what I've got so far, to read sentences that end in periods, I guess. I can't figure out the rest, at least not yet. [center]However, like ninjas, cyber warriors operate in silence.AutoIt Chat Engine (+Chatbot) , Link Grammar for AutoIt , Simple Speech RecognitionArtificial Neural Networks UDF , Bayesian Networks UDF , Pattern Matching UDFTransparent PNG GUI Elements , Au3Irrlicht 2Advanced Mouse Events MonitorGrammar Database GeneratorTransitions & Tweening UDFPoker Hand Evaluator[/center] Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now