Sign in to follow this  
Followers 0
JRowe

Text search/replace formatting issue

2 posts in this topic

I'm trying to format text exported from pdf files. I'm running into a few problems, and I was wondering if someone could show me a regexp pattern which could solve it?

They both deal with line breaks appearing in inappropriate places. The first is sentences that are split by linebreaks. Sentences are words followed by a period, exclamation mark, or question mark. Some sentences have parentheses in them, others have parentheses around them.

The second is groups of 3 or 4 linebreaks (page breaks in the pdf doc itself.)

I'm trying to format several books in order to parse and analyze the grammatical structures (more on this later.) After spending about an hour doing it manually, I realized that a simple set of regexp search and replace functions would handle the whole thing automagically.

Thanks for any help!

samples:

"blahblahblah," character said, "blah blah blah

blahblah blah."

Sentence text

more, sentence text sentence sentence word

word. "Someone saying something something blah something

more from the same someone saying something."

Getting rid of all non-enclosed linebreaks would probably do the trick. At any rate, thanks for any help!

Share this post


Link to post
Share on other sites



#include <Array.au3>
$data = FileOpen("test.txt", 0)

$chars = FileRead($data)


Local $a
$a = StringRegExp($chars, "[\w\s]+?\.", 3)
_ArrayDisplay($a)

That's what I've got so far, to read sentences that end in periods, I guess. I can't figure out the rest, at least not yet.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0