Sign in to follow this  
Followers 0
Realm

StringRegExp Pattern

27 posts in this topic

#1 ·  Posted (edited)

I have this one particular file I just can not formulate a proper StringRegExp pattern for and could use some input.

This is an example of the File I am having problems with:

Company                                         Est    Actual   Supervisor[Employees]
=== April 2012 ==== [Total Estimate/Total Actual: 88/99] =========================================================================================
Henderson, Smith and Thurman Project [HST]       136.25 145.5    Anderson[2] Hoskins[1] Tillman[3]
Tillman Remod [TR]                               71.5     70        Anderson[3]
Baker Automotive Supplies Project [BAS]         212             
Miller Sports Arena Project [MSA]                           451.1833  Baker2[14]

Each document has mulltiple sections similar to the one above.

Each Line is a Maximum of 147 characters in lengh. In This document....

The First field is always 51 characters in lenght,

The Second and Third are each 10 characters in lenght,

The Last Field varies to the end of text.

I am asked to extract to these new fields:

Company

[Company code]; Not including the brackets

Estimated number,

Actual Number

and Total Employees under supervision

Example

Henderson, Smith and Thurman Project

HST

136.25

145.5

6

I have spent a couple hours trying to formulate a SRE Pattern to at least capture the 4 separate fields to no avail. When the Last Field has no present data, It caputures the first letter of the first field on the next line with this example:

$text = 'Company                                            Est    Actual   Supervisor[Employees] ' & @CRLF _
      & '=== April 2012 ===================================================================================================================================' & @CRLF _
      & 'Henderson, Smith and Thurman Project [HST]      136.25 145.5    Anderson[2] Hoskins[1] Tillman[3] ' & @CRLF _
      & 'Tillman Remod [TR]                              71.5     70        Anderson[3] ' & @CRLF _
      & 'Baker Automotive Supplies Project [BAS]            212              ' & @CRLF _
      & 'Miller Sports Arena Project [MSA]                          451.1833  Baker2[14] ' & @CRLF

$SRE = StringRegExp( $text, '(.*?)s[(.*?)]s*(d*.d*)s*(d*.d*)s*(.*)rn', 3)

_ArrayDisplay($SRE)

Thanks in advance for any help!

Realm

Edit: Added an Example File

I could not find where to upload my file via my Profile page, so I hope this was the correct way.

http://www.autoitscript.com/forum/files/file/193-projectledgertxt/

Edited by Realm

My Contributions: Unix Timestamp: Calculate Unix time, or seconds since Epoch, accounting for your local timezone and daylight savings time. RegEdit Jumper: A Small & Simple interface based on Yashied's Reg Jumper Function, for searching Hives in your registry.  

Share this post


Link to post
Share on other sites



Do it one line at a time, instead of the whole thing?


If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Share this post


Link to post
Share on other sites

I tried using line by line, utilizing _FileReadToArray(), last night, however found that it was more taxing on time than I had expected. I have almost 30 years worth of these documents to convert, and they have changed the data format they wanted twice already, so I was hoping for something a little quicker in case they change how they want the data extracted again. With SRE, I can grab the data from each file in just a few seconds, reading line by line even through an array was taking a considerable amount of time for each document, approximately 30 seconds to a minute. With over 2,000 documents that are similar to this to convert I was really hoping for the faster solutions, However, if this is the only way... So be It!


My Contributions: Unix Timestamp: Calculate Unix time, or seconds since Epoch, accounting for your local timezone and daylight savings time. RegEdit Jumper: A Small & Simple interface based on Yashied's Reg Jumper Function, for searching Hives in your registry.  

Share this post


Link to post
Share on other sites

The problem with SRE is that it's unforgiving if the data isn't exactly the same and your search pattern doesn't take that into account. Make sure that you are only searching each line up until the line end (nr if I'm not mistaken) so that it's not trying to get something from the next line.

YMMV because I haven't dipped into Regex yet so I'm guessing. ;)


If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Share this post


Link to post
Share on other sites

GEOSoft would have fun with this one ;)

Share this post


Link to post
Share on other sites

Realm,

You may want to supply an actual file that you are parsing.

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

If the fields are fixed length, I wouldn't use a regex right away. I'd split the lines by character position, then use a regex on the pieces, although even that may not be necessary.


Gerard J. Pinzonegpinzone AT yahoo.com

Share this post


Link to post
Share on other sites

#8 ·  Posted (edited)

@BrewManNH,

In my example I had included /r/n, which is the correct order for this file. However it still is expecting a fourth field and pulling the first character of the next line.

@kylomas,

I really wish I could since it could possibly make a difference, however being a third party in the contract, I am not certain if I can post data publicly that could be considered confidential or copyrighted. I attempted to provide an example that portrays the actual data as accurately as possible.

@GPinzone,

At first, when I realized how difficult an SRE pattern was going to be, I did script something similar to your suggested approached that yielded the exact needed results. However, as I mentioned before, this process is a little more time consuming than I originally anticipated, and I may have to rerun this project a few times across a large amount of documents. Was hoping for a quicker solution as SRE generally provides.

I also noticed that it can have errors when there is data missing from the middle 2 fields as well, so I'm gonna go ahead and finish this with line by line reading. Next time I'll investigate file formats a bit deeper before bidding a project. Thanks for your input and attempts to help guys!

Realm

Edit: Typos

Edited by Realm

My Contributions: Unix Timestamp: Calculate Unix time, or seconds since Epoch, accounting for your local timezone and daylight savings time. RegEdit Jumper: A Small & Simple interface based on Yashied's Reg Jumper Function, for searching Hives in your registry.  

Share this post


Link to post
Share on other sites

This is an example of the File I am having problems with:

Company                                         Est    Actual   Supervisor[Employees]
=== April 2012 ==== [Total Estimate/Total Actual: 88/99] =========================================================================================
Henderson, Smith and Thurman Project [HST]       136.25 145.5    Anderson[2] Hoskins[1] Tillman[3]
Tillman Remod [TR]                               71.5     70        Anderson[3]
Baker Automotive Supplies Project [BAS]         212             
Miller Sports Arena Project [MSA]                           451.1833  Baker2[14]

Each document has mulltiple sections similar to the one above.

Each Line is a Maximum of 147 characters in lengh. In This document....

The First field is always 51 characters in lenght,

The Second and Third are each 10 characters in lenght,

The Last Field varies to the end of text.

No, they aren't the fixed lengths you mention, at least in your sample. That would be too easy, probably!

If they were actually fixed length, the first would really be something like 48c, and the first supervisor name would start at a constant column.

Please confirm the format and one could eventually come up with a single regexp, even if some fields are left as whitespaces.

Note that the time increase you experience by applying a regexp to every line in turn in a loop over an array is due to the time PCRE needs for compiling the regexp, added to the time of function invokation. If you apply the regexp (even long and complex) once to a file loaded as one block of text, the compilation takes place only once, hence the speed gain. That's why it makes sense to try to come up with a huge, awful, ugly regexp able to digest every case, but you need to be explicit what "all the cases" exactly are.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

#10 ·  Posted (edited)

@jchd

They are fixed lengths, there are over 2000 documents with this format alone, I didn't manually check every single one, but I did pick 4 random ones, 1 from this year, 1 from the year 1985, and 2 others from randomly inbetween. The documents are exactly as I had described. I don't know why my cut'n'paste does not show the data aligned correctly. I can provide a file with my reproducer example, which I will do after I finish this post.

Edit:

I completely agree with you jchd, I really was hoping, praying, I could find help with the pattern.

I generally used Notepad++ to read my text files... and I just opened this file with the normal Notepad that comes with windows, and noticed it does not align correctly with the regular Notepad, but does with Notepad++. My script I written last night that reads line by line, utilizes the fact that there are fixed lengths to extract the four initial fields, before it extracts and converts the rest of the data. The last field may not be exact or uniform, however the first three are in order: 51c, 10c, 10c.

Edited by Realm

My Contributions: Unix Timestamp: Calculate Unix time, or seconds since Epoch, accounting for your local timezone and daylight savings time. RegEdit Jumper: A Small & Simple interface based on Yashied's Reg Jumper Function, for searching Hives in your registry.  

Share this post


Link to post
Share on other sites

Meanwhile may I ask the maximum number of supervisors (saying 20 is no big problem, but it has to be upper-bounded).

If it works, get ready for a really insane and ugly regexp, albeit fairly simple!


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

At this moment, They don't want/care about the Supervisor's names. They did originally, but later in the day changed their minds that all was needed was the total amount of employees that were under those supervisors. However, in the 2012 files, there are 7 different supervisors.. in 2011 there was 6. So even if the names mattered, it would be inconsistent across 30 years worth of documents.

so for the data 'Anderson[2] Hoskins[1] Tillman[3]' in the first record... They only need '6' the total of the employees that worked/is working on the project. However, and I could be wrong, I don't think SRE can add those for me, but if I could just pull that field as a string, I can loop through to extract and add those figures later.


My Contributions: Unix Timestamp: Calculate Unix time, or seconds since Epoch, accounting for your local timezone and daylight savings time. RegEdit Jumper: A Small & Simple interface based on Yashied's Reg Jumper Function, for searching Hives in your registry.  

Share this post


Link to post
Share on other sites

#13 ·  Posted (edited)

I understood that names were irrelevant. Just wanted to know how many (?:[^[]+[(d+)])? were needed! So let's keep it safe and say 10. Yes, employees total will have to be computed elsewhere (but that's easy).

Also, it seems that both Estimate and Total values could be optional. Is that correct?

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Ah, sorry I misunderstood what you were asking. Each record is different, I don't recall seeing more than 5 or 6 supervisors on a shared project across the files. Most commonly there is 1 to 4.

Yes, for the most part records have both, however, there are a few projects per 1000 that do not have estimate totals, almost all the old files have actual totals, however 2012 and 2011 files appear to have some of those figures missing, probably unfinished or yet to be started projects.


My Contributions: Unix Timestamp: Calculate Unix time, or seconds since Epoch, accounting for your local timezone and daylight savings time. RegEdit Jumper: A Small & Simple interface based on Yashied's Reg Jumper Function, for searching Hives in your registry.  

Share this post


Link to post
Share on other sites

OK so far. Now you'll need a bit of patience as I put all this together. (Just noticed the fixed-length confirmation you edited above.)

So if it's 51, 10 & 10, let's go for that...


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Patience I have, attitude to boot. However, they are rarely found hand in hand. jj

I appreciate the help jchd, this boggled me for a couple hours last night, and a few today as well. There are only a few members of these forums that I know of, that can pull off the really tough SRE patterns, you sir, are mostly definitely one of them. I'm pretty sure it was you that helped me with one two years back that darn near made me go insane!


My Contributions: Unix Timestamp: Calculate Unix time, or seconds since Epoch, accounting for your local timezone and daylight savings time. RegEdit Jumper: A Small & Simple interface based on Yashied's Reg Jumper Function, for searching Hives in your registry.  

Share this post


Link to post
Share on other sites

#17 ·  Posted (edited)

Fairly possible!

I'm at it, but still need some sewing up.

Here's an unstable preview of the very start:

Edit: but don't rely on it: it's already wrong!!!

Edit2: so wrong that I remove that shit. Below is much better.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Yikes, this is more complex than I thought, and is very much above my skill level. Thanks for providing the comments, I really enjoy learning about these complex procedures.

Correct me if I am off base, but (?(DEFINE) ;SRE code) is kind of like a user function but assigned like an html variable or script, to be called upon later post-definition? Or is this being put to use where it is defined?


My Contributions: Unix Timestamp: Calculate Unix time, or seconds since Epoch, accounting for your local timezone and daylight savings time. RegEdit Jumper: A Small & Simple interface based on Yashied's Reg Jumper Function, for searching Hives in your registry.  

Share this post


Link to post
Share on other sites

#19 ·  Posted (edited)

It's just a named subroutine, a subpattern which you decide will have a name. That's handy in complex expressions to avoid repeating series of hard to read patterns.

Forget the above crap code: it can't work this way. Gonna remove it.

Now I need to fix THAT:

EDIT

Try this and tell us where is bursts on your actual data. Add error code ad lib.

Note that you may need less than 8 supervisors' employee count, but it won't change runtime much.

Local $text = 'Company                                          Est    Actual   Supervisor[Employees] ' & @CRLF _
      & '=== April 2012 ===================================================================================================================================' & @CRLF _
      & 'Henderson, Smith and Thurman Project [HST]      136.25 145.5    Anderson[2] Hoskins[1] Tillman[3] ' & @CRLF _
      & 'Acme Inc. [ACM]                                    41.65    193       Anderson[99] ' & @CRLF _
      & 'Disney Software [DSN]                                      488    ' & @CRLF _
      & 'Donald Duck [DND]                                                    ' & @CRLF _
      & 'Tillman Remod [TR]                              71.5     70        Anderson[3] ' & @CRLF _
      & 'Baker Automotive Supplies Project [BAS]            212              ' & @CRLF _
      & 'Miller Sports Arena Project [MSA]                          451.1833  Baker2[14] ' & @CRLF

Local $SRE = StringRegExp($text, _
    "(?mx) (?# PCRE comment example)" & _
    "     (?# the m option is 'multi-line')" & _
    "     (?# We use option x to allow for unsignificant whitespaces in regexp [much easier to read])" & _
    "^   (?# anchor here at line break or start of subject)" & _
    "(?|   (?# this requests for capturing pattern number re-use in alternation)" & _
    "   ([^[]  {1}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{44}     (?#  1-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[]  {2}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{43}     (?#  2-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[]  {3}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{42}     (?#  3-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[]  {4}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{41}     (?#  4-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[]  {5}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{40}     (?#  5-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[]  {6}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{39}     (?#  6-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[]  {7}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{38}     (?#  7-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[]  {8}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{37}     (?#  8-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[]  {9}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{36}     (?#  9-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {10}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{35}     (?# 10-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {11}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{34}     (?# 11-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {12}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{33}     (?# 12-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {13}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{32}     (?# 13-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {14}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{31}     (?# 14-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {15}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{30}     (?# 15-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {16}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{29}     (?# 16-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {17}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{28}     (?# 17-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {18}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{27}     (?# 18-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {19}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{26}     (?# 19-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {20}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{25}     (?# 20-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {21}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{24}     (?# 21-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {22}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{23}     (?# 22-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {23}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{22}     (?# 23-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {24}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{21}     (?# 24-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {25}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{20}     (?# 25-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {26}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{19}     (?# 26-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {27}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{18}     (?# 27-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {28}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{17}     (?# 28-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {29}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{16}     (?# 29-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {30}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{15}     (?# 30-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {31}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{14}     (?# 31-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {32}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{13}     (?# 32-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {33}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{12}     (?# 33-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {34}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{11}     (?# 34-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {35}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{10}     (?# 35-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {36}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{9}   (?# 36-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {37}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{8}   (?# 37-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {38}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{7}   (?# 38-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {39}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{6}   (?# 39-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {40}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{5}   (?# 40-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {41}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{4}   (?# 41-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {42}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{3}   (?# 42-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {43}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{2}   (?# 43-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {44}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] ) s{1}   (?# 44-char company name followed by company code)" & _
    "   | (?# or )" & _
    "   ([^[] {45}) s [ (?| ([A-Z]) ] ss | ([A-Z]{2}) ] s | ([A-Z]{3}) ] )          (?# 45-char company name followed by company code)" & _
    ")   (?# end of renumbering alternation)" & _
    "     (?# now pick up the two values or whitespaces)" & _
    "( d+.d+ s{1,7}| d+ s{1,9} | s{10} )" & _
    "( s{10,} | s* d+.d+ | s* d+ )" & _
    "     (?# now grab as many employees number as possible; let's say 8. Force a capture in all cases)" & _
    "(?| [^[]* [ (d+) ] s | () )" & _
    "(?| [^[]* [ (d+) ] s | () )" & _
    "(?| [^[]* [ (d+) ] s | () )" & _
    "(?| [^[]* [ (d+) ] s | () )" & _
    "(?| [^[]* [ (d+) ] s | () )" & _
    "(?| [^[]* [ (d+) ] s | () )" & _
    "(?| [^[]* [ (d+) ] s | () )" & _
    "(?| [^[]* [ (d+) ] s | () )" & _
    "s?   (?# possible final whitespace)" & _
    "$   (?# anchor here at line break or end of subject)" & _
    "     (?# and that's it)", 3)
_ArrayDisplay($SRE)
Local $j
For $i = 0 To (UBound($SRE) - 1) / 12
    $j = 12 * $i
    ConsoleWrite('"' & $SRE[$j] & '",' & _
                 '"' & $SRE[$j + 1] & '",' & _
                 Number($SRE[$j + 2])& ',' & _
                 Number($SRE[$j + 3])& ',' & _
                 Number($SRE[$j + 4]) + Number($SRE[$j + 5]) + Number($SRE[$j + 6]) + Number($SRE[$j + 7]) + _
                 Number($SRE[$j + 8]) + Number($SRE[$j + 9]) + Number($SRE[$j + 10]) + Number($SRE[$j + 11]) & @LF)
Next

PLEASE, people, be kind and don't laugh at the mistakes I forcibly made in the above!

Time for debugging now...

Geez, I've been repeatedly editing ad nauseam a copy of the regexp which was steadily passivated outside actual code...

II was needing some sleep and it proved useful.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

#20 ·  Posted (edited)

And another example.

#include <Array.au3>

Local $text = 'Company.............................................Est.......Actual....Supervisor[Employees]' & @CRLF _
            & '=== April 2012 ===================================================================================================================================' & @CRLF _
            & 'Henderson, Smith and Thurman Project [HST]__________136.25____145.5_____Anderson[2] Hoskins[1] Tillman[3]    ' & @CRLF _
            & 'Tillman Remod [TR]__________________________________71.5______70________Anderson[3]                    ' & @CRLF _
            & 'Baker Automotive Supplies Project [BAS]_____________212                                              ' & @CRLF _
            & 'Miller Sports Arena Project [MSA]_____________________________451.1833  Baker2[14] ' & @CRLF
;Local $text = FileRead("RealmTestFile.txt")

Local $SRE = StringRegExp($text, '(?m)^(.{0,51})(.{0,10})(.{0,10})(.+)$', 3)
Local $sNewFile = "=========================================================================================" & @LF
$sNewFile &= StringReplace(StringRegExpReplace($text, "^(.*?v+.*?v+)(?s)(?:.*$)", "$1"), "=", " ")
;_ArrayDisplay($SRE)

For $i = 8 To UBound($SRE) - 1 Step 4
    $sNewFile &= StringFormat("%-51s%-10s%-10s%-6sn", StringStripWS(StringRegExpReplace($SRE[$i], "([.*?])", ""), 3), $SRE[$i + 1], $SRE[$i + 2], Execute(StringTrimRight(StringRegExpReplace($SRE[$i + 3], "s*(?:.*?[(d*?)].*?)s*", "$1+"), 1)))
Next

ConsoleWrite($sNewFile & @LF)

#cs
Local $file, $sNewFileName = "NewRealmTestFile.txt"
If FileExists($sNewFileName) = 0 Then FileOpen($sNewFileName, 0)
$file = FileOpen($sNewFileName, 1) ; Append
FileWrite($file, $sNewFile)
FileClose($file)
ShellExecute($sNewFileName)
#ce

Edit: It appears the number spaces in the data are randomized when posting to this forum. So, after many edits the spaces were replaced with underscores.

Edited by Malkey

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0