Jump to content
Sign in to follow this  
saywell

RegEx help requested

Recommended Posts

saywell

Hi all.

I'm hopeless at regexes and I don't use them enough to get skilled. Perhaps when I retire [less that a year now!!] I can find some time to learn systematically. But meanwhile this is just a cop-out plea for help - sorry.

I'm trying to find a way to get a set of document properties out of RTF documents created in MS Word. I can do it using the word UDF but our IT have just upgraded to Office 2010 on machines installed in the mid-90s, so it doesn't run at all fast [especially with network delays included, as the files are on a server]and repetitive operations keep throwing up errors.

The particular propery is set from within my program in a structured form, so should be regex-able. If I open the file with notepad i can see it amongst the garbage after the text.

This is an example :

446-431-1070, 509663, SAYWELL, William, LOS, JSB, Obstetrics, 2012

The first group is the NHS number [this one altered to anonymise] which is in the format ddd-ddd-dddd [though a few patients don't have NHS numbers and it might be entered as 000-000-0000 or "unavailable"]

The second is the hospital number, always 6 digits

The third is surname so letters but may have hypen, apostrophe or white space. Not case-specific.

Fouth is firstname - letters

Fifth and sixth are secrecary and author code. Upper case; usually 3 letters but may be more than 3

7th is Specialty name which is free text and may include spaces and characters like ampersand. Eg "Obs and Gynae"; "Trauma & Orthopaedics"

The last is the year of creation.

Each is separated by a comma.

If any of you RegEx gurus can come up with something that matches that lot, i'd be most grateful!!

Regards,

William

Share this post


Link to post
Share on other sites
UEZ

Try this:

#include <Array.au3>
$sString = "446-431-1070, 509663, SAYWELL, William, LOS, JSB, Obstetrics, 2012"
$aTokens = StringRegExp($sString, "(d{3}-d{3}-d{4})s*,s*(d{6})s*,s*(w+)s*,s*(w+)s*,s*(w+)s*,s*(w+)s*,s*(w+)s*,s*(d{4})", 3)

_ArrayDisplay($aTokens)

The regex can be shorten but this is more user friendly.

Br,

UEZ


Please don't send me any personal message and ask for support! I will not reply!

Selection of finest graphical examples at Codepen.io

The own fart smells best!
Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!
¯\_(ツ)_/¯  ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ

Share this post


Link to post
Share on other sites
jchd

OTOH you could try using simply StringSplit if the field structure is consistent accross entries.

Just make sure you specify the correct separator (comma or comma + blank) and the corresponding option (see help file).


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
saywell

Thanks.

Unfortunately the regex didn't work in a couple of 'real world' examples. Opening the word ftrf document in notepad and copying the whole thing.

Stringsplit will be the next step, once the string to split has been isolated by the regex. At present, it nestles amongst a load of other apparently random characters, some of which notepad can only resolve as a little square.

William

Share this post


Link to post
Share on other sites
UEZ

Can you post a real example and we can check for a better solution? Otherwise it is hard to find a solution which will fit your needs.

Br,

UEZ


Please don't send me any personal message and ask for support! I will not reply!

Selection of finest graphical examples at Codepen.io

The own fart smells best!
Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!
¯\_(ツ)_/¯  ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ

Share this post


Link to post
Share on other sites
saywell

Here is a snippet from a word-created RTF opened in notepad, suitably anonymised:

Ô à

$ 0 8 @ H ä 4 S:ClinicalDocumentsReal_Patient_Data08013308 < Letter from Clinic: FRACTURE CLINIC dated: 17 May 2012 Samantha Jones X 123-456-7890, 123456, VEGAS, Jonny, JRG, HJ, Department of Trauma & Orthopaedics, 2012 Clinic_letter.dot IT Services 3 Microsoft Office Word @ ´V @ Zzö5Í@ v Õã4Í@ Zzö5Í û

__________________________________________________________________________________________________

and another:

I T S e r v i c e s þÿ à…ÿòùOh«‘ +'³ù0 0 ˜ Ü , | ˆ ´ à

à ì ø

{µùJì º n

u´ tV» ùJ½ `¿ ÔÇ hÉ 5Ê ½Ê ±

Ë nÐ vDÚ 7kÝ z=â Ïqã bsä Eè ÚyÉ )Ð yAö Ô(÷ zyù pû ÿ , - 4 5 > ? H I c d ~ ‘ ’ Ë ì q Ž Ž Ž Ž Ž Ž Ž Ž † ÿ@€ Ç Ç ¨˜³ Ç Ç p ` @ ÿÿ U n k n o w n ÿÿ ÿÿ ÿÿ ÿÿ ÿÿ ÿÿ G †z € ÿ T i m e s N e w R o m a n 5 € S y m b o l 3& †z € ÿ A r i a l 5& †z a € ÿ T a h o m a " qˆ ÐÐ h ÊË÷&ÏË÷&ÏË÷& º n 6 ƒW º n 6 W ! Ð ŠŠx £ ‚€24 d ò ò 3ƒ Ðßß HX )Ðÿ ? ä ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿîH, 2 ÿÿ U : C l i n _ D o c s C l i n i c _ l e t t e r . d o t 1 S : C l i n i c a l D o c u m e n t s R e a l _ P a t i e n t _ D a t a 0 8 0 2 1 8 0 8 0 L e t t e r f r o m C l i n i c : J A G / J D d a t e d : 2 5 J u l y 2 0 1 1 G 4 8 2 - 4 I T S e r v i c e s þÿ à…ÿòùOh«‘ +'³ù0 0 ˜ Ü , | ˆ ´ à

à ì ø

( ä 4 S:ClinicalDocumentsReal_Patient_Data08021808 4 Letter from Clinic: JAG/JD dated: 25 July 2011 Mary Moneypenny H 433-999-4466, 221333, Parker-Bowles, Camilla, JAG, HS, OBS & GYNAE, 2011 Clinic_letter IT Services 3 Microsoft Office Word @ ^в @

{µùJì@ ¬ªùJì@

{µùJì º n

Meanwhile, I'm trying to find another approach to the problem, not just because of this, but because even automated, the notepad workaround is kludgy, and Fileread doesn't work [only in binary, which I don't know how to deal with thereafter]

William

PS to generate more examples, create an RTF from Word, add some properties and a bit of dummy text, and save it.

Edited by saywell

Share this post


Link to post
Share on other sites
Malkey

Maybe something like this.

#include <GuiRichEdit.au3>
#include <Array.au3>

Local $aArray1 = Main("RTF_FullPath_FileName.rtf")
_ArrayDisplay($aArray1)


Func Main($sFileName)
    Local $hGui, $hRichEdit, $sFile
    $hGui = GUICreate("")
    $hRichEdit = _GUICtrlRichEdit_Create($hGui, FileRead($sFileName), 10, 10)
    $sFile = _GUICtrlRichEdit_GetText($hRichEdit)
    _GUICtrlRichEdit_Destroy($hRichEdit)
        GUIDelete($hGui)
    Return StringRegExp($sFile, "(?i)(d{3}-d{3}-d{4})s*,s*(d{6})s*,s*([w' -]+)s*,s*([w- .]+)s*,s*(w+)s*,s*(w+)s*,s*([^,]+)s*,s*(d{4})", 3)
EndFunc   ;==>Main

Edit: Added GUIDelete($hGui) as per UEZ's good suggestion in next post.

Edited by Malkey

Share this post


Link to post
Share on other sites
UEZ

That's very clever Malkey! ;) I was also searching for a way to use the _GUICtrlRichEdit_* functions. Something like _GUICtrlRichEdit_LoadFromFile()

But I would to a GUIDelete($hGui) before the Return :)

Br,

UEZ


Please don't send me any personal message and ask for support! I will not reply!

Selection of finest graphical examples at Codepen.io

The own fart smells best!
Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!
¯\_(ツ)_/¯  ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ

Share this post


Link to post
Share on other sites
saywell

I've tried that - it won't read in from Word-created RTFs.

If you open them in wordpad, then save, the Rich text will open them - but they lose the document properties in the process.

Word puts loads of extra-aneous crud throughout its files!

William

Share this post


Link to post
Share on other sites
UEZ

I tried it with a dummy created RTF by Word and it worked properly.

Can you attach the RTF?

Br,

UEZ


Please don't send me any personal message and ask for support! I will not reply!

Selection of finest graphical examples at Codepen.io

The own fart smells best!
Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!
¯\_(ツ)_/¯  ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ

Share this post


Link to post
Share on other sites
saywell

OK - here's one of the files. Your script gets the text as in

$sContent = $oWordApp.Activedocument.Range.Text [code=auto:0], but not the properties, unfortunately.

However, Much of what I need is in the letter headers, too, so perhaps a bit of regex magic there might do the trick!

William

Share this post


Link to post
Share on other sites
saywell

Tried it on some others, and they don't even let me read the text - eg

ÐÏࡱá

was all I got when I clipput $sFile to the clipboard!

William

Share this post


Link to post
Share on other sites
UEZ

Where is the the text like 446-431-1070, 509663, SAYWELL, William, LOS, JSB, Obstetrics, 2012 in the RTF?

Br,

UEZ


Please don't send me any personal message and ask for support! I will not reply!

Selection of finest graphical examples at Codepen.io

The own fart smells best!
Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!
¯\_(ツ)_/¯  ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ

Share this post


Link to post
Share on other sites
saywell

It's in the document properties. You only see it if you open it in notepad, and then it's almost lost towards the end in the melee of microsoft code crud..

Sorry - I didn't make it clear in the original post.

W.

Share this post


Link to post
Share on other sites
saywell

PS I've just looked at the file i posted, after I'd opened it to change some names, and found that the saved version ;which is what I posted, has a different format from the original when opened in notepad - it looks more like a ;pure' rtf file with curly brackets and keywords.

This will now open to display its text in your RTE code [though not the keywords!

I despair of bl**dy MS Word!

William

Share this post


Link to post
Share on other sites
Malkey

I can not see the "document properties" in word, but this worked for me.

#include <Array.au3>

Local $sREPattern = "(?i)(\d{3}-\d{3}-\d{4})\s*,\s*(\d{6})\s*,\s*([\w' \-]+)\s*,\s*([\w\- .]+)\s*,\s*(\w+)\s*,\s*(\w+)\s*,\s*([^,]+)\s*,\s*(\d{4})"
Local $aArray = StringRegExp(FileRead("edited_BLANK_Nellie_2012_April_26_152043.rtf"), $sREPattern, 3)
_ArrayDisplay($aArray)

Share this post


Link to post
Share on other sites
czardas

I don't have Office 2010 so I can't really comment on this issue or your proposed solution.

... our IT have just upgraded to Office 2010 on machines installed in the mid-90s,

This seems rather extreme. I doubt that avoiding a hardware upgrade (any longer) is going to be realistic.

Someone once told me:

Old software for old machines

Intuitively this sounds right to me.

Word puts loads of extra-aneous crud throughout its files!

I agree that MS Office apps are generally pretty annoying.

A thought just occured to me, but I would only try this as a last resort: If you open and save each file with the previous version of word (which was working with your program) you may be able to solve your problem temporarily until you have discovered a better solution. I can't guarantee that this will work.

Edited by czardas

Share this post


Link to post
Share on other sites
saywell

Thanks, everyone.

czardas - I totally agree. This is the NHS and there is no money this F/Y for upgrades!!

Unfortunately, Office 97 is now in that great trash can in the sky, so far as this organisation is concerned.

I've come to the conclusion that Word does funny things to these files.

I create them programatically from a .dot template and sav as rtf. But I have a suspicion that they are actually .doc files with an rtf extension. hence not readable by rich text edit.

If I open them in the word program, and save them again, they seem to become something like 'proper' RTFs - and are readable in RTE!

In use, I have no knowledge as to whether a given file has been previously opened and save [eg for editing] so I need a consistent approach.

last evening I found, from another forum post here, a free program called 'antiword'.

This reads the text from the word docs, and I have subsequently been able to parse the text for most of the variables I was looking for from the file metadata. So I've changed tack and will proceed with this method.

Thanks everyone for your help along the way.

Regards,

William

Share this post


Link to post
Share on other sites
czardas

You're a good man to continue working under such constraints. Good luck.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×