saywell Posted May 31, 2012 Share Posted May 31, 2012 Hi all. I'm hopeless at regexes and I don't use them enough to get skilled. Perhaps when I retire [less that a year now!!] I can find some time to learn systematically. But meanwhile this is just a cop-out plea for help - sorry. I'm trying to find a way to get a set of document properties out of RTF documents created in MS Word. I can do it using the word UDF but our IT have just upgraded to Office 2010 on machines installed in the mid-90s, so it doesn't run at all fast [especially with network delays included, as the files are on a server]and repetitive operations keep throwing up errors. The particular propery is set from within my program in a structured form, so should be regex-able. If I open the file with notepad i can see it amongst the garbage after the text. This is an example : 446-431-1070, 509663, SAYWELL, William, LOS, JSB, Obstetrics, 2012 The first group is the NHS number [this one altered to anonymise] which is in the format ddd-ddd-dddd [though a few patients don't have NHS numbers and it might be entered as 000-000-0000 or "unavailable"] The second is the hospital number, always 6 digits The third is surname so letters but may have hypen, apostrophe or white space. Not case-specific. Fouth is firstname - letters Fifth and sixth are secrecary and author code. Upper case; usually 3 letters but may be more than 3 7th is Specialty name which is free text and may include spaces and characters like ampersand. Eg "Obs and Gynae"; "Trauma & Orthopaedics" The last is the year of creation. Each is separated by a comma. If any of you RegEx gurus can come up with something that matches that lot, i'd be most grateful!! Regards, William Link to comment Share on other sites More sharing options...
UEZ Posted May 31, 2012 Share Posted May 31, 2012 Try this: #include <Array.au3> $sString = "446-431-1070, 509663, SAYWELL, William, LOS, JSB, Obstetrics, 2012" $aTokens = StringRegExp($sString, "(d{3}-d{3}-d{4})s*,s*(d{6})s*,s*(w+)s*,s*(w+)s*,s*(w+)s*,s*(w+)s*,s*(w+)s*,s*(d{4})", 3) _ArrayDisplay($aTokens) The regex can be shorten but this is more user friendly. Br, UEZ Please don't send me any personal message and ask for support! I will not reply! Selection of finest graphical examples at Codepen.io The own fart smells best! ✌Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!¯\_(ツ)_/¯ ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ Link to comment Share on other sites More sharing options...
jchd Posted May 31, 2012 Share Posted May 31, 2012 OTOH you could try using simply StringSplit if the field structure is consistent accross entries. Just make sure you specify the correct separator (comma or comma + blank) and the corresponding option (see help file). This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
saywell Posted May 31, 2012 Author Share Posted May 31, 2012 Thanks. Unfortunately the regex didn't work in a couple of 'real world' examples. Opening the word ftrf document in notepad and copying the whole thing. Stringsplit will be the next step, once the string to split has been isolated by the regex. At present, it nestles amongst a load of other apparently random characters, some of which notepad can only resolve as a little square. William Link to comment Share on other sites More sharing options...
UEZ Posted May 31, 2012 Share Posted May 31, 2012 Can you post a real example and we can check for a better solution? Otherwise it is hard to find a solution which will fit your needs. Br, UEZ Please don't send me any personal message and ask for support! I will not reply! Selection of finest graphical examples at Codepen.io The own fart smells best! ✌Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!¯\_(ツ)_/¯ ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ Link to comment Share on other sites More sharing options...
saywell Posted May 31, 2012 Author Share Posted May 31, 2012 (edited) Here is a snippet from a word-created RTF opened in notepad, suitably anonymised: Ô à $ 0 8 @ H ä 4 S:ClinicalDocumentsReal_Patient_Data08013308 < Letter from Clinic: FRACTURE CLINIC dated: 17 May 2012 Samantha Jones X 123-456-7890, 123456, VEGAS, Jonny, JRG, HJ, Department of Trauma & Orthopaedics, 2012 Clinic_letter.dot IT Services 3 Microsoft Office Word @ ´V @ Zzö5Í@ v Õã4Í@ Zzö5Í û __________________________________________________________________________________________________ and another: I T S e r v i c e s þÿ à…ÿòùOh«‘ +'³ù0 0 ˜ Ü , | ˆ ´ à à ì ø {µùJì º n u´ tV» ùJ½ `¿ ÔÇ hÉ 5Ê ½Ê ± Ë nÐ vDÚ 7kÝ z=â Ïqã bsä Eè ÚyÉ )Ð yAö Ô(÷ zyù pû ÿ , - 4 5 > ? H I c d ~ ‘ ’ Ë ì q Ž Ž Ž Ž Ž Ž Ž Ž † ÿ@€ Ç Ç ¨˜³ Ç Ç p ` @ ÿÿ U n k n o w n ÿÿ ÿÿ ÿÿ ÿÿ ÿÿ ÿÿ G †z € ÿ T i m e s N e w R o m a n 5 € S y m b o l 3& †z € ÿ A r i a l 5& †z a € ÿ T a h o m a " qˆ ÐÐ h ÊË÷&ÏË÷&ÏË÷& º n 6 ƒW º n 6 W ! Ð ŠŠx £ ‚€24 d ò ò 3ƒ Ðßß HX )Ðÿ ? ä ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿîH, 2 ÿÿ U : C l i n _ D o c s C l i n i c _ l e t t e r . d o t 1 S : C l i n i c a l D o c u m e n t s R e a l _ P a t i e n t _ D a t a 0 8 0 2 1 8 0 8 0 L e t t e r f r o m C l i n i c : J A G / J D d a t e d : 2 5 J u l y 2 0 1 1 G 4 8 2 - 4 I T S e r v i c e s þÿ à…ÿòùOh«‘ +'³ù0 0 ˜ Ü , | ˆ ´ à à ì ø ( ä 4 S:ClinicalDocumentsReal_Patient_Data08021808 4 Letter from Clinic: JAG/JD dated: 25 July 2011 Mary Moneypenny H 433-999-4466, 221333, Parker-Bowles, Camilla, JAG, HS, OBS & GYNAE, 2011 Clinic_letter IT Services 3 Microsoft Office Word @ ^в @ {µùJì@ ¬ªùJì@ {µùJì º n Meanwhile, I'm trying to find another approach to the problem, not just because of this, but because even automated, the notepad workaround is kludgy, and Fileread doesn't work [only in binary, which I don't know how to deal with thereafter] William PS to generate more examples, create an RTF from Word, add some properties and a bit of dummy text, and save it. Edited May 31, 2012 by saywell Link to comment Share on other sites More sharing options...
Malkey Posted May 31, 2012 Share Posted May 31, 2012 (edited) Maybe something like this. #include <GuiRichEdit.au3> #include <Array.au3> Local $aArray1 = Main("RTF_FullPath_FileName.rtf") _ArrayDisplay($aArray1) Func Main($sFileName) Local $hGui, $hRichEdit, $sFile $hGui = GUICreate("") $hRichEdit = _GUICtrlRichEdit_Create($hGui, FileRead($sFileName), 10, 10) $sFile = _GUICtrlRichEdit_GetText($hRichEdit) _GUICtrlRichEdit_Destroy($hRichEdit) GUIDelete($hGui) Return StringRegExp($sFile, "(?i)(d{3}-d{3}-d{4})s*,s*(d{6})s*,s*([w' -]+)s*,s*([w- .]+)s*,s*(w+)s*,s*(w+)s*,s*([^,]+)s*,s*(d{4})", 3) EndFunc ;==>Main Edit: Added GUIDelete($hGui) as per UEZ's good suggestion in next post. Edited May 31, 2012 by Malkey Link to comment Share on other sites More sharing options...
UEZ Posted May 31, 2012 Share Posted May 31, 2012 That's very clever Malkey! I was also searching for a way to use the _GUICtrlRichEdit_* functions. Something like _GUICtrlRichEdit_LoadFromFile() But I would to a GUIDelete($hGui) before the Return Br, UEZ Please don't send me any personal message and ask for support! I will not reply! Selection of finest graphical examples at Codepen.io The own fart smells best! ✌Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!¯\_(ツ)_/¯ ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ Link to comment Share on other sites More sharing options...
saywell Posted May 31, 2012 Author Share Posted May 31, 2012 I've tried that - it won't read in from Word-created RTFs. If you open them in wordpad, then save, the Rich text will open them - but they lose the document properties in the process. Word puts loads of extra-aneous crud throughout its files! William Link to comment Share on other sites More sharing options...
UEZ Posted May 31, 2012 Share Posted May 31, 2012 I tried it with a dummy created RTF by Word and it worked properly. Can you attach the RTF? Br, UEZ Please don't send me any personal message and ask for support! I will not reply! Selection of finest graphical examples at Codepen.io The own fart smells best! ✌Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!¯\_(ツ)_/¯ ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ Link to comment Share on other sites More sharing options...
saywell Posted May 31, 2012 Author Share Posted May 31, 2012 OK - here's one of the files. Your script gets the text as in $sContent = $oWordApp.Activedocument.Range.Text [code=auto:0], but not the properties, unfortunately. However, Much of what I need is in the letter headers, too, so perhaps a bit of regex magic there might do the trick! William Link to comment Share on other sites More sharing options...
saywell Posted May 31, 2012 Author Share Posted May 31, 2012 Tried it on some others, and they don't even let me read the text - eg ÐÏࡱá was all I got when I clipput $sFile to the clipboard! William Link to comment Share on other sites More sharing options...
UEZ Posted May 31, 2012 Share Posted May 31, 2012 Where is the the text like 446-431-1070, 509663, SAYWELL, William, LOS, JSB, Obstetrics, 2012 in the RTF?Br,UEZ Please don't send me any personal message and ask for support! I will not reply! Selection of finest graphical examples at Codepen.io The own fart smells best! ✌Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!¯\_(ツ)_/¯ ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ Link to comment Share on other sites More sharing options...
saywell Posted May 31, 2012 Author Share Posted May 31, 2012 It's in the document properties. You only see it if you open it in notepad, and then it's almost lost towards the end in the melee of microsoft code crud.. Sorry - I didn't make it clear in the original post. W. Link to comment Share on other sites More sharing options...
saywell Posted May 31, 2012 Author Share Posted May 31, 2012 PS I've just looked at the file i posted, after I'd opened it to change some names, and found that the saved version ;which is what I posted, has a different format from the original when opened in notepad - it looks more like a ;pure' rtf file with curly brackets and keywords. This will now open to display its text in your RTE code [though not the keywords! I despair of bl**dy MS Word! William Link to comment Share on other sites More sharing options...
Malkey Posted May 31, 2012 Share Posted May 31, 2012 I can not see the "document properties" in word, but this worked for me. #include <Array.au3> Local $sREPattern = "(?i)(\d{3}-\d{3}-\d{4})\s*,\s*(\d{6})\s*,\s*([\w' \-]+)\s*,\s*([\w\- .]+)\s*,\s*(\w+)\s*,\s*(\w+)\s*,\s*([^,]+)\s*,\s*(\d{4})" Local $aArray = StringRegExp(FileRead("edited_BLANK_Nellie_2012_April_26_152043.rtf"), $sREPattern, 3) _ArrayDisplay($aArray) Link to comment Share on other sites More sharing options...
czardas Posted May 31, 2012 Share Posted May 31, 2012 (edited) I don't have Office 2010 so I can't really comment on this issue or your proposed solution.... our IT have just upgraded to Office 2010 on machines installed in the mid-90s,This seems rather extreme. I doubt that avoiding a hardware upgrade (any longer) is going to be realistic.Someone once told me:Old software for old machinesIntuitively this sounds right to me.Word puts loads of extra-aneous crud throughout its files!I agree that MS Office apps are generally pretty annoying.A thought just occured to me, but I would only try this as a last resort: If you open and save each file with the previous version of word (which was working with your program) you may be able to solve your problem temporarily until you have discovered a better solution. I can't guarantee that this will work. Edited May 31, 2012 by czardas operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
saywell Posted June 1, 2012 Author Share Posted June 1, 2012 Thanks, everyone. czardas - I totally agree. This is the NHS and there is no money this F/Y for upgrades!! Unfortunately, Office 97 is now in that great trash can in the sky, so far as this organisation is concerned. I've come to the conclusion that Word does funny things to these files. I create them programatically from a .dot template and sav as rtf. But I have a suspicion that they are actually .doc files with an rtf extension. hence not readable by rich text edit. If I open them in the word program, and save them again, they seem to become something like 'proper' RTFs - and are readable in RTE! In use, I have no knowledge as to whether a given file has been previously opened and save [eg for editing] so I need a consistent approach. last evening I found, from another forum post here, a free program called 'antiword'. This reads the text from the word docs, and I have subsequently been able to parse the text for most of the variables I was looking for from the file metadata. So I've changed tack and will proceed with this method. Thanks everyone for your help along the way. Regards, William Link to comment Share on other sites More sharing options...
czardas Posted June 1, 2012 Share Posted June 1, 2012 You're a good man to continue working under such constraints. Good luck. operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now