Sign in to follow this  
Followers 0
GimK

Organize a text file by comparing it to another

15 posts in this topic

#1 ·  Posted (edited)

This is my first post, so first of all hello everyone !

I have already looked a little bit everywhere to find an answer to my question, but if I missed it please redirect me :)

So, actually I try to organize a text file into an array by comparing it to another. One of my files is a messy document full of text, spaces, tabulations, that I got from copying a form. The other file is a list of every title of the form.

To make it clear, here is an example :

File 1 :

Name:Antony    Lastname : Kob
Age    15       height   :1.95      Hobbies  football, tennis, autoit

File 2 : 

Name
Lastname
Age
Height
Hobbies

This is of course way more simple that what I have, but the principle is here. In the end, I would like an array with all the content of the file 1 organised like that :

Name
Antony
Lastname
Kob
Age
15
Height
1.95
Hobbies
football
tennis
autoit

How can I do that ? Thanks !

EDIT: I forgot to say that sometimes the title is composed of multiple words, like "Owned By :" for example, and the following text can be empty.

Edited by GimK

Share this post


Link to post
Share on other sites



#2 ·  Posted (edited)

Up !

I managed to do a part of the code actually.

#include <MsgBoxConstants.au3>
#include <StringConstants.au3>
#include <AutoItConstants.au3>
#include <FileConstants.au3>
#include <Array.au3>
#include <File.au3>

HotKeySet("{END}", "Terminate")

Local $formTitlesPath = @ScriptDir & "\FormTitles.txt"
Local $formTitles
Local $all
Local $current


If NOT (_FileReadToArray($formTitlesPath, $formTitles)) Then
  fileMsgBox(@error, "FormTitles.txt")
  Terminate()
EndIf

Local $testFile = FileOpen(@ScriptDir & "\test.txt")
If ($testFile == -1)  Then
  MsgBox(0, "Oops, there's an error", "Can't open test file")
  Terminate()
EndIf

$all = FileRead($testFile)


Local $charPos
Local $finalSize = 2*UBound($formTitles)
Local $finalArray[$finalSize]

While 1
  
    For $i = 1 To UBound($formTitles)-1
        $current = $formTitles[$i]
        $finalArray[2*$i] = $current
    
        $charPos = StringinStr($all, $current) + StringLen($current)
        $finalArray[(2*$i)+1] = $charPos
     Next
_ArrayDisplay($finalArray)

WEnd
Terminate()


Func Terminate()
  Exit
EndFunc   ;==>Terminate

;File opening error function
Func fileMsgBox($error, $file)
  MsgBox(0, "Oops, there's an error type " & $error, "Can't open the '" & $file & "' file.")
EndFunc

But this should only create an duplicate of the $formTitle array with spaces between each, and, I believe, the starting position of what is between each title.

However, regarding to the result, the position seem wrong. And I can't figure out how to catch what is in there..

Edited by GimK

Share this post


Link to post
Share on other sites

Just a try

#Include <Array.au3>

$txt = "  Owned By :        Name:Antony    Lastname : Kob" & @crlf & _
    "Age    15       height   :1.95      Hobbies  football, tennis, autoit"

$ref = "Owned By|Name|Lastname|Age|Height|Hobbies"

$txt = StringReplace(StringStripWS($txt, 3), @crlf, @TAB)
$txt1 = StringRegExpReplace($txt, '(?i)(?<!^|\w)(?=' & $ref & ')|(?<=' & $ref & ')\h*:?', @crlf) 
; Msgbox(0,"1", $txt1)

$res = StringSplit($txt1, @crlf, 3)
Local $array[UBound($res)/2][2]
For $i = 0 to UBound($res)-1 step 2
    $array[$i/2][0] = $res[$i]
    $array[$i/2][1] = StringStripWS($res[$i+1], 3)
Next
_ArrayDisplay($array)

 

Share this post


Link to post
Share on other sites

Hi ! Thank you for the answer.

Sorry I'm pretty new with AutoIt, so I don't understand everything. Could you explain roughly what you do ? Even with the function reference of StringRegExp, I don't really understand your pattern. Following either..

Thanks for your help !

 

Share this post


Link to post
Share on other sites

Regular Expressions aren't easy to understand until you work with them on a daily basis. That's at least my impression.

1 person likes this

My UDFs and Tutorials:

Spoiler

UDFs:
Active Directory (NEW 2017-04-18 - Version 1.4.8.0) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX (NEW 2017-02-27 - Version 1.3.1.0) - Download - General Help & Support - Example Scripts - Wiki
ExcelChart (2015-04-01 - Version 0.4.0.0) - Download - General Help & Support - Example Scripts
Excel - Example Scripts - Wiki
Word - Wiki
PowerPoint (2015-06-06 - Version 0.0.5.0) - Download - General Help & Support

Tutorials:
ADO - Wiki

 

Share this post


Link to post
Share on other sites

#6 ·  Posted (edited)

water, your impression is sooo correct  :)

GimK,
The first String* funcs are easy to understand
Explanations for the StringRegExpReplace :

'(?i)(?<!^|\w)(?=' & $ref & ')|(?<=' & $ref & ')\h*:?'

(?i)    : case insensitive
(?<! )  : negative lookbehind, means 'not preceded by'
    ^|\w  : beginning of string OR a word char
(?=' & $ref & ')  : positive lookahead, means 'followed by' (by the content of the $ref variable)
|    : or (alternation)
(?<=' & $ref & ')  : positive lookbehind, means 'preceded by' (by the content of the $ref variable)
\h*:?  : 0 or more horizontal whitespace + an optional colon


$ref = "Owned By|Name|Lastname|Age|Height|Hobbies"  :
    This string contains the subpattern with the keywords alternation
    It means ("Owned By" OR "Name" OR "Lastname" ... etc )

So in usual language this regex says :

" Find
- positions (not preceded by the beginning of string OR by a word char)  ; because "name" must not match in "Lastname"
                 and (followed by a keyword)
or
- some horizontal spaces (or none) with a colon (or not) preceded by a keyword

And replace them by a @crlf "

 

Edited by mikell
1 person likes this

Share this post


Link to post
Share on other sites

Alright, thanks !

I think I understood, the rest is clear now ! (Sorry for the delay, couldn't work on it this week end.)

water you should be right, because this looks a little bit like Brainfuck for me at the moment ;)

Share this post


Link to post
Share on other sites
If NOT (_FileReadToArray($formTitlesPath, $formTitles)) Then
  fileMsgBox(@error, "FormTitles.txt")
  Terminate()
EndIf

Local $testFile = FileOpen(@ScriptDir & "\test.txt")
If ($testFile == -1)  Then
  MsgBox(0, "Oops, there's an error", "Can't open test file")
  Terminate()
EndIf

$all = FileRead($testFile)
Local $fAll
Local $res

$formTitles = _ArrayToString($formTitles, "|")
$all = StringReplace(StringStripWS($all, 3), @crlf, @TAB)
$fAll = StringRegExpReplace($all, '(?i)(?<!^|\w)(?=' & $formTitles & ')|(?<=' & $formTitles & ')\h*:?', @crlf)
Msgbox(0,"1", $formTitles)
MsgBox(0,"1", $fAll)

$res = StringSplit($fAll, @crlf, 3)
_ArrayDisplay($res)

Local $array[UBound($res)/2][2]
For $i = 0 to UBound($res)-1 step 2
    $array[$i/2][0] = $res[$i]
    $array[$i/2][1] = StringStripWS($res[$i+1], 3)
Next
_ArrayDisplay($array)

Terminate()

Well, I still have an issue. The list of titles seems okay, as well as the $fAll string (= $text1). But the $res array have only his fist column filled, with all titles and answers without any WS. And I guess that is why I got an "Array variable has incorrect number of subscripts or subscript dimension range exceeded : $array[$i/2][0] = $res[$i]
^ ERROR"

I don't see where it's coming from ?

 

Share this post


Link to post
Share on other sites

Probably because you have not disabled the count return in element 0 when using _FileReadToArray

If NOT (_FileReadToArray($formTitlesPath, $formTitles, $FRTA_NOCOUNT)) Then

 

Share this post


Link to post
Share on other sites

#10 ·  Posted (edited)

I still have the same error..

I changed this line 

$res = StringSplit($fAll, @crlf, 3)

in this

$res = StringSplit($fAll, @TAB, 3)

And I have now a readable array in $res, even if there is a lot of blank lines, and the same dimension error with $array..

But I have to admit I don't understand what is happening, since this

$all = StringReplace(StringStripWS($all, 3), @crlf, @TAB)
$fAll = StringRegExpReplace($all, '(?i)(?<!^|\w)(?=' & $formTitles & ')|(?<=' & $formTitles & ')\h*:?', @crlf)

should put @crlf between each, and not @TAB, right ? Excepted if the StringRegExpReplace() doesn't work right

Edited by GimK

Share this post


Link to post
Share on other sites

Hum regex need accuracy
The pattern in post #3 was intended to work on your sample text 'File1' in post #1
So if you are currently using a different text, could you please post the exact copy of the current content of "test.txt" ?

BTW the regex uses @crlf as a delimiter for the output, so if one or more @crlf already exist in the original text it must be removed first (reason why I replaced it by a tab)

1 person likes this

Share this post


Link to post
Share on other sites

Alright, here is the text file, and the titles attached.

Sorry I didn't post it before because I thought there would be a general solution to the problem !

Thank you a lot for your time

test.txt

FormTitles.txt

Share this post


Link to post
Share on other sites

#13 ·  Posted (edited)

OMG
I dreaded something like this
Where does this text come from ? a web page ? if so there is certainly a better / easier / more reliable way to go


Edit
OK the problem was in the file "FormTitles.txt" with some titles containing either special characters or typos
Please use the one below, as is, and this code

#include <Array.au3>
#include <File.au3>

Local $formTitlesPath = @ScriptDir & "\FormTitles.txt"
Local $formTitles
If NOT (_FileReadToArray($formTitlesPath, $formTitles)) Then
  MsgBox(@error, "FormTitles.txt")
  Terminate()
EndIf

Local $titles 
For $i = 1 to $formTitles[0]
   $titles &= "\Q" & $formTitles[$i] & "\E|"
Next
$titles = StringTrimRight($titles, 1)
; Msgbox(0,"1", $titles)

Local $testFile = FileOpen(@ScriptDir & "\test.txt")
If ($testFile == -1)  Then
  MsgBox(0, "Oops, there's an error", "Can't open test file")
  Terminate()
EndIf

$all = FileRead($testFile)
$all = StringReplace(StringStripWS($all, 3), @crlf, @TAB)
$fAll = StringRegExpReplace($all, '(?i)(?<!^|\w)(?=' & $titles & ')|(?<=' & $titles & ')\h*:?', @crlf)
; MsgBox(0,"1", $fAll)

$res = StringSplit($fAll, @crlf, 3)
; _ArrayDisplay($res)

Local $array[UBound($res)/2][2]
For $i = 0 to UBound($res)-1 step 2
    $array[$i/2][0] = $res[$i]
    $array[$i/2][1] = StringStripWS($res[$i+1], 3)
Next
_ArrayDisplay($array)


Func Terminate()
  Exit
EndFunc   ;==>Terminate

FormTitles.txt

Edited by mikell
1 person likes this

Share this post


Link to post
Share on other sites

#14 ·  Posted (edited)

Nope, it comes from IBM Notes, a collaboration platform. I looked for COM or any way to gather the data but I didn't succeed..

Thanks a lot !

The FormTitles.txt you gave me is the same as the one I got, maybe it is the wrong one ? Because I still have the same error as before..

EDIT: Oh my bad, I forgot to change the parameters of _FileReadToArray. This is working perfectly ! Thank you a lot, I don't know what I would have done without your help.

 

Edited by GimK

Share this post


Link to post
Share on other sites

Glad I could help  :)   (© M23)

BTW FormTitles.txt looks the same but is not exactly the same
Example : there was a missing space in "Drawing Title : (Match Drawing Title)" and as regex require a perfect accuracy such a typo is enough to make the whole thing fail...

1 person likes this

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0