Jump to content
PnD

String Seach with Multiple Scenarios

Recommended Posts

Dear all

Currently, I am having a text file like this

Description
REMOTE
Order No :         1028263
Date Range:  04/19/2021
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Appointment

 

What I am trying to do is to get only "1028263" and I could be able to do that easily with the script below

 

#include <MsgBoxConstants.au3>
#include <File.au3>
#include <Array.au3>

Global $file= @ScriptDir & "\test.txt"
Global $sList = StringReplace(FileRead($file),@CRLF, ",")

Global $ITIorder = StringRegExp($sList, "ITI Order No(.*?),",3)
$RealITIOrder= stringright ($ITIorder[0],7)

MsgBox(0,0,$RealITIOrder)

However, sometime our text has different value such as

Description
REMOTE
Order No :
1028263
Date Range:  04/19/2021
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Appointment

in which the "1028263" is moving to the next line.  I am kind of stuck on how to incorporate an new condition in the script above to tackle this new text value.

I would very appreciate if you could provide your feedback.

Thank you all.

 

 

Share this post


Link to post
Share on other sites

surely the use of regexp is more elegant, however some time ago I created two functions that are based on the string functions that are used to find a word that follows another known word or even the word that precedes the known word. if you are interested you can find them here : https://www.autoitscript.com/forum/topic/155726-searching-for-a-string-after-the-string/?do=findComment&comment=1174545

 


small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites
8 hours ago, jguinch said:

or just Order No\D*(\d+)

Thank you Jguinch, i tried your suggestion and it works perfectly as

$ITIorder = StringRegExp(FileRead($File), "(?s)Order No\D*(\d+)", 1)[0]

 

Share this post


Link to post
Share on other sites
8 hours ago, Chimp said:

surely the use of regexp is more elegant, however some time ago I created two functions that are based on the string functions that are used to find a word that follows another known word or even the word that precedes the known word. if you are interested you can find them here : https://www.autoitscript.com/forum/topic/155726-searching-for-a-string-after-the-string/?do=findComment&comment=1174545

 

Thank you Chimp! I will definitely try out your suggestion as it may be helpful for other scenario where regexp is not needed!

Share this post


Link to post
Share on other sites
26 minutes ago, PnD said:

Thank you JockoDundee!

I tried your suggestion but it gave me the error message as _StringBetween(): undefined function.

#include <String.au3>
$ITIOrder=_StringBetween(StringStripWS(FileRead($sFile),8), "OrderNo:", "DateRange:")

Pnd - please try it now, I left out the include...


Code hard, but don’t hard code...

Share this post


Link to post
Share on other sites

@PnD :

_StringBetween returns an array (not a string), so the result should be presented e.g. as follows :

#include <String.au3>
$sFile= @ScriptDir & "\test.txt"
$aArr =_StringBetween(StringStripWS(FileRead($sFile),8), "OrderNo:", "DateRange:")
If Not @error Then
    $ITIOrder = $aArr[0]
    MsgBox(0, "Order No. :", $ITIOrder)
Else
    MsgBox(0, "Order No. :", "no match found")
EndIf

 


Musashi-C64.png

"In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move."

Share this post


Link to post
Share on other sites
2 minutes ago, Musashi said:

@PnD :

_StringBetween returns an array (not a string), so the result should be presented e.g. as follows :

 

I was just mirroring  @PnD's own code which returns an array:

Global $ITIorder = StringRegExp($sList, "ITI Order No(.*?),",3)

 @Nine actually started the trend of returning the first match.


Code hard, but don’t hard code...

Share this post


Link to post
Share on other sites
18 minutes ago, Musashi said:

@PnD :

_StringBetween returns an array (not a string), so the result should be presented e.g. as follows :

#include <String.au3>
$sFile= @ScriptDir & "\test.txt"
$aArr =_StringBetween(StringStripWS(FileRead($sFile),8), "OrderNo:", "DateRange:")
If Not @error Then
    $ITIOrder = $aArr[0]
    MsgBox(0, "Order No. :", $ITIOrder)
Else
    MsgBox(0, "Order No. :", "no match found")
EndIf

 

Thank you both Musashi and JockoDundee for following up on this thread. I tried your codes and it worked perfectly as well. The only disadvantage of this method is we have to rely on OrderNo and DateRange  for the script to work. If daterange is changed to !!!!! or something else, then it will not work. 

From my personal opinion, the solutions from Nine and jguinch work best regardless of scenarios.

Share this post


Link to post
Share on other sites
13 hours ago, Deye said:

Get the first explicit row with numbers, don't have to mention any respective previous row.

$ION = StringRegExpReplace($s, "[^\w\d].*|\D{0,}", "")

 

Hi Deye

Thank you for your solution. I tried it and it worked great! However, it only work for this particular scenario in which any lines before "Order No :
1028263" that do not have a number.

If, for example, "Remote 15", then 15 will be capture instead of 1028263.

I still think solutions from from Nine and jguinch are perfect for all scenarios.

Share this post


Link to post
Share on other sites

One more thing worth knowing is that if the sequence of digits you are looking for is of at least "6" and you know for a fact that this sequence is always longer than any other sequence elsewhere in the data. then it could also be done this way:

Local $s = 'Description' _
         & @LF & 'REMOTE: 12' _
         & @LF & 'Order No :         1028263' _
         & @LF & 'Date Range:  04/19/2021'

Local $a = StringRegExp($s, "\d{6,}", 3)
MsgBox(0, "", ($a ? "None Found" : $a[0]))

 

Share this post


Link to post
Share on other sites
13 hours ago, Deye said:

One more thing worth knowing is that if the sequence of digits you are looking for is of at least "6" and you know for a fact that this sequence is always longer than any other sequence elsewhere in the data. then it could also be done this way:

Local $s = 'Description' _
         & @LF & 'REMOTE: 12' _
         & @LF & 'Order No :         1028263' _
         & @LF & 'Date Range:  04/19/2021'

Local $a = StringRegExp($s, "\d{6,}", 3)
MsgBox(0, "", ($a ? "None Found" : $a[0]))

 

Thank you Deye for this quick solution! This also works great and the \d{6,} is fantastic!

Share this post


Link to post
Share on other sites

@PnD, one thing that I was going to point out earlier when you commented about my code:

On 5/1/2021 at 11:54 PM, PnD said:

The only disadvantage of this method is we have to rely on OrderNo and DateRange 

was the, IMHO, this dependence may actually be an advantage, depending on factors that only you know.

However, looking at the data-set you provided:

Description

REMOTE
Order No :
1028263
Date Range:  04/19/2021
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Appointment

it appears to be some sort of order/invoice/service memo, with possibly multiple free-text fields, e.g. “Description”.  Which allows for the possibility of a part#, a P.O. Number, phone number without punct., a previous order # cut and pasted from another record complete with the words “Order No:”. 

Which could mean relying on just digits, or even simply the single token “Order No:” might lead to misinterpretation.

Moreover, this entered text could be hard to predict, if free-entry is allowed.

On the other hand, if it can be confirmed that DateRange does indeed follow OrderNo, this would not change, barring a modification of the program or template that creates it.

IMO, if this is not a one-off report, but something that runs periodically, you may want to use even more filters and sanitizers.

 

 


Code hard, but don’t hard code...

Share this post


Link to post
Share on other sites
On 5/3/2021 at 8:24 PM, JockoDundee said:

@PnD, one thing that I was going to point out earlier when you commented about my code:

was the, IMHO, this dependence may actually be an advantage, depending on factors that only you know.

However, looking at the data-set you provided:

Description

REMOTE
Order No :
1028263
Date Range:  04/19/2021
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Appointment

it appears to be some sort of order/invoice/service memo, with possibly multiple free-text fields, e.g. “Description”.  Which allows for the possibility of a part#, a P.O. Number, phone number without punct., a previous order # cut and pasted from another record complete with the words “Order No:”. 

Which could mean relying on just digits, or even simply the single token “Order No:” might lead to misinterpretation.

Moreover, this entered text could be hard to predict, if free-entry is allowed.

On the other hand, if it can be confirmed that DateRange does indeed follow OrderNo, this would not change, barring a modification of the program or template that creates it.

IMO, if this is not a one-off report, but something that runs periodically, you may want to use even more filters and sanitizers.

 

 

hi @JockoDundee, you are absolutely correct, and you are a very good observer with very good logical thinking. (Wondering how you would end up here instead of being a detective 😄. ) The data that I provided is just a small sample of the real one that due to the sensitivity of it that I cannot post them here publicly. 

It is actualy from an Ntag that has all kind of information of a product. I extracted the text from the scanned document in pdf format using OCR software  and that why you can see !!!!!!!!!!!! and many more weird characters which I tried not to include them in my text sample.

There is one thing for sure that the Order No: is fixed and the 6 digit number could be either on the same line or on the second line depending on how the OCR convert images to text. The rest of the data is changed dynamically. 

Base on my real data, and through a lot of testing from the solutions that you all geniuses provided, StringRegExp is still the best solution to get the result.

However, that does not mean your stringbetween solution is not great. I did actually use it for my other project and it works beautifully as well. 

Programming is not my expertise and I am just starting to learn how to write codes in the last couple weeks. Autoit just accidentally opens the gateway for me to explore and I am actually learning a lot through your generosity in helping me and others in this forum. 

I once again thank you all a lot.

Another day and another learning. Life is beautiful.

 

Share this post


Link to post
Share on other sites
15 minutes ago, PnD said:

Base on my real data, and through a lot of testing from the solutions that you all geniuses provided, StringRegExp is still the best solution to get the result.  However, that does not mean your stringbetween solution is not great...

Just to be clear, I'm not necessarily advocating for stringbetween, I only used it to mix it up :)  Had I been a first responder, I may have reg'ed myself.

Rather, I'm speaking about the business logic, in that, IF accuracy is paramount, then maybe you can find another filter, for instance line# Range etc.

But since you're ok with it, then happy data mining!


Code hard, but don’t hard code...

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...