Jump to content

Return Position of RegEx in String


Recommended Posts

 

Hello ,
I am reading line by line from a text file and trying to fix errors.
How do I find the position of RegEx search match.

Input File :
659855424638 Michelle Heidt 978-240-0653 214-585-8297 michellemheidt@gustr.com Maxillofacial radiologist "Michelle Heidt, 2095 Pearlman Avenue, Franklin, Massachusetts, United States, 2038"
659855424639 Emilee Akins 904-724-3260 502-463-3665 emileerakins@armyspy.com Forest and conservation worker "Emilee Akins, 2054 Boundary Street, Jacksonville, Florida, United States, 32211"
659855424640 Lori Girouard 512-963-1160 413-772-3313 lorilgirouard@teleworm.us Agricultural and food science technician "Lori Girouard, 4603 Short Street, Austin, Texas, United States, 78741"
659855424628 Samantha Richardson 407-856-8677 973-447-6977 samanthatrichardson@example.com Budget officer "Samantha Richardson, 4599 McDonald Avenue, McDonald, McDonald, United States, 12345"

 

 

1. CAPITALsmallsmallCAPITALsmallsmall - First Occurence of this - Insert a space , before the second Capital
Error Eg.
659855424628 SamanthaRichardson 407-856-8677 973-447-6977 samanthatrichardson@example.com Budget officer "Samantha Richardson, 4599 McDonald Avenue, McDonald, McDonald, United States, 12345"

2. Alphabetnumbernumber , Insert a space , before the number
Error Eg.
659855424628 Samantha Richardson407-856-8677 973-447-6977 samanthatrichardson@example.com Budget officer "Samantha Richardson, 4599 McDonald Avenue, McDonald, McDonald, United States, 12345"

3. Quote touching the alphabet , Note : second quote from right. Insert a space , before the quote.
Error Eg.
659855424628 Samantha Richardson 407-856-8677 973-447-6977 samanthatrichardson@example.com Budget officer"Samantha Richardson, 4599 McDonald Avenue, McDonald, McDonald, United States, 12345"

4.
Too Long Email , Put a space after the .com
Error Eg.
659855424628 Samantha Richardson 407-856-8677 973-447-6977 samanthatrichardson@example.comBudget officer "Samantha Richardson, 4599 McDonald Avenue, McDonald, McDonald, United States, 12345"

 

Can you please assist in how we can search with a regex pattern and get the poistion.
StringRegEx - only returns true or false
StringInStr - requires specific string and cannot search for regex

Also , a bit of guidance , on how we can read and overwrite the line , maintaining the position.
a brief snippet on how to read the entire file ,
maintaining a 2D arry would help.

 


$segments = StringSplit($theLine, " ")
Examples of RegEx
If StringRegExp($segments[2], "[0-9]") Then
If StringRegExp($segments[2], "([A-Z])\w+([A-Z])\w+") Then
If StringRegExp($segments[4], "\d{3}-\d{3}-\d{4}") Then
If StringRegExp($segments[6],'^[\_]*([a-z0-9]+(\.|\_*)?)+@([a-z][a-z0-9\-]+(\.|\-*\.))+[a-z]{2,6}$',0) = 1 Then

 

Link to comment
Share on other sites

I'm not sure to understand... maybe this work :

Local $string = "abcdefghijklmnopqrstuvmxyz"

MsgBox(0, "", _StringPos($string, "z") )


Func _StringPos($sString, $sSearch)
    Local $aReplace = StringRegExp($string, "(?s)(.*?)\Q" & $sSearch & "\E", 1)
    If @error Then Return 0
    Return StringLen($aReplace[0]) + 1
EndFunc

 

Link to comment
Share on other sites

StringRegExp can return more than true or false. Check the optional Flag.

https://www.autoitscript.com/autoit3/docs/functions/StringRegExp.htm

If you look at the 4th parameter, "offset", this could be used in a for loop to read line by line.

  • get lines in file.
  • for each line run match with line number in offset
  • if match print line or offset in this case

just an idea. There may be a more elegant way to do it Check Reply #2 from jguinch, I haven't used this function in a while.

Edited by mrflibblehat

[font="'courier new', courier, monospace;"]Pastebin UDF | Prowl UDF[/font]

Link to comment
Share on other sites

Hello / mrfibblehat , Thanks for the reply ,

jguinch  - I am trying to understand your function , can you please explain the RegularExpression and what is being returned.
Thanks for the assistance. I am a newbie.

 

I am not very good with RegEx.
Can you please advise for this cases :

 

$theLine = HoldstheLine
$segments = StringSplit($theLine, " ")

 

1.SamanthaRichardson - Find this pattern ie twoCaps in single word
Insert space before second Capital letter.

2.dson407 - word continued by numbers without space
Insert space before Number

3.cer"S - No space before doubleQuote
Insert space before doubleQuote

4.ple.comBu - letters after .com
Insert space after .com

 

Please suggest , how I can modify the segments and also the entire line.

 

 

Link to comment
Share on other sites

(?s)     Single-line or DotAll: . matches anything including a newline sequence
.        Matches any single character except, by default, a newline sequence. Matches newlines as well when option (?s) is active.
*?       0 or more, lazy (takes the smallest match)
\Q...\E  Verbatim sequence: metacharacters loose their special meaning between \Q and \E (if you have special characters in your search string)

 

Link to comment
Share on other sites

4 hours ago, adityaparakh said:

How do I find the position of RegEx search match

Use @extended

Local $string = "abcdefghijklmnopqrstuvmxyz"

MsgBox(0, "", _StringPos($string, "d") )

Func _StringPos($sString, $sSearch)
    StringRegExp($string, "\Q" & $sSearch & "\E", 1)
    Return @error ? 0 : @extended - StringLen($sSearch)
EndFunc

 

Link to comment
Share on other sites

I didn't remember about this... (it's in the 1st example in the help page)  :)

edit : instead of using StringLen :

Local $string = "abcdefghijklmnopqrstuvmxyz"

MsgBox(0, "", _StringPos($string, "a") )

Func _StringPos($sString, $sSearch)
    StringRegExp($string, "(?=\Q" & $sSearch & "\E)", 1)
    Return @error ? 0 : @extended
EndFunc

^_^

Edited by jguinch
Link to comment
Share on other sites

I will try now. Thank you for the reply.

Finding this one difficult , as locating the position of the second Capital seems  challenging.
Can you please help. (Two Upper in single Word)

$inputString = "MichaelJackson"
$outputSting = "Michael Jackson"

 

I will use the answer for this , and try on the rest patterns.
Trying with this , "([A-Z])\w+([A-Z])\w+") but getting confused with the positioning.

 

Link to comment
Share on other sites

8 minutes ago, adityaparakh said:

I will try now. Thank you for the reply.

Finding this one difficult , as locating the position of the second Capital seems  challenging.
Can you please help. (Two Upper in single Word)

$inputString = "MichaelJackson"
$outputSting = "Michael Jackson"

This pattern has to be searched for in the entire text file and action taken.

I will use the answer for this , and try on the rest patterns.
Trying with this , "([A-Z])\w+([A-Z])\w+") but getting confused with the positioning.

 

 

Link to comment
Share on other sites

Local $sIutputstring = "MichaelJackson"
Local $sOutputstring = StringRegExpReplace($sIutputstring, "\w\K(?=[[:upper:]])", " ")
ConsoleWrite($sOutputstring)

 

\w        Matches any "word" character: any digit, any letter or underscore "_"
\K        Resets start of match at the current point in subject string (the character before the mathing upper letter won't be part of the replacement)
(?=X)     Positive look-ahead: matches when the subpattern X matches starting at the current position.
[:upper:] ASCII uppercase letters (same as [A-Z]).

 

Link to comment
Share on other sites

Try this:

; read file with FileReadToArray
Local $s = [ _
    '659855424638 Michelle Heidt 978-240-0653 214-585-8297 michellemheidt@gustr.com Maxillofacial radiologist "Michelle Heidt, 2095 Pearlman Avenue, Franklin, Massachusetts, United States, 2038"', _
    '659855424639 Emilee Akins 904-724-3260 502-463-3665 emileerakins@armyspy.com Forest and conservation worker "Emilee Akins, 2054 Boundary Street, Jacksonville, Florida, United States, 32211"', _
    '659855424640 Lori Girouard 512-963-1160 413-772-3313 lorilgirouard@teleworm.us Agricultural and food science technician "Lori Girouard, 4603 Short Street, Austin, Texas, United States, 78741"', _
    '659855424628 SamanthaRichardson407-856-8677 973-447-6977 samanthatrichardson@example.comBudget officer "Samantha Richardson, 4599 McDonald Avenue, McDonald, McDonald, United States, 12345"', _
    '659855424628 Samantha Richardson 407-856-8677 973-447-6977 samanthatrichardson@example.com Budget officer "Samantha Richardson, 4599 McDonald Avenue, McDonald, McDonald, United States, 12345"' _
]

Local $a
For $i = 0 To UBound($s) - 1
    $a = StringRegExp($s[$i], '(\d+\s[A-Z][a-z]+)\s?([A-Z][a-z]+)\s?([\d -]+[^@]+@[-a-z_]+\.[-a-z_]+(?:\.[-a-z_]+)?)\s?([A-Z][^"]+[a-z])\s?(".*)', 3)
    ConsoleWrite(@error & '   ' & @extended & @LF)
    $s[$i] = _ArrayToString($a, ' ')
    ConsoleWrite($s[$i] & @LF)
Next

; delete input file and write with FileWriteFromArray (be sure you have a safe copy of input first!

Exit

 

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

;~  1.SamanthaRichardson - Find this pattern ie twoCaps in single word
;~ Insert space before second Capital letter.
$sIutputstring = "SamanthaRichardson"
$sOutputstring = StringRegExpReplace($sIutputstring, "\w\K(?=[[:upper:]])", " ")
ConsoleWrite($sIutputstring & " => " & $sOutputstring & @CRLF)

;~ 2.dson407 - word continued by numbers without space
;~ Insert space before Number
$sIutputstring = "dson407"
$sOutputstring = StringRegExpReplace($sIutputstring, "[A-Za-z]\K(?=\d)", " ")
ConsoleWrite($sIutputstring & " => " & $sOutputstring & @CRLF)

;~ 3.cer"S - No space before doubleQuote
;~ Insert space before doubleQuote
$sIutputstring = "cer""S"
$sOutputstring = StringRegExpReplace($sIutputstring, "[A-Za-z]\K(?="")", " ")
ConsoleWrite($sIutputstring & " => " & $sOutputstring & @CRLF)


;~ 4.ple.comBu - letters after .com
;~ Insert space after .com
$sIutputstring = "ple.comBu"
$sOutputstring = StringRegExpReplace($sIutputstring, "[A-Za-z]\K(?=\.com)", " ")
ConsoleWrite($sIutputstring & " => " & $sOutputstring & @CRLF)

 

Link to comment
Share on other sites

On 2/23/2018 at 7:54 PM, jguinch said:
;~  1.SamanthaRichardson - Find this pattern ie twoCaps in single word
;~ Insert space before second Capital letter.
$sIutputstring = "SamanthaRichardson"
$sOutputstring = StringRegExpReplace($sIutputstring, "\w\K(?=[[:upper:]])", " ")
ConsoleWrite($sIutputstring & " => " & $sOutputstring & @CRLF)

;~ 2.dson407 - word continued by numbers without space
;~ Insert space before Number
$sIutputstring = "dson407"
$sOutputstring = StringRegExpReplace($sIutputstring, "[A-Za-z]\K(?=\d)", " ")
ConsoleWrite($sIutputstring & " => " & $sOutputstring & @CRLF)

;~ 3.cer"S - No space before doubleQuote
;~ Insert space before doubleQuote
$sIutputstring = "cer""S"
$sOutputstring = StringRegExpReplace($sIutputstring, "[A-Za-z]\K(?="")", " ")
ConsoleWrite($sIutputstring & " => " & $sOutputstring & @CRLF)


;~ 4.ple.comBu - letters after .com
;~ Insert space after .com
$sIutputstring = "ple.comBu"
$sOutputstring = StringRegExpReplace($sIutputstring, "[A-Za-z]\K(?=\.com)", " ")
ConsoleWrite($sIutputstring & " => " & $sOutputstring & @CRLF)

 

 

Thank you @jguinch ,

It was really helpful. Inspiring.
Having Knowledge can simplify hours of work.

Link to comment
Share on other sites

@jguinch

$lineInput =
798678 168165 TANGOSOLINC T-1480240304 4 August 2004 Randy Johnston "May 9, 2004" 11:11:00 AM GREGGORY DAY offthewallsl@home.com LEONARD ALLENSTEIN 1935 N PROSPECT ST WALDORF VA 58575 (865) 932-7685 VGCC |47855444555| 2-3 4-4 1-3 5

refId-InvoiceNumber-CompanyName-CourierNumber-CourierDate-PersoneName-Date-Time-PersoneName-Email-PerSonName-Address-City-State-Zip-Phone-Group-GroupCode-ProductList

Can you please assist with usage of StringRegExpReplace.
Wish to enclose both before and after the following patterns with pipe symbol |

1.  10 July 2004 or 01 April 2014 or  1 May 2014
2. "June 17, 2004" or "November 30, 1999"   ;quotes exists in original data
3.  05:16:00 PM or 07:44:00 AM or 04:11:00 PM
4. Email Address
5. (856) 845-5184 or (860) 667-1874
6.  4-4 or 12-4 or 18-18 or 4-18
7. WALDORF VA 58575 (865) 932-7685 or FONTANA IN 59094 (859) 689-2150 or LONG BEACH ma 58886 (860) 741-0435 or SAN BERNARDINO fl 59138 (858) 488-3780
This one is  City | State |Zip | Phone.
Once we find Phone. We can start from right-to-left the first number would be zip then twoCharacter-StateCode.
For City , I have a different approach - I am planning to have a text file , which will have the list of cities , if it matches with first word right to left after state - use it.
Else for two words (eg. Los Angeles , San Francisco etc) and then use. But to be able to use. I need to segment it with "|"
8. E-1480240304  or T-1516958759 or W-055373 or W-055373373 or

 

Hope you can please help , I have been trying various combinations but really struggling.
Will be thankful for your help.

 

 

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...