Sign in to follow this  
Followers 0
LondonNDIB

Bloody regex... stuck again :(

12 posts in this topic

#1 ·  Posted (edited)

I've read so many regex tutorials... I guess my brain just isn't wired for this beast.  Just as I think I'm getting a handle...

Can someone help me out?  Sample text below

<tbody>
                    <tr class="currentOrders_bluepanel">
                        <td class="currentOrders_bluetext">D135069445</td>
                        <td class="currentOrders_bluetext">Tracked Packet USA</td>
                        <td class="currentOrders_bluetext" nowrap="nowrap">
<script type="text/javascript" src="/esto/app/javax.faces.resource/jsf.js?ln=javax.faces"></script>
<a id="esto_currentorders_form:j_idt65:0:j_idt66:0:esto_currentorder" href="#" onclick="mojarra.jsfcljs(document.getElementById('esto_currentorders_form'),{'esto_currentorders_form:j_idt65:0:j_idt66:0:esto_currentorder':'esto_currentorders_form:j_idt65:0:j_idt66:0:esto_currentorder'},'reprintArtifact');return false">Shipping Label</a>
                        </td>
                        <td class="currentOrders_bluetext">LM026058444CA</td>
                      <td class="currentOrders_bluetext" align="right">7244353</td>
                        <td class="currentOrders_bluetext" align="right">95336</td>
                    </tr>
                    <tr class="currentOrders_greypanel">
                        <td class="currentOrders_bluetext">D135064462</td>
                        <td class="currentOrders_bluetext">Small Packet International Air</td>
                        <td class="currentOrders_bluetext" nowrap="nowrap"><a id="esto_currentorders_form:j_idt65:1:j_idt66:0:esto_currentorder" href="#" onclick="mojarra.jsfcljs(document.getElementById('esto_currentorders_form'),{'esto_currentorders_form:j_idt65:1:j_idt66:0:esto_currentorder':'esto_currentorders_form:j_idt65:1:j_idt66:0:esto_currentorder'},'reprintArtifact');return false">Shipping Label</a>
                        </td>
                        <td class="currentOrders_bluetext"></td>
                      <td class="currentOrders_bluetext" align="right">7244353</td>
                        <td class="currentOrders_bluetext" align="right">35018</td>
                    </tr>
            </tbody>

I have a known order (eg: D135069445) and I want to grab the tracking number associated with it (same eg: LM026058444CA)

So basically I want to say "in the text after "D135069445", look for something that looks like LM026058444CA.  But here's the kicker... sometimes it doesn't look like LM026058674CA.  Sometimes it is all numeric something like this:  7244353444343422.  And it isn't safe to assume the number of characters will always be the same (although it should be close.  Let's say "between 12 and 18" should be good.  

So I guess I want "a bunch of Alphanumeic characters between > < some point following D135069445".

If it helps, I know the tracking number will always be on the line immediately preceding a line containing "7244353".  But there will be multiple instances of that.  Ie. The 7244353 is not unique but the D135069445 is unique.

 

Heeeeeeeellllppppppppp!

Edited by LondonNDIB

Share this post


Link to post
Share on other sites



#2 ·  Posted (edited)

I came up with this pattern that does match... but I'm worried is it succinct enough?  I'm afraid I just don't understand this stuff enough to know WHY this worked and to be comfortable that it won't false-match and/or it won't match every time:

(?s)(?:D135069445)(?:.*)(?:>)([[:alnum:]]{12,18})
Edited by LondonNDIB

Share this post


Link to post
Share on other sites

curious.  Just to test my pattern, I doubled up on my sample text (copied and pasted so everything was there twice) and I expected it to return an array of two matches... but it doesn't, still just one.  Why?

Share this post


Link to post
Share on other sites

Being more specific, here is another way

#Include <Array.au3>

$txt = FileRead("1.txt")
$var = "D135069445"
$res = StringRegExp($txt, '(?s)bluetext">' & $var & '.*?</a>.*?bluetext">([^<]*)', 3)
 _ArrayDisplay($res)

Share this post


Link to post
Share on other sites

Thanks!

Mine didn't work.  When I tested with more orders and tracking numbers.... it would always grab the LAST tracking number, rather than the one immediately following the order number.

I don't get it :(

 

I tested yours and expected you misunderstood my question because TO ME, it ooks like you're trying to match the order number.  I can't see ANYTHING in your example that indicates it will show what I'm looking for... but it does.

That's GREAT that you helped give me an answer.  I wish I understood.

Share this post


Link to post
Share on other sites

Suggestion: use XML parsers, or html parsers.

Wanna point me in the right direction?

Last night I came across the same advice on several other forums.  So I googled "html parser" and only came across things that were even more complicated!  They also seemed like they were geared toward grabbing massing amounts of data (site mining, etc).  What am I missing?

Share this post


Link to post
Share on other sites

#8 ·  Posted (edited)

The regex works with marks from the html code : bluetext"> and </a> , with some lazy '0 or more chars'

([^<]*) means '0 or more non-< chars'

:)

Edit

Try this one on your html

$res = StringRegExp($txt, '(?s)bluetext">([^<]+).*?</a>.*?bluetext">([^<]*)', 3)

This should grab all couples of references

Edited by mikell

Share this post


Link to post
Share on other sites

I think my brain kinda fills in with elevator music when I get to the terms "lazy" and "greedy".

I'm grateful for your help.  Some day, I may start to understand.

Share this post


Link to post
Share on other sites

#10 ·  Posted (edited)

If the line count and naming convention stays the same between the known value and the target string

#Include <Array.au3>
#Include <File.au3>

local $aArray = 0
$txt = _FileReadToArray("1.txt" , $aArray)
$index = _ArrayFindAll($aArray , "D135069445" , 0  , 0, 0 , 1)
msgbox(0, '' , (stringtrimright(stringtrimleft(stringstripws($aArray[$index[0] + 6] , 8) , 34) , 5)))

Here it is under the rules of being on the line above the next 7244353

#Include <Array.au3>
#Include <File.au3>

local $aArray = 0
$txt = _FileReadToArray("1.txt" , $aArray)
$index = _ArrayFindAll($aArray , "D135069445" , 0  , 0, 0 , 1)

$aTarget = _ArrayFindAll($aArray , "7244353" , $index , 0 , 0 , 1)

msgbox(0, '' , (stringtrimright(stringtrimleft(stringstripws($aArray[$aTarget[0] - 1] , 8) , 34) , 5)))
Edited by boththose

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites

And another view of the problem.

#include <IE.au3>
#include <Array.au3>

; ------ Get html source  --------
Local $sHTML = StringRegExpReplace(FileRead(@ScriptFullPath), "(?s)^.*#cs\s*(.+)\s*#ce.*$", "\1") ; Get HTML text from this script that is between #cs and #ce.
;Local $sHTML = FileRead("1.txt")  ; or,  Get HTML text fron text file.
;ConsoleWrite($sHTML & @LF)
; -------------------------------

$sText = _IE_InnerhtmlToOutertext($sHTML) ; Get the text that is displayed in browser.
;ConsoleWrite( $sText & @LF)

$var = "D135069445" ; "D135064462" ;

$res = StringRegExpReplace($sText, '(?is)^.*' & $var & '.*?Shipping Label\s+([A-Z0-9]+)(?=\s*7244353).*$', "\1")
$res = @extended ? $res : "" ; If @extended <> 0 Then $res = $res Else $res = "".  Used because this returns nothing instead of the whole string returning, when there is no RE match, or zero replacements.
ConsoleWrite($res & @LF)
;or

$res = StringRegExp($sText, '(?is)' & $var & '.*?Shipping Label\s+([A-Z0-9]+)(?=\s+7244353)', 3)
_ArrayDisplay($res)


; From HTML source, "innerhtml", get "outertext", the text that is displayed in browser.
Func _IE_InnerhtmlToOutertext($sSource)
    Local $oIE = _IECreate("about:blank", 0, 0) ; Hidden
    _IEPropertySet($oIE, "innerhtml", $sSource)
    Local $sRet = _IEPropertyGet($oIE, "outertext")
    _IEQuit($oIE)
    Return $sRet
EndFunc   ;==>_IE_InnerhtmlToOutertext

#cs
    <tbody>
    <tr class="currentOrders_bluepanel">
    <td class="currentOrders_bluetext">D135069445</td>
    <td class="currentOrders_bluetext">Tracked Packet USA</td>
    <td class="currentOrders_bluetext" nowrap="nowrap">
    <script type="text/javascript" src="/esto/app/javax.faces.resource/jsf.js?ln=javax.faces"></script>
    <a id="esto_currentorders_form:j_idt65:0:j_idt66:0:esto_currentorder" href="#" onclick="mojarra.jsfcljs(document.getElementById('esto_currentorders_form'),{'esto_currentorders_form:j_idt65:0:j_idt66:0:esto_currentorder':'esto_currentorders_form:j_idt65:0:j_idt66:0:esto_currentorder'},'reprintArtifact');return false">Shipping Label</a>
    </td>
    <td class="currentOrders_bluetext">LM026058444CA</td>
    <td class="currentOrders_bluetext" align="right">7244353</td>
    <td class="currentOrders_bluetext" align="right">95336</td>
    </tr>
    <tr class="currentOrders_greypanel">
    <td class="currentOrders_bluetext">D135064462</td>
    <td class="currentOrders_bluetext">Small Packet International Air</td>
    <td class="currentOrders_bluetext" nowrap="nowrap"><a id="esto_currentorders_form:j_idt65:1:j_idt66:0:esto_currentorder" href="#" onclick="mojarra.jsfcljs(document.getElementById('esto_currentorders_form'),{'esto_currentorders_form:j_idt65:1:j_idt66:0:esto_currentorder':'esto_currentorders_form:j_idt65:1:j_idt66:0:esto_currentorder'},'reprintArtifact');return false">Shipping Label</a>
    </td>
    <td class="currentOrders_bluetext"></td>
    <td class="currentOrders_bluetext" align="right">7244353</td>
    <td class="currentOrders_bluetext" align="right">35018</td>
    </tr>
    </tbody>
#ce

Share this post


Link to post
Share on other sites

Just wanted to post another "thanks" for the help in here guys!  

After working WAY too long on this, I just ran through a job of 100+ labels and it went PERFECTLY!  Yay :)   Every tracking number was accurately grabbed.  OK, that was just a minor part of this, but it was a headache for me!

 

Incidentally, along the way I found a nice command line tool for marking up PDF files programically.  I only found it after a lot of searching and wasting time on more complicated and/or less stable methods.  I wonder if this kind of tool is something I should start a thread about here?

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0