Jump to content

Bloody regex... stuck again :(


Recommended Posts

I've read so many regex tutorials... I guess my brain just isn't wired for this beast.  Just as I think I'm getting a handle...

Can someone help me out?  Sample text below

<tbody>
                    <tr class="currentOrders_bluepanel">
                        <td class="currentOrders_bluetext">D135069445</td>
                        <td class="currentOrders_bluetext">Tracked Packet USA</td>
                        <td class="currentOrders_bluetext" nowrap="nowrap">
<script type="text/javascript" src="/esto/app/javax.faces.resource/jsf.js?ln=javax.faces"></script>
<a id="esto_currentorders_form:j_idt65:0:j_idt66:0:esto_currentorder" href="#" onclick="mojarra.jsfcljs(document.getElementById('esto_currentorders_form'),{'esto_currentorders_form:j_idt65:0:j_idt66:0:esto_currentorder':'esto_currentorders_form:j_idt65:0:j_idt66:0:esto_currentorder'},'reprintArtifact');return false">Shipping Label</a>
                        </td>
                        <td class="currentOrders_bluetext">LM026058444CA</td>
                      <td class="currentOrders_bluetext" align="right">7244353</td>
                        <td class="currentOrders_bluetext" align="right">95336</td>
                    </tr>
                    <tr class="currentOrders_greypanel">
                        <td class="currentOrders_bluetext">D135064462</td>
                        <td class="currentOrders_bluetext">Small Packet International Air</td>
                        <td class="currentOrders_bluetext" nowrap="nowrap"><a id="esto_currentorders_form:j_idt65:1:j_idt66:0:esto_currentorder" href="#" onclick="mojarra.jsfcljs(document.getElementById('esto_currentorders_form'),{'esto_currentorders_form:j_idt65:1:j_idt66:0:esto_currentorder':'esto_currentorders_form:j_idt65:1:j_idt66:0:esto_currentorder'},'reprintArtifact');return false">Shipping Label</a>
                        </td>
                        <td class="currentOrders_bluetext"></td>
                      <td class="currentOrders_bluetext" align="right">7244353</td>
                        <td class="currentOrders_bluetext" align="right">35018</td>
                    </tr>
            </tbody>

I have a known order (eg: D135069445) and I want to grab the tracking number associated with it (same eg: LM026058444CA)

So basically I want to say "in the text after "D135069445", look for something that looks like LM026058444CA.  But here's the kicker... sometimes it doesn't look like LM026058674CA.  Sometimes it is all numeric something like this:  7244353444343422.  And it isn't safe to assume the number of characters will always be the same (although it should be close.  Let's say "between 12 and 18" should be good.  

So I guess I want "a bunch of Alphanumeic characters between > < some point following D135069445".

If it helps, I know the tracking number will always be on the line immediately preceding a line containing "7244353".  But there will be multiple instances of that.  Ie. The 7244353 is not unique but the D135069445 is unique.

 

Heeeeeeeellllppppppppp!

Edited by LondonNDIB
Link to comment
Share on other sites

I came up with this pattern that does match... but I'm worried is it succinct enough?  I'm afraid I just don't understand this stuff enough to know WHY this worked and to be comfortable that it won't false-match and/or it won't match every time:

(?s)(?:D135069445)(?:.*)(?:>)([[:alnum:]]{12,18})
Edited by LondonNDIB
Link to comment
Share on other sites

Thanks!

Mine didn't work.  When I tested with more orders and tracking numbers.... it would always grab the LAST tracking number, rather than the one immediately following the order number.

I don't get it :(

 

I tested yours and expected you misunderstood my question because TO ME, it ooks like you're trying to match the order number.  I can't see ANYTHING in your example that indicates it will show what I'm looking for... but it does.

That's GREAT that you helped give me an answer.  I wish I understood.

Link to comment
Share on other sites

Suggestion: use XML parsers, or html parsers.

Wanna point me in the right direction?

Last night I came across the same advice on several other forums.  So I googled "html parser" and only came across things that were even more complicated!  They also seemed like they were geared toward grabbing massing amounts of data (site mining, etc).  What am I missing?

Link to comment
Share on other sites

The regex works with marks from the html code : bluetext"> and </a> , with some lazy '0 or more chars'

([^<]*) means '0 or more non-< chars'

:)

Edit

Try this one on your html

$res = StringRegExp($txt, '(?s)bluetext">([^<]+).*?</a>.*?bluetext">([^<]*)', 3)

This should grab all couples of references

Edited by mikell
Link to comment
Share on other sites

If the line count and naming convention stays the same between the known value and the target string

#Include <Array.au3>
#Include <File.au3>

local $aArray = 0
$txt = _FileReadToArray("1.txt" , $aArray)
$index = _ArrayFindAll($aArray , "D135069445" , 0  , 0, 0 , 1)
msgbox(0, '' , (stringtrimright(stringtrimleft(stringstripws($aArray[$index[0] + 6] , 8) , 34) , 5)))

Here it is under the rules of being on the line above the next 7244353

#Include <Array.au3>
#Include <File.au3>

local $aArray = 0
$txt = _FileReadToArray("1.txt" , $aArray)
$index = _ArrayFindAll($aArray , "D135069445" , 0  , 0, 0 , 1)

$aTarget = _ArrayFindAll($aArray , "7244353" , $index , 0 , 0 , 1)

msgbox(0, '' , (stringtrimright(stringtrimleft(stringstripws($aArray[$aTarget[0] - 1] , 8) , 34) , 5)))
Edited by boththose

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

And another view of the problem.

#include <IE.au3>
#include <Array.au3>

; ------ Get html source  --------
Local $sHTML = StringRegExpReplace(FileRead(@ScriptFullPath), "(?s)^.*#cs\s*(.+)\s*#ce.*$", "\1") ; Get HTML text from this script that is between #cs and #ce.
;Local $sHTML = FileRead("1.txt")  ; or,  Get HTML text fron text file.
;ConsoleWrite($sHTML & @LF)
; -------------------------------

$sText = _IE_InnerhtmlToOutertext($sHTML) ; Get the text that is displayed in browser.
;ConsoleWrite( $sText & @LF)

$var = "D135069445" ; "D135064462" ;

$res = StringRegExpReplace($sText, '(?is)^.*' & $var & '.*?Shipping Label\s+([A-Z0-9]+)(?=\s*7244353).*$', "\1")
$res = @extended ? $res : "" ; If @extended <> 0 Then $res = $res Else $res = "".  Used because this returns nothing instead of the whole string returning, when there is no RE match, or zero replacements.
ConsoleWrite($res & @LF)
;or

$res = StringRegExp($sText, '(?is)' & $var & '.*?Shipping Label\s+([A-Z0-9]+)(?=\s+7244353)', 3)
_ArrayDisplay($res)


; From HTML source, "innerhtml", get "outertext", the text that is displayed in browser.
Func _IE_InnerhtmlToOutertext($sSource)
    Local $oIE = _IECreate("about:blank", 0, 0) ; Hidden
    _IEPropertySet($oIE, "innerhtml", $sSource)
    Local $sRet = _IEPropertyGet($oIE, "outertext")
    _IEQuit($oIE)
    Return $sRet
EndFunc   ;==>_IE_InnerhtmlToOutertext

#cs
    <tbody>
    <tr class="currentOrders_bluepanel">
    <td class="currentOrders_bluetext">D135069445</td>
    <td class="currentOrders_bluetext">Tracked Packet USA</td>
    <td class="currentOrders_bluetext" nowrap="nowrap">
    <script type="text/javascript" src="/esto/app/javax.faces.resource/jsf.js?ln=javax.faces"></script>
    <a id="esto_currentorders_form:j_idt65:0:j_idt66:0:esto_currentorder" href="#" onclick="mojarra.jsfcljs(document.getElementById('esto_currentorders_form'),{'esto_currentorders_form:j_idt65:0:j_idt66:0:esto_currentorder':'esto_currentorders_form:j_idt65:0:j_idt66:0:esto_currentorder'},'reprintArtifact');return false">Shipping Label</a>
    </td>
    <td class="currentOrders_bluetext">LM026058444CA</td>
    <td class="currentOrders_bluetext" align="right">7244353</td>
    <td class="currentOrders_bluetext" align="right">95336</td>
    </tr>
    <tr class="currentOrders_greypanel">
    <td class="currentOrders_bluetext">D135064462</td>
    <td class="currentOrders_bluetext">Small Packet International Air</td>
    <td class="currentOrders_bluetext" nowrap="nowrap"><a id="esto_currentorders_form:j_idt65:1:j_idt66:0:esto_currentorder" href="#" onclick="mojarra.jsfcljs(document.getElementById('esto_currentorders_form'),{'esto_currentorders_form:j_idt65:1:j_idt66:0:esto_currentorder':'esto_currentorders_form:j_idt65:1:j_idt66:0:esto_currentorder'},'reprintArtifact');return false">Shipping Label</a>
    </td>
    <td class="currentOrders_bluetext"></td>
    <td class="currentOrders_bluetext" align="right">7244353</td>
    <td class="currentOrders_bluetext" align="right">35018</td>
    </tr>
    </tbody>
#ce
Link to comment
Share on other sites

Just wanted to post another "thanks" for the help in here guys!  

After working WAY too long on this, I just ran through a job of 100+ labels and it went PERFECTLY!  Yay :)   Every tracking number was accurately grabbed.  OK, that was just a minor part of this, but it was a headache for me!

 

Incidentally, along the way I found a nice command line tool for marking up PDF files programically.  I only found it after a lot of searching and wasting time on more complicated and/or less stable methods.  I wonder if this kind of tool is something I should start a thread about here?

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...