Jump to content
Robinson1

Use RegExp on binary data

Recommended Posts

Robinson1

Well the plan is to use the power of regular expressions engine of AutoIT for patching binary data.
Something like this: StringRegExp( $BinaryData,  "(?s)\x55\x8B.."
 

<cut> ... Okay straight to question/problem

Spoiler

 

As introduction here's a working (and a little senseless) example:
~it'll just match the first 4 letters of Notepad.exe~

#include <FileConstants.au3>
; = 1a.= Get Data
$BinaryData = FileRead( FileOpen( @SystemDir & "\notepad.exe" , $FO_Binary), 0x1000)

ConsoleWrite('$BinaryData ' & @TAB & '= ' & $BinaryData & @CRLF )

; = 1b.= Convert
;~ $BinaryData = BinaryToString( $BinaryData )

#include "StringConstants.au3"
; = 2.= seek
$pat = "(?s) "

$pat &= "4D 5A.."

; Nice looking => working RE-Pattern
$pat = StringReplace( $pat," ",     ""  )
$pat = StringReplace( $pat,".",     ".." )


$match = StringRegExp( $BinaryData, _
            $pat, _
            $STR_REGEXPARRAYFULLMATCH _
        )
;3 Output
$Pos = @extended
$Pos -= StringLen( $match [0] ) ; seek to start of match

$Pos -= 2                       ; to skip '0x...'
$Pos = BitShift($Pos,1)         ; divide by 2 (via rightshift) to


ConsoleWrite('$Pos ' & @TAB & @TAB &'= ' & hex(  $Pos    ) & @CRLF )
ConsoleWrite('$match[0] ' & @TAB  & '= ' &       $match[0] & @CRLF )


;~ Expected OUTPUT:
;~ $BinaryData  = 0x4D5A900...
;~ $Pos         = 00000000
;~ $match[0]    = 4D5A9000

You may again with 55 8B to match some start of a function

55            PUSH    EBP
8Bxx          MOV     ESP, ExX

The problem. Like this it's fucking slow and wastes much memory.
So instead of working with a 'number string monster' that looks like this:
"0x4D5A90..."

It would be really awesome to work with the real binary data.
So here we go:

#include <FileConstants.au3>
; = 1a.= Get Data
$BinaryData = FileRead( FileOpen( @SystemDir & "\notepad.exe" , $FO_Binary), 0x1000)

; = 1b.= Convert
$BinaryData = BinaryToString( $BinaryData ) ;Mod #1  line added

ConsoleWrite('$BinaryData ' & @TAB & '= ' & $BinaryData )
ConsoleWrite( @CRLF)

#include "StringConstants.au3"
; = 2.= seek
$pat = "(?s)"

$pat &= " 4D 5A.."

; Nice looking => working RE-Pattern
$pat = StringReplace( $pat," ",     "\x"    )   ;Mod #2 ""  => "/x"
;~ $pat = StringReplace( $pat,".",  ".." )      ;Mod #3  line commented out
ConsoleWrite('pat ' & @TAB  & @TAB  & '= ' & $pat & @CRLF )


$match = StringRegExp( $BinaryData, _
            $pat, _
            $STR_REGEXPARRAYFULLMATCH _
        )
;3 Output
$Pos = @extended
$Pos -= StringLen( $match [0] ) ; seek to start of match

;~ $Pos -= 2                      ;Mod #4  line commented out  ; to skip '0x...'
;~ $Pos = BitShift($Pos,1)        ;Mod #5  line commented out  ; divide by 2 (via rightshift) to


ConsoleWrite('$Pos ' & @TAB & @TAB &'= ' & hex(  $Pos    ) & @CRLF )
ConsoleWrite('$match[0] ' & @TAB  & '= ' &       $match[0] & @CRLF )


;~ Expected OUTPUT:
;~ $BinaryData  = MZ...
;~ pat      = (?s)\x4D\x5A..
;~ $Pos

Wow that seems to work. BUT ...

 

... certain bytes that are in the range from 0x80 to 0xA0 won't match. :'(

Hmm seem to be a char encoding problem. In detail these are 27 chars: 0x80, 0x82~8C, 0x8E, 0x91~9C, 0x9E,0x9F

Here's a small code snippet to explore / explain this problem:

#include "StringConstants.au3"

$TestData = BinaryToString("0x7E7F808182")

;Okay
$match = StringRegExp( $TestData ,'\x7E' ,$STR_REGEXPARRAYFULLMATCH)
ConsoleWrite('@extended = ' & @extended & '  $match = ' & $match & @CRLF)

;Okay
$match = StringRegExp( $TestData ,'\x7F' ,$STR_REGEXPARRAYFULLMATCH)
ConsoleWrite('@extended = ' & @extended & '  $match = ' & $match & @CRLF)

;Error no match
$match = StringRegExp( $TestData ,'\x80' ,$STR_REGEXPARRAYFULLMATCH)
ConsoleWrite('@extended = ' & @extended & '  $match = ' & $match & @CRLF)

;Okay
$match = StringRegExp( $TestData ,'\x81' ,$STR_REGEXPARRAYFULLMATCH)
ConsoleWrite('@extended = ' & @extended & '  $match = ' & $match & @CRLF)

;Error no match
$match = StringRegExp( $TestData ,'\x82' ,$STR_REGEXPARRAYFULLMATCH)
ConsoleWrite('@extended = ' & @extended & '  $match = ' & $match & @CRLF)

;~ output:
;~ @extended = 2  $match = 
;~ @extended = 3  $match = 
;~ @extended = 0  $match = 1
;~ @extended = 5  $match = 
;~ @extended = 0  $match = 1

Hmm what to do? Go back and use the 'numberstring monster' implementation or just omit that range of 'unsafe bytes'. What is the root of this problem?

Any idea how to fix this?
 

Update: Okay I know a byte is not a character.
But StringRegExp operates on String and so character level.
Okay as long as you stay at Ansi encoding and only use /x00 - /X7F in the search pattern using  StringRegExp works well to search for binary data.

What bytes can be matched that are in the range from /X7F - /xFF is also depending on the code page.
So this avoid to search for bytes in the range from 0x80-0xa0 only applies to Germany.
I just change this country setting:

vollbildaufzeichnung1p8uaa.jpg

to Thai and now near all bytes from /X7F - /xFF fails to match.

Edited by Robinson1

Share this post


Link to post
Share on other sites
czardas

Well I don't know what you're trying to do, but binary is quite meaningless if not interpreted the same way as encoded. Perhaps you should consider trying the other encoding options for BinaryToString(), if you haven't done that already. Sorry I misread your code.

Edit: Try adding (*UCP) to the start of the regular expression and see if that helps with UTF-8 encoding. Perhaps it won't. It's a mystery!

Edited by czardas

Share this post


Link to post
Share on other sites
czardas

How about this?

#include "StringConstants.au3"

$TestData = BinaryToString("7E7F808182")

;Okay
$match = StringRegExp( $TestData ,'\x7E' ,$STR_REGEXPARRAYFULLMATCH)
ConsoleWrite('@extended = ' & @extended & '  $match = ' & $match & @CRLF)

;Okay
$match = StringRegExp( $TestData ,'\x7F' ,$STR_REGEXPARRAYFULLMATCH)
ConsoleWrite('@extended = ' & @extended & '  $match = ' & $match & @CRLF)

;Error no match
$match = StringRegExp( $TestData ,'\x80' ,$STR_REGEXPARRAYFULLMATCH)
ConsoleWrite('@extended = ' & @extended & '  $match = ' & $match & @CRLF)

;Okay
$match = StringRegExp( $TestData ,'\x81' ,$STR_REGEXPARRAYFULLMATCH)
ConsoleWrite('@extended = ' & @extended & '  $match = ' & $match & @CRLF)

;Error no match
$match = StringRegExp( $TestData ,'\x82' ,$STR_REGEXPARRAYFULLMATCH)
ConsoleWrite('@extended = ' & @extended & '  $match = ' & $match & @CRLF)

No that's not it. :whistle:

The instruction \x in the regular expression is trying to match a unicode code point. This is most unlikely to coincide with an ascii character. I'm beginning to think that is the problem.

Edited by czardas

Share this post


Link to post
Share on other sites
Robinson1

Cool thanks for ya reply.

Okay what I wanna do now is to patch away this nag-screen

"Gesponserte Sitzung"
"Dies war eine kostenlose Sitzung mit Unterstützung von www.teamviewer.com"

that pops ups after each remote Session.
 

Spoiler

 

To do so I need to

  1. seek/located the 'ShowSponsoredSessionDialog' via some specific vectors/unique values.
  2. seek to the start of this function and put 'Return' there to disable it.
  3. Null that CRC-check - same procedure as 1. & 2. ...

Well the specifications are already there. I just thought it would be nice to apply them with Autoit using RegExp patch pattern.

 


Well yes BinaryToString seems to be the critical point.
And more in particular its flags that specify how the binary data is converted/encode.
And there only
$SB_ANSI (1) = binary data is ANSI (default) makes some sense here.

However what is  not in the AutoIT documentation that this encode/decoding is depending on the country settings. I uses Phython3 before and there string encoding decoding issue is well done. It's nice to learn and to get practical experience on that topic..

 

Well so far I end up creation some function called BinRegExp() that wraps in StringRegExp.

  1. Does some preSearch by replacing all /x7F-/xFF in the pattern with . (anychar)
  2. Checks each match via the slower but better working Version regex that uses HexNumberStrings
  3. Loops if needed (to filter out match artefacts )

I may posted it here sooner or later but it's not really a solution more like workaround around the problem.

 

But let's get focus back on this:

StringRegExp( $TestData ,'\x80'...

Why it is not working?.

A.) all 0x80 inside $TestData got somehow messed up during by BinaryToString
B.) \x80 is somehow not transformed by StringRegExp as intented
C.) Something else

Edited by Robinson1

Share this post


Link to post
Share on other sites
czardas

Using \x with extended ascii is a futile exercise. The number of bytes may be incorrect or the binary might refer to meaningless code points. This has to be the reason it doesn't work. I still need to try and understand it properly myself.

Edited by czardas

Share this post


Link to post
Share on other sites
jchd

Short answer:
PCRE isn't well suited to match binary data.

Long answer:
Wait a minute guys. You supply a string and call BinaryToString()?
If you want binary input, then performing conversion to binary would be a good idea, perhaps?

$TestData = Binary("0x7E7F808182")
ConsoleWrite(_vardump($TestData) & @LF)

Gives:
Binary       (5) 0x7E7F808182

Then, please realize that
ConsoleWrite('@extended = ' & @extended & '  $match = ' & $match & @CRLF)
isn't going to tell you much as $STR_REGEXPARRAYFULLMATCH returns an array of matches. $match being an array, ConsoleWrite-ing it in nonsensical.

Then, the changelog of AutoIt warns us:
3.3.10.0 (23rd December, 2013) (Release)
...
Added: Regular expressions (PCRE engine) now using the new native 16bit mode and also compiled with full UCP support. Prefix patterns with (*UCP) to enable.

PCRE works character per character. Strings supplied to StringRegExp[Replace] were previously converted from native AutoIt UTF16-LE (actually just UCS-2 in fact) to UTF8 and the 8-bit PCRE engine was used. Now in 16-bit mode PCRE matches UCS-2 codepoints (using 16-bit encoding units).

As a user of the StringRegExp[Replace] wrappers, you can't control which engine (8- or 16-bit) is linked with AutoIt core and then used.
Note that you can't use \C to tell PCRE to match individual bytes regardless of character encoding, since \C works in current encoding units size (16-bit since AutoIt v3.3.10.0).
In theory you could bypass the hurdle by first converting your input binary to UTF16 but that would complicate thing further and doing so doesn't raise the final issue below.

Finally, the last problem --even with 8-bit PCRE-- with random binary data is input containing \x00 which is a string stop.

All of this results in binary not being the best food for StringRegExp. You're still not out of business. Forget binary and work on its raw hex representation!

All you have to do then is insure that your regexp always group couples of characters, each of then representing one input byte.

#include "StringConstants.au3"

$TestData = "7E7F8081823031323300006162637E507F518052815382548200"

$match = StringRegExp($TestData, "(?:..)*?(7E..)", $STR_REGEXPARRAYGLOBALMATCH)
_ArrayDisplay($match)

$match = StringRegExp($TestData, "(?:..)*?(7F..)", $STR_REGEXPARRAYGLOBALMATCH)
_ArrayDisplay($match)

$match = StringRegExp($TestData, "(?:..)*?(80..)", $STR_REGEXPARRAYGLOBALMATCH)
_ArrayDisplay($match)

$match = StringRegExp($TestData, "(?:..)*?(81..)", $STR_REGEXPARRAYGLOBALMATCH)
_ArrayDisplay($match)

$match = StringRegExp($TestData, "(?:..)*?(82..)", $STR_REGEXPARRAYGLOBALMATCH)
_ArrayDisplay($match)

 


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
czardas

I did notice and tried converting the string to binary, but it didn't solve the problem. Since it had been reported as working with ANSI, I dismissed that as being the issue. The Help File states that \x applies to unicode (not ASCII). I was also quite tired. Thanks for the explanation.

Share this post


Link to post
Share on other sites
Jury
"(*UCP)\x{0102}"

As an example for UTF-8 encoding note the code point is enclosed in {}  - or am I missing your problem? 

Share this post


Link to post
Share on other sites
jchd

(*UCP) effect is only to enable [Unicode] character properties, so you can use \p and \P spécifications. See PCRE documentation for more details.

Again, current implementation of PCRE in AutoIt is UTF16 only, not UTF8 as before 12/2013.

The OP wants matching on the byte basis, that's why one needs to use the hex representation to match binary because "our" PCRE will never match bytes, only UTF16.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Robinson1

Okay so in the end I decided for some hybrid Implementation:

1.Do a test to find out which chars RegExp can not match and store it.
2. Do a search on char level ( in the match pattern a replace all not working chars with '.' Any char
3. Check each match by converting the the match to a hexnumber string ( and the match pattern as well to match a hexnumber string) .
 

Func init_NotWorkingBytes()
        ;Create TestData
        local $TestData = "0x"
        for $i=00 to 0xff
            $TestData &= StringFormat( "%02X", $i)
        Next
        $TestData = BinaryToString ($TestData)

        Global $RegExpNotWorkingBytes    =    "(?|\\x80)"

        for $i=0x0 to 0xFF
            $pat = StringFormat( "\x%02X", $i)
            $match = StringRegExp( $TestData ,$pat  ,$STR_REGEXPARRAYFULLMATCH)
            if @error<>0 then
;~             ConsoleWrite('$match = ' & _
;~                 $pat& ' - ' & $match & '  > ' & chr($i) ) ;### Debug Console

                $RegExpNotWorkingBytes &= "|(?|\" & $pat & ")"
;~                 ConsoleWrite( @CRLF)

            EndIf
        Next
    Return $RegExpNotWorkingBytes
EndFunc


; #FUNCTION# ====================================================================================================================
; Name ..........: BinRegExp
; Description ...: Use RegExp with binary data
; Syntax ........: BinRegExp($test, $pattern[, $flag = 0[, $offset = 1]])
; Parameters ....: $test                - a dll struct value.
;                  $pattern             - a pointer value.
;                  $flag                - [optional] a floating point value. Default is 0.
;                  $offset              - [optional] an object. Default is 1.
; Return values .: None
; Remarks .......: That's kind of workaround since the
;That's a kinda hybrid for /x00-/x7F it uses StringRegExp with binary data and
;                   checks each match again with the slower StringRegExp hexnumberstring binary data
; Related .......:
; Link ..........:
; Example .......: No
; ===============================================================================================================================
Func BinRegExp($test, $pattern, $flag = 0, $offset = 1)

    $RegExpNotWorkingBytes = init_NotWorkingBytes()
;~     ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : $RegExpNotWorkingBytes = ' & $RegExpNotWorkingBytes & @CRLF & '>Error code: ' & @error & @CRLF) ;### Debug Console
;~     $RegExpNotWorkingBytes = '\\x[7-9A-Fa-f][0-9A-Fa-f]'

    ;Replace not working in Range of /x7F-/xFF with .
    $SafePattern = StringRegExpReplace( $pattern, _
            $RegExpNotWorkingBytes, _
            '.')

;~     ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : $SafePattern = ' & $SafePattern & @CRLF & '>Error code: ' & @error & @CRLF) ;### Debug Console

    for $Round = 1 to 0x7FFFFFFF

        Local $RetVal         = StringRegExp($test, $SafePattern, $flag, $offset)
        Local $RetError     = @error
        Local $RetExtended     = @extended

        If $RetError = 0 Then

            $MatchData      = $RetVal[0]
            $MatchLength = StringLen($MatchData)


                $MatchStart         =  $RetExtended
                $MatchStart     -=  $MatchLength
                $MatchStart     -= 1

            $RetVal2 = _BinRegExp($MatchData, $pattern, $flag)
            If @error = 0 Then
                ; Match is valid
                ExitLoop

            ElseIf @error = 3 Then
              ; the match was to big - apply delta; seek back from end of current match

                $offset = $MatchStart + @extended


            else
                ; ... was not a real match - look for more
                $offset = $RetExtended
            EndIf


        Else
            ExitLoop
        EndIf

        ConsoleWrite('.')
;~         myLog(@ScriptLineNumber ,"$offset = " & hex($offset) )
    Next

    Return SetError($RetError, $RetExtended, $RetVal)

EndFunc   ;==>BinRegExp

Func _BinRegExp($test, $pattern, $flag = 0, $offset = 1)
        const $xdigit = "." ;"[0-9A-Fa-f]"

    ; Replace \xXX with .
    $Numberstring = StringReplace($pattern, '\x', '')
    $Numberstring = StringReplace($Numberstring, ".", "(?:" & $xdigit & $xdigit & ")")


    $test = StringToBinary($test)


    Local $RetVal = StringRegExp( $test, $Numberstring, $STR_REGEXPARRAYMATCH  )
    Local $RetError     = @error
    Local $RetExtended     = @extended


    If $RetError = 0 Then
        $MatchData      = $RetVal[0]

        $MatchLength = StringLen($MatchData)


        $testLength = StringLen( $test ) - 2 ; no '0x'

        $delta = $testLength - $MatchLength
        if $delta >= 2 then
            ; the match was to big - set Error 4 and return adjustment delta

;~             $delta = $MatchLength - $delta ; set delta to how many bytes to seek back from end of current match
            $delta = DivBy2($delta)


            Return SetError(3, $delta)

        EndIf


    endif


    Return SetError($RetError, $RetExtended, $RetVal)
EndFunc   ;==>_BinRegExp


Func DivBy2($Divident)
    Return BitShift($Divident, 1)
EndFunc   ;==>DivBy2

Full sample using this is here:
http://bit.do/TeamViewerNA 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Similar Content

    • FroVN
      By FroVN
      i have a text : <Name>Jonh</Name>.<Age>15</Age>
      how i can get Jonh and 15 in one stringregexp? pls give me example
    • therks
      By therks
      I'm looking for a regex genius, cus I'm stumped when it comes to assertions.
      So what I have now, is this regular expression: ([^|=]+)=([^|]+)
      It takes a string (user input) of keys=values separated by pipes (ie: "param=value|param=value") and splits them into an array.
      Example:
      $vParamData = 'example=value|fruit=apple|phrase=Hello world' $aRegEx = StringRegExp($vParamData, '([^|=]+)=([^|]+)', 3) ; Result ; [0] => example ; [1] => value ; [2] => fruit ; [3] => apple ; [4] => phrase ; [5] => Hello world So that's working fine, but I'm wondering if there's also a way I could have this capture escaped pipes instead of splitting by them.
      ie:
      $vParamData = 'pipe test=this \| is a pipe|example=value' $aRegEx = StringRegExp($vParamData, '([^|=]+)=([^|]+)', 3) ; I'm getting this: ; [0] => pipe test ; [1] => this \ ; [2] => example ; [3] => value ; But I'd like a result like this: ; [0] => pipe test ; [1] => this \| is a pipe ; [2] => example ; [3] => value Is there some pattern that would accomplish this, or am I better off parsing it some other way?
    • PClough
      By PClough
      Hi everyone!
      After updating autoit, I tried to run an old program using complex regexp's.  It did not work.  Eventually I broke the problem down to this example:
       
      #include <Array.au3> $buf = "First title" & @CRLF & "Tom" & Chr(0x92) & "s sleepwalking" & @CRLF & "Last | line" & @CRLF $items = StringRegExp($buf, '([\x20-\xff]+)\x0d\x0a', 3) _ArrayDisplay($items,'') And this is the result I get when running it:
      Row 0
       
    • Miliardsto
      By Miliardsto
      Hello . How to do that
      $regexp = starts from "abcdef" and after this could be anything in name
      WinActivate($regexp)
×