Jump to content

Use RegExp on binary data


Recommended Posts

Strange. I don't know what is the problem with your machine.

My results:

  • Microsoft Windows x64 [Version 10.0.18363.959]
  • AutoIt v3.3.14.5
D:\AutoIt\BinFind>BinFind test.bin "\x80.."
Filename: test.bin
Regex pattern: \x80..
Offset: 0x00000080  Length: 3  Bytes: 0x80 0x81 0x82    Char: [?ü?]

D:\AutoIt\BinFind>BinFind test.bin "\x81.."
Filename: test.bin
Regex pattern: \x81..
Offset: 0x00000081  Length: 3  Bytes: 0x81 0x82 0x83    Char: [ü??]

D:\AutoIt\BinFind>BinFind test.bin "[\x80-\x9F]"
Filename: test.bin
Regex pattern: [\x80-\x9F]
Offset: 0x00000080  Length: 1  Bytes: 0x80      Char: [?]
Offset: 0x00000081  Length: 1  Bytes: 0x81      Char: [ü]
Offset: 0x00000082  Length: 1  Bytes: 0x82      Char: [?]
Offset: 0x00000083  Length: 1  Bytes: 0x83      Char: [?]
Offset: 0x00000084  Length: 1  Bytes: 0x84      Char: [?]
Offset: 0x00000085  Length: 1  Bytes: 0x85      Char: [?]
Offset: 0x00000086  Length: 1  Bytes: 0x86      Char: [?]
Offset: 0x00000087  Length: 1  Bytes: 0x87      Char: [?]
Offset: 0x00000088  Length: 1  Bytes: 0x88      Char: [?]
Offset: 0x00000089  Length: 1  Bytes: 0x89      Char: [?]
Offset: 0x0000008A  Length: 1  Bytes: 0x8A      Char: [?]
Offset: 0x0000008B  Length: 1  Bytes: 0x8B      Char: [?]
Offset: 0x0000008C  Length: 1  Bytes: 0x8C      Char: [?]
Offset: 0x0000008D  Length: 1  Bytes: 0x8D      Char: [ì]
Offset: 0x0000008E  Length: 1  Bytes: 0x8E      Char: [?]
Offset: 0x0000008F  Length: 1  Bytes: 0x8F      Char: [Å]
Offset: 0x00000090  Length: 1  Bytes: 0x90      Char: [É]
Offset: 0x00000091  Length: 1  Bytes: 0x91      Char: [?]
Offset: 0x00000092  Length: 1  Bytes: 0x92      Char: [?]
Offset: 0x00000093  Length: 1  Bytes: 0x93      Char: [?]
Offset: 0x00000094  Length: 1  Bytes: 0x94      Char: [?]
Offset: 0x00000095  Length: 1  Bytes: 0x95      Char: [?]
Offset: 0x00000096  Length: 1  Bytes: 0x96      Char: [?]
Offset: 0x00000097  Length: 1  Bytes: 0x97      Char: [?]
Offset: 0x00000098  Length: 1  Bytes: 0x98      Char: [?]
Offset: 0x00000099  Length: 1  Bytes: 0x99      Char: [?]
Offset: 0x0000009A  Length: 1  Bytes: 0x9A      Char: [?]
Offset: 0x0000009B  Length: 1  Bytes: 0x9B      Char: [?]
Offset: 0x0000009C  Length: 1  Bytes: 0x9C      Char: [?]
Offset: 0x0000009D  Length: 1  Bytes: 0x9D      Char: [¥]
Offset: 0x0000009E  Length: 1  Bytes: 0x9E      Char: [?]
Offset: 0x0000009F  Length: 1  Bytes: 0x9F      Char: [?]
Press any key to continue . . .

I suspect you have the antivirus software blocking access to binary files.

BTW, the latin1 string is in fact a ucs-2 wide character string encoded from iso-latin1 to ucs-2. For example byte 0x80 from the input binary data is encoded as 0x0080 (little-Endian) in the mirror ucs-2 string. 

The \x{FFFF} regex token searches 16-bit code units (not bytes) for a hexadecimal FFFF code point. The trick here, is that you can omit the leading FF to become \xFF, thus \x80 matches at the code unit (wchar) 0x0080.

If the ANSI code page (Windows-1252, for most users) was used for conversion, the 0x80 byte will be encoded as the Euro sign (code point: 0x20AC), and then the regex search for \x80 will fail to find the expected 0x0080. 

As the AutoIt function BinaryToString() uses the ANSI code page, it destroys all the C1 control characters (0x80 - 0x9F) during conversion of the input binary data. 

You could also double check with the attached script HexFind.au3 that I wrote to validate the results of BinFind.

The script uses a linear search algorithm, instead of regular expressions.

 

HexFind.au3

#Region ;**** Directives created by AutoIt3Wrapper_GUI ****
#AutoIt3Wrapper_Change2CUI=y
#AutoIt3Wrapper_Run_Tidy=y
#EndRegion ;**** Directives created by AutoIt3Wrapper_GUI ****

AutoItSetOption("MustDeclareVars", 1)

;~ A demonstration to show how to perform search over binary files from command line.
;~ https://www.autoitscript.com/forum/topic/188564-use-regexp-on-binary-data

;~ Examples:
;~ HexFind "C:\Windows\System32\notepad.exe" "0x4D5A"
;~ HexFind "C:\Windows\System32\notepad.exe" "0x8984"

#include <FileConstants.au3>
#include <StringConstants.au3>

If $CmdLine[0] <> 2 Then
    ConsoleWrite("Wrong command line arguments." & @CRLF & @CRLF & "Usage: HexFind <filename> <0xFFFF...>" & @CRLF) ;
    Exit
EndIf

Local Const $sFilePath = $CmdLine[1]
Local Const $dSequence = Binary($CmdLine[2])

If Not FileExists($sFilePath) Then
    ConsoleWrite("File not found: " & $sFilePath & @CRLF)
    Exit
EndIf

ConsoleWrite("Filename: " & $sFilePath & @CRLF)
ConsoleWrite("Hexadecimal sequence: " & String($dSequence) & @CRLF)

; Get the binary data
Local $hFileOpen = FileOpen($sFilePath, $FO_READ + $FO_Binary)
If $hFileOpen = -1 Then
    ConsoleWrite("An error occurred when reading the file." & @CRLF)
    Exit
EndIf
Local $BinaryData = FileRead($hFileOpen)
FileClose($hFileOpen)

; Perform a linear search over the binary data.
Local $iOffset = 1, _
        $iMatches = 0
While 1
    $iOffset = _HexFind($BinaryData, $dSequence, $iOffset)
    If @error Then ExitLoop

    $iMatches += 1
    ConsoleWrite("Offset: 0x" & Hex($iOffset - 1) & "  ")   ; convert to zero-based file offset
    ConsoleWrite("Length: " & BinaryLen($dSequence) & "  ")
    ConsoleWrite("Bytes: ")
    For $j = 1 To BinaryLen($dSequence)
        Local $iByte = BinaryMid($dSequence, $j, 1)
        ConsoleWrite("0x" & Hex($iByte, 2) & " ")
    Next
    ConsoleWrite(@CRLF)
    $iOffset += BinaryLen($dSequence)   ; seek to end of match
WEnd

If $iMatches = 0 Then
    ConsoleWrite("No matches could be found." & @CRLF)
EndIf

; #FUNCTION# ====================================================================================================================
; Name ..........: _HexFind
; Description ...: Search for a byte sequence in a binary data and return the position.
; Syntax ........: _HexFind($dBinaryData, $dSequence[, $iStart = 1])
; Parameters ....: $dBinaryData         - The binary data to search.
;                  $dSequence           - The byte sequence to search for.
;                  $iStart              - [optional] The starting position of the search. Default is 1.
; Return values .: Success:               The position of the byte sequence.
;                  Failure:               0 and sets the @error flag to non-zero.
; Remarks .......: The first binary position is 1.
; Related .......:
; Link ..........:
; Example .......: No
; ===============================================================================================================================
Func _HexFind($dBinaryData, $dSequence, $iStart = 1)
    Local $iBinaryLength = BinaryLen($dBinaryData), _
            $iSeqLength = BinaryLen($dSequence)

    If $iBinaryLength = 0 Or _
            $iSeqLength = 0 Or _
            $iStart < 1 Or _
            $iStart > $iBinaryLength - $iSeqLength + 1 Then

        Return SetError(2, @extended, 0)
    EndIf

    For $iPosition = $iStart To ($iBinaryLength - $iSeqLength + 1)
        For $i = 1 To $iSeqLength
            Local $iTemp1 = BinaryMid($dBinaryData, $iPosition + $i - 1, 1)
            Local $iTemp2 = BinaryMid($dSequence, $i, 1)
            If $iTemp1 <> $iTemp2 Then
                ContinueLoop 2
            EndIf
        Next
        Return SetError(0, @extended, $iPosition)
    Next

    Return SetError(1, @extended, 0)
EndFunc   ;==>_HexFind

Expected output:

D:\AutoIt>HexFind "C:\Windows\System32\notepad.exe" "0x4D5A"
Filename: C:\Windows\System32\notepad.exe
Hexadecimal sequence: 0x4D5A
Offset: 0x00000000  Length: 2  Bytes: 0x4D 0x5A
Offset: 0x00012279  Length: 2  Bytes: 0x4D 0x5A
Offset: 0x000156D0  Length: 2  Bytes: 0x4D 0x5A
Offset: 0x00015D27  Length: 2  Bytes: 0x4D 0x5A
Offset: 0x00019555  Length: 2  Bytes: 0x4D 0x5A
Offset: 0x00023474  Length: 2  Bytes: 0x4D 0x5A
Offset: 0x00023C62  Length: 2  Bytes: 0x4D 0x5A

D:\AutoIt>HexFind "C:\Windows\System32\notepad.exe" "0x8984"
Filename: C:\Windows\System32\notepad.exe
Hexadecimal sequence: 0x8984
Offset: 0x000004A9  Length: 2  Bytes: 0x89 0x84
Offset: 0x00000D92  Length: 2  Bytes: 0x89 0x84
Offset: 0x000010AA  Length: 2  Bytes: 0x89 0x84
Offset: 0x0000170F  Length: 2  Bytes: 0x89 0x84
Offset: 0x00001BA0  Length: 2  Bytes: 0x89 0x84
Offset: 0x00005806  Length: 2  Bytes: 0x89 0x84
Offset: 0x000077E4  Length: 2  Bytes: 0x89 0x84
Offset: 0x0000AED0  Length: 2  Bytes: 0x89 0x84
Offset: 0x0000B6F0  Length: 2  Bytes: 0x89 0x84
Offset: 0x0000B7B4  Length: 2  Bytes: 0x89 0x84
Offset: 0x0000E54E  Length: 2  Bytes: 0x89 0x84
Offset: 0x0000E5D2  Length: 2  Bytes: 0x89 0x84
Offset: 0x0000E5E9  Length: 2  Bytes: 0x89 0x84
Offset: 0x0000E6C1  Length: 2  Bytes: 0x89 0x84
Offset: 0x0000E6ED  Length: 2  Bytes: 0x89 0x84
Offset: 0x0000E7F4  Length: 2  Bytes: 0x89 0x84
Offset: 0x0000E896  Length: 2  Bytes: 0x89 0x84
Offset: 0x0000EB15  Length: 2  Bytes: 0x89 0x84
Offset: 0x0000EBDB  Length: 2  Bytes: 0x89 0x84
Offset: 0x0000F1F4  Length: 2  Bytes: 0x89 0x84
Offset: 0x0001DB88  Length: 2  Bytes: 0x89 0x84

Note: Results may vary depending on your Windows version.

 

HexFind.au3 test.cmd

Edited by AmrAli
Upload .au3 file
Link to comment
Share on other sites

6 hours ago, AmrAli said:

Strange. I don't know what is the problem with your machine.

The problem was the clock here! It was too late for me to think clearly, sorry.

6 hours ago, AmrAli said:

I suspect you have the antivirus software blocking access to binary files.

😁 No thank you, forget that.

6 hours ago, AmrAli said:

BTW, the latin1 string is in fact a ucs-2 wide character string encoded from iso-latin1 to ucs-2. For example byte 0x80 from the input binary data is encoded as 0x0080 (little-Endian) in the mirror ucs-2 string. 

If you remove references to Latin1, your statement is correct. It's precisely because we DON'T convert to Latin1 that the conversion is verbatim. I correct my previous (too-late-to-be true) post! Your uses of AscW and ChrW were correct, my mistake.

7 hours ago, AmrAli said:

The script uses a linear search algorithm, instead of regular expressions.

Yes, but regexes are all linear as well, albeit done in optimized low-level compiled C, faster than interpreted AutoIt.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

23 minutes ago, jchd said:

The problem was the clock here! It was too late for me to think clearly, sorry.

😁 No thank you, forget that.

If you remove references to Latin1, your statement is correct. It's precisely because we DON'T convert to Latin1 that the conversion is verbatim. I correct my previous (too-late-to-be true) post! Your uses of AscW and ChrW were correct, my mistake.

Yes, but regexes are all linear as well, albeit done in optimized low-level compiled C, faster than interpreted AutoIt.

Appreciating your help. 

see you,

Link to comment
Share on other sites

Again sorry for my mistake, it was > 3:30

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

I was playing with the 'test.bin' file you posted before to check different encodings.

I used powershell to encode the file to utf16, as I tried with notepad plus and it is buggy.

function EncodeToUtf16($InFile, $Charset, $OutFile)  {
    $Encoding = [System.Text.Encoding]::GetEncoding($Charset)
    $BinaryText = [System.IO.File]::ReadAllText($InFile, $Encoding)
    $Utf16LE = New-Object System.Text.UnicodeEncoding -ArgumentList $False, $False
    [System.IO.File]::WriteAllText($OutFile, $BinaryText, $Utf16LE)
}

EncodeToUtf16  -InFile "$pwd\test.bin"  -Charset "Windows-1252"  -OutFile "$pwd\ansi.bin"
EncodeToUtf16  -InFile "$pwd\test.bin"  -Charset "iso-8859-1"    -OutFile "$pwd\iso_8859-1.bin"

Then the hex differences was examined in BeyondCompare.

This definitely shows that the first 256 Unicode code points are actually the ascii codes of the ISO 8859-1 (Latin1) charset.

Capture.PNG

Wikipedia links:

https://en.wikipedia.org/wiki/Windows-1252

https://en.wikipedia.org/wiki/ISO/IEC_8859-1

Edited by AmrAli
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...