Jump to content

Looking for feedback on function to do a binary-like search on a 21GB file of hashes from haveibeenpwned.com


Recommended Posts

Posted (edited)

So I noticed the other day that haveibeenpwned.com provides a downloadable list of the password hashes they have so that you can more securely check a password, obviously if you use the their website to check a password you would then have to consider that password less secure compared to If you check it locally on your machine.

The password list is about 21GB of SHA1 hashes along with what I believe is the count of breaches the password has been found in and then a line break, looking like this:

0ADE7C2CF97F75D009975F4D720D1FA6C19F4897:9
1B6453892473A467D07372D45EB05ABC2031647A:4
356A192B7913B04C54574D18C28D46E6395428AB:1
77DE68DAECD823BABBB58EDB1C8E14D7106E83BB:3
902BA3CDA1883801594B6E1B452790CC53948FDA:7
AC3478D69A3C81FA62E60F5C3696165A4E5E6AC4:5
B1D5781111D84F7B3FE45A0852E59758CD7A87E5:10
C1DFD96EEA8CC2B62785275BCA38AC261256E278:6
DA4B9237BACCCDF19C0760CAB7AEC4A8359010B0:2
FE5DBBCEA5CE7E2988B8C69BCFDFDE8904AABC1F:8

I was surprised I couldn't find a function to efficiently search a large file like this, general googling for how to best search a large file of ordered data turned up nothing ... and when i searched for methods to search the haveibeenpwned.com text file specifically all I got was a number of linear search methods which again surprised me. I'm probably just missing something and I'd love for anyone to share what they know of for doing this.

So mostly for fun I've created a function that I believe you could say is just a basic binary search and it is extremely fast and would be even on a file that was 22TB...

I ran in to difficulty related to the precision of moving the file pointer and how much data to read from the file at once but so far my solutions have worked and all my tests have returned perfect accuracy. I'd love if someone can give me feedback on it so I can make improvements and feel more confident in it's accuracy, I'm also trying to adapt it to work with other types of data, especially where the individual data points being searched through are variable length but i haven't figured that out yet. Test data (hashes of 1-10 ) can be copied from above but is also attached to the post (script expects testdata.txt).

#NoTrayIcon
#Region ;**** Directives created by AutoIt3Wrapper_GUI ****
#AutoIt3Wrapper_Change2CUI=y
#EndRegion ;**** Directives created by AutoIt3Wrapper_GUI ****
#include <FileConstants.au3>
#include <Crypt.au3>
Opt("TrayIconHide", 1)

; This is a test script of a way to do a binary search on a large file, specificly made for the file of sha1 hashes (hash ordered) provided by https://haveibeenpwned.com/Passwords
; The loop below provides an input to enter a password that will then be hashed and passed to the main function _FileBinarySearch where it will check pwned.txt for the hash
; You can also leave the input blank to loop through a set of test data which by default will check testdata.txt

$Message = ""

While 1
    $SearchText = InputBox("Pwned File Search",$Message&@CRLF&@CRLF&"What do you want to search for? (leave blank for test data)","")
    If @error Then
        Exitloop
    endif

    If $SearchText = "" Then ;loop through $TestSamples[] and do a search for each
        ConsoleWrite(@CRLF&"Testing")

        Local $TestSamples = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "0", "11"]
        $sFilePath = "testdata.txt"

        ;Local $TestSamples = ["passwd1", "passwd1234", "123456", "1234567", "12345678", "safdsadg3423547fgjh45", "sddsj439j23hkjh3gh4KJH3", "sdfsdkjh38saklj1jh7jdh"]
        ;$sFilePath = "pwned.txt"

        For $i = 0 To  UBound($TestSamples)-1
            $Message = $Message & _FileBinarySearch($sFilePath, StringTrimLeft(_Crypt_HashData($TestSamples[$i], $CALG_SHA1),2)) & ","
        next

    Else ;use the provided text and do a search
        $Message = _FileBinarySearch("pwned.txt", StringTrimLeft(_Crypt_HashData($SearchText, $CALG_SHA1),2))

    EndIf

Wend



Func _FileBinarySearch($sFilePath, $Search, $Delimiter = @CRLF)
    Local $Size = FileGetSize($sFilePath)
    Local $hFileOpen = FileOpen($sFilePath, $FO_READ)
    Local $RangeStart = 0
    Local $RangeEnd = $Size
    Local $EstimatedSearchSize = StringLen($Search) + StringLen($Delimiter) + 4
    Local $FileReadSize = $EstimatedSearchSize*2
    Local $Iterations = 1
    Local $IterationsLimit = Floor(Log($Size) / Log(2)) ;the maximum expected difficulty to search the given data
    Local $Return = False

    ;ConsoleWrite(@CRLF&@CRLF&"Search: "&$Search&" ReadSize: "&$FileReadSize&" IterationsLimit: "&$IterationsLimit)
    while 1
        ;Get pointer position for middle of given range offset by our read size
        $PointerPos = Floor(($RangeEnd - $RangeStart) / 2 + $RangeStart - ($FileReadSize / 2))
        If $PointerPos < 0 Then $PointerPos = 0

        ;Read the file
        FileSetPos ($hFileOpen, $PointerPos, 0)
        $Data = FileRead ($hFileOpen,$FileReadSize)
        If @extended < $FileReadSize Then $Data = $Data&$Delimiter ;end of file? add delimiter to make it easy
        If $PointerPos = 0 Then $Data = $Delimiter&$Data ;start of file? add delimiter to make it easy

        ;ConsoleWrite(@CRLF&@CRLF&"Raw Read: "&StringReplace(StringReplace($Data,@LF,"@"),@CR,"%"))

        ;Clean up the data we read
        ; an issue will arise in other sets of data with variable length where the data read might not contain delimiters
        $DataStart = StringInStr($Data,@CRLF) + StringLen($Delimiter)
        $Data = StringMid($Data, $DataStart, StringInStr($Data, $Delimiter, 0, 2) - $DataStart)

        ;Update pointer pos with precise start point of the data we are comparing
        $PointerPos = $PointerPos + $DataStart 

        ;ConsoleWrite(@CRLF&"Data: "&$Data&" Range: "&$RangeStart&"/"&$RangeEnd&" Pointer: "&$PointerPos&" Iterations: "&$Iterations)

        If StringInStr($Data, $Search) Then ;found the string string
            ConsoleWrite(@CRLF&"Match Found At "&$PointerPos)
            $Return = $PointerPos
            Exitloop

        ElseIf $RangeEnd - $RangeStart < $EstimatedSearchSize Then ;nothing left to search
            ConsoleWrite(@CRLF&"No Match")
            Exitloop

        ElseIf $Iterations > $IterationsLimit Then ;exceeded the expected difficulty
            ConsoleWrite(@CRLF&" Error With Too Many Iterations")
            SetError(1)
            Exitloop

        ElseIf $Data < $Search Then ;the found string is lesser
            ;ConsoleWrite(" Result: Under")
            $RangeStart = $PointerPos + StringLen($Data)

        ElseIf $Data > $Search Then ;the found string is greater
            ;ConsoleWrite(" Result: Over")
            $RangeEnd = $PointerPos

        EndIf

        $Iterations += 1
    WEnd

    FileClose($hFileOpen)
    Return $Return
EndFunc

 

pwned.au3

testdata.txt

Edited by JohnMC

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...