Jump to content

Comparing Large Files using SQLite (Not sure I'm doing this right)


Go to solution Solved by jchd,

Recommended Posts

Hi all!

I have two files. One of which, I read into a SQLite database and the other I read line by line and search the SQLite database for the lines from the file. It works, but it's really slow, so I'm wondering if it's my process or just the nature of the beast.

File2 (FSTEST.dat) is 8mb and File1 (ESTEST.dat) is 1.5mb. These are the smaller of the sets of files (the other sets can be upwards of 50mb or so).

Is there a way to make this run faster? Thanks for the help as always!

#include <File.au3>
#include <SQLite.au3>
#include <SQLite.dll.au3>
Global $hQuery, $aRow, $fl_array2
Global $sTempDB = @ScriptDir & "\TempDB.db"
Global $iTotalCount = 0, $iTotalSoFar = 0
Global $iKey1_a = 1
Global $iKey1_b = 12
Global $iKey2_a = 18
Global $iKey2_b = 25
Global $st = TimerInit()
Global $iOrphans, $iDupes, $iMatches

; ---------------------------------------------
; Start SQLite
; ---------------------------------------------
_SQLite_Startup()
If @error Then
 ConsoleWrite('error loading sqlite.dll' & @LF)
 Exit
EndIf
If FileExists($sTempDB) Then FileDelete($sTempDB)
_SQLite_Open($sTempDB)
If @error Then
 ConsoleWrite('Unable to open DB' & @LF)
 _Exit()
EndIf
; ---------------------------------------------
; Get the files to compare
; ---------------------------------------------
$sFile1 = FileOpenDialog("Select EXTEST.DAT", @ScriptDir, "Dat files (*.dat)", 1 + 2, "")
If @error Then _Exit()
$sFile2 = FileOpenDialog("Select FSTEST.DAT", @ScriptDir, "Dat files (*.dat)", 1 + 2, "")
If @error Then _Exit()
; ---------------------------------------------
; Load the data from File2 into the db
; ---------------------------------------------
_ReLoad()
; ---------------------------------------------
; Compare the fields from File1 to those in File2
; ---------------------------------------------
CompareItems($sFile1)
; ---------------------------------------------
Func CompareItems($sFileContents)
 Dim $aSplit
 _FileReadToArray($sFileContents, $aSplit)
 If IsArray($aSplit) Then
  If $aSplit[1] = "" Then
   _ArrayDelete($aSplit, 1)
   $aSplit[0] -= 1
  EndIf
  $iTotalCount += $aSplit[0]
  For $i = 1 To $aSplit[0]
   $sPart1 = StringMid($aSplit[$i], $iKey1_a, ($iKey1_b - $iKey1_a) + 1)
   $sPart2 = StringMid($aSplit[$i], $iKey2_a, ($iKey2_b - $iKey2_a) + 1)
   If $sPart1 <> "" Then
    $iCount = 0
    _SQLite_Query(-1, 'select * from compare where PART1 = "' & $sPart1 & '" and PART2 = "' & $sPart2 & '";', $hQuery)
    While _SQLite_FetchData($hQuery, $aRow) = $SQLite_OK
     If $aRow[0] <> "" Then $iCount += 1
    WEnd
    ConsoleWrite($sPart1 & " $iCount = " & $iCount & @CRLF)
    If $iCount = 0 And $i <> $aSplit[0] Then
     $iOrphans += 1
    EndIf
    If $iCount = 1 Then
     $iMatches += 1
    EndIf
    If $iCount > 1 Then
     $iDupes += 1
    EndIf
   EndIf
   $iTotalSoFar += 1
  Next
  ConsoleWrite("---------------------------" & @CRLF)
  ConsoleWrite("Orphans: " & $iOrphans & @CRLF)
  ConsoleWrite("Matches: " & $iMatches & @CRLF)
  ConsoleWrite("Dupes: " & $iDupes & @CRLF)
  ConsoleWrite("Total: " & $iTotalCount & @CRLF)
 EndIf
EndFunc   ;==>CompareItems
Func _ReLoad()
 ; drop "parts" table if it exists, re-define and reload it
 If _SQLite_Exec(-1, 'drop table if exists compare;') <> $SQLite_OK Then
  ConsoleWrite('Drop table failed' & @LF)
  _Exit()
 Else
  ConsoleWrite('Compare table dropped for refresh' & @LF)
 EndIf
 If _SQLite_Exec(-1, 'create table compare (PART1, PART2);') <> $SQLite_OK Then
  ConsoleWrite('Create Table Failed' & @LF)
  _Exit()
 EndIf
 _FileReadToArray($sFile2, $fl_array2)
 Switch @error
  Case 1
   ConsoleWrite('Input file failed to open' & @LF)
   _Exit()
  Case 2
   ConsoleWrite('Unable to split file' & @LF)
   _Exit()
 EndSwitch
 Local $aLine, $sql
 ProgressOn('Loading Compare Table', 'Please Wait', '', Default, Default, 2 + 16); 2 - without always on top / 16 - Window can be moved
 _SQLite_Exec(-1, "begin immediate;")
 For $1 = 1 To $fl_array2[0]
  ProgressSet(($1 / $fl_array2[0]) * 100)
  $sql = 'insert into compare values ("' & _
    StringMid($fl_array2[$1], $iKey1_a, ($iKey1_b - $iKey1_a) + 1) & '", "' & _
    StringMid($fl_array2[$1], $iKey2_a, ($iKey2_b - $iKey2_a) + 1) & '");'
  If _SQLite_Exec(-1, $sql) <> $SQLite_OK Then
   ConsoleWrite('Table insert failed STMT = ' & $sql & @LF)
   _Exit()
  EndIf
 Next
 _SQLite_Exec(-1, "commit;")
 ProgressOff()
 ConsoleWrite('Table loaded with ' & UBound($fl_array2) - 1 & ' records in ' & Round(TimerDiff($st) / 1000, 3) & ' seconds' & @LF)
EndFunc   ;==>_ReLoad
Func _Exit()
 _SQLite_Close()
 _SQLite_Shutdown()
 Exit
EndFunc   ;==>_Exit

(I don't know why the code box always loses my tab structure...weird)

Test Files.zip

Edited by buymeapc
Link to comment
Share on other sites

Is that possible with AutoIt? I wouldn't even know how to begin to create a CRC.

My ultimate goal of this script is to read each line of File1 and see if characters 1 - 12 exist in File2. Pretty simple, but the amount of data is huge which is making this challenging.

Link to comment
Share on other sites

Well 50 MiB is far from huge by today's standards.

First, you should escape (using _SQLite_FastEscape()) string in SQLite commands.

Then a quick search will turn out very similar topics, like:

Some examples will do what you want directly, some after minor changes.

Store only the head of strings in the DB if you need to compare of chars 1..12.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Ok, here's an odd thing I noticed.

I changed my script to use an autoincrementing primary key, which is great and all. But why would these two 'create' function make a huge speed difference in querying files?? When using the lower one, querying is blazing fast, while the top one runs incredibly slow. Why is that?

If _SQLite_Exec(-1, 'CREATE TABLE Strings (id integer primary key autoincrement, String1 CHAR CONSTRAINT ksString1 NOT NULL, String2 CHAR CONSTRAINT ksString2 NOT NULL);') <> $SQLite_OK Then ; SLOW (about 2 queries/second)
If _SQLite_Exec(-1, 'CREATE TABLE Strings (id integer primary key autoincrement, String1 CHAR CONSTRAINT ksString1 UNIQUE ON CONFLICT IGNORE, String2);') <> $SQLite_OK Then ; VERY FAST (about 100 queries/second)
Link to comment
Share on other sites

The second schema uses a unique index on ksString1 while the first doesn't have any index on ksString1.

That will make a very significant difference in speed in queries involving this column in some way. I bet this is the effect you notice.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Ah, ok. That makes sense. Is there any way to keep the speed without using the unique constraint since it's possible to have an item more than once? Well, to be technical, the first string could use the unique constraint, but the second string might have multiple values associated with the first.

How would I be able to accomodate that?

Link to comment
Share on other sites

  • Solution

Either REPLACE (alias for INSERT OR REPLACE) or INSERT OR IGNORE could do the trick.

Anyway be sure to enclose your insert loop(s) within a transaction (BEGIN; ... COMMIT;) to speed up bulk inserts.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Thanks for the help! With your suggestions, I used the following line and I got my speed back. It works much better now.

If _SQLite_Exec(-1, "CREATE TABLE Strings (String1 TEXT, String2 TEXT, PRIMARY KEY (String1, String2) ON CONFLICT IGNORE);") <> $SQLite_OK Then
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...