Jump to content

Suggestions for searching large datasets


snap
 Share

Recommended Posts

Hello,

I am new here, however I have been using AutoIt very successfully for automation and testing purposes for the company I work for. I first would like to thank the developers and community for all the hard work they have put into making such a useful automation language. I am moderately familiar with programming, but no expert so I joined these forums to seek out advice from those with more experience and knowledge. Currently I am facing as issue concerning rather large sets of data, and then querying for this information from the large set.

For example; I have a set of data that contains ~8000 rows of data with 6 columns in each row. So it is basically a table 8000x6. The first column is sorted so I could perform a binary search method to quickly search and then query data from the columns of the given row. I need to perform this search nearly as many times as there are rows (~8000). My first thought was an excel file as a spreadsheet would hold this information nicely; however I have a case that contains over 1,000,000 rows and 6 columns and I need to search it over 1,000,000 times. Also I am worried that using an excel file might be a bit slow. So I am looking for other options I could use and I am asking the advice of anyone here willing to give me his/her opinion.

Here is an example of what the "format" of the data looks like:

File: Export Test - 510.csv

MD5 Hash|File Name|File Path|File Size|File Extension|Folder Structure

File1|

File2|

.

.

.

FileN

Thanks for your wisdom and time,

Edit: Another thought that has just now occurred to me is creating many smaller arrays , each array basically being a folder and containing information for the files in the folder etc.

Link to comment
Share on other sites

First welcome to Autoit community.

About your question: Definitely search forum for SQLite examples!!

Thanks for the suggestion. I have been playing around with SQLite for the last couple of days and I think it will work for what I need. Keep in mind that I knew nothing about SQL before this adventure. In any case I want to post a snippet of what I have so far, if you wouldn't mind reviewing it. I apologize in advance for any code conventions I'm not following. Please comment and tell me if I could make any improvements or I'm doing something wrong.

Func _CreateSQLdatabase($db_path, $file_path, $table_name)

    _SQLite_Startup()
    _SQLite_Open($db_path)
    _SQLite_Exec(-1, "BEGIN;")

    Local $NULL
    Local $file = FileOpen($file_path)

    _SQLite_Exec(-1, "CREATE TABLE " & $table_name & " (key INTEGER PRIMARY KEY AUTOINCREMENT, name, path, ext, size, tree, md5);")
    _SQLite_Exec(-1, "CREATE INDEX ix1 on " & $table_name & " (tree);")
    _SQLite_Exec(-1, "CREATE INDEX ix2 on " & $table_name & " (md5);")

    While 1
        $string = _FileReadLine($file)

        If @error Then ExitLoop

        ToolTip("(" & $string & ")", 0, 0, "Adding to Database:")
        _SQLite_Exec(-1, "INSERT INTO " & $table_name & " (name, path, ext, size, tree, md5) VALUES (" & $string & ");")

        If @error Then MsgBox(0, "File Read", "Insert into " & $table_name & " values (" & $string & ");")

    WEnd

    FileClose($file)

    _SQLite_Exec(-1, "COMMIT;")
    _SQLite_Close()
    _SQLite_Shutdown()

EndFunc

Func _QueryDatabase($db_path, $table_name, $search_term, $column, $search_term2 = "", $column2 = "")

    Local $hQuery, $aResult, $iRow, $iColumn
    Local $return_value[3] = ["", "", ""]

    _SQLite_Startup()
    _SQLite_Open($db_path)

    If ($search_term2 == "" And $column2 == "") Then
        _SQLite_GetTable2d(-1, "SELECT * FROM " & $table_name & " WHERE " & $column & " = " & _SQLite_Escape($search_term) & ";", $aResult, $iRow, $iColumn)
    Else
        _SQLite_GetTable2d(-1, "SELECT * FROM " & $table_name & " WHERE " & $column & " = " & _SQLite_Escape($search_term) & " AND " & $column2 & " = " & _SQLite_Escape($search_term2) & ";", $aResult, $iRow, $iColumn)
    EndIf

    _SQLite_Close()
    _SQLite_Shutdown()

    $return_value[0] = $iRow
    $return_value[1] = $iColumn
    $return_value[2] = $aResult

    Return $return_value

EndFunc

Func _FileReadLine($file_handle)

    Local $array[6] = ["", "", "", "", "", ""]
    Local $buffer, $temp, $NULL
    Local $count = 0

    While $count < 6
        $temp = FileRead($file_handle, 1)

        If @error = -1 Then Return SetError(-1, 0, "End of File")

        If $temp == @TAB Then
            $array[$count] = $buffer
            $count += 1
            $buffer = $NULL
        Else
            $buffer &= $temp
        EndIf
    WEnd

    Return _SQLite_Escape($array[0]) & ", " & _SQLite_Escape($array[1]) & ", " & _SQLite_Escape($array[2]) & ", " & _SQLite_Escape($array[3]) & ", " & _SQLite_Escape($array[4]) & ", " & _SQLite_Escape($array[5])

EndFunc
Link to comment
Share on other sites

One thing that'll kill you is calling SQLite_Startup(), SQLite_Open(), SQLite_Close() and SQLite_Shutdown() repeatedly.

Just call startup once at the beginning of your script or when your script is about to access the database, and call shutdown once when you're finished. The open and close statements would also only need to be executed once for each database accessed.

Just curious...

Is this a "throw-away" database? You're creating it temporarily just to enable one-time processing, after which it will be deleted?

How and when is the flat file from which you're loading the database built?

typo

Edited by Spiff59
Link to comment
Share on other sites

I build the database using some .txt files that I have already created. These .txt files contain the information I need to verify that my program is exporting these files correctly. I want to have the database built in advance (and portable), so I can easily take it from one machine to the next without having to build a database each time I run the script. So to answer your question "Is this a throw-away" database, then no it is not. (Some of these .txt files are quite large and building the database each time would be very time consuming)

Edited by snap
Link to comment
Share on other sites

It sounds like portability is an issue...

You're going to install SQLite on each of these machines?

I don't think Autoit has any problem with 1 million element arrays. I'm not sure the overhead involved in using SQLite is worthwhile. It might be easier/faster/cleaner to just make your own pseudo-database files to carry around.

Your "file creation" script could create your main file with the 6 fields delimited by a comma or semicolon like: "path;name;ext;size;date;md5", and you could also build additional presorted files for your needed search indexes at the same time. Something like:

$tmp = _FileReadToArray("Main.dat") ; load main file (if necessary)
Global $aIdxName[$tmp[0]][2]
Global $aIdxMD5[$tmp[0]][2]
For $x = 1 to $tmp[0] - 1 ; load index arrays
    $filedata = StringSplit($tmp[$i], ";", 2)
    $aIdxName[$x][0] = $filedata[0]
    $aIdxName[$x][1] = $x
    $aIdxMD5[$x][0] = $filedata[5]
    $aIdxMD5[$x][1] = $x
Next
_ArraySort($aIdxName, 0, 1)
_ArraySort($aIdxMD5, 0, 1)
$IdxName = FileOpen("Name.idx", 2)
$IdxMD5 = FileOpen("MD5.idx", 2)
For $x = 1 to $tmp[0] - 1 ; dump index arrays to disk
    FileWriteLine($IdxName, $aIdxName[$x][0] & ";" & $aIdxName[$x][1])
    FileWriteLine($IdxMD5, $aIdxMD5[$x][0] & ";" & $aIdxMD5[$x][1])
Next
FileClose($IdxName)
FileClose($IdxMD5)
(except this example loads the main data from a file, which you wouldn't need to do were this the same script that creates that main file. You could create the indices as you create the main file.)

Then you could use a second compiled script, on say a flash drive, to load the prebuilt main and index files into arrays and then do your processing fairly efficiently against those. The indices would at least already be built and sorted.

In this second "processing" script, you would need to use one of the 2D versions of the binary array search function to search the index on column 1 and return the key from column 2. BrewManNH just posted one a couple days ago in the Examples forum.

Edit: Plan B.

Edited by Spiff59
Link to comment
Share on other sites

@snap,

Yes, you can have an SQLite DB either in memory or on disk, and backup any of them into the other form (even while it's being used!). Search for SQLite backup in the example forum.

SQLite DBs are single files which are 100% binary portable on _any_ system able to compile SQLite, from your smartphone to datacenters.

You can FileInstall your DB (this makes the DB travel along with the AutoIt executable).

SQLite doesn't need any installation, administration or maintenance.

Using a third party SQLite manager (SQLite Expert is my all-time favorite) will save you days and make early/advanced experiments easy.

If you need help getting up to speed with SQLite, just ask.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

MD5 Hash|File Name|File Path|File Size|File Extension|Folder Structure

Just wondering. What is 'Folder Structure' containing(data)? (Assuming "File Path" is for something like '\dir1\dir2' or 'x:\dir1\dir2')

"Straight_and_Crooked_Thinking" : A "classic guide to ferreting out untruths, half-truths, and other distortions of facts in political and social discussions."
"The Secrets of Quantum Physics" : New and excellent 2 part documentary on Quantum Physics by Jim Al-Khalili. (Dec 2014)

"Believing what you know ain't so" ...

Knock Knock ...
 

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...