Jump to content

File existence checker


cmcb
 Share

Recommended Posts

Hi

I'm working on a script to do the following:

1) query a SQL database for a filename and id

2) store the results somewhere

3) use the filename obtained in 1 and check it exists

4) if file doesn't exist, take the id of that file and obtain the corresponding details

End product will be in the form of a GUI but struggling with basic functionality just now.

What's the best way to store the 2 database table columns (ID & Filename)? 2D array? How would I then use only the ID or only the filename? If 2D array is not the way to go, I'm open to suggestions. Think I'm OK with the SQL, using variables etc.

Appreciate any help with this. Thanks!! :mellow:

1) Get the files and IDs.

$file = @ScriptDir & "\outputfile.txt"

$conn = ObjCreate( "ADODB.Connection" )
$DSN = "DRIVER={SQL Server};SERVER=myserver;DATABASE=testdb;UID=myuser;PWD=mypass;"
$conn.Open($DSN)

$RS = ObjCreate("ADODB.RecordSet")

$RS.Open = "SELECT ID, Filename 'file' FROM Table"

If $RS.RecordCount Then
    While Not $RS.EOF
        FileWrite($file, "\\myserver\myshare\" & $RS.Fields("file").value ; just testing

    $RS.MoveNext
    WEnd


EndIf

$conn.close
FileClose($file)

$file_to_read = @scriptDir & "\outputfile.txt"
$file_to_write = @scriptDir & "\docs_out.txt"

Dim $aRecords

If Not _FileReadToArray($file_to_read, $aRecords) Then
   MsgBox(4096,"Error", " Nothing found for specified date range!:")
   Exit
EndIf
For $x = 1 to $aRecords[0]
    ;Msgbox(0,'Record:' & $x, $aRecords[$x])
    If Not FileExists($x) Then
        FileWriteLine($file_to_write, $aRecords[$x])
    ElseIf FileExists($x) Then
        ;FileWriteLine($files_that_exist, $aRecords[$x]) ; dont do anything...
    EndIf
Next
Edited by cmcb
Link to comment
Share on other sites

If you store the info in a simple Autoit 2D array you would always have to loop through all values to find specific ones. I think what you should look for is an "associative array", for which no Autoit standard exists. However there are some UDFs for this in the examples forum, also do a search for the 'Scripting.Dictionary' object, which is my personal favorite for such kind of tasks.

Create two different associative arrays, one with the ID as the key and one with the filename as the ID. This way you can poll the values with either array(id)=file or array(file)=id.

Link to comment
Share on other sites

Thanks Kafu. That was very helpful. Not had much time today, but I have queried my database and checked file existence. I write my missing files out to a text file and my array of document IDs. Not sure if I need to write out the document IDs or just keep it in an array.

If Not FileExists($oDictionary.Item($vKey)) Then
        $filesFile = FileWriteLine($file_to_write2, $oDictionary.Item($vKey) & @CRLF)
        $bArray = $bArray + $vKey
        $docIDsFile = FileWriteLine($file_to_write1, $bArray)
    ElseIf FileExists($oDictionary.Item($vKey)) Then
        ;ConsoleWrite("exists" & $oDictionary.Item($vKey))
    EndIf
Next

The next bit is more complicated. I need to use the document IDs stored in $bArray as a variable in SQL.

SQL would be something like $RS.Open("SELECT * FROM TABLE WHERE DOCID IN '"&$bArray&'", $conn)

I'm not sure how to use the array elements for this. Not sure if I need to store them with a comma either to format them for SQL? i.e. IN '1,2'3

Hope that makes sense. Appreciate any feedback :mellow:

Link to comment
Share on other sites

By $bArray you mean $string_missing_IDs? Maybe something like this:

$string_missing_IDs = ""

if not ... then

$string_missing_IDs &= $vKey & ","

And I guess the SQL call should be something like

if $string_missing_IDs then $string_missing_IDs = StringTrimRight($string_missing_IDs,1) ; to kill the trailing comma

$RS.Open("SELECT * FROM TABLE WHERE DOCID IN (" & $string_missing_IDs & ")", $conn)

Edited by KaFu
Link to comment
Share on other sites

By $bArray you mean $string_missing_IDs? Maybe something like this:

$string_missing_IDs = ""

if not ... then

$string_missing_IDs &= $vKey & ","

And I guess the SQL call should be something like

if $string_missing_IDs then $string_missing_IDs = StringTrimRight($string_missing_IDs,1) ; to kill the trailing comma

$RS.Open("SELECT * FROM TABLE WHERE DOCID IN (" & $string_missing_IDs & ")", $conn)

Ah... it was the SQL part that was throwing the error. I hadn't handled the commas correctly so I had a trailing one. The StringTrimRight($string_missing_IDs,1) sorted it.

Fantastic, thanks :mellow:

Link to comment
Share on other sites

Below is my code so far.

This will:

1) Get list of documents from DB

2) Check that they exist

3) Build array of document IDs

4) Get details of documents that don't exist

My main concern with this is speed. I've only tested for around 200 docs and that takes over 20 seconds. Going by the code below, are there any obvious performance tweaks? For instance, should I just be opening 1 SQL connection and re-using it or am I right to open that 2nd connection? Is there a more efficient method of counting? Will be running this on a production environment using date filters, however, may try a one off search with no dates at which point it could be looking at 100000 files.

Thanks again.

; Declare variables
Global $vKey, $sItem, $sMsg, $oDictionary
Local $filesFile, $docIDsFile
Dim $string_missing_IDs
Local $missingCount, $totalCount

$missingCount = 0
$totalCount = 0

; Output filenames
;$file_to_write1 = @ScriptDir & "\my_docids.txt"
;$file_to_write2 = @ScriptDir & "\my_filenames.txt"
$docDetails = @ScriptDir & "\my_DocumentDetails.txt"

;Delete files if already present
;FileDelete($file_to_write1)
;FileDelete($file_to_write2)
FileDelete($DocDetails)

; Declare and create scripting dictionary
$oDictionary = ObjCreate("Scripting.Dictionary")

; begin first SQL connection - get documents and ids
$conn = ObjCreate( "ADODB.Connection" )
$DSN = "DRIVER={SQL Server};SERVER=myserver;DATABASE=mydb;UID=testuser;PWD=testpass;"
$conn.Open($DSN)

;Create record set and run query
$RS = ObjCreate("ADODB.RecordSet")
$RS.Open("SELECT DocID 'docid', filename 'file' FROM mytable WHERE service = '1'", $conn)

If $RS.RecordCount Then
    While Not $RS.EOF

            $docid = $RS.Fields("DocID").value
            $totalCount = $totalCount + 1
            $filename = "\\myserver\c$\Interfaces\" & $RS.Fields("file").value

            $RS.MoveNext
            ; Add to dictionary
            $oDictionary.Add ($docid, $filename)
        WEnd
EndIf

$conn.close
; end first SQL connection

For $vKey In $oDictionary
 $sItem &= $vKey & ": " & $oDictionary.Item($vKey) & @CRLF

; EXISTENCE CHECK - if exists, write to file and build doc id string for sql query
If Not FileExists($oDictionary.Item($vKey)) Then
        ;$filesFile = FileWriteLine($file_to_write2, $oDictionary.Item($vKey) & @CRLF)
        $string_missing_IDs &= $vKey & ","
        $missingCount = $missingCount+1
    ElseIf FileExists($oDictionary.Item($vKey)) Then
        ;ConsoleWrite("exists" & $oDictionary.Item($vKey))
    EndIf
Next

; begin 2nd SQL connection - get details for missing documents
$conn2 = ObjCreate( "ADODB.Connection" )
$DSN2 = "DRIVER={SQL Server};SERVER=myserver;DATABASE=mydb;UID=testuser;PWD=testpass;"
$RS2 = ObjCreate("ADODB.RecordSet")
$conn2.Open($DSN)

$string_missing_IDs = StringTrimRight($string_missing_IDs,1)

$RS2.Open("SELECT docid 'DocID', name 'name', doctitle 'title', docsubject 'subject' docdate 'date' from mytable WHERE docid IN (" & $string_missing_IDs & ")",  $conn2)

If $RS2.RecordCount Then
   While Not $RS2.EOF
    FileWrite($docDetails, $RS2.Fields("DocID").value & ", " & $RS2.Fields("name").value & ", " & $RS2.Fields("title").value & ", " & $RS2.Fields("subject").value & ", " & $RS2.Fields("date").value & ", " &  @CRLF)
    $RS2.MoveNext
        WEnd
    EndIf

$conn2.close

Msgbox(0, "Summary", "Total: " & $totalCount & @CRLF & "Missing: " & $missingCount)
Link to comment
Share on other sites

Why not change the first query to "Select * From ...", so you'll have all your columns available, and then process your exceptions in one loop? Throw out the second and third loops, get rid of Scripting.Dictionary and the second query.

Edit: Something like:

$oADO = ObjCreate("ADODB.Connection")
If Not IsObj($oADO) Then MsgBox(1,"ADO Error", "Unable to connect to database")
$oRS = ObjCreate("ADODB.RecordSet")
$oRS.Open("SELECT * FROM mytable WHERE service = '1'", $oADO)
While Not $oRS.EOF
    $totalCount += 1
    $filename = $oRS("filename").value
    If Not FileExists("\\myserver\c$\Interfaces\" & $filename) Then
        FileWriteLine($docDetails, $oRS("DocID").value & ", " & $filename & ", " & $oRS("title").value & ", " & $oRS("subject").value & ", " & $oRS("date").value)
    EndIf
    $oRS.MoveNext
WEnd
$oRS.close
$oADO.close
Edited by Spiff59
Link to comment
Share on other sites

Thanks for your suggestion Spiff59. I see your point here, but I just checked there and for a particular service I have nearly 2.5 million documents! Looks like I missed a 0 from my initial estimate.

Select * FROM table where service = '1' would return all the documents aswell as their details which I would not necessarily need. That's why I'm going back to the database to get details for only missing documents.

If anything, would this not slow me down? :mellow:

Link to comment
Share on other sites

Thanks for your suggestion Spiff59. I see your point here, but I just checked there and for a particular service I have nearly 2.5 million documents! Looks like I missed a 0 from my initial estimate.

Select * FROM table where service = '1' would return all the documents aswell as their details which I would not necessarily need. That's why I'm going back to the database to get details for only missing documents.

If anything, would this not slow me down? :mellow:

I don't know how many total columns are in your table, if most are used in the $docDetails report then I'd think "Select *" is fine. If there are a bunch of fields in the table that you're not using, then specifying just the ones you want in the Select would probably be a better idea.

I do know that queries that use the "IN" verb are notoriously slow.

It doesn't seem a lot of code to play with, maybe try running a benchmark both ways?

Link to comment
Share on other sites

I don't know how many total columns are in your table, if most are used in the $docDetails report then I'd think "Select *" is fine. If there are a bunch of fields in the table that you're not using, then specifying just the ones you want in the Select would probably be a better idea.

I do know that queries that use the "IN" verb are notoriously slow.

It doesn't seem a lot of code to play with, maybe try running a benchmark both ways?

Interesting Spiff59. I removed the second query and grabbed the document details at the same time as the ID/filename. Original app took 16 seconds, the second attempt took 7 seconds.

I'll try and get a test server / database setup and see how this runs against a very large amount of data.

Link to comment
Share on other sites

I've tested this against a fair amount of files (40000). Took 7 minutes. The SQL part only takes 7 seconds via Management Studio. I then tried against 160000 files , SQL = 29 seconds. My SQL execution plan suggests the query is as optimised as possible using indexes and so on.

I think the part taking the longest is either checking existence or writing details to file. On a live system there will be few (if any) files missing so this might be faster. Perhaps there's quicker means of either 1) existence check or 2) file writing that I could consider.

I'm keen to optimise if possible. Any advice or suggestions appreciated. Thanks! :mellow:

Here's my code (I just call the function using a GUI).

Func myfunc()

Local $filesFile, $docIDsFile
Global $missingCount, $totalCount
Global $missing_COunt, $total_Count
Global $startDate, $endDate

$missingCount = 0
$totalCount = 0

$docDetails = @ScriptDir & "\DocumentDetails.txt"
$Summary = @ScriptDir & "\Summary.txt"
FileDelete($DocDetails)
FileDelete($Summary)

; begin first SQL connection - get documents and ids
$conn = ObjCreate( "ADODB.Connection" )
$DSN = "DRIVER={SQL Server};SERVER=myserver;DATABASE=mydb;UID=testuser;PWD=testpass;"
$conn.Open($DSN)

;Create record set and run query
$RS = ObjCreate("ADODB.RecordSet")
$RS.Open("SELECT DocID 'docid', filename 'file', title 'title', subject 'subject', date 'date', name 'name'  FROM mytable WHERE service = '1'", $conn)

If $RS.RecordCount Then
    While Not $RS.EOF
            $totalCount += 1
            $filename = "\\myserver\c$\Interfaces\" & $RS.Fields("file").value
    
            If Not FileExists($filename) Then
                $missingcount += 1
                FileWrite($docDetails, $RS.Fields("DocID").value & ", " & $RS.Fields("name").value & ", " & $RS.Fields("title").value & ", " & $RS.Fields("subject").value & ", " & $RS.Fields("date").value & @CRLF)
            EndIf

            $RS.MoveNext
        WEnd
EndIf


$conn.close

FileWrite($Summary, "Total: " & $totalCount & @CRLF & "Missing: " & $missingCount)

EndFunc
Edited by cmcb
Link to comment
Share on other sites

Maybe this is a little faster?

Local $sBuffer_docDetails
While Not $RS.EOF
    $totalCount += 1
    If Not FileExists("\\myserver\c$\Interfaces\" & $RS.Fields("file" ).value) Then
        $missingcount += 1
        $sBuffer_docDetails &= $RS.Fields("DocID" ).value & ", " & $RS.Fields("name" ).value & ", " & $RS.Fields("title" ).value & ", " & $RS.Fields("subject" ).value & ", " & $RS.Fields("date" ).value & @CRLF
    EndIf
    $RS.MoveNext
WEnd
FileWrite($docDetails, $sBuffer_docDetails)
Link to comment
Share on other sites

Not speed related, But you might like to use* "Static Local" for those "Global ..." vars. (if your not using those var's in other part of your code.)

*and test, as Static still has a experimental status. (Working pretty well though)

Edited by iEvKI3gv9Wrkd41u

"Straight_and_Crooked_Thinking" : A "classic guide to ferreting out untruths, half-truths, and other distortions of facts in political and social discussions."
"The Secrets of Quantum Physics" : New and excellent 2 part documentary on Quantum Physics by Jim Al-Khalili. (Dec 2014)

"Believing what you know ain't so" ...

Knock Knock ...
 

Link to comment
Share on other sites

Argh! KaFu beat me to it! :mellow:

But, on another note... Your FileWrite was opening, appending, and closing the file each time you added a line. Not good practice. As shown in the help file, you can pass FileWrite() and most I/O routines either a filename or a file handle. For multiple actions against a file, using the handle method (without all the automatic opens and closes) is much faster. You should open the file once up-front getting a file handle, then use that handle in the read/write function, then at the end close the file (again by referrencing the handle).

Alternately, it might be quicker to just build a huge string($Str &= ...) in the loop, and dump it to a file all at once at the end.

PS - There may be sneaky things you can do to save time in the FileExists() area. Are all the files being processed in the one "\\myserver\c$\Interfaces\" folder, with nothing in subfolders? What are the formats of the filenames being processed?

Edited by Spiff59
Link to comment
Share on other sites

Ahh,., I get ya. Understand the file write suggestion. Makes perfect sense so I'll get that implemented ASAP.

I ran the app on my file server and performance was much improved using a local filepath instead of unc.

The files are stored under year -> month -> day sub folders. Perhaps that rules out your next suggestion? :-0

Thanks again for the input folks. :-)

Link to comment
Share on other sites

Not speed related, But you might like to use* "Static Local" for those "Global ..." vars. (if your not using those var's in other part of your code.)

*and test, as Static still has a experimental status. (Working pretty well though)

I declare the counts as globals here then use them in my main gui file.

Thanks.

Link to comment
Share on other sites

There is a sneaky way to gain a lot of speed by discarding all of the FileExists() calls. It's playing dirty, and wouldn't be allowed in a shop that enforced standards, but it does work, and is very fast.

An adaptation of a Yashied trick:

#include <File.au3>

Global $FileList, $SearchList[101] = [100]

For $x = 1 to 5000 ; Create test files
    $str = StringRight("0000" & $x, 4)
    FileWrite("test" & $str & ".log", "")
Next
$FileList = _FileListToArray(@ScriptDir, "test*.log"); Load files into array
 
For $x = 1 to UBound($SearchList) - 1 ; Create search list
    $rnd = Random(1,$FileList[0],1)
    $SearchList[$x] = $FileList[$rnd]
Next
$SearchList[25] = "xyzzy" ; insert bad apples
$SearchList[50] = "plugh"
$SearchList[75] = "plover"

$result1 = Traditional() ; Using FileExists()
$result2 = Sneaky() ; Using Assign()/IsDeclared()
MsgBox(1, "5000 files, 500000 searches", $result1 & @CRLF & $result2)

FileDelete("test*.log") ; delete test files
Exit

;===================================================================================================================================
Func Traditional()
    $timer = TimerInit()
    For $x = 1 to 5000
        $cnt = 0
        For $y = 1 to $SearchList[0]
;           If Not FileExists($SearchList[$y]) Then ConsoleWrite("T > $SearchList[" & $y & "]: " & $SearchList[$y] & " not found" & @CRLF)
            If Not FileExists($SearchList[$y]) Then $cnt += 1
        Next
    Next
    Return "T > " & $cnt & " files missing, found in " & Round(TimerDiff($timer)/1000, 2) & " seconds"
EndFunc

;===================================================================================================================================
Func Sneaky()
    $timer = TimerInit()
    $FileList = _FileListToArray(@ScriptDir, "test*.log"); Load files into array (again, to make benchmark more fair)
    For $x = 1 to $FileList[0]
        Assign($FileList[$x], 0 ,1) ; Create variables from array
    Next
    For $x = 1 to 5000
        $cnt = 0
        For $y = 1 to $SearchList[0] ;  search for missing files
;           If Not IsDeclared($SearchList[$y]) Then ConsoleWrite("S > $SearchList[" & $y & "]: " & $SearchList[$y] & " not found" & @CRLF)
            If Not IsDeclared($SearchList[$y]) Then $cnt += 1
        Next
    Next
    Return "S > " & $cnt & " files missing, found in " & Round(TimerDiff($timer)/1000, 2) & " seconds"
EndFunc

If not for (disk) caching I'm betting the FileExists() method would be miserably slower, but still a 350% performance gain ain't too bad! Besides being "dirty" code, it does have the requirement that the filenames be fairly standard, as filenames containing certain special characters will blow the Assign() statement out of the water. But if you are dealing with a million searches, and your filenames are vanilla, then I might consider throwing standards out the window.

Edit: From another test it appears the performance gap widens considerably (in favor of the "cheating" method) as the number of files and searches increase.

Edited by Spiff59
Link to comment
Share on other sites

Don't think its sneaky/cheating. If the work situation allows for it, it just a potential fast alternative.

Comspec with 'dir /b' comes to mind as a other one than might shave some points of the time.

- With the (assumed) benefit (over autoit) that any consecutive dir calls for the same location (if needed) are lightning fast due to windows cache use.

+ if dir-call return data is used as is(string block), no array use needed either.

"Straight_and_Crooked_Thinking" : A "classic guide to ferreting out untruths, half-truths, and other distortions of facts in political and social discussions."
"The Secrets of Quantum Physics" : New and excellent 2 part documentary on Quantum Physics by Jim Al-Khalili. (Dec 2014)

"Believing what you know ain't so" ...

Knock Knock ...
 

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...