Jump to content

Find and count occurances of a string in many files at once


Recommended Posts

Hi guys :)

Im in need of a script that will read all files in folder and look for a string in those files, files will have multiple lines. 

At the end script will display how many times string was found in all files.

There will be around 300 files and expected string occurrences in those files is around 300 / 400k

I did write some code for this and it works on test env with a couple of files that have couple of lines

what I don't know is how efficient this will be in case mentioned earlier.

So my request here is for you guys to look at my code and let me know will this be ok

or there are some other ways to do this more efficiently.

Here's the code

#include <File.au3>
#include <Array.au3>

Global $bArray, $found = 0, $stringToLookFor = "PB11"

look4stringInManyFiles()

Func look4stringInManyFiles()

    $where2look4files = "C:\temp\test\"
    
    $aArray = _FileListToArray($where2look4files, "*.txt", 0, True)

    For $i = 1 To UBound($aArray) - 1
        $path2file = $aArray[$i]
        _FileReadToArray($path2file, $bArray)
        For $a = 0 To UBound($bArray) - 1
            If StringInStr($bArray[$a], $stringToLookFor) Then $found = $found + 1
        Next
    Next

    MsgBox(0, "", $found&" instances of the string were found in all files")

EndFunc   ;==>look4stringInManyFiles

Thanks! :)

 

Edited by sakej
typo
Link to comment
Share on other sites

@BigDaddyO thanks for suggestion.

I ran both version of code a couple times on my small scale test and your version actually always took longer to complete.

Arrays did it in around 1.7 where opening and closing files needed around 2.4

I don't want to be smartass as I'm here asking for help but this makes me think that my initial approach will be better in this scenario. Anyone?

Link to comment
Share on other sites

Why don't you just use one of the numerous GREP for Windows command line programs?  Most have a switch that will provide just a count of matches.  That would be much faster than any grep-like logic that you could create in AutoIt.

Edited by TheXman
Link to comment
Share on other sites

Hi sakej
It seems that a RegExp approach gives a faster result.
But one has to be very careful when launching several tests one after the other as the cache memory can still be filled with the precedent test result.

 

#include <File.au3>

Global $bArray, $found = 0, $stringToLookFor = "PB11"
look4stringInManyFiles()

Func look4stringInManyFiles()
    $where2look4files = "C:\temp\test\"
    $aArray = _FileListToArray($where2look4files, "*.txt", 1, True)

    For $i = 1 To UBound($aArray) - 1
        $path2file = $aArray[$i]

;~      _FileReadToArray($path2file, $bArray)
;~      For $a = 0 To UBound($bArray) - 1
;~          If StringInStr($bArray[$a], $stringToLookFor) Then $found = $found + 1
;~      Next

        $sFileContent = FileRead($path2file)
        $cArray = StringRegExp($sFileContent, '(?i)' & $stringToLookFor, $STR_REGEXPARRAYGLOBALMATCH)
        If @error = 0 Then $found = $found + Ubound($cArray)

    Next

    MsgBox(0, "", $found & " instances of the string were found in all files")
EndFunc   ;==>look4stringInManyFiles

 

Some remarks :
=> (?i) in RegExp makes the results case-insensitive (to match with your StringInStr() parameters)

=> changed one parameter from 0 to 1 in _FileListToArray() to return files only, not files + folders

=> in case the RegExp way brings a few more results : the explanation should be that "PB11" has been found more than once in a line, when StringInStr() ignored a 2nd occurence of "PB11" in the same line.

One should probably add timers in the script and launch it (with, then without the RegExp way) at different times of the day, but certainly not testing both ways one after the other.

Good luck :)

 

Link to comment
Share on other sites

Try this version, I think it should be fast

...

$stringToLookFor = StringUpper($stringToLookFor)

For $i = 1 to UBound($aArray) - 1
    $hFile = FileOpen($aArray[$i], 0) ; mode=read
    $sContent = StringUpper(FileRead($hFile))
    FileClose($hFile)
    StringReplace($sContent, $stringToLookFor, "", 0, 1) ; casesense=1
    $found += @extended
Next

 

Edited by Zedna
Link to comment
Share on other sites

15 hours ago, sakej said:

@BigDaddyO thanks for suggestion.

I ran both version of code a couple times on my small scale test and your version actually always took longer to complete.

Arrays did it in around 1.7 where opening and closing files needed around 2.4

I don't want to be smartass as I'm here asking for help but this makes me think that my initial approach will be better in this scenario. Anyone?

As the help file says, you will only see the improvement with doing a FileOpen on larger files.  perhaps you should try testing on a few real files if can.  I assumed they were large as you expect 300 - 400k finds per file.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...