Sign in to follow this  
Followers 0
jurco

Help with a searching machine

16 posts in this topic

Friends:

The work to be perform is as follows: I have a huge text file (arround 600 pages), and I have to find some "strings" in that text file, and copy the complete line to a new text file.

Right now, I had done it by usind two notepads, using the searching machine from the Notepad, and copy to the results Notepad, everything just emulating key strokes, and it works fairly good, but (there are always some guts), it is slow, i cant do anything with my computer until the process is done, and can have some mistakes.

Ok, now, I had been trying with the file and string functions.

This is my first approach:

; CHECKING THE FILE, LINE BY LINE.

do

$line = FileReadLine (0,$linenum)

$result = StringInStr ($line,$schtxt)

if $result<>0 then filewriteline (1,$line)

$linenum = $linenum + 1

until $linenum = 10000 ;$schfilect + 1

PROBLEM: Terribly slow, even slower than doing it emulating key strokes.

Here is my second approach:

; CHECKING THE FILE, AS ONE LONG STRING, GETS THE RESULT IN CHARACTER NUMBER.

$filesch = Fileread ($filesch1, 1000000)

$result1 = StringInStr ($filesch,$schtxt,0,1)

$result2 = StringInStr ($filesch,$schtxt,0,2)

$result3 = StringInStr ($filesch,$schtxt,0,3)

$result4 = StringInStr ($filesch,$schtxt,0,4)

$result5 = StringInStr ($filesch,$schtxt,0,5)

$result6 = StringInStr ($filesch,$schtxt,0,6)

$result7 = StringInStr ($filesch,$schtxt,0,7)

$result8 = StringInStr ($filesch,$schtxt,0,8)

$result9 = StringInStr ($filesch,$schtxt,0,9)

$result10 = StringInStr ($filesch,$schtxt,0,10)

$result11 = StringInStr ($filesch,$schtxt,0,11)

$result12 = StringInStr ($filesch,$schtxt,0,12)

MsgBox (0,"Las líneas resultantes son:", $result1 & "-" & $result2 & "-" & $result3 & "-" & $result4 & "-" & $result5 & "-" & $result6 & "-" & $result7 & "-" & $result8 & "-" & $result9 & "-" & $result10 & "-" & $result11 & "-" & $result12)

As you can see, I am not very wise on the matter, but anyway, the idea is to search all the text file a a long long string.

PROBLEMS: It is incredibly fast, but instead of having the line where is the searched text, I have the character position.

ALTERNATIVES:

One alternative is to use the String format, so I can define the lenght of each line, and do it fix, so to convert from a character position to a line is just a simple division.

My problem is that I had been trying to understand how to use the string format with any luck until now.

Second alternative, is find the way to find the Line Feed flags arround the result character, and copy the text between them. I have no idea how to do this, maybe will have to get the previos character, and compare it until is the Chr(10), and the same going to the next character.

The third alternative, is to run a program that acumulates in an array the first and the last character of each line, and then, look the character result to match the line (I tried this, but again, is terribly slow)

Any help will be highly appreciated.

Share this post


Link to post
Share on other sites



PLEASE DONT FLAME ME !!!!!! :)

ok if your file is 600 pages and you have 60 lines per page then you have 36000 records and if your file keeps growing you will kill your system trying to load this file into a $variable ENTER GAWK

GAWK is a free programming laguage and if you want speed then how about

56,500 records per second on an 866 pentium III?

so here is the code and save it as "script.awk:

###############################################

# THIS IS AWK OR GAWK FOR WINDOWS IF YOU ARE BRAVE !!!

# DOWNLOAD GAWK AT http://sourceforge.net/project/showfiles.p...ackage_id=16431

# INSTALL IT THEN RUN IT

# C:\GAWK\g31\BIN\gawk.exe -fC:\SCRIPT.AWK

BEGIN {

IGNORECASE ="1" # THIS WILL ALLOW YOU TO IGNORE CASE FOR MATCH

print "BEG TIME......... "strftime("%H:%M:%S") # JUST A TIMER

}

{

# $0 IS THE RECORD READ FROM "FILENAME"

if ($0 ~ /TO MATCH/) # REPLACE THE TEXT TO BE MATCHED HERE

{

print $0 > "c:/output.txt"

cntsel++

}

}

END {print "INPUT FILENAME... " FILENAME

print "SELECTED RECORDS. " cntsel+0

print "RECORDS ........ " NR

print "END TIME ........ " strftime("%H:%M:%S")

}

###############################################

please people I love AUTOIT3 but when you match 6,000,000 records you need speed. Gawk is free and it is easy to learn the previous code will do the match and create the new file in seconds.

if you have any questions about the code or awk send me an EMAIL so that people dont get offended.

again; to automate anything use AUTOIT.

Share this post


Link to post
Share on other sites

When I first read the post, I thought perl (windows port):

#!perl

open(outFILE, ">c:\\outfile.txt") || die "Cant open file"; #open output file
open(FILE, "C:\\fileToSearch") || die "Cant open file"; #open file to search through

#loop through file
while (<FILE>) {
  chomp( $LINE = $_ ); #remove trailing newline
  if ( $LINE =~ /searchString/ ) {
    print outFILE "$LINE\n"; #put it in output file
  }
}

But, I think Larry is spot-on with the find.exe example.

Def

Share this post


Link to post
Share on other sites

Larry is ALWAYS RIGHT

My Gawk sample can be reduced to:

{if ($0 ~ /MATCH THIS/){print $0 > "c:/output.txt"}}

replace MATCH THIS and also output.txt as needed

notice the usage of "/" instead of "\"

GAWK is fast

Perl is powerful

but AUTOIT IS KING!!!

Share this post


Link to post
Share on other sites

I'm not positive, but I think he said he used the find feature didnt he? Or was he talking about the find inside of Notepad?

Is the find in Notepad that much slower than the one outside?

JS


AutoIt Links

File-String Hash Plugin Updated! 04-02-2008 Plugins have been discontinued. I just found out.

ComputerGetInfo UDF's Updated! 11-23-2006

External Links

Vortex Revolutions Engineer / Inventor (Web, Desktop, and Mobile Applications, Hardware Gizmos, Consulting, and more)

Share this post


Link to post
Share on other sites

I'm not positive, but I think he said he used the find feature didnt he? Or was he talking about the find inside of Notepad?

Is the find in Notepad that much slower than the one outside?

JS

<{POST_SNAPBACK}>

he said find.exe and gave an example.

"I'm not even supposed to be here today!" -Dante (Hicks)

Share this post


Link to post
Share on other sites

If the file got big, I would just load it in any real Text editor with Regular Expressions.

You can use regular expressions to find those lines that have the text you want and mark them, or delete all lines that don't have the text you need and save as a new name. There are window ports to Unix GREP? that can do it from command line as well.

Gawk reminds me of GREP a bit.


AutoIt3, the MACGYVER Pocket Knife for computers.

Share this post


Link to post
Share on other sites

#8 ·  Posted (edited)

If the file got big, I would just load it in any real Text editor with Regular Expressions.

You can use regular expressions to find those lines that have the text you want and mark them, or delete all lines that don't have the text you need and save as a new name.  There are window ports to Unix GREP? that can do it from command line as well.

Gawk reminds me of GREP a bit.

<{POST_SNAPBACK}>

To all you friends:

You had been really kind to take a look at this topic.

Thanks JS ThePatriot and ScriptKitty for your comments.

For Normeus and Def, thanks very much for your codes, but as a newbee, I am trying to understand ONE program before jumping to other ones. I am sure you share with me the idea that we spent a lot of time on this, because is passionating.

For LARRY, as always wise. I tried the find.exe example, but I cound not make it run (sure because my lack of knowledge, specially with the @comspec function than I dont really understand yet=

I would like to share the solution, and some questions that might be of public interest:

First the solution: I used the first LARRY script, and modified it a little bit finding these quotes:

When I tried to replace @crlf for @lf, the process got to much time, but instead, I directly used @lf in the stringsplit and works perfect.

As I had to perform several searches (I mean several strins to search) I generate a second file with the strings to search in different lines, then, following Larry instructions, I stringsplit this new text to have a second array, and then, just perform one loop inside a second loop.

TO THE FACTS, when I stringsplit the searchtexts, kept on each registry of the second array the @lf character, so the search falls each time. After 5 hours I notice this and change the text in lines, for a text separated by "/" and that was it.

PLEASE, I DONT WANT TO FILL THIS SPACE WITH GARBAGE, IF ANYONE IS INTERESTED ON THE CODE, LET ME KNOW, IT IS SIMPLE, BUT MAYBE GOOD FOR SOMEONE ELSE.

RESULTS FOR THE CODE: 20 SEARCHES ON 10,000 LINES IN 2.8 MINUTES.(on a sheety computer, by the way)

OK, NOW, THE QUESTIONS:

If a character necesarely equals to a byte? LARRY, in your script, you use Filegetsize (that returns bytes) as a limit for FileRead (that ask you for characters). I just want to be sure this is a rule.

What is the difference between @crlf and @lf, when a text file use each of them?

As I can assume, the info in the arrays is lost when the code finishes the run. Is there any way to save those arrays for future uses?

Can I make an array of more than 2 dimensions? maybe 3 or even 4? It will be just great to have something like this.

About a huge text, can I use stringsplit two times on it? will the result be an array of two dimensions?

I hope that with your help, I will be able to keep learning about Autoit, and programming a little bit.

THANKS TO ALL AGAIN.

Edited by jurco

Share this post


Link to post
Share on other sites

#9 ·  Posted (edited)

I know array's in most languages can have multiple dimensions, but I am not positive about AutoIt. I will check up on that.

When creating arrays you are limited to up to 64 dimensions and/or a total of 16 million elements.

I am pretty sure that one character isnt equal to one byte, but I could be wrong on that.

Yes if you dont leave the script running the data in the array's will be lost. I am also pretty sure if you read in the help file some more that you will find how to save data in an array to a .txt document that you could read the next time you need the script ran. Or something like that. That way it builds on it self. Eventually your document would become to large to open keep that in mind.

I am interested to know the difference between @CRLF and @LF as well so we shall see...

JS

Edited by JSThePatriot

AutoIt Links

File-String Hash Plugin Updated! 04-02-2008 Plugins have been discontinued. I just found out.

ComputerGetInfo UDF's Updated! 11-23-2006

External Links

Vortex Revolutions Engineer / Inventor (Web, Desktop, and Mobile Applications, Hardware Gizmos, Consulting, and more)

Share this post


Link to post
Share on other sites

I know array's in most languages can have multiple dimensions, but I am not positive about AutoIt. I will check up on that.

I am pretty sure that one character isnt equal to one byte, but I could be wrong on that.

Yes if you dont leave the script running the data in the array's will be lost. I am also pretty sure if you read in the help file some more that you will find how to save data in an array to a .txt document that you could read the next time you need the script ran. Or something like that. That way it builds on it self. Eventually your document would become to large to open keep that in mind.

I am interested to know the difference between @CRLF and @LF as well so we shall see...

JS

<{POST_SNAPBACK}>

JS:

Thanks for your comments:

About the arrays: I know it will be quite simple to save them as a text file, I was wondering is there any way I can save them as an array to avoid

"rearraying it" each time I need it, and maybe the explanation is kind of messy.

I am working for a law firm. daily, we have to pull down the results of each of 74 judeges, "flag" them with the judge number and the date, and then perform a search with the firm client list to know if there is any movement in the sues.

Right now, I generate a report by day, and save it. If we could save it as a huge array of 4 dimensions, we will be able to perform a search by judge, or by client without having to pull the info again. THAT IS THE GOAL. Maybe it is hard because I am basically asking to save the info to a database, instead of a txt file.

Share this post


Link to post
Share on other sites

What is stopping you from using a database? I would use an access database for what you are talking about. I am saying access because you probably already have it if you have office, and its not that hard to learn the VBA and such it would take to get it workingn properly.

JS


AutoIt Links

File-String Hash Plugin Updated! 04-02-2008 Plugins have been discontinued. I just found out.

ComputerGetInfo UDF's Updated! 11-23-2006

External Links

Vortex Revolutions Engineer / Inventor (Web, Desktop, and Mobile Applications, Hardware Gizmos, Consulting, and more)

Share this post


Link to post
Share on other sites

Good forum for Databases, is www.dbforums.com excellent forum. I get on there as much as possible to help people just like on here.

JS


AutoIt Links

File-String Hash Plugin Updated! 04-02-2008 Plugins have been discontinued. I just found out.

ComputerGetInfo UDF's Updated! 11-23-2006

External Links

Vortex Revolutions Engineer / Inventor (Web, Desktop, and Mobile Applications, Hardware Gizmos, Consulting, and more)

Share this post


Link to post
Share on other sites

LARRY. Thanks for your comments, if you have any idea of how to save the arrays other than a database, I will thank you.

JS: Thanks for your comments too, maybe I should start working on a database. If you have any tutorial other than www.dbforums.com, I will thank you for that.

Again, thanks to all.

Share this post


Link to post
Share on other sites

I would have to say that if you are using a lot of data like that, and want to preform searches, you should use a database.

Databases can search through a million records like butter through a knife. You can use a microsoft access database, or one of the thousands of cheaper ones.

For ease of use, I would have to say Filemaker is the easiest.

"My Database" is one of the cheapest, (I don't recomend.)

Access is most popular.

SQL is the best, and most widely used (industry wide and internet.)

The one thing about databases is indexing, and relational.

Databases like structure, and they are all about speed and acuracy.

I hate to say learn a new program, but although I have been known to use a knife as a screwdriver, I like to use a screwdriver at times when I need a screwdriver. :)

I do use AutoIt to do nice little conversions and edits to get text data ready for importing into a database, and also to automate a lot of database tasks. :)


AutoIt3, the MACGYVER Pocket Knife for computers.

Share this post


Link to post
Share on other sites

I would have to say that if you are using a lot of data like that, and want to preform searches, you should use a database.

Databases can search through a million records like butter through a knife.  You can use a microsoft access database, or one of the thousands of cheaper ones.

For ease of use, I would have to say Filemaker is the easiest.

"My Database" is one of the cheapest, (I don't recomend.)

Access is most popular.

SQL is the best, and most widely used (industry wide and internet.)

The one thing about databases is indexing, and relational.

Databases like structure, and they are all about speed and acuracy.

I hate to say learn a new program, but although I have been known to use a knife as a screwdriver, I like to use a screwdriver at times when I need a screwdriver. :)

I do use AutoIt to do nice little conversions and edits to get text data ready for importing into a database, and also to automate a lot of database tasks. :)

<{POST_SNAPBACK}>

Well, I think your point is right, As you and JS recomends, I think that now is the time to start studiing access.

Thanks again to all your comments.

Jurco.

Share this post


Link to post
Share on other sites

Jurco I will be looking for you on DBForums... they are very helpful and I will help you in anyway I can. Playing with a database using AutoIt isnt that hard but its not as easy as I thought it was going to be.

JS


AutoIt Links

File-String Hash Plugin Updated! 04-02-2008 Plugins have been discontinued. I just found out.

ComputerGetInfo UDF's Updated! 11-23-2006

External Links

Vortex Revolutions Engineer / Inventor (Web, Desktop, and Mobile Applications, Hardware Gizmos, Consulting, and more)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0