Followers 0

# Extract a big block of text out of a big file

## 10 posts in this topic

Hi ,

The file I hve is about 6MB ..I 'd like to extract out like 3000 lines starting from a certain keyword text in that file and end by another keyword text. If I use traditional way like a while loop with filereadline and search for that keyword using regex, it takes abit long . I wonder if it is possible to use regex to extract out such big block of text using the string from fileread?

I tried the patterns I learnt so far from Malky, Mikell , Melba and some others but it doesn't work for a big

block .

Attached the example file ..maybe I 'd like to start take out the data block  from test1000 to test4000 or wherever you can help with 3000 line block

##### Share on other sites

What about trying StringBetween? (That uses regex)

##### Share on other sites

#3 ·  Posted (edited)

_StringBetween() isn't always great...

#include <Array.au3>
#include <Constants.au3>

Local $sData = FileRead('big.txt') Local$aSRE = StringRegExp($sData, '(?s)^test1\R(.*)\Rtest1000\R',$STR_REGEXPARRAYGLOBALMATCH)
If Not @error Then
ConsoleWrite($aSRE[0]) EndIf Edited by guinness #### Share this post ##### Link to post ##### Share on other sites Try this: $sResult = StringRegExpReplace(FileRead(@ScriptDir & "\big.txt"), "(?si)(?:.*)test999\s+(.*?)test4001\s+(.*)", "$1") ConsoleWrite($sResult & @CRLF)

Br,

UEZ

The own fart smells best!
Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!
¯\_(ツ)_/¯  ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ

##### Share on other sites

#5 ·  Posted (edited)

$sResult = _Extract("big.txt", "test1000", "test4000") FileWrite("1.txt",$sResult)

Func _Extract($file,$from, $to) Return StringRegExpReplace(FileRead($file), '(?is).*?(' & $from & '\W.*?' &$to & ')\W?.*', "$1") EndFunc non included, the way to do if keywords contain special characters And how will you manage this if there are several occurences of the keyword(s) in the text ? Edited by mikell #### Share this post ##### Link to post ##### Share on other sites Thank you all..most of your code works. I found out stringregex only works with max of about 4 million of chars ( or around 4500000) ( I use stringlen to count the string from fileread , my file has len:19129246 chars.$s = FileRead($p2)$l = StringLen($s) ConsoleWrite('len:' &$l &@crlf)

if I use stringleft to take out  about 4 million chars, the regex works..but more than that it wont find it. it dump out the whole main file.

$sl = StringLeft($s, 4500000) ;min is 150000 for one unit
$sResult = StringRegExpReplace($sl, "(?si)(?:.*)(d_unit123s+.*?)d_unit456w+s+(.*)", "\$1")

So I guess I have to cut that main string into some parts to search for certain unit.

Mikell,

the file I parse it has keyword of the unit IDs ...it has some keyword like d_id_unit123 ..something like that ...

##### Share on other sites

#7 ·  Posted (edited)

Beware that d in a pattern represents a digit. If the bounding "keywords" actually contain d, then escape those backslash d or enclose the keywords in Q ... E  as in Qd_id_unit123E wich instruct the engine to treat verbatim whatever is inside.

You can try placing (*LIMIT_MATCH=9999999) at the very start of your pattern, with 9999999 being a large enough value.

Lastly if you still hit a brick wall, I suggest to use StringRegExp (not *Replace).

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

##### Share on other sites

#9 ·  Posted (edited)

Hi JHCD, I tried Guiness's  stringregex code , it works w/o even putting any limit in it.

as for other stringregrexreplace of Mikell and Malky, it doesn't do anything with the set limit to 9999999 or any  number.

I tried to work on more stringregex to see if it can repeat of search for the same unit the rest of file...but doenst work so far...and I dun know how to set the end keyword for a unit block that is  at the end of the file..would be cool if I  can get help on that .

Edited by iahngy

##### Share on other sites

so I changed the end keyword for the block (test result )which always included in instead of another unit id .

it works great ...it able to search for again for the same unit ..I added the trick from Mikell the question mark ? for middle and end of pattern.