Whats the largest thing you can stringsplit? (or the largest array?)

sebgg · November 30, 2011

Have a 112 million character file.

tried to stringsplit by "" to get an array of each character

autoit died.

either exceeded max limmit for split or array im guessing?

anyone know what the limmit of each is?

Seb.

Zedna · November 30, 2011

From helpfile FAQ#15 What are the current technical limits of AutoIt v3?

Maximum string length: 2,147,483,647 characters

Arrays: A maximum of 64 dimensions and/or a total of 16 million elements

Edited November 30, 2011 by Zedna

guinness · November 30, 2011

Please search the Help file for 'AutoIt limits' there it will give you more of an idea. Though if I'm honest the limitations are very difficult to reach, touch wood I've never required a 64 dimension Array!

Edited November 30, 2011 by guinness

kylomas · November 30, 2011

sebgg,

Let's see the code...

kylomas

kylomas · December 1, 2011

sebgg,

If you REALLY require a 112mil element array then consider using a structure and populating it in a loop from your flat file.

kylomas

Edited December 1, 2011 by kylomas

kylomas · December 1, 2011

sebgg,

This may be of some use to you...

local $s10,$st
; generate 112 M random char string and write it to a file
$st = TimerInit()
for $i = 0 to 112000000
$s10 &= chr(random(32,128,1))
next
local $ofl = fileopen("c:tmpstruct array test.txt",2)
if $ofl = -1 then msgbox(0,"open error","")
filewrite($ofl,$s10)
fileclose($ofl)
consolewrite('Total time to create 122 mil random char string = ' & timerdiff($st)/1000 & ' seconds' & @lf)
; define and populate array structure from file
$st = TimerInit()
$a10 = dllstructcreate("char[112000000]")
if @error <> 0 then msgbox(0,"Struct Create Error",@error)
$s10 = fileread("c:tmpstruct array test.txt")
for $i = 0 to stringlen($s10)
dllstructsetdata($a10,1,stringmid($s10,$i,1),$i)
Next
consolewrite('Total time to create 122 mil random char array = ' & timerdiff($st)/1000 & ' seconds' & @lf)
; display 1st 50 chars
consolewrite("1st 50 chars = ")
for $i = 0 to 49
consolewrite(dllstructgetdata($a10,1,$i))
next
consolewrite(@lf)
; display last 50 chars
consolewrite("Last 50 chars = ")
for $i = dllstructgetsize($a10) to dllstructgetsize($a10) - 49 step -1
consolewrite(dllstructgetdata($a10,1,$i))
next
consolewrite(@lf)

I am NOT a c/c++ programmer so there may be a way to populate this struct using assignment rather than this long loop (it runs about 350 second on my machine).

Perhaps one of the experts could render an opinion?

kylomas

jchd · December 1, 2011

Just by curiosity: care to expose why you need such large array of chars?

sebgg · December 1, 2011

Arrays: A maximum of 64 dimensions and/or a total of 16 million elements

ahh so i guess this is the problem

sebgg,
Let's see the code...
kylomas

code was simple

$x = Fileread("contig.txt)
$y = stringsplit ($x, "")

but it dies

sebgg,
If you REALLY require a 112mil element array then consider using a structure and populating it in a loop from your flat file.
kylomas

what ive done is split the 112 million char list into 100 1.12 million char lists, and split these instead.

using the code:

for $i = 1 to 100
 $origional = fileread ("cleaned contig99.txt")
 $string = fileread("cleaned contig99.txt" ,1121224)
 filewrite ("section " & $i & ".txt", $string)
 $new = stringreplace ($origional , $string, "")
 filedelete ("cleaned contig99.txt")
 filewrite ("cleaned contig99.txt", $new )
Next

this starts slow but gets faster ofc, done in afew mins

Just by curiosity: care to expose why you need such large array of chars?

just for fun im translating a whole transcriptome (transcribed genome) and wanted to see the longest english word spelled out. just for fun mindso next code is to translate each peice into the 3 different frames, so 100 files becomes 300 files wih a 1/3rd less characters (will do reverse 3 later) using this (runnign currently)

now this is a really slow one, but i expected it to be,

#include <Array.au3>
; Script Start - Add your code below here
;$q7 = FILEREAD("1st frame.txt")
$codontable = fileread("fasta codon.txt")
;$w7 = STRINGSPLIT ($q7 , @CR )
$1stframe = ""
$2ndframe = ""
$3rdframe = ""
$x1 = 0
$1x = 0
$1xx = 0
$5x = 0
;_arraydisplay ($w)
;$line1 = filereadline ("fasta codon.txt",$i+1)
;msgbox(1,"yesy", $line1)
;$x = stringsplit ($line1, " ")
;_arraydisplay ($x)
global $arr[68][2]
;_arraydisplay ($arr)
for $i = 0 to 67
 $line1 = filereadline ("fasta codon.txt",$i+1)
 ;msgbox(1,"yesy", $line1)
 $x = stringsplit ($line1, " ")
 $arr[$i][1] = $x[2]
 $arr[$i][0] = $x[1]
Next
for $o = 1 to 100
 
$string = fileread("section " & $o &".txt")
$split = stringsplit ($string , "")
$frame = 0
$start =timerinit ()
for $i = 1 to $split[0]-5
 $frame = $frame + 1
$codon = $split[$i] & $split[$i+1] & $split[$i+2]
 
if $frame = 1 then
 for $u= 0 to 67
  if $arr[$u][0] = $codon then
   $1stframe = $1stframe & $arr[$u][1]
  EndIf
  next
elseif $frame = 2 then
 for $u= 0 to 67
  if $arr[$u][0] = $codon then
   $2ndframe = $2ndframe & $arr[$u][1]
  EndIf
  next
elseif  $frame = 3 then
 for $u= 0 to 67
  if $arr[$u][0] = $codon then
   $3rdframe = $3rdframe & $arr[$u][1]
  EndIf
 next
  $frame = 0
EndIf
if $i >$split[0]/1000 and $x1 = 0 then
 
 tooltip("section " & $o & "   0.1% complete   time : " & (timerdiff($start)/1000),0,0)
 $x1 = 1
EndIf
if $i >$split[0]/100 and $1x = 0 then
 tooltip("section " & $o & "   1% complete   time : " & (timerdiff($start)/1000),0,0)
 $1x = 1
EndIf
if $i >$split[0]/10 and $1xx = 0 then
 tooltip("section " & $o & "   10% complete   time : " & (timerdiff($start)/1000),0,0)
 $1xx = 1
EndIf
if $i >$split[0]/2 and $5x = 0 then
 tooltip("section " & $o & "   50% complete   time : " & (timerdiff($start)/1000),0,0)
 $5x = 1
EndIf
 Next
$x1 = 0
$1x = 0
$1xx = 0
$5x = 0
  filewrite ("translate/section " & $o & " - 1stframe.txt", $1stframe)
 filewrite ("translate/section " & $o & " - 2ndframe.txt", $2ndframe)
 filewrite ("translate/section " & $o & " - 3rdframe.txt", $3rdframe)
$3rdframe = ""
$2ndframe = "" 
$1stframe = ""
 next
 Next

Mat · December 1, 2011

You shouldn't really be using StringSplit and "" to get individual characters... StringMid would mean you would not need to use any arrays and avoid that limitation.

Basically:

$aString[0] ==> StringLen($sString)

$aString[n] ==> StringMid($sString, n, 1)

You can pretty much do that replacement automatically (short regexReplace call).

jchd · December 1, 2011

Yuo've taken the least efficient path at every step.

sebgg · December 1, 2011

You shouldn't really be using StringSplit and "" to get individual characters... StringMid would mean you would not need to use any arrays and avoid that limitation.

Basically:
$aString[0] ==> StringLen($sString)
$aString[n] ==> StringMid($sString, n, 1)

You can pretty much do that replacement automatically (short regexReplace call).

well wow this looks like it makes a big difference

other translation was about 36% complete now, this one is probably going to over take it soon.

Yuo've taken the least efficient path at every step.

and yeah im a total rookie at this whole scripting thing. but thanks for the useful advice and help with this problem. youre why i like this community

jchd · December 1, 2011

Advice to what? Manipulating useless arrays much larger than possible? What were you asking in the first place: see your own title. You got quick answers (in 6 & 8 minutes!) that 112M elements were too much for AutoIt arrays. BTW these limitations are found in due place in the help file. Subsequent answers showed that experienced members were highly suspicious of the usefulness of char arrays that large.

I was the first one to ask why you insisted doing it that way, in an attempt to help you (despite what you seem to believe), as I strongly suspected you embarked on a wrong track. Splitting the file is just the wrong band-aid which is going to make things even more difficult. Doing things for fun doesn't necessarily imply making wrong choices and not asking around. There is nothing wrong in not knowing something but trying surgeon tools randomly chosen, even on a dead corpse, isn't going to help you much understanding correct surgery practices.

Should you have asked "How can I find word patterns in a huge string", you probably would be done by now, and us as well.

Since we have to ask -- and answer -- the last question ourselves, here's one easy magic wand among others: StringRegExp[Replace]. Carefully read helpfile and start thinking how this will help you tremendously.

Before you start investigating what and how in this direction, realize that a 112Mb (bytes) file will eat at least twice as much memory because AutoIt strings are Unicode (one 8-bit character on file = one 16-bit word in memory). So keeping things simple and efficient is certainly a good idea.

Then to the beef: can you provide a short extract ( few Kb) of sample input, your codon file, any other detail required to people not versed into DNA stuff and a precise description of what you're going to search for.

By this time you can expect valid, useful hints (or even code if someone is brave and good enough) to achieve what you want. That would teach you how to deal with similar situations in the future and a bit of AutoIt knowledge.

As a sidenote, I'm using a terrible wifi hotspot setup which drops the link randomly. I happened to hit POST because the reconnection window disappeared at the moment I was clicking there. My intention was to explain you how to proceed. You misinterpret easily.

sebgg · December 1, 2011

Advice to what? Manipulating useless arrays much larger than possible? What were you asking in the first place: see your own title. You got quick answers (in 6 & 8 minutes!) that 112M elements were too much for AutoIt arrays. BTW these limitations are found in due place in the help file. Subsequent answers showed that experienced members were highly suspicious of the usefulness of char arrays that large.

I was the first one to ask why you insisted doing it that way, in an attempt to help you (despite what you seem to believe), as I strongly suspected you embarked on a wrong track. Splitting the file is just the wrong band-aid which is going to make things even more difficult. Doing things for fun doesn't necessarily imply making wrong choices and not asking around. There is nothing wrong in not knowing something but trying surgeon tools randomly chosen, even on a dead corpse, isn't going to help you much understanding correct surgery practices.

Should you have asked "How can I find word patterns in a huge string", you probably would be done by now, and us as well.

Since we have to ask -- and answer -- the last question ourselves, here's one easy magic wand among others: StringRegExp[Replace]. Carefully read helpfile and start thinking how this will help you tremendously.

Before you start investigating what and how in this direction, realize that a 112Mb (bytes) file will eat at least twice as much memory because AutoIt strings are Unicode (one 8-bit character on file = one 16-bit word in memory). So keeping things simple and efficient is certainly a good idea.

Then to the beef: can you provide a short extract ( few Kb) of sample input, your codon file, any other detail required to people not versed into DNA stuff and a precise description of what you're going to search for.

By this time you can expect valid, useful hints (or even code if someone is brave and good enough) to achieve what you want. That would teach you how to deal with similar situations in the future and a bit of AutoIt knowledge.

As a sidenote, I'm using a terrible wifi hotspot setup which drops the link randomly. I happened to hit POST because the reconnection window disappeared at the moment I was clicking there. My intention was to explain you how to proceed. You misinterpret easily.

advice to solve the problem once i stated what im up to, just like everyone else was doing.

i know this project is just for fun, but i dont think the arrays im manipulating are useless. (as the only ones i can manipulate are the ones not too large for autoit)

I havent been choosing functions at random, this code is to the best and all of my knowledge about autoit, to repeat, my choices were absolutly not random - i just dont know alot.

Having read regexpreplace, i think that really will be usefull for finding the words thansk for that advice. to my knowledge not sure if i can use it do do all the translations? but ill look into it.

really the main question was not really How can I find word patterns in a huge string, more best ways to manipulate very large strings. and for that i have some nice answers and have learned alot. apologies if i phrased it badly in the beginning/not at all.

hope this has cleared everything up and thanks for the advice.

oh and at the time i read youre reply, i interpreted everything in it, on face value, which may or maynot have been a mistake. it clearly was a missenterpretation but i cant assume everything posted on here is a work in progress and to wait for additional edits.

Mat · December 1, 2011

You have some kind of input with english and non-english words, say

$sInput = "wae#IsWorlddf.nkj1Helloegfi.sdfrhef"

And you have a dictionary of some description:

$sWords = "a aardvark ... zygospore zygote"

You want to find the longest sequence of characters in $sInput that is in $sWords.

Now we have defined the problem we can approach it.

The words in $sWords could be written as a tree structure. Each branch is another letter and each leaf is a word. Lets take the following set of words: "a abort aborted analogue animal"

a --- E
   |- b --- o --- r --- t --- E
   |                       |- e --- d
   |- n --- a --- l --- o --- g --- u --- e
         |- i --- m --- a --- l

Where E stands for END, in practice being '0' (the terminating null)

You can see if a sequence of characters conform by descending the tree. The length of the word will be the depth that it reaches, and it will be valid only if it can end on an E.

Now you need an efficient way to store a ridiculously large tree. Memory is not the answer. Perhaps memory for the first 3 levels, as you can be sure they will be common enough, but after that you need to think about how you store it.

Basic program:

For each letter L
    while s[L+n] is in tree
        n++

        if branch has END node
            lastEnd = n
        end if

        descend 1 more level in tree
    end while

    if lastEnd > longest
        longest = lastEnd
        longestWord = s[L .. n]
    end if
end for

Perhaps there are better ways, but there is improvement there.

jchd · December 1, 2011

Let's brush away urelated stuff.

FileRead the whole thing.

You most probably can do codon replacement on place (in fact it's not but appears to be) using $text = StringRegExpReplace($text, $codon[$i], $codontoken[$i]) not actual code just the idea.

Filter out words from a word list (there is one thread running about that) from words that can't be represented with codon tokens (I've no idea how you're going to represent them). How large is that short list ?

Optionally sort that short list.

Then find out the presence of, say, 100 words at a time using alternation in a StringRegExp loop against the whole sequence. The RE can be sped up using careful use of prefixes along the idea of Mat in the previous post.

sebgg · December 1, 2011

@ mat

yep thats the basic idea, ive pretty much compiled the 3 x 30 million character lists now, so now thats done its just a case of getting words from my known words list to see how many in there.

this tree idea looks really neat, never seen anything like it, will have a nosey.

@ jchd

so for codon replacement, stringregexp will maybe be tricky? as it wont keep frame.

so for the string

atgatgatgatgatg

frame 1 would be

atg atg atg atg atg

so if i did stringregexpe for atg to "m"

id get m m m m m

perfect!

but if i did for gat, it would totaly loose frame

so i think going through it base by base may be needed here?

have actually done this now but would be nice to know anyways of a faster method if you know it?

regarding the searching words bit.

my shortlist is just 240,000 words

( thanks GEOSoft! http://dundats.mvps.org/autoit/ (under miscelaneous)

so my plan here was to just populate an array of the 240 k known words. and for each stringinstr back to the 30mil character string

if no error then print word.

then i was planning to trim down this list later on.

now this for sure will be slow. so a better way will probably useing stringregexp as you said so will try this with mats idea.

as far as shortening the short list. for sure that possible. ive ran my script on just the first 0.3% of the file and got up to 6-7 character words already. so i could cut everyuthing from the list shorter than that for sure. will do this now and have a look at the difference.

cheers,

seb

sebgg · December 1, 2011

removing all words <7 char brought it down from 255k to 214k

not really such a difference, so i may start with a very stringent list (only words >10-12 characters long) if i find nothing, then work my way down.

Seb

edit: yep >15 is just 10.8 k words, may start with this

Edited December 1, 2011 by sebgg

GEOSoft · December 1, 2011

You could very easily use a Reg Exp to separate that file into separate files each containing only words of a given length. or even combine only words of a given range for example all the 1 and 2 character words in one file.

Another method would be to alpha split the file.

Whats the largest thing you can stringsplit? (or the largest array?)

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members