Jump to content

Best plan of attack


MFerris
 Share

Recommended Posts

I am working on a project where I need to some basic statistics on a LARGE amount of data (stored as flat ASCII txt).

Most importantly I'm looking to do frequencies, or 'counts' for any given column in the data that I'll define.

For instance:

123456

846515

966452

A freq on column 1 would return:

1 - 1

8 - 1

9 - 1

A freq on column 3 would return:

3 - 1

6 - 2

Now, take my little example and make it thousands of columns wide and thousands (if not tens of thousands) rows long (that's worst case, but entirely possible). Excel would not be my friend here.

While I think I can program this using a menagerie of arrays, 'charAt()' and liberal use of "count+=1", I'm thinking that there may be a better way to accomplish this that I'm not seeing - not to mention I think that what I have in mind would be rather processor and time-intensive to go through the data like that. Especially since I do not want to limit myself to a single column. There may be instances where I want to do a frequency of columns 5-6 (which would return 56, 15, and 52 using my example).

I then thought of SQLite - but I haven't used databases since dBase III was in vogue and in my googling I can't seem to find anything indicating whether SQLite can do this (again, since this is all in a txt file, I'd have to import it into SQLite, define potentially hundreds of fields, etc.)

So what do you think? Stick with my original concept of just grinding through each row, comparing against an array and keeping track of duplicates? Can SQLite accomplish what I'm looking to do (keeping in mind I'm relatively clueless with SQLite so far), or can you think of some other method of attack that I haven't considered yet?

I'm not expecting lightning-fast speeds but if it's going to take hours to run then I don't know that I want to put the time into it.

Thanks for your suggestions!

Edit: Also meant to mention that I would also like the ability to do averages and means, however these are not dealbreakers - being able to do a frequency is my highest priority.

Edited by MFerris
Link to comment
Share on other sites

skip SQLlite, got straight for MySQL

it will hold the data, but im not sure if you can get a specific SQL syntax statement to retrieve the info you are after, you may still have to write that manually.

How often is the data going to change?

if it is going to be stagnant (entered and never edited) then the effort would be worth it, make the program that sorts and gives the stats do everything in one hit, park it on a machine, and let it go... have it store all of its findings into another database, and then poll that for what you need.. it wont have to crunch the numbers then, it will only have to spit out the details.

it is an intriguing and challenging idea though, i may just attempt it on a small scale for the sheer fun of it. (say, 10 x 10 grid of numbers)

Link to comment
Share on other sites

skip SQLlite, got straight for MySQL

it will hold the data, but im not sure if you can get a specific SQL syntax statement to retrieve the info you are after, you may still have to write that manually.

How often is the data going to change?

if it is going to be stagnant (entered and never edited) then the effort would be worth it, make the program that sorts and gives the stats do everything in one hit, park it on a machine, and let it go... have it store all of its findings into another database, and then poll that for what you need.. it wont have to crunch the numbers then, it will only have to spit out the details.

it is an intriguing and challenging idea though, i may just attempt it on a small scale for the sheer fun of it. (say, 10 x 10 grid of numbers)

Thanks.. one thing I forgot to mention is portability - as I'm creating this for other others who will have no idea what SQL is, nevermind how to use it. My aim is for something I can send them, they open, select a file, and click "go". My thinking was that with SQLite it could run quietly 'under the hood' and no one would ever know - as opposed to MySQL which would require an installation, etc.

I did play around with this using my original concept, that of simply going through each row, using StringMid() to get the value, and looping through an array looking for the value and incrementing a counter (else adding it to the array). It's not as slow as I thought it would be.

I ran this on one of my bigger files - 160,000 records. I had it check a location 4 columns wide (which ultimately contained 103 unique values) and it took 30 seconds to go through 160,000 records. I think this will do what I need it to.

(Just for context, I work in research, and I'm writing this to check survey data files. There are research-specific applications that do what I need, but they're not free, and the licenses typically aren't given to people in my department.)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...