Compression Thoughts...

seandisanti · August 20, 2010

the problem with this type of compression, even if you cut out some overhead creating the bmp without using paint; is that you're still storing 3 values for every value you're storing.... if you're just looking to save space you can use your same algorithm for condensing the text, but then encode it into HEX, binary etc. Then you may end up saving some space, depending of course on the number of duplicated characters. if you're labeling characters that are not duplicated with a 1 for example you're going to end up increasing the number of characters and thus the size, no matter how efficiently you store the info.

Rather than grouping things the way they are displayed, perhaps take a whole text file, chop it into segments, (3-6 characters based on frequency) then use your system to map the locations of those segments. example, hit the original text file with StringRegExp mode 3 to chop up the file based on 3 character chunks, then remove dupes from that array. do that again for 4, 5, and 6(or up to however many you want to compress) . and whichever comes up with the fewer number of indexes after removing dupes is the one to go with. Then you create a header, which has each of the segments, with a number to associate with each, then you can just list the hex code number of the segments as they appear and you've effectively chopped up to 6 characters down to 2 every time regardless of duplication so the larger the file, the better the compression.

does that make any sense?

***edit***

assign codes for punctuations and white space too, but ignore them when creating chunks to reduce the number of unique chunks and raise efficiency of the compression

***edit 2***

working on a formula for dynamically deciding the best chunk size right now if you have your code check the formula

(Text file size - (chunk size * chunk count)) - (number of substitutions * (chunk size - coded chunk size))

after each pass, you can figure what size to make the chunks based on which length returns the lowest result Edited August 20, 2010 by cameronsdad

Drifter · August 20, 2010

a combination of bases.... .... interesting idea. Would this be something simple to figure out what base to use? or would it be trial and error / while loop? lol....

So once we determine what base to be in we would need some pattern finding algorithm... hmmmm i understand the logic behind it a bit, but the code itself and whether or not it would be that great a compression is still a little fuzzy.

Drifter · August 20, 2010

Cameronsdad: Your idea seems best so far. Unfortunately i dont FULLY see how this works, but it does make sense to chop into pieces and represent each piece in a more simple manner. In your mind can you pretty much see how this would lay out? If so is it possible to ask for a short example? Perhaps "Hello world!" or something? I'm not asking you to sit here and pound out all my code for me, I will try my best but I need to understand more clearly first.

seandisanti · August 20, 2010

a combination of bases.... .... interesting idea. Would this be something simple to figure out what base to use? or would it be trial and error / while loop? lol....
So once we determine what base to be in we would need some pattern finding algorithm... hmmmm i understand the logic behind it a bit, but the code itself and whether or not it would be that great a compression is still a little fuzzy.

i tried to lay out all of the logic, because i'm not able to code it up right now (listening to someone work right now, so can't do any real coding concurrently), using regex and arrays to identify the optimal the chunk size to use may take a couple of seconds of processing. i'd skip straight to 6, 8, and 10 character chunks to see the biggest difference quick. figure one regex test each will create your arrays, then a quick couple of lines per array to remove dupes from each, then use the formula in edit 2 up there to identify which is best for the file... your header in the new file will consist of x number of chunks and codes, delimited however you like, then the rest of the new file would be a straight list of codes. i'll write it up if you want but it will take a little while and i want a cut of the profits if you decide to sell it

Drifter · August 20, 2010

i'll write it up if you want but it will take a little while and i want a cut of the profits if you decide to sell it

Not sure where I want to go there. I definitely appreciate your willingness to help, but i also want to learn how its actually done; my intent is not to leech code but to learn in the process. "Teach a man to fish..."

I feel a little uncertain on where to set the boundary but some mix of helping/writing/me thinking it out for a while might be okay. This of course is entirely dependent on what time you have available, as maybe the giving me nudges here and there would take longer.

If its a system that works well, maybe we could have it do a few passes and try to achieve a certain ratio, or "time out" and flush the compressed data to file after so many "Max passes" if compression ratio isn't reached.... just some thoughts.

seandisanti · August 20, 2010

Not sure where I want to go there. I definitely appreciate your willingness to help, but i also want to learn how its actually done; my intent is not to leech code but to learn in the process. "Teach a man to fish..."
I feel a little uncertain on where to set the boundary but some mix of helping/writing/me thinking it out for a while might be okay. This of course is entirely dependent on what time you have available, as maybe the giving me nudges here and there would take longer.
If its a system that works well, maybe we could have it do a few passes and try to achieve a certain ratio, or "time out" and flush the compressed data to file after so many "Max passes" if compression ratio isn't reached.... just some thoughts.

i'm happy to help for free as long as the result remains free.

Drifter · August 20, 2010

i'm happy to help for free as long as the result remains free.

Okay, so is the result of this an "I add you, you add me" and we use the forums message system to carry on? Or is there some other thing that could use public view? I would be quite happy to spend the time (and my brain juices!) to learn on this topic cameronsdad.

seandisanti · August 20, 2010

Okay, so is the result of this an "I add you, you add me" and we use the forums message system to carry on? Or is there some other thing that could use public view? I would be quite happy to spend the time (and my brain juices!) to learn on this topic cameronsdad.

i don't mind working publicly, the people here are great contributors to ideas

Drifter · August 20, 2010

i don't mind working publicly, the people here are great contributors to ideas

I agree entirely. The question then is where we should operate. Should I create a thread in the "Showing off scripts" sub-forum? I could pass it off as "In development". I'm not so sure it belongs here... i do need "General Help" but its sort of a different kind of thing, I'm not asking how something functions in AutoIt or how many arguments it takes.

Ahhh, ambiguity....

seandisanti · August 20, 2010

I agree entirely. The question then is where we should operate. Should I create a thread in the "Showing off scripts" sub-forum? I could pass it off as "In development". I'm not so sure it belongs here... i do need "General Help" but its sort of a different kind of thing, I'm not asking how something functions in AutoIt or how many arguments it takes.
Ahhh, ambiguity....

lol, you're getting a bit ahead of yourself. initially i was thinking you maybe should have posted in chat since you weren't really asking for help, but it doesn't matter to me at all. when it needs to be moved somewhere else, i've no doubt smoke_n or one of the others will get it where it needs to go. i can probably do up a quick little proof of concept tonight after i get off (another 1.5 hours left)was already starting to write up the algo

Drifter · August 20, 2010

Valid point on the forum mods. I had no idea this community had a chat to be honest!

Edit: my bad, misread you, possibly because its a bit late and I've been thinking about various AutoIt projects I'm working on most of the day. Yeah "Chat" sub-forum seems appropriate.

Edited August 20, 2010 by Drifter

seandisanti · August 20, 2010

here's what i'm thinking. thinking multiple functions for scalability...

;compression algo
;read in the file
;use a regular expression to create an array of all punctuation and nonprintable characters
;save contents of file in on variable, and the result of stripping all whitespace and punctuation in another, $fulltext
;function to be called with 3 parameters; 1 chunk length to try, 2 $fulltext, 3 a globally defined array to store the resulting chunks
;function builds array based on chunks, and stores an array of chunks (non unique array at this point)
;should be called for every chunk length you want to try, i'd say 3,6,8,and 10 (won't often run into enough repeats on the 10+ char strings to justify trying more
;another function that takes is called with an array by reference
;removes duplicates and puts chunks in alpha order
;3rd function is passed each array, amd the length of $fulltext
;function returns an compression ratio, based on the length of the chunks, and the total savings in bytes
;running lowest total size is stored in global variable after function is called, so after loop is done calling it
;the most efficient chunk size will be stored
;all necessary punctuation and non printable characters are added to the correct array, and a second dimension
;is added for code numbers, which are HEX values (or could go with a higher base up to 36 to keep it at 2 characters
;if there are more than 225 unique chunks
;Quick replace with regex or even StringReplace($array[$i][0],$array[$i][1]) to create layer one map.

;at this point there is a file to be saved that should be a LOT smaller than the original, with a compression
;ratio that is more and more efficient depending on the length of the plaintext
;BUT at this point you still have a text file that could then be run through again for further compression.

czardas · August 21, 2010

a combination of bases.... .... interesting idea. Would this be something simple to figure out what base to use? or would it be trial and error / while loop? lol....
So once we determine what base to be in we would need some pattern finding algorithm... hmmmm i understand the logic behind it a bit, but the code itself and whether or not it would be that great a compression is still a little fuzzy.

As this is a theoretical debate, I was simply suggesting some thoughts which came into my mind. The concept I had in mind involves combining different bases in the output. So the output would contain several symbols each representing a repeating string in a designated base. How practical or efficient such a system would be is not clear to me, and I don't imagine that it would be easy to impliment.

To give you an idea. Let's say that a 20 digit binary string is found at 12 different locations. We can call this pattern A, and assign a symbol to represent it. The characters between pattern A could then be tested for repeats in base 3, base 4, base 5 etc. Each time a large enough repeat is discovered we assign a new symbol to represent it. Perhaps it would be more appropriate to start with a higher base and work your way down towards binary, or to determine the best starting base by trial and error.

It seems long winded and complicated, and I am uncertain as to the outcome. There may be some time saving methods such as only determining the longest, or most common repeat (in each base) prior to making a choice of action. There would have to be some imposed limits as the comparisons could become too numerous. However it is the idea of acheiving maximum compression that appeals to me.

Edited August 21, 2010 by czardas

Drifter · August 21, 2010

Czardas, I agree. "Max compression" is quite alluring, that's why I'm in this too, as well as learning stuff of course

Cameronsdad: This is all still a little hard for me to digest. So far what i understand is you are separating the punctuation and white space from the rest of the string, leaving you with the other string being mostly alpha-numerical? You then compare lengths of different chunk sizes after "redundancy" is removed and select the best candidate, putting the punctuation back into it somehow?

Sorry if I'm wrong, this is a little over my head and I'm just trying to summarize to see if I understand correctly.

Looks like a fun project though, I haven't used RegEx before

Drifter · August 22, 2010

I made a simple little container for the algorithm.

#include <EditConstants.au3>
#include <WindowsConstants.au3>
#include <GUIConstantsEx.au3>

Global $index = 1
Global $edit, $editLen, $label

AdlibRegister("checkEdit",1000)

GUICreate("Text Compressor",600,600)
$edit = GUICtrlCreateEdit("",20,40,560,460,BitOr($ES_MULTILINE,$ES_WANTRETURN,$WS_VSCROLL, $ES_AUTOVSCROLL))
GUICtrlSetLimit($edit,750000)
GUICtrlCreateLabel("Letters Typed:",20,510,70,20)
$label = GUICtrlCreateLabel("0 of 750000 (0%)",100,510,300,20)
$button = GUICtrlCreateButton("Compress",270,540,60,40)

GUISetState()


While 1
    $msg = GUIGetMsg()

    Select
        Case $msg = $button
            AdlibUnRegister("checkEdit")
            compress()
            AdlibRegister("checkEdit",1000)
        Case $msg = $GUI_EVENT_CLOSE
            Exit
    EndSelect

WEnd

Func compress()

    ;compression method goes here. Method determines the file size the string 
    ;that was entered would be in a real text file, and compares to the compressed 
    ;size, then asks user if they want to save the compressed data.

EndFunc


Func checkEdit()

    If Not (StringLen(GUICtrlRead($edit)) = $editLen) Then
        $editLen = StringLen(GUICtrlRead($edit))
        GUICtrlSetData($label, $editLen & " of 750000 (" & Round($editLen * 100 / 750000,3) & "%)")
    EndIf

EndFunc

Edited August 22, 2010 by Drifter

Richard Robertson · August 22, 2010

I'm just curious, why do you guys keep talking about numbers in different bases? The width the number takes up only changes in string form.

Drifter · August 22, 2010

yeah thats part of the problem that I had in doing some of my earlier calculations. "01001011" as binary takes 1 byte, where as in a string it is 8 bytes. However, we cant change the fact that computers run on switches, which can only have one of two states. >_>

Edit: not sure if thats what you meant btw....

Edited August 22, 2010 by Drifter

seandisanti · August 22, 2010

I'm just curious, why do you guys keep talking about numbers in different bases? The width the number takes up only changes in string form.

i thought we were talking about compressing text files? hence reducing the string width would be reducing the size...

Richard Robertson · August 23, 2010

i thought we were talking about compressing text files? hence reducing the string width would be reducing the size...

If you're converting the bytes of a character into any base less than base 256, you will only inflate the string.

czardas · August 23, 2010

If you're converting the bytes of a character into any base less than base 256, you will only inflate the string.

Of course you are right. However it may enable you to dispense with 90 percent of the string length by the use of pattern recognition and formulas for compression. Then you can store expansion instructions in bytes. Or am I missing something? Edited August 23, 2010 by czardas

Sign In

Compression Thoughts...

Recommended Posts

seandisanti

Top Posters In This Topic

Top Posters In This Topic

Posted Images

Drifter

Drifter

seandisanti

Drifter

seandisanti

Drifter

seandisanti

Drifter

seandisanti

Drifter

seandisanti

czardas

Drifter

Drifter

Richard Robertson

Drifter

seandisanti

Richard Robertson

czardas

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta