Compression Thoughts...

Drifter · August 19, 2010

First off I want to say that to the best of my knowledge I am posting in the appropriate forum for this. I do not have a finished script to share, and I do not have a question about any AutoIt functions or how AutoIt works. However, since this forum is labeled "General Help...", I think my posting here is the best place.

My Thoughts:

I remember reading or hearing in my high-school networking class that there was a consideration to add color to the light pulses sent down fiber optic lines. The idea behind this was that since the current light pulses only had a single color, information could be sent down the line "faster" because adding the color would allow more data to be sent in each pulse.

My Hypothesis:

I thought about this information and reasoned that one could do similar things with data stored on a hard drive. A program could be made to bring the data to its original form, and theoretically the larger the encode/decode program in disk space, the more compression that could be made for the data itself.

Thought Experiment:

I decided to test this idea by using bitmap information to compress data written in a text file. Attached is a little drawing work i did on this, showing the math. Please allow me to walk through it in case this isn't clear.

In each pixel of color in a picture, you have a Red, Green, and Blue component, each of which holds 256 colors. therefore each pixel has roughly 16.8 million possible values.

I tried to figure out how many possible characters there would be that i would want to represent. There is a-z both lower and upper case, 0-9 both normal and shift (meaning !@#$% etc), the 20 characters near the enter key (-_=+[{]}\|;: etc), ~ and `, and also space and enter keys. This adds to 96 which i reasoned i could reserve 128 for, filling the other 32 slots with things like ¿ or § or other common multi-lingual characters.

Since 16,777,216 = 128^3 * 8 this means that each pixel holds 3 characters of data. However as you can see, this type of compression does not work. I tried saving the pixel in different formats like JPG, PNG using IrfanView, but it seems Ms Paints 24 bit format is the smallest, and what i get is inflation of data, not compression.

Conclusion:

I'm not sure if it's just the method I am using that is incorrect, or if it is my hypothesis. Is my hypothesis wrong? Does anyone have any ideas how i could compress data in a way that if my compressing program gets larger and larger the files that i want to compress can be smaller and smaller? The math isn't working for me so far...

Thanks all for any ideas or thoughts on this matter. I really appreciate it!

Drifter · August 19, 2010

Additional comment: I just tried 16 pixels vs 16*3 = 48 characters. In this case the picture vs text ratio was 102 bytes/48 bytes. This is a marked improvement from the ratio of 58 bytes / 3 or 9 bytes as done in the original experiment.

Also at 100 pixels vs 100*3 characters, i got a ratio of 374/300.

Maybe this WILL work if i use longer messages? No idea why Mspaint can save 16 pixels without the file being 16 times as large but hey, works for me?

Edited August 19, 2010 by Drifter

Drifter · August 19, 2010

grah, tried 10k pixels vs 30k text and i get 29.3kb / 29.2kb..... I'm GUESSING that:

as the data size approaches infinity, the ratio of picture size to text size = 1.

This STILL means my idea doesn't work. I would need to see values less than 1 for it to be working yeah? And now I sit and wait for replies.

Edited August 19, 2010 by Drifter

AndyG · August 19, 2010

I´am trying to explain with my bad english^^

Compression only works which MUCH Data. The more data, the better!

Example for Compression "Bitmaps":

Lets say, we have a 24bit-Bitmap RGB, some "Pixel"-Colours named A,B,C,D,E.F....

The Bitmap-Data could be as following:

ABCDDDDDEFGHHHHHHHHHHHHHDEFDEFDEFDEFABC...

The "easiest" Compression Method is to count the seqentially Pixels of the same colour.

ABC4DEFG13HDEFDEFDEFDEFABC...

Therefor you could make a "Stringformat", a ~~Dword~~ Byte could be the "Informationcenter" of the following Bytes:

Byte = 8 Bits

Bit 7 is a Flag which tells us if the following pixels are all the same colour or not

Bit 6-0 (7 Bit) is the number of "Pixels" following the Byte

Lets show this "control"-Byte with # in the Bitmap. So what we have is:

#ABC#D#EFG#H#DEFDEFDEFDEFABC...

which is always shorter than the original "Bitmap"

Now we see, that there is a Sequence of DEF DEF DEF in the Bitmap. We could compress it if we expand our #-Byte to a Word with the Information of the following Bytes..

##ABC##D##EFG##H ##DEF##ABC....which is not much longer^^

Edited August 19, 2010 by AndyG

Drifter · August 19, 2010

AndyG:

Thanks for your insight. I think I am only partly understanding you though. The Information Center part was confusing to me, but that's possibly because I've never been down this road before

I do see what you mean though in the way of showing "5D" instead of "DDDDD" but i doubt you would have lots of places in text, for example, where the same letter repeats.

Proof: looking at this post i count (roughly) 8 times where this happens, and that's only a double repeat, so no text compression for something like "2S".

I will read through your information though and see if it makes more sense after thinking about it a few times.

Edited August 19, 2010 by Drifter

kaotkbliss · August 19, 2010

Since this is all theoretical...

What if it was converted to binary? then there would be lots of repeats...

AndyG · August 19, 2010

If you have "plain" text (Ansi), the easiest way to "compress" ist to use only 7 Bit (can hold 127 different chars)

7-Bit Ascii....good ol´times If you are a hardliner, 6 Bit are enough.....although....5 bit=32 chars

What if it was converted to binary? then there would be lots of repeats...

No, if you mean "binary" like 0x0A0B0CFFFAFFFFFFFFFF0A0B0C... where is the Problem?

If you think binary in the sense of "many Bytes of an i.e. EXE-File", yes...some data compress well, others not.

Edited August 19, 2010 by AndyG

kaotkbliss · August 19, 2010

I mean binary like 0101011101 lol

Drifter · August 19, 2010

theoretically there should be some repeats there, but mainly in ASCII whose values are either consecutively low or high (allowing lots of 0's or 1's respectively.)

I have a feeling that my idea of having some massive 100 MB compression program to cut down files like crazy isnt going to happen?

czardas · August 19, 2010

I have also had some ideas about compression (other than following). Somebody mentioned the idea of factorization to me, and I thought the idea was ingenious until I realized that it was only useful for a limited set of possible values. For example:

1,000,000,000,001 = 10^12+1
999,999,999,999 = 10^12-1

Problems arise with both selecting appropriate factors and with the less than satisfactory results I was achieving. The idea has since been ditched. I now think the only reasonable approach to compression is pattern recognition as suggested by AndyG. I think the pixels idea is quite novel, but I doubt it can amount to anything better than a 1/1 ratio. Although I have no proof of this.

Edited August 19, 2010 by czardas

Drifter · August 19, 2010

interesting thoughts there czardas! It is interesting to me as well although yes it is more efficient at listing numbers close to those powers of 10, or whatever your base would be.

Yes I know my pixel idea is a bit of a flop, but if nothing else, its a neat way of encoding data without much inflation yeah?

czardas · August 19, 2010

Yes I know my pixel idea is a bit of a flop, but if nothing else, its a neat way of encoding data without much inflation yeah?

I wouldn't say the idea was a flop, if it worked it would be brilliant. Besides, it may yet have other applications or lead you to a solution you wouldn't have thought of otherwise.

Drifter · August 19, 2010

I would give it a go with sound (mp3 file?) but it seems near impossible. The idea would be to encode using different lengths and frequencies of notes. If there was a really accurate way to "listen" for the computer, i could have notes that vary in length by a single millisecond. However i never heard of any language which can analyze speaker tones in this manner.... just sending my thoughts

czardas · August 19, 2010

Audio-visual data compression.

Okay now you have a lot more permutations. You could assign a colour and a musical frequency to all words, or common patterns, longer than a minimum string length. Perhaps leaving the remaining values unassigned. Then we upload a secret government database, compressed into a 5 minute video clip, on YouTube.

Cool!

And if you separate the audio from the visual, you have the world's most entertaining encryption.

Edited August 19, 2010 by czardas

Drifter · August 19, 2010

Great idea, a bunch of what appears to be visual "Noise"/"Static" accompanied by what sounds like half assed 80's computer generated music.

Again though sadly that's out of the range of what i know how to do or what AutoIt (and anything else i can think of) allows.

Edited August 19, 2010 by Drifter

czardas · August 19, 2010

Call it something like Data Punk.

Who knows, one day it might become a new form of music.

Edited August 19, 2010 by czardas

Drifter · August 19, 2010

czardas: you have an imagination as vivid as i do, only in a different way somehow... its awesome!

Back onto topic though.... so really the only realistic way of compression is predicting possible repeating patterns that are proven by some sort of statistical research or etc?

There's no way of cramming the data into some other format which mathematically is proven to be able to store more data in less hard disk space? Or making some sort of huge program that compresses, but results in very small compressed files?

Edited August 19, 2010 by Drifter

czardas · August 19, 2010

Okay, back on track.

I have not read up on this, so don't take my words as gospel. I guess there are many ingenious algorithms for compression. I think the main problem is that they are generally less effective than desired. I'm not sure, but I have a feeling that this may be partly due to processing power.

To illustrate my reason for thinking this, consider again the factorization idea above. Testing all possible factorizations takes too long and there is no guarantee that the resulting formulas will be any smaller than the original data. Pattern recognition can also be time consuming.

The idea I have been playing around with involves attempting to match the first half the string with the second half. Then testing smaller lengths until you have identified all possible repeat patterns. After that, and by some mathematical method, assertain the most effective compression and generate a key to uncompress the data. These searches will be long winded and the concept requires further thought. I don't think it will be practical for general use because of the processing time. However there are some circumstances where data storage is more important than the cost of a few main frames. Plus, with a key, uncompression will be much faster than compression.

I am also convinced that the use of numerical base conversions will allow greater compression. And that just about sums up everything I can think of at the moment.

Edited August 19, 2010 by czardas

Drifter · August 19, 2010

I was thinking about that, starting out basic and having say, base 36 (a-z) + (0-9)... but the problem there is that you would have to represent the compressed data as a string, thus each piece of the string takes up a byte. Does this work better at higher bases? base 72? (a-z) + (A-Z) + (0-9) + (!@#$% etc)

czardas · August 19, 2010

No, I don't mean it like that. I mean there may be repeating patterns, which allow greater compression, in a different numerical base (or combination of bases).

Edited August 19, 2010 by czardas

Sign In

Compression Thoughts...

Recommended Posts

Drifter

Top Posters In This Topic

Top Posters In This Topic

Posted Images

Drifter

Drifter

AndyG

Drifter

kaotkbliss

AndyG

kaotkbliss

Drifter

czardas

Drifter

czardas

Drifter

czardas

Drifter

czardas

Drifter

czardas

Drifter

czardas

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta