Jump to content

Compression Thoughts...


Recommended Posts

  • Replies 54
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

Messy Er...?

That is something I can't deny. It might also not work at all, but without trying, I won't know the answer. Unless of course I find an obvious refutation.

Here's how it might work (or not).

Say this is the start of some binary data.
1011 1111 1010 0010 1111 1101 0111 1110 1011 1111 0111

I may find multiple repetitions of a long section. (scaled down here for simplicity)
101(1 1111 101)0 0010 (1111 1101) 0(111 1110 1)0(11 1111 01)11

A = 11111101

So we can create a new representation.
101 A 00010 A 0 A 0 A 11

In this case there won't be any compression, but imagine that such patterns happen to occur over much longer sections when searched for in a different base. I don't know what possibilities will arise. It may be highly unpredictable.

Edited by czardas
Link to comment
Share on other sites

Searching for the matched patterns requires a lot of processing time. As your window gets bigger, it'll take longer and longer.

For instance, you would need to make 666 (scary coincidence) checks in "1011 1111 1010 0010 1111 1101 0111 1110 1011 1111 0111" using one byte wide blocks. The first 27 checks would compare "1011 1111" to every byte, starting at every bit. The next 26 would compare "011 1111 1" to every byte, starting at every bit, and so on, down to the last check which would compare "1110 1011" to the last byte.

666 checks for five and a half bytes isn't going to scale well.

Edited by Richard Robertson
Link to comment
Share on other sites

I see the point that you are making. Designing an efficient search algorithm represents a challenge. That part is confusing me. I think that some form of guess work will be required. Only searching a fraction of the possible patterns and then guessing again. Or some kind of fuzzy logic - whatever that means? ;)

Link to comment
Share on other sites

  • 3 weeks later...

cameronsdad:

Still willing to assist here? I have started university so unfortunately i cant spend as much time here as id like, but i might take all of what you said to a professor if i get the time.

I did talk to a professor who did things with compression and they said you cant obtain a much better system than the common algorithms already used, unless you want to use lots of computations somehow, in which case you might be able to squeeze out another 1% compression.

I do wonder if there really is any new good way to compress. ;)

Link to comment
Share on other sites

cameronsdad:

Still willing to assist here? I have started university so unfortunately i cant spend as much time here as id like, but i might take all of what you said to a professor if i get the time.

I did talk to a professor who did things with compression and they said you cant obtain a much better system than the common algorithms already used, unless you want to use lots of computations somehow, in which case you might be able to squeeze out another 1% compression.

I do wonder if there really is any new good way to compress. ;)

yes, still willing to help; i've actually just finished a couple of big projects at work that were keeping me pretty busy. as far as the assertion that it's as good as it's going to get, that's just silly. i'm not claiming that i've any profound insight on how to accomplish said goal, BUT every new way to do something was thought impossible or not worth trying until someone actually did it, and then it just became the way.
Link to comment
Share on other sites

as far as the assertion that it's as good as it's going to get, that's just silly. i'm not claiming that i've any profound insight on how to accomplish said goal, BUT every new way to do something was thought impossible or not worth trying until someone actually did it, and then it just became the way.

i agree with that one, but i do think that any further compression will come at the cost of processing time, and since hard drives now can hold 1TB and are cheap, processing time is viewed as more valuable. So we are possibly going against the grain of what most compression does (something fast).

However, i will again add that MAYBE this isn't entirely true either, i'm still interested in what your method is. Unfortunately i still have issues understanding while reading what you've posted thus far on the issue.

Link to comment
Share on other sites

i agree with that one, but i do think that any further compression will come at the cost of processing time, and since hard drives now can hold 1TB and are cheap, processing time is viewed as more valuable. So we are possibly going against the grain of what most compression does (something fast).

However, i will again add that MAYBE this isn't entirely true either, i'm still interested in what your method is. Unfortunately i still have issues understanding while reading what you've posted thus far on the issue.

Perhaps a better way to describe what i was talking about before is as a simple cypher. Think of a substitution cypher where typically you'll have one or more replacement letters for each plain(uncompressed) letter. Except instead of increasing the length of the replacements, you're increasing the length of the plaintext segments that each replacement is substituting... maybe that doesn't clear it up much... Look at the ascii characters, each is represented by a code, it's a direct 1-1 relationship... what if instead of having 1 ascii code per letter, you had one hex ascii code for each 3 character permutation of letters? then the longer the file, the more it would benefit by switching to that scheme... Now assuming you're still with me to this point. because text varies so much, 3 may not be the optimal number of characters for any given text, maybe a book can be broken up into fewer unique 4 character combinations than 3... or 5 characters, etc etc. using regular expressions, or any other set based algorithm, finding the ideal chunk length, and mapping out the replacements should be trivial even for large files...
Link to comment
Share on other sites

perhaps an example is in order... figure in the english alphabet if you ignore case and white space, you've got 26 (1A in hex) different letters. there are only 676 (2A4) possible 2 character combinations though any given file is probably going to have fewer then 255 (FF) different 2 character combinations. If you find a chunk length that the file can be broken into which would create 255 or fewer unique substitutions (the fewer the better)then replacing the chunks with the numbers that represent their replacements would save you space, and give you lossless compression. because you'd be storing 3+ characters with 2 characters of space used... i fear i may have started repeating myself a bit now; did i add ANY clarity to the idea?

Link to comment
Share on other sites

a little, in that there most likely wont be every pattern theoretically possible. This is becoming easier to grasp now. ;)

exactly, so by replacing the patterns with single or double character replacements, the longer the patterns or chunks that we can replace; the greater the compression ratio.
Link to comment
Share on other sites

This is correct. I learned this through a compression method i just tried involving use of an English dictionary to replace. Since my method works with 3 chars max to represent all words in my list, the best compression comes out of the longer words. ;)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...