Page 1 of 1

Best non-alpha/numeric characters for compression?

Posted: Thu May 22, 2008 12:27 pm
by JAB Creations
What are the best single characters for compression?

In example I'll be imploding a lot of property+values together separated by a character. I will then compress the string so I'm wondering since I can choose what character I can use for that separator which non-alpha/numeric character would yield the greatest compression level?

With the same type and level of compression for example five hundred periods versus five hundred question marks...I would presume on a half-educated guess that it takes fewer 1s and more 0s to define a dot then all the dots that makeup a question mark character and thus yield a smaller size/greater level of compression?

Thoughts?

Re: Best non-alpha/numeric characters for compression?

Posted: Thu May 22, 2008 1:29 pm
by VladSun
Generally speaking - less entropy leads to bigger compression ratio. Entropy means how "chaos"-like is the information.
In your example case 500x. and 500x? will result the same compression ratio because of the "byte" based information blocks.
So, I think to choice of delimiter wouldn't affect the compression ratio.

Re: Best non-alpha/numeric characters for compression?

Posted: Thu May 22, 2008 1:41 pm
by JAB Creations
Would I be correct to presume the following...

1.) That not all characters can be represented by a single byte (eight bits)?

2.) The bits in each byte can be directly compressed by some/all compression algorithms? For example I presume the character '0' byte would = '00000000' in bits which may compress better then a byte such as '011000111'?

Re: Best non-alpha/numeric characters for compression?

Posted: Thu May 22, 2008 2:24 pm
by VladSun
Entropy based compression algorithms don't have prior knowledge where the data comes from (i.e. is it English text, picture or a video), so they use 1 byte block coding.
And for your example:
00000000 -> entropy = 0 (no chaos!) => compression ratio would be very big
011000111 -> entropy > 0 => compression ratio would be lower.

I've tried to RAR two files - one with 710x. and one with 710x? - the second one was 2 bytes smaller ;), but it is also only 1-2 percent smaller.