Best non-alpha/numeric characters for compression?

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
JAB Creations
DevNet Resident
Posts: 2341
Joined: Thu Jan 13, 2005 6:44 pm
Location: Sarasota Florida
Contact:

Best non-alpha/numeric characters for compression?

Post by JAB Creations »

What are the best single characters for compression?

In example I'll be imploding a lot of property+values together separated by a character. I will then compress the string so I'm wondering since I can choose what character I can use for that separator which non-alpha/numeric character would yield the greatest compression level?

With the same type and level of compression for example five hundred periods versus five hundred question marks...I would presume on a half-educated guess that it takes fewer 1s and more 0s to define a dot then all the dots that makeup a question mark character and thus yield a smaller size/greater level of compression?

Thoughts?
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Best non-alpha/numeric characters for compression?

Post by VladSun »

Generally speaking - less entropy leads to bigger compression ratio. Entropy means how "chaos"-like is the information.
In your example case 500x. and 500x? will result the same compression ratio because of the "byte" based information blocks.
So, I think to choice of delimiter wouldn't affect the compression ratio.
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
JAB Creations
DevNet Resident
Posts: 2341
Joined: Thu Jan 13, 2005 6:44 pm
Location: Sarasota Florida
Contact:

Re: Best non-alpha/numeric characters for compression?

Post by JAB Creations »

Would I be correct to presume the following...

1.) That not all characters can be represented by a single byte (eight bits)?

2.) The bits in each byte can be directly compressed by some/all compression algorithms? For example I presume the character '0' byte would = '00000000' in bits which may compress better then a byte such as '011000111'?
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Best non-alpha/numeric characters for compression?

Post by VladSun »

Entropy based compression algorithms don't have prior knowledge where the data comes from (i.e. is it English text, picture or a video), so they use 1 byte block coding.
And for your example:
00000000 -> entropy = 0 (no chaos!) => compression ratio would be very big
011000111 -> entropy > 0 => compression ratio would be lower.

I've tried to RAR two files - one with 710x. and one with 710x? - the second one was 2 bytes smaller ;), but it is also only 1-2 percent smaller.
There are 10 types of people in this world, those who understand binary and those who don't
Post Reply