Page 1 of 1

Get file encoding

Posted: Sat Sep 02, 2006 11:52 am
by Ollie Saunders
Is it possible using bash or PHP to get the details of the character encoding of a bunch of files?
What resources are there for getting encoding information out of a file in either bash or PHP?

Posted: Sat Sep 02, 2006 11:56 am
by Chris Corbyn
mb_detect_encoding() will work on the data inside the file.

Posted: Sat Sep 02, 2006 12:22 pm
by Ambush Commander
It's a tricky problem, since there's (somewhat of) a chicken egg problem: if the encoding is specified inside the file, how do you figure out what encoding to read it out by?

If you're dealing with XML or HTML, try to detect the <?xml or <meta tags for the encoding.

If you're dealing with plaintext, you outta luck. You can try using mb_detect_encoding() but 1) it isn't always right and 2) it's not always available.

Posted: Sat Sep 02, 2006 12:26 pm
by Ollie Saunders
d11wtq wrote:mb_detect_encoding() will work on the data inside the file.
Thanks d11.
If you're dealing with plaintext, you outta luck.
Yeah I'm realising that now, not that it was particularly surprising.
You can try using mb_detect_encoding() but 1) it isn't always right and 2) it's not always available.
It seems to say ASCII for everything. But when you put a special char in the file it reports correctly; better than nothing.

Oh does anyone have a recursive version of glob by any chance?

Posted: Sat Sep 02, 2006 12:27 pm
by Ambush Commander
Well, you *should* be converting them to UTF-8, so if it reports ASCII, you can treat it like UTF-8 and be done with it.