Page 1 of 1
Identifying text vs. binary files
Posted: Thu Feb 15, 2007 7:05 pm
by benwei
Hello there,
I am trying to find a simple way of differentiating plaintext from binary files without resorting to either the Fileinfo PECL module or (now depreciated) mime_content_type() function. Doe anybody know of one?
Thanks,
Ben
Posted: Fri Feb 16, 2007 1:30 am
by ryuuka
use the
function to get the extension of the file
but this is a very insecure way to do this
Re: Identifying text vs. binary files
Posted: Fri Feb 16, 2007 3:09 am
by shwanky
benwei wrote:Hello there,
I am trying to find a simple way of differentiating plaintext from binary files without resorting to either the Fileinfo PECL module or (now depreciated) mime_content_type() function. Doe anybody know of one?
Thanks,
Ben
If exec isn't disabled on your server you can do exec("file -b $filename" $result); ^.^
Posted: Fri Feb 16, 2007 3:14 am
by Chris Corbyn
Just look for NULL bytes.
Code: Select all
function is_binary_file($path)
{
if (is_file($path) && is_readable($path))
{
$handle = fopen($path, "rb");
while (false !== $byte = fread($handle, 1))
if ($byte == "\0") return true;
}
//Not binary, not NULLs detected
return false;
}
Posted: Fri Feb 16, 2007 8:09 am
by feyd
I'm not sure all binary files will have NULL bytes. While the probability they do is likely high, it's not guaranteed.
"Standard" text files use a maximum of 7 bits. Scanning the file for bytes out of that range may yield better positives. However that depends on what you consider text files.
Posted: Fri Feb 16, 2007 8:26 am
by Chris Corbyn
Unicode is still text. I would certainly say a text file contains no NULL bytes, but I can't for sure say that a binary file WILL contain NULL bytes. It's unlikely that you won't find a NULL byte in a binary file however.
I saw another app (haven't got the foggiest where now!) which took a substr() of the first 1000 bytes then checked that for strpos() of \0. I'm not sure if they had a reason for only taking the first 1000 bytes other than memory/speed issues.
Posted: Fri Feb 16, 2007 8:31 am
by feyd
d11wtq wrote:Unicode is still text.
That's why I quote standard and added that what the user may consider as text may or may not fall into it.
Since all files are binary, I would actually suggest analyzing the file data to determine if it is what the user considers as text.
Posted: Fri Feb 16, 2007 9:30 am
by the DtTvB
I would check for any characters that have the character code between 0 - 31.