Page 1 of 1

Identifying text vs. binary files

Posted: Thu Feb 15, 2007 7:05 pm
by benwei
Hello there,

I am trying to find a simple way of differentiating plaintext from binary files without resorting to either the Fileinfo PECL module or (now depreciated) mime_content_type() function. Doe anybody know of one?

Thanks,
Ben

Posted: Fri Feb 16, 2007 1:30 am
by ryuuka
use the

Code: Select all

explode()
function to get the extension of the file

but this is a very insecure way to do this

Re: Identifying text vs. binary files

Posted: Fri Feb 16, 2007 3:09 am
by shwanky
benwei wrote:Hello there,

I am trying to find a simple way of differentiating plaintext from binary files without resorting to either the Fileinfo PECL module or (now depreciated) mime_content_type() function. Doe anybody know of one?

Thanks,
Ben
If exec isn't disabled on your server you can do exec("file -b $filename" $result); ^.^

Posted: Fri Feb 16, 2007 3:14 am
by Chris Corbyn
Just look for NULL bytes.

Code: Select all

function is_binary_file($path)
{
    if (is_file($path) && is_readable($path))
    {
        $handle = fopen($path, "rb");
        while (false !== $byte = fread($handle, 1))
            if ($byte == "\0") return true;
    }
    //Not binary, not NULLs detected
    return false;
}

Posted: Fri Feb 16, 2007 8:09 am
by feyd
I'm not sure all binary files will have NULL bytes. While the probability they do is likely high, it's not guaranteed.

"Standard" text files use a maximum of 7 bits. Scanning the file for bytes out of that range may yield better positives. However that depends on what you consider text files.

Posted: Fri Feb 16, 2007 8:26 am
by Chris Corbyn
Unicode is still text. I would certainly say a text file contains no NULL bytes, but I can't for sure say that a binary file WILL contain NULL bytes. It's unlikely that you won't find a NULL byte in a binary file however.

I saw another app (haven't got the foggiest where now!) which took a substr() of the first 1000 bytes then checked that for strpos() of \0. I'm not sure if they had a reason for only taking the first 1000 bytes other than memory/speed issues.

Posted: Fri Feb 16, 2007 8:31 am
by feyd
d11wtq wrote:Unicode is still text.
That's why I quote standard and added that what the user may consider as text may or may not fall into it.


Since all files are binary, I would actually suggest analyzing the file data to determine if it is what the user considers as text.

Posted: Fri Feb 16, 2007 9:30 am
by the DtTvB
I would check for any characters that have the character code between 0 - 31.