In late October of 2006, the Hardened-PHP Project (
www.hardened-php.net) found
a buffer overflow vulnerability in the htmlentities() and htmlspecialchars()
functions that are built into PHP. Those two functions are built with the idea that
HTML characters are never more than eight characters long. Most of the time this is
true. Unfortunately, if you use UTF-8 encoding with Greek characters, this assumption
fails.
UTF-8 is a variable-length character-encoding scheme that allows for characters
outside the typical Roman alphabet. The benefit to using UTF-8 is that it allows your
application to handle international data, typically in Asian or Middle Eastern languages.
To handle these characters, UTF-8 allots 4 bytes to each character:
• The 128 ASCII characters require only 1 byte to encode. These are the characters
most commonly used online because for much of the existence of computing,
work was done primarily in English. Politically incorrect? Possibly. But computer
programmers—especially those who deal with low-level operating system functions
like character encodings—aren’t widely known for their social graces. UTF-8 was
created to solve the limitations of English-based ASCII while maintaining backward
compatibility. The UTF-8 encoding of the letter A, for example, would be
only 1 byte long, just like its ASCII equivalent.
• Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, and Thaana alphabets
require 2 bytes to encode. This is where htmlentities() runs into trouble,
because it assumes a 1-byte character.
• Three bytes are required for the Basic Multilingual Plane in Unicode, which
includes virtually all characters in use today.
• Four bytes are reserved for other Unicode planes, which are rarely used. This,
however, doesn’t mean you can assume the fourth byte in a UTF-8-encoded character
is empty or harmless.
The htmlentities() and htmlspecialchars() functions assume an 8-character
entity. Most of the time this isn’t a problem. As we noted above, the vast majority of
computing is done in English, although this is changing as the Internet becomes
more widely available outside of North America and Western Europe. What happens
when a user (or a hacker, depending purely upon motivations) inserts a Greek UTF-8-
encoded character into your Web form, which you then pass to htmlentities() for
sanitization before displaying it in the browser? When the HTML entity encoder in
PHP encounters this Greek HTML entity that is larger than the current 8-character
buffer, PHP will simply increase the size of the buffer by 2 characters. Unfortunately, if
the HTML entity is 11 characters long, the buffer will overflow and allow for arbitrary
code to be executed. Figure 4.4 shows how PHP handles a normal, English-language
HTML entity. Figure 4.5 shows how this vulnerability is exploited with a Greek character.
There are two important points to take from this exploit:
• First, buffer overflows do happen in PHP. The only solution to the htmlentities()
and htmlspecialchars() exploit is to upgrade PHP to version 5.2.0 or greater,
so it’s crucial to keep PHP (and its underlying libraries, and the operating system)
up to date.
Second, if a buffer overflow vulnerability occurred once, it can—and will—occur
again. Just because one hole was closed does not imply that no other holes exist, nor
does it imply that new holes won’t be introduced in the next version of the language,
or its underlying libraries. Before the htmlentities() and htmlspecialchars()
buffer overflow vulnerability was discovered, the same vulnerability was found
and fixed in the wordwrap() function. There will certainly be vulnerabilities discovered
in the future. You simply can’t assume that because one vulnerability was
found and resolved, another doesn’t exist or won’t be introduced later.
if(strlen($incoming_html_char) > 10) {
//Reject the data
} //otherwise continue processing
$safe_html .= htmlentities($incoming_html_char);
In this case, we’ve chosen an arbitrary length that we assume is smaller than the
underlying buffer. As application programmers, we are several layers removed from
the actual buffer code, so unfortunately we don’t usually know exactly how large the
limit is, at least not until it’s been exploited. Sure, we could dig through the code for
the PHP interpreter and all its built-in functions and libraries to figure out if there are
assumed variable size limits (as in the htmlentities() and htmlspecialchars()
vulnerability). Once we’re done there, we’d have to do the same thing with all the
C libraries that PHP is built on. If you have that kind of time, go for it. For most of us,
that’s just not realistic, so we make educated guesses about what a reasonable length
for our variables is.