Page 1 of 1
Filtering html; html parsing.
Posted: Thu Apr 30, 2009 2:16 pm
by JellyFish
Hi, I'm looking for the best and most flexible way to filter HTML elements and attributes out of a string of text with PHP. I'm looking to do this with built in PHP <= 5 methods/functions/classes, no installation of extensions.
I need a way to white-list specific HTML elements and attributes, and delete HTML tags (when I say tags I mean <begin> and </end> tags, not the element and it's content) that aren't in the white-list. I figure a white-list is the best way to go about this, because otherwise I'll have to keep up with new HTML features when they enter the industry (if I go with a black-list). I also need the ability to filter attribute values, such as the value of a style attribute, so that I can filter the CSS rules of an element.
Is there any PHP built-in classes or libraries (not extensions, I don't have root access on my server) that could help me with in doing this? Or should I build my own API and use str_replace and the like to do so? If I have to build my own API, what advice would you have? What would you advise as the most secure way to use str_replace to remove elements, tags, and attributes without hackable workarounds?
Any help on this is VERY appreciated. Thanks for reading.
Cheers!

Re: Filtering html; html parsing.
Posted: Thu Apr 30, 2009 2:20 pm
by Reviresco
strip_tags:
string
strip_tags ( string
$str [, string
$allowable_tags ] )
http://us.php.net/strip-tags
Re: Filtering html; html parsing.
Posted: Thu Apr 30, 2009 2:22 pm
by requinix
While strip_tags won't handle attributes, there are some functions given in the user comments that do. But I don't think any of them go into as much detail as filtering CSS in an inline style.
Re: Filtering html; html parsing.
Posted: Thu Apr 30, 2009 6:08 pm
by JellyFish
Hmm, I've seemed to have found something called
HTML Purifier created by someone on DevNetwork forums, actually! It looks fantastic, but I'm having difficulty getting started with it. It's a library which is perfect, but it contains a LOT of files.
If anyone is familiar with HTML Purifier, which files would I need to include from the HTML Purifier zip (everything in the library folder, maybe)? Also how do I go about using the classes, and which classes do I use (because there's a <span style='color:blue' title='I'm naughty, are you naughty?'>smurf</span> load!)? The documentation is kinda sketchy for someone who's has little experience with PHP libraries (I'm more of a JavaScript guy

), so any help info would be great. Of course, I'm going to be hacking around with the library and maybe I can figure it out.
[EDIT]
I seem to have gotten HTML Purifier up and working on my site. But now I'm looking to configure it a bit. But I don't know how this is done really. Let me see if I'm doing this right:
Code: Select all
$config = HTMLPurifier_Config::createDefault();
$config->set("CSS", "AllowedProperties", array("display"));
I'm trying to allow the CSS display property. Am I don't this wrong?
I'd like to know what are all the methods and properties that I can use for both the HTMLPurifier_Config class and a HTMLPurifier object and how to use them. I don't seem to understand the documentation for HTML Purifier very well.
[EDIT] Turns out HTML Purifier doesn't support the display css property. What I'm wondering is why HTML Purifier doesn't leave it up to the developer to create the whitelist?
Re: Filtering html; html parsing.
Posted: Fri May 01, 2009 12:09 am
by php_east
you can use a combination of DOM HTML for most of the HTML and Attributes, and then and DOM XML for the style sheets and inline styles, but i suspect very much there is a lot of work to be done on to achieve what you want.
if it is all worth it, that would be the way i would go.
Re: Filtering html; html parsing.
Posted: Fri May 01, 2009 1:32 am
by JellyFish
php_east wrote:you can use a combination of DOM HTML for most of the HTML and Attributes, and then and DOM XML for the style sheets and inline styles, but i suspect very much there is a lot of work to be done on to achieve what you want.
if it is all worth it, that would be the way i would go.
Hey, thanks for posting. If you look at the edits in my previous post you'd see that I've decided to use HTML Purifier to filter HTML content for my site. The only thing is I need some help understanding how to use the library, so any info on this would be very appreciated. I guess I'll need to sign up on the HTML Purifier forums instead, maybe.
Re: Filtering html; html parsing.
Posted: Fri May 01, 2009 3:22 am
by Benjamin
JellyFish, I think you're going to have to breakdown and use some tight regex in order to whitelist html tags.
Re: Filtering html; html parsing.
Posted: Sat May 02, 2009 4:14 pm
by JellyFish
astions wrote:JellyFish, I think you're going to have to breakdown and use some tight regex in order to whitelist html tags.
But why when there is
HTML Purifier?
Re: Filtering html; html parsing.
Posted: Sat May 02, 2009 4:21 pm
by Benjamin