Filtering html; html parsing.

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
JellyFish
DevNet Resident
Posts: 1361
Joined: Tue Feb 14, 2006 7:18 pm
Location: San Diego, CA

Filtering html; html parsing.

Post by JellyFish »

Hi, I'm looking for the best and most flexible way to filter HTML elements and attributes out of a string of text with PHP. I'm looking to do this with built in PHP <= 5 methods/functions/classes, no installation of extensions.

I need a way to white-list specific HTML elements and attributes, and delete HTML tags (when I say tags I mean <begin> and </end> tags, not the element and it's content) that aren't in the white-list. I figure a white-list is the best way to go about this, because otherwise I'll have to keep up with new HTML features when they enter the industry (if I go with a black-list). I also need the ability to filter attribute values, such as the value of a style attribute, so that I can filter the CSS rules of an element.

Is there any PHP built-in classes or libraries (not extensions, I don't have root access on my server) that could help me with in doing this? Or should I build my own API and use str_replace and the like to do so? If I have to build my own API, what advice would you have? What would you advise as the most secure way to use str_replace to remove elements, tags, and attributes without hackable workarounds?

Any help on this is VERY appreciated. Thanks for reading.

Cheers! :D
Reviresco
Forum Contributor
Posts: 172
Joined: Tue Feb 19, 2008 4:18 pm
Location: Milwaukee

Re: Filtering html; html parsing.

Post by Reviresco »

strip_tags:

string strip_tags ( string $str [, string $allowable_tags ] )

http://us.php.net/strip-tags
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: Filtering html; html parsing.

Post by requinix »

While strip_tags won't handle attributes, there are some functions given in the user comments that do. But I don't think any of them go into as much detail as filtering CSS in an inline style.
User avatar
JellyFish
DevNet Resident
Posts: 1361
Joined: Tue Feb 14, 2006 7:18 pm
Location: San Diego, CA

Re: Filtering html; html parsing.

Post by JellyFish »

Hmm, I've seemed to have found something called HTML Purifier created by someone on DevNetwork forums, actually! It looks fantastic, but I'm having difficulty getting started with it. It's a library which is perfect, but it contains a LOT of files.

If anyone is familiar with HTML Purifier, which files would I need to include from the HTML Purifier zip (everything in the library folder, maybe)? Also how do I go about using the classes, and which classes do I use (because there's a <span style='color:blue' title='I&#39;m naughty, are you naughty?'>smurf</span> load!)? The documentation is kinda sketchy for someone who's has little experience with PHP libraries (I'm more of a JavaScript guy ;)), so any help info would be great. Of course, I'm going to be hacking around with the library and maybe I can figure it out.

[EDIT]
I seem to have gotten HTML Purifier up and working on my site. But now I'm looking to configure it a bit. But I don't know how this is done really. Let me see if I'm doing this right:

Code: Select all

 
$config = HTMLPurifier_Config::createDefault();
$config->set("CSS", "AllowedProperties", array("display"));
 
I'm trying to allow the CSS display property. Am I don't this wrong?

I'd like to know what are all the methods and properties that I can use for both the HTMLPurifier_Config class and a HTMLPurifier object and how to use them. I don't seem to understand the documentation for HTML Purifier very well.

[EDIT] Turns out HTML Purifier doesn't support the display css property. What I'm wondering is why HTML Purifier doesn't leave it up to the developer to create the whitelist?
User avatar
php_east
Forum Contributor
Posts: 453
Joined: Sun Feb 22, 2009 1:31 pm
Location: Far Far East.

Re: Filtering html; html parsing.

Post by php_east »

you can use a combination of DOM HTML for most of the HTML and Attributes, and then and DOM XML for the style sheets and inline styles, but i suspect very much there is a lot of work to be done on to achieve what you want.

if it is all worth it, that would be the way i would go.
User avatar
JellyFish
DevNet Resident
Posts: 1361
Joined: Tue Feb 14, 2006 7:18 pm
Location: San Diego, CA

Re: Filtering html; html parsing.

Post by JellyFish »

php_east wrote:you can use a combination of DOM HTML for most of the HTML and Attributes, and then and DOM XML for the style sheets and inline styles, but i suspect very much there is a lot of work to be done on to achieve what you want.

if it is all worth it, that would be the way i would go.
Hey, thanks for posting. If you look at the edits in my previous post you'd see that I've decided to use HTML Purifier to filter HTML content for my site. The only thing is I need some help understanding how to use the library, so any info on this would be very appreciated. I guess I'll need to sign up on the HTML Purifier forums instead, maybe.
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Re: Filtering html; html parsing.

Post by Benjamin »

JellyFish, I think you're going to have to breakdown and use some tight regex in order to whitelist html tags.
User avatar
JellyFish
DevNet Resident
Posts: 1361
Joined: Tue Feb 14, 2006 7:18 pm
Location: San Diego, CA

Re: Filtering html; html parsing.

Post by JellyFish »

astions wrote:JellyFish, I think you're going to have to breakdown and use some tight regex in order to whitelist html tags.
But why when there is HTML Purifier?
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Re: Filtering html; html parsing.

Post by Benjamin »

Not sure how I missed that. Have a look here:

http://htmlpurifier.org/docs/dev-advanced-api.html
Post Reply