Page 1 of 1

Using regular expressions to remove text from attribute

Posted: Thu Aug 27, 2009 2:03 pm
by SpankMarvin
Hello all

I am creating a project in WordPress, in which a listing, stored as a string, results in tags being present inside the title attribute of a tags. I know this isn't ideal but it is a by-product of a necessary and otherwise-working hack. So, my question is, can I use regular expressions to remove ONLY the tags from all title attributes within each a tag?

So far, I have the following:

Code: Select all

$title_expr = '/title=\"(.*?)\"/';
    
    $cat_list = preg_replace($title_expr, "", $cat_list);
    echo($cat_list);
which has removed the title attributes entirely. This is a good enough worst-case scenario, but if it's possible to find all title attributes, then replace anything between and including <> from each, I'd be interested in knowing how to go about it...

Thank you!

John

Re: Using regular expressions to remove text from attribute

Posted: Thu Aug 27, 2009 3:15 pm
by requinix
So you want to remove all <...title="?"...>s?

Code: Select all

preg_replace('/<[^>]*?title\\s*=\\s*"[^"]*"[^>]*>/', "", $text)

Re: Using regular expressions to remove text from attribute

Posted: Thu Aug 27, 2009 4:28 pm
by SpankMarvin
tasairis wrote:So you want to remove all <...title="?"...>s?

Code: Select all

preg_replace('/<[^>]*?title\\s*=\\s*"[^"]*"[^>]*>/', "", $text)
Thank you, but this is not quite what I'm after. Here's what I'd like to do for each <a> tag in the string:

- Find title attribute.
- Where there is a title attribute, within the title's contents, remove all tags

I know there should not be any tags inside the title attribute, which is why I'm attempting to do this. The automatically-generated HTML is producing the tags within the attribute. I want to keep as much of the title attribute as possible, but remove any <> characters and everything contained within them.

My original code gets rid of the entirety of the title attribute, which is overkill, but at least gets the document validating correctly. This, however, is as far as I can get in the regular expressions before my head explodes.

Re: Using regular expressions to remove text from attribute

Posted: Thu Aug 27, 2009 5:30 pm
by requinix
Okay, so for each title in an <a> tag, strip HTML tags from it?

Code: Select all

<a href="/link" title="<b>Title</b>"> becomes
<a href="/link" title="Title">

Code: Select all

preg_replace('/(<a\\b[^>]+?title\\s*=\\s*")([^"]+)("[^>]*>)/ie', '"$1" . strip_tags("$2") . "$3"', $text)

Re: Using regular expressions to remove text from attribute

Posted: Thu Aug 27, 2009 6:44 pm
by SpankMarvin
tasairis wrote:Okay, so for each title in an <a> tag, strip HTML tags from it?

Code: Select all

<a href="/link" title="<b>Title</b>"> becomes
<a href="/link" title="Title">

Code: Select all

preg_replace('/(<a\\b[^>]+?title\\s*=\\s*")([^"]+)("[^>]*>)/ie', '"$1" . strip_tags("$2") . "$3"', $text)
Absolutely fantastic. Thank you so much. This does precisely what I was after.

Would you mind briefly explaining the symbols you have used here? I have tried a couple of books and various online tutorials to get my head around this, but was unable to work it out. I'd really appreciate a little breakdown (or if you know of a good tutorial!) so I can learn from this.

Thanks again for your help, wonderful.

John

EDIT: P.S. I'm thinking of making a small WP tutorial about what I was trying to achieve, and will credit you for this last stage when I do. If you have a website or something for me to link to, PM me or pop it in here, and I'll link to you with my credit!

Re: Using regular expressions to remove text from attribute

Posted: Thu Aug 27, 2009 9:16 pm
by requinix
Ah, sorry, playing Crysis.

Code: Select all

preg_replace('/(<a\\b[^>]+?title\\s*=\\s*")([^"]+)("[^>]*>)/ie', '"$1" . strip_tags("$2") . "$3"', $text)
Here goes:

Code: Select all

(<a\b[^>]+?title\s*=\s*")([^"]+)("[^>]*>)
 
Three parts:
(<a\b[^>]+?title\s*=\s*")
- \b is a word boundary - matches a location (between a \w and a not-\w)
- [^>]+? is at least one character that isn't a > but matches as few as possible
- \s is whitespace (space, tab, newline, etc)
 
([^"]+)
- [^"]+ is at least one character that isn't a " and matches as many as it possibly can
 
("[^>]*>)
- [^>]* is some number (could be zero) of not-> characters. Also matches as many as possible
All together:
$1 is <a, some number of not-> characters, title, maybe some spaces, =, maybe some more spaces, and a ".
$2 is some number of not-" characters.
$3 is a ", any number of not-> characters, and a >.

The /i flag means to do a case-insensitive search (so it works with <A TITLE=""> too) and /e means that the replacement text is actually PHP code to evaluate.

Code: Select all

"$1" . strip_tags("$2") . "$3"
preg_replace will replace $X with the captured subpattern (after slashes have been added) so the evaluated code could look something like

Code: Select all

"<a href=\"http://forums.devnetwork.net\" title=\"" . strip_tags("<b>DevNetwork Forums</b>") . "\" rel=\"external\">"

Re: Using regular expressions to remove text from attribute

Posted: Thu Aug 27, 2009 11:09 pm
by SpankMarvin
Thanks so much for taking the time to explain. Really useful stuff.