Using regular expressions to remove text from attribute

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
SpankMarvin
Forum Newbie
Posts: 17
Joined: Mon Jul 27, 2009 1:08 am

Using regular expressions to remove text from attribute

Post by SpankMarvin »

Hello all

I am creating a project in WordPress, in which a listing, stored as a string, results in tags being present inside the title attribute of a tags. I know this isn't ideal but it is a by-product of a necessary and otherwise-working hack. So, my question is, can I use regular expressions to remove ONLY the tags from all title attributes within each a tag?

So far, I have the following:

Code: Select all

$title_expr = '/title=\"(.*?)\"/';
    
    $cat_list = preg_replace($title_expr, "", $cat_list);
    echo($cat_list);
which has removed the title attributes entirely. This is a good enough worst-case scenario, but if it's possible to find all title attributes, then replace anything between and including <> from each, I'd be interested in knowing how to go about it...

Thank you!

John
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: Using regular expressions to remove text from attribute

Post by requinix »

So you want to remove all <...title="?"...>s?

Code: Select all

preg_replace('/<[^>]*?title\\s*=\\s*"[^"]*"[^>]*>/', "", $text)
SpankMarvin
Forum Newbie
Posts: 17
Joined: Mon Jul 27, 2009 1:08 am

Re: Using regular expressions to remove text from attribute

Post by SpankMarvin »

tasairis wrote:So you want to remove all <...title="?"...>s?

Code: Select all

preg_replace('/<[^>]*?title\\s*=\\s*"[^"]*"[^>]*>/', "", $text)
Thank you, but this is not quite what I'm after. Here's what I'd like to do for each <a> tag in the string:

- Find title attribute.
- Where there is a title attribute, within the title's contents, remove all tags

I know there should not be any tags inside the title attribute, which is why I'm attempting to do this. The automatically-generated HTML is producing the tags within the attribute. I want to keep as much of the title attribute as possible, but remove any <> characters and everything contained within them.

My original code gets rid of the entirety of the title attribute, which is overkill, but at least gets the document validating correctly. This, however, is as far as I can get in the regular expressions before my head explodes.
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: Using regular expressions to remove text from attribute

Post by requinix »

Okay, so for each title in an <a> tag, strip HTML tags from it?

Code: Select all

<a href="/link" title="<b>Title</b>"> becomes
<a href="/link" title="Title">

Code: Select all

preg_replace('/(<a\\b[^>]+?title\\s*=\\s*")([^"]+)("[^>]*>)/ie', '"$1" . strip_tags("$2") . "$3"', $text)
SpankMarvin
Forum Newbie
Posts: 17
Joined: Mon Jul 27, 2009 1:08 am

Re: Using regular expressions to remove text from attribute

Post by SpankMarvin »

tasairis wrote:Okay, so for each title in an <a> tag, strip HTML tags from it?

Code: Select all

<a href="/link" title="<b>Title</b>"> becomes
<a href="/link" title="Title">

Code: Select all

preg_replace('/(<a\\b[^>]+?title\\s*=\\s*")([^"]+)("[^>]*>)/ie', '"$1" . strip_tags("$2") . "$3"', $text)
Absolutely fantastic. Thank you so much. This does precisely what I was after.

Would you mind briefly explaining the symbols you have used here? I have tried a couple of books and various online tutorials to get my head around this, but was unable to work it out. I'd really appreciate a little breakdown (or if you know of a good tutorial!) so I can learn from this.

Thanks again for your help, wonderful.

John

EDIT: P.S. I'm thinking of making a small WP tutorial about what I was trying to achieve, and will credit you for this last stage when I do. If you have a website or something for me to link to, PM me or pop it in here, and I'll link to you with my credit!
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: Using regular expressions to remove text from attribute

Post by requinix »

Ah, sorry, playing Crysis.

Code: Select all

preg_replace('/(<a\\b[^>]+?title\\s*=\\s*")([^"]+)("[^>]*>)/ie', '"$1" . strip_tags("$2") . "$3"', $text)
Here goes:

Code: Select all

(<a\b[^>]+?title\s*=\s*")([^"]+)("[^>]*>)
 
Three parts:
(<a\b[^>]+?title\s*=\s*")
- \b is a word boundary - matches a location (between a \w and a not-\w)
- [^>]+? is at least one character that isn't a > but matches as few as possible
- \s is whitespace (space, tab, newline, etc)
 
([^"]+)
- [^"]+ is at least one character that isn't a " and matches as many as it possibly can
 
("[^>]*>)
- [^>]* is some number (could be zero) of not-> characters. Also matches as many as possible
All together:
$1 is <a, some number of not-> characters, title, maybe some spaces, =, maybe some more spaces, and a ".
$2 is some number of not-" characters.
$3 is a ", any number of not-> characters, and a >.

The /i flag means to do a case-insensitive search (so it works with <A TITLE=""> too) and /e means that the replacement text is actually PHP code to evaluate.

Code: Select all

"$1" . strip_tags("$2") . "$3"
preg_replace will replace $X with the captured subpattern (after slashes have been added) so the evaluated code could look something like

Code: Select all

"<a href=\"http://forums.devnetwork.net\" title=\"" . strip_tags("<b>DevNetwork Forums</b>") . "\" rel=\"external\">"
SpankMarvin
Forum Newbie
Posts: 17
Joined: Mon Jul 27, 2009 1:08 am

Re: Using regular expressions to remove text from attribute

Post by SpankMarvin »

Thanks so much for taking the time to explain. Really useful stuff.
Post Reply