Page 1 of 1

remove a link in html code when the href is a given tag

Posted: Mon Jan 10, 2011 7:43 am
by trucmuche
Hello !

I have a PHP variable $htmlmessage which contains the HTML code of a webpage, including meta tags, image and links tags and all kind of stuff.

In this variable $htmlmessage, I would like to remove the 'a' links when the HREF is set to "[MYPERSONALTAG]".

I tried with the regex

Code: Select all

$htmlmessage = preg_replace('/<a href\=[\"]\[MYPERSONALTAG\][\"]\>(.*?)<\/a>/si','',$htmlmessage);
but the problem is that I don't know that the HREF tag is always the first one : it could be after another tag in the <a> tag...
The tag could be
<a href="[MYPERSONALTAG]"> ... </a>
but it could also be
<a href="[MYPERSONALTAG]" id="..."> ... </a>
<a href="[MYPERSONALTAG]" class="..."> ... </a>
<a id="..." href="[MYPERSONALTAG]"> ... </a>
<a class="..." href="[MYPERSONALTAG]"> ... </a>
or many other possibilities...

We don't know the number of arguments in the 'a' tag, neither their order, but if there is a 'a' tag which contains the tag 'href="[MYPERSONALTAG]"', I would like to remove it.

I don't know how to do this with regex, and I do not know how to do this in an other way...

Could you help me ??

Many thanks in advance for your help !!

trucmuche

Re: remove a link in html code when the href is a given tag

Posted: Mon Jan 10, 2011 8:42 am
by jankidudel
Maybe here you will find help http://api.jquery.com/remove/

Re: remove a link in html code when the href is a given tag

Posted: Mon Jan 10, 2011 10:40 am
by trucmuche
Nice... But is there a way to achieve this without Javascript (only with PHP) ?
Actually, I would like to embed this into a newsletter program which is entirely written in PHP so I think that a PHP-only-solution would be better... and this is precisely the reason why I asked my question on this PHPdn forum :)

Could you help me in this way ?

Thank you very very much for your help !! :-)

Trucmuche

Re: remove a link in html code when the href is a given tag

Posted: Mon Jan 10, 2011 11:07 am
by jankidudel
Sorry , no , with regex to handle nested tags is very hard, I had this problem recently myself, and probably 3-4 days searched on forums about that, found couple functions but after reading comments about not working in 1 or more case i've decided not to do it with php.

Re: remove a link in html code when the href is a given tag

Posted: Mon Jan 10, 2011 12:21 pm
by trucmuche
Hello !

Maybe this would be possible without regexp... I can imagine an algorithm which should work, I think :
  • detect a tag which begins with "<a" and stores its position in the string (in a variable $begin which is an integer).
  • detects the position of the corresponding closing tag "</a>" : this is the first "</a>" after the $begin'th character in the string. Store its position in a variable $end (which is an integer too).
  • extract the substring which begins by the "<a" and finishes by the "</a>" in a variable called $link, say : it's easy since we know the position of the first character ($begin) and the position of the last one ($end).
  • scans the $link variable to see if the substring "[MYPERSONALTAG]" is in it. If so, the function outputs the original string without the $link.
This would do the trick, isn't it ?

Thank you very much for your advices !

Trucmuche

Re: remove a link in html code when the href is a given tag

Posted: Wed Jan 12, 2011 12:57 pm
by ridgerunner
This script does what you are asking using one regex:

Code: Select all

<?php // test.php Rev:2011-01-12_11:00
// Regular expression goodness:
$re = '% # Match <A>..</A> tag having magic attribute=value pair.
    <a\b                               # Opening literal chars for <A> tag.
    # Match non-magic attribute/value pairs before magic attribute:
    (?>                                # Zero or more non-"MY_HREF" attributes
      \s++                             # Attributes separated by whitespace
      (?!                              # Look ahead to make sure attribute not magic.
        href\s*+=\s*+                  # Magic attribute is named HREF.
        (?:([\'"]))?+                  # $1: Optional value open quote delimiter.
        \[MYPERSONALTAG\]              # Magic attribute value is "[MYPERSONALTAG]".
        (?(1)\1)                       # If there was an open quote, match closing quote.
      )                                # If this is not the magic attribute, proceed.
      [\w\-.:]++                       # Required non-magic attribute name (HTML).
      (?>                              # Optional attribute value
        = \s*+
        (?>                            # Value can be one of three quoted alternatives:
          \'[^\']*+\'                  # Single quoted value
        | "[^"]*+"                     # or double quoted value
        | [\w\-.:]++                   # or un-quoted value.
        )                              # End group of attribute value alternatives.
      )?+                              # End optional attribute value.
    )*+                                # Zero or more non-magic attribute/values.
    # Match the magic attribute/value pair:
    \s++                               # whitespace before magic attribute name
    href\s*+=\s*+                      # Magic attribute name.
    (?>([\'"]))?+                      # $2: Magic value optional open quote delimiter.
    \[MYPERSONALTAG\]                  # Magic attribute value = "MYPERSONALTAG".
    (?(2)\2)                           # If there was an open quote, match closing quote.
    # Match attribute/value pairs after magic attribute:
    (?>                                # Zero or more attributes.
      \s++                             # Attributes separated by whitespace
      [\w\-.:]++                       # Required non-magic attribute name.
      (?:                              # Optional attribute value
        = \s*+                         # Attribute value requires equals sign delimiter.
        (?:                            # Value can be one of three quoted alternatives:
          \'[^\']*+\'                  # Single quoted value
        | "[^"]*+"                     # or double quoted value
        | [\w\-.:]++                   # or un-quoted value.
        )                              # End group of attribute value alternatives.
      )?+                              # End optional attribute value.
    )*+                                # Zero or more non-magic attribute/values.
    \s*>                               # Whitespace and closing delimiter of <A> open tag.
    # Match <A> tag contents.
    (?>                                # Zero or more of one of the following two alternatives:
      [^<]++(?:(?!</a\s*+>)<[^<]*+)*+  # Either start on a non-"<", match one or more non-"<".
    |       (?:(?!</a\s*+>)<[^<]*+)++  # or start on a "<" (non-"</a>"), and zero or more non-"<".
    )*+                                #
    </a\s*+>                           # Closing literal chars.
    %ix'; // This regex uses: i="ignorecase" and x="free-spacing" modes.

// Test data.
$data = '
    VALID LINKS TO BE MATCHED:
    <a href=[MYPERSONALTAG]>test 001</a>
    <a href="[MYPERSONALTAG]">test 002</a>
    <a href=\'[MYPERSONALTAG]\'>test 003</a>
    <a class=before href=[MYPERSONALTAG]>test 004</a>
    <a class="before" href=[MYPERSONALTAG]>test 005</a>
    <a class=\'before\' href=[MYPERSONALTAG]>test 006</a>
    <a href=[MYPERSONALTAG] class=after>test 007</a>
    <a href=[MYPERSONALTAG] class="after">test 008</a>
    <a href=[MYPERSONALTAG] class=\'after\'>test 009</a>
    <a class=before href=[MYPERSONALTAG] class=after>test 010</a>
    <a class="before" href=[MYPERSONALTAG] class="after">test 011</a>
    <a class=\'before\' href=[MYPERSONALTAG] class=\'after\'>test 012</a>

    INVALID LINKS NOT TO BE MATCHED:
    <a href=[NOTMYPERSONALTAG]>test 001</a>
    <a href="[NOTMYPERSONALTAG]">test 002</a>
    <a href=\'[NOTMYPERSONALTAG]\'>test 003</a>
    <a class=before href=[NOTMYPERSONALTAG]>test 004</a>
    <a class="before" href=[NOTMYPERSONALTAG]>test 005</a>
    <a class=\'before\' href=[NOTMYPERSONALTAG]>test 006</a>
    <a href=[NOTMYPERSONALTAG] class=after>test 007</a>
    <a href=[NOTMYPERSONALTAG] class="after">test 008</a>
    <a href=[NOTMYPERSONALTAG] class=\'after\'>test 009</a>
    <a class=before href=[NOTMYPERSONALTAG] class=after>test 010</a>
    <a class="before" href=[NOTMYPERSONALTAG] class="after">test 011</a>
    <a class=\'before\' href=[NOTMYPERSONALTAG] class=\'after\'>test 012</a>
    ';
// Process the $data through the regex $re. Matches are placed in $matches.
$count = preg_match_all($re, $data, $matches);
if ($count > 0) {
    printf("There were %d matches:\n", $count);
    for ($i = 0; $i < $count; ++$i) { // $matches[0] is array of all complete matches.
        printf("Match %d of %d:\t\"%s\".\n", $i + 1, $count, $matches[0][$i]);
    }
} else echo("There were no matches.\n");
?>
Here is the output from the script:
[text]There were 12 matches:
Match 1 of 12: "<a href=[MYPERSONALTAG]>test 001</a>".
Match 2 of 12: "<a href="[MYPERSONALTAG]">test 002</a>".
Match 3 of 12: "<a href='[MYPERSONALTAG]'>test 003</a>".
Match 4 of 12: "<a class=before href=[MYPERSONALTAG]>test 004</a>".
Match 5 of 12: "<a class="before" href=[MYPERSONALTAG]>test 005</a>".
Match 6 of 12: "<a class='before' href=[MYPERSONALTAG]>test 006</a>".
Match 7 of 12: "<a href=[MYPERSONALTAG] class=after>test 007</a>".
Match 8 of 12: "<a href=[MYPERSONALTAG] class="after">test 008</a>".
Match 9 of 12: "<a href=[MYPERSONALTAG] class='after'>test 009</a>".
Match 10 of 12: "<a class=before href=[MYPERSONALTAG] class=after>test 010</a>".
Match 11 of 12: "<a class="before" href=[MYPERSONALTAG] class="after">test 011</a>".
Match 12 of 12: "<a class='before' href=[MYPERSONALTAG] class='after'>test 012</a>".[/text]
Hope this helps!
:)

Re: remove a link in html code when the href is a given tag

Posted: Thu Jan 13, 2011 3:04 am
by trucmuche
Waouw ! Really impressive !! It seems work perfectly ! :-) You're a regex magician ;-)
Thank you very very much !! :-)
I'm going to test it extensively now :-)

Thanks again ! :-)

Re: remove a link in html code when the href is a given tag

Posted: Fri Jan 14, 2011 11:37 pm
by ridgerunner
Improved version matches spaces between name and equals sign. Passes new test case 13.

Code: Select all

<?php // test.php Rev:2011-01-14_22:00
// Regular expression goodness:
$re = '% # Match <A>..</A> tag having magic attribute=value pair.
    <a\b                               # Opening literal chars for <A> tag.
    # Match non-magic attribute/value pairs before magic attribute:
    (?>                                # Zero or more non-"MY_HREF" attributes
      \s++                             # Attributes separated by whitespace
      (?!                              # Look ahead to make sure attribute not magic.
        href\s*+=\s*+                  # Magic attribute is named HREF.
        (?:([\'"]))?+                  # $1: Optional value open quote delimiter.
        \[MYPERSONALTAG\]              # Magic attribute value is "[MYPERSONALTAG]".
        (?(1)\1)                       # If there was an open quote, match closing quote.
      )                                # If this is not the magic attribute, proceed.
      [\w\-.:]++                       # Required non-magic attribute name (HTML).
      (?>                              # Optional attribute value
        \s*+ = \s*+                    # Attribute value requires equals sign delimiter.
        (?>                            # Value can be one of three quoted alternatives:
          \'[^\']*+\'                  # Single quoted value
        | "[^"]*+"                     # or double quoted value
        | [\w\-.:]++                   # or un-quoted value.
        )                              # End group of attribute value alternatives.
      )?+                              # End optional attribute value.
    )*+                                # Zero or more non-magic attribute/values.
    # Match the magic attribute/value pair:
    \s++                               # whitespace before magic attribute name
    href\s*+=\s*+                      # Magic attribute name.
    (?>([\'"]))?+                      # $2: Magic value optional open quote delimiter.
    \[MYPERSONALTAG\]                  # Magic attribute value = "MYPERSONALTAG".
    (?(2)\2)                           # If there was an open quote, match closing quote.
    # Match attribute/value pairs after magic attribute:
    (?>                                # Zero or more attributes.
      \s++                             # Attributes separated by whitespace
      [\w\-.:]++                       # Required non-magic attribute name.
      (?:                              # Optional attribute value
        \s*+ = \s*+                    # Attribute value requires equals sign delimiter.
        (?:                            # Value can be one of three quoted alternatives:
          \'[^\']*+\'                  # Single quoted value
        | "[^"]*+"                     # or double quoted value
        | [\w\-.:]++                   # or un-quoted value.
        )                              # End group of attribute value alternatives.
      )?+                              # End optional attribute value.
    )*+                                # Zero or more non-magic attribute/values.
    \s*>                               # Whitespace and closing delimiter of <A> open tag.
    # Match <A> tag contents.
    (?>                                # Zero or more of one of the following two alternatives:
      [^<]++(?:(?!</a\s*+>)<[^<]*+)*+  # Either start on a non-"<", match one or more non-"<".
    |       (?:(?!</a\s*+>)<[^<]*+)++  # or start on a "<" (non-"</a>"), and zero or more non-"<".
    )*+                                #
    </a\s*+>                           # Closing literal chars.
    %ix'; // This regex uses: i="ignorecase" and x="free-spacing" modes.

// Test data.
$data = '
    VALID LINKS TO BE MATCHED:
    <a href=[MYPERSONALTAG]>test 001</a>
    <a href="[MYPERSONALTAG]">test 002</a>
    <a href=\'[MYPERSONALTAG]\'>test 003</a>
    <a class=before href=[MYPERSONALTAG]>test 004</a>
    <a class="before" href=[MYPERSONALTAG]>test 005</a>
    <a class=\'before\' href=[MYPERSONALTAG]>test 006</a>
    <a href=[MYPERSONALTAG] class=after>test 007</a>
    <a href=[MYPERSONALTAG] class="after">test 008</a>
    <a href=[MYPERSONALTAG] class=\'after\'>test 009</a>
    <a class=before href=[MYPERSONALTAG] class=after>test 010</a>
    <a class="before" href=[MYPERSONALTAG] class="after">test 011</a>
    <a class=\'before\' href=[MYPERSONALTAG] class=\'after\'>test 012</a>

  <a class ="before" href=[MYPERSONALTAG] class=\'after\'>test 013</a>

    INVALID LINKS NOT TO BE MATCHED:
    <a href=[NOTMYPERSONALTAG]>test 001</a>
    <a href="[NOTMYPERSONALTAG]">test 002</a>
    <a href=\'[NOTMYPERSONALTAG]\'>test 003</a>
    <a class=before href=[NOTMYPERSONALTAG]>test 004</a>
    <a class="before" href=[NOTMYPERSONALTAG]>test 005</a>
    <a class=\'before\' href=[NOTMYPERSONALTAG]>test 006</a>
    <a href=[NOTMYPERSONALTAG] class=after>test 007</a>
    <a href=[NOTMYPERSONALTAG] class="after">test 008</a>
    <a href=[NOTMYPERSONALTAG] class=\'after\'>test 009</a>
    <a class=before href=[NOTMYPERSONALTAG] class=after>test 010</a>
    <a class="before" href=[NOTMYPERSONALTAG] class="after">test 011</a>
    <a class=\'before\' href=[NOTMYPERSONALTAG] class=\'after\'>test 012</a>
    ';
// Process the $data through the regex $re. Matches are placed in $matches.
$count = preg_match_all($re, $data, $matches);
if ($count > 0) {
    printf("There were %d matches:\n", $count);
    for ($i = 0; $i < $count; ++$i) { // $matches[0] is array of all complete matches.
        printf("Match %d of %d:\t\"%s\".\n", $i + 1, $count, $matches[0][$i]);
    }
} else echo("There were no matches.\n");
?>
And the new output with extra test case.
[text]There were 13 matches:
Match 1 of 13: "<a href=[MYPERSONALTAG]>test 001</a>".
Match 2 of 13: "<a href="[MYPERSONALTAG]">test 002</a>".
Match 3 of 13: "<a href='[MYPERSONALTAG]'>test 003</a>".
Match 4 of 13: "<a class=before href=[MYPERSONALTAG]>test 004</a>".
Match 5 of 13: "<a class="before" href=[MYPERSONALTAG]>test 005</a>".
Match 6 of 13: "<a class='before' href=[MYPERSONALTAG]>test 006</a>".
Match 7 of 13: "<a href=[MYPERSONALTAG] class=after>test 007</a>".
Match 8 of 13: "<a href=[MYPERSONALTAG] class="after">test 008</a>".
Match 9 of 13: "<a href=[MYPERSONALTAG] class='after'>test 009</a>".
Match 10 of 13: "<a class=before href=[MYPERSONALTAG] class=after>test 010</a>".
Match 11 of 13: "<a class="before" href=[MYPERSONALTAG] class="after">test 011</a>".
Match 12 of 13: "<a class='before' href=[MYPERSONALTAG] class='after'>test 012</a>".
Match 13 of 13: "<a class ="before" href=[MYPERSONALTAG] class='after'>test 013</a>".[/text]
:)

Re: remove a link in html code when the href is a given tag

Posted: Sat Jan 22, 2011 10:46 pm
by fwycruiser118
hey ridgerunner nice regex, why do you add two single repetition quantifiers here
[\w\-.:]++

Re: remove a link in html code when the href is a given tag

Posted: Wed Jan 26, 2011 11:34 am
by ridgerunner
fwycruiser118 wrote:hey ridgerunner nice regex, why do you add two single repetition quantifiers here
[\w\-.:]++
The second plus sign is a "possessive" modifier applied to the single "one-or-more" plus quantifier. It says: "Match one or more of the preceding, but don't give any back (for backtracking)". This possessive plus modifier can be applied to any quantifier to make it "possessive" - ie. ?+, *+, ++ {1,5}+. To answer the much more difficult question of how and why to make best use of the possessive quantifier is certainly beyond the scope of a single post in a forum! Jeffrey Friedl covers this and other advanced regex topic in his masterpiece: Mastering Regular Expressions (3rd Edition).

The short answer: possessive quantifieris and atomic groups can be used to improve the efficiency of a regex and can help avoid catastrophic super-linear backtracking in certain expressions as well.