Page 1 of 1

Finding a word outside of HTML

Posted: Wed Nov 18, 2009 6:20 am
by gillykid
Hi

I am pretty new to this and can't solve this problem. I want to find all instances of a word that occur outside of any containing html tag. For example, consider the following sentance:

Code: Select all

 
This is a before a link, <a href="">this is inside a link tag</a>, this is a after link tag
 
Say we are searching for the word "link", I would want it to match the first one and the last one only.

Any help :( ?

Re: Finding a word outside of HTML

Posted: Wed Nov 18, 2009 6:50 am
by Apollo

Code: Select all

$bla = 'This is a before a link, <a href="">this is inside a link tag</a>, this is a after link tag';
 
$bla = preg_replace('#((^|</[^>]*>)[^<]*)link#',"\\1click-thingy",$bla);
 
// Hooray, $bla is now: 'This is a before a click-thingy, <a href="">this is inside a link tag</a>, this is a after click-thingy tag'

Re: Finding a word outside of HTML

Posted: Wed Nov 18, 2009 7:21 am
by gillykid
Amazing! You got it first time right out of the box! Thank you! :D

Re: Finding a word outside of HTML

Posted: Wed Nov 18, 2009 8:14 am
by gillykid
Ah, I have come across a bug. The regex only selects the last item before a tag. So if you have two or more items before a tag, or no tags at all it will only ever select the last item. For example:

Code: Select all

 
 
$bla = 'This is a link before a link, <a href="">this is inside a link tag</a>, this is a after link tag';
 
$bla = preg_replace('#((^|</[^>]*>)[^<]*)link#',"\\1click-thingy",$bla);
 
// Doh, $bla is now: 'This is a link before a click-thingy, <a href="">this is inside a link tag</a>, this is a after click-thingy tag'
 
 
Any thoughts?

Re: Finding a word outside of HTML

Posted: Sun Nov 22, 2009 8:15 pm
by ridgerunner
To match a word outside an XHTML tag, one must first be able to match what is inside an XHTML tag - and this is complicated by the fact thst the tag may itself contain nested tags. Keeping track of nested structures like this is difficult to do using regex alone but can be done by crafting a regex which takes advantage of the recursive expression abilities of the PCRE library (which is used by PHP's preg_* family of regex functions).

Here is a commented script that matches and lists all well-formed XHTML tags (each of which may contain nested tags).

Code: Select all

<?php // File: test.php
$XHTMLpatternLong = '%
(?P<TAG>  # capture outermost XHTML in named group: "TAG"
  # XHTML tag can have 2 forms. First is: <open>contents<close>
  <(?P<TAGNAME>\w++)    # capture tag name in "TAGNAME" group
  [^>]*+                # match any/all opening tag attributes
  (?<!/)>               # make sure opening tag is not self closing
  (?:                   # begin non-capture group for tag contents
    [^<]++              # either a string of regular non-tag text
  |                     # or...
    (?P>TAG)            # recusive expression for nested XHTML tag!
  )*                    # grab * between outer open and close tags
  </(?P=TAGNAME)>       # match outermost closing tag (named group)
| # or... second XHTML tag form is: <self closing tag />
  <\w[^>]*/>            # a self closing tag
)  # end "TAG" named capture group
%six';
 
$XHTMLpatternShort = '%(?P<TAG><(?P<TAGNAME>\w++)[^>]*+(?<!/)>(?:[^<]++|(?P>TAG))*</(?P=TAGNAME)>|<\w[^>]*/>)%si';
 
$data = file_get_contents('XHTML.txt');
if (($matchcount = preg_match_all($XHTMLpatternShort, $data, $matches)) > 0) {
    printf("%d matches found.\n", $matchcount);
    for ($i = 0; $i < $matchcount; $i++) {
        printf("TAG[%d]:\n%s\n", $i +1, $matches["TAG"][$i]);
    }
} else {
    echo("BAD! No XHTML tags found");
}
?>
Note that this regex will fail miserably if your code is not well-formed XHTML - that is, all tags that are opened must have a matching closing tag, or tags that have no closing tag must be self closing (i.e. must be <br /> not <br>).

Now that we can match tags and grab everything inside, we can build one that will match a word outside:

Code: Select all

Out of time for now. Coming soon...
Hope this helps!