To match a word outside an XHTML tag, one must first be able to match what is inside an XHTML tag - and this is complicated by the fact thst the tag may itself contain nested tags. Keeping track of nested structures like this is difficult to do using regex alone but can be done by crafting a regex which takes advantage of the recursive expression abilities of the PCRE library (which is used by PHP's preg_* family of regex functions).
Here is a commented script that matches and lists all well-formed XHTML tags (each of which may contain nested tags).
Code: Select all
<?php // File: test.php
$XHTMLpatternLong = '%
(?P<TAG> # capture outermost XHTML in named group: "TAG"
# XHTML tag can have 2 forms. First is: <open>contents<close>
<(?P<TAGNAME>\w++) # capture tag name in "TAGNAME" group
[^>]*+ # match any/all opening tag attributes
(?<!/)> # make sure opening tag is not self closing
(?: # begin non-capture group for tag contents
[^<]++ # either a string of regular non-tag text
| # or...
(?P>TAG) # recusive expression for nested XHTML tag!
)* # grab * between outer open and close tags
</(?P=TAGNAME)> # match outermost closing tag (named group)
| # or... second XHTML tag form is: <self closing tag />
<\w[^>]*/> # a self closing tag
) # end "TAG" named capture group
%six';
$XHTMLpatternShort = '%(?P<TAG><(?P<TAGNAME>\w++)[^>]*+(?<!/)>(?:[^<]++|(?P>TAG))*</(?P=TAGNAME)>|<\w[^>]*/>)%si';
$data = file_get_contents('XHTML.txt');
if (($matchcount = preg_match_all($XHTMLpatternShort, $data, $matches)) > 0) {
printf("%d matches found.\n", $matchcount);
for ($i = 0; $i < $matchcount; $i++) {
printf("TAG[%d]:\n%s\n", $i +1, $matches["TAG"][$i]);
}
} else {
echo("BAD! No XHTML tags found");
}
?>
Note that this regex will fail miserably if your code is not well-formed XHTML - that is, all tags that are opened must have a matching closing tag, or tags that have no closing tag must be self closing (i.e. must be <br /> not <br>).
Now that we can match tags and grab everything inside, we can build one that will match a word outside:
Code: Select all
Out of time for now. Coming soon...
Hope this helps!