Page 1 of 1
PERL compatible Regex problem
Posted: Fri Dec 03, 2004 9:44 am
by EricS
I'm trying to write a function that will search a string for all occurences of & and replace them with &. The kicker is that I want to ignore all html entities in the process. So when the function runs across ´ or ʼ, for example, it doesn't replace that ampersand with &.
I wrote a regular expression pattern that can detect if something is an html entity.
Code: Select all
$htmlEntityPattern = '/&ї#|\w]\w{1,6};/';
I can't seem to get to the next step which is detecting all &'s but ignoring anything that is an html entity.
Any of you regular expression guru's got an idea?
Thanks for your time.
EricS
Posted: Fri Dec 03, 2004 9:54 am
by protokol
Just match all & that have a space after it maybe?
Posted: Fri Dec 03, 2004 10:14 am
by EricS
I had certainly thought about that. I just worry about occurences of people who accidentally forget to put a space before and after the &.
That is a temporary fix, but I'm hoping for something that is a little more concrete. I appreciate the help though.
Posted: Fri Dec 03, 2004 10:49 am
by timvw
I'm think you're better of with a lexer/parser to analyze your html code first instead of blindly using regular expressions to replace stuff...
Btw, what is wrong with
http://www.php.net/htmlentities ?
Posted: Fri Dec 03, 2004 12:12 pm
by EricS
htmlentities() doesn't correctly parse all the htmlentites that I run into. For instance I run into • quite often. While it's not actually an HTML entity, but rather a numbered entity, htmlentities() changes the • into • which cause an incorrect interpretation.
I'm not familiar with lexer/parser. And why would writing a regular expression that can identify html entities and numbered entity be a "blind" solution?
Posted: Fri Dec 03, 2004 12:55 pm
by rehfeld
have you looked at the user contributed notes for htmlentities?
i bet theres something in there about this and a solution to it.
Posted: Fri Dec 03, 2004 1:58 pm
by redmonkey
You could try....
(not tested)
Code: Select all
<?php
$text = preg_replace('/&(?!(?:#\d{1,4}|\w{1,6});)/', '&', $text);
?>
Posted: Fri Dec 03, 2004 3:11 pm
by EricS
Thanks to everyone for your assistance. The following is the function I settled on to solve my problem.
Code: Select all
<?php
function htmlentities2($string) {
$translationTable = get_html_translation_table(HTML_ENTITIES, ENT_QUOTES);
$translationTable[chr(38)] = '&';
return preg_replace( "/&(?![A-Za-z]{0,4}\w{2,3};|#[0-9]{2,4};|#x[0-9a-fA-F]{2,4};)/", "&", strtr($string, $translationTable));
}
?>
I understand almost everything except for two characters in the regular expression pattern.
The ?! doesn't make any sense to me. I thought the ? was to make quantitiers match the least number of characters possible but I thought you put the ? after the quantitier you are limiting. In this case I see it at the beginning of the group. I also don't know what the ! does. Is that a negator?
Anyway, the function works beautifully and I was just wondering if someone could help me demystify those two characters of the regular expressions pattern. I'm really trying to get a good grasp of regular expressions but this one has stumped me. Thanks a million.
Posted: Fri Dec 03, 2004 3:18 pm
by rehfeld
Posted: Fri Dec 03, 2004 3:48 pm
by EricS
Yes Danielson, it all makes sense now.
Thanks rehfeld for the help. I'll figure these blasted regular expressions out one day!