PERL compatible Regex problem

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
EricS
Forum Contributor
Posts: 183
Joined: Thu Jul 11, 2002 12:02 am
Location: Atlanta, Ga

PERL compatible Regex problem

Post by EricS »

I'm trying to write a function that will search a string for all occurences of & and replace them with &. The kicker is that I want to ignore all html entities in the process. So when the function runs across ´ or ʼ, for example, it doesn't replace that ampersand with &.

I wrote a regular expression pattern that can detect if something is an html entity.

Code: Select all

$htmlEntityPattern = '/&ї#|\w]\w{1,6};/';
I can't seem to get to the next step which is detecting all &'s but ignoring anything that is an html entity.

Any of you regular expression guru's got an idea?

Thanks for your time.
EricS
User avatar
protokol
Forum Contributor
Posts: 353
Joined: Fri Jun 21, 2002 7:00 pm
Location: Cleveland, OH
Contact:

Post by protokol »

Just match all & that have a space after it maybe?
EricS
Forum Contributor
Posts: 183
Joined: Thu Jul 11, 2002 12:02 am
Location: Atlanta, Ga

Post by EricS »

I had certainly thought about that. I just worry about occurences of people who accidentally forget to put a space before and after the &.

That is a temporary fix, but I'm hoping for something that is a little more concrete. I appreciate the help though.
timvw
DevNet Master
Posts: 4897
Joined: Mon Jan 19, 2004 11:11 pm
Location: Leuven, Belgium

Post by timvw »

I'm think you're better of with a lexer/parser to analyze your html code first instead of blindly using regular expressions to replace stuff...

Btw, what is wrong with http://www.php.net/htmlentities ?
EricS
Forum Contributor
Posts: 183
Joined: Thu Jul 11, 2002 12:02 am
Location: Atlanta, Ga

Post by EricS »

htmlentities() doesn't correctly parse all the htmlentites that I run into. For instance I run into • quite often. While it's not actually an HTML entity, but rather a numbered entity, htmlentities() changes the • into • which cause an incorrect interpretation.

I'm not familiar with lexer/parser. And why would writing a regular expression that can identify html entities and numbered entity be a "blind" solution?
rehfeld
Forum Regular
Posts: 741
Joined: Mon Oct 18, 2004 8:14 pm

Post by rehfeld »

have you looked at the user contributed notes for htmlentities?

i bet theres something in there about this and a solution to it.
redmonkey
Forum Regular
Posts: 836
Joined: Thu Dec 18, 2003 3:58 pm

Post by redmonkey »

You could try....
(not tested)

Code: Select all

<?php
$text = preg_replace('/&(?!(?:#\d{1,4}|\w{1,6});)/', '&', $text);
?>
EricS
Forum Contributor
Posts: 183
Joined: Thu Jul 11, 2002 12:02 am
Location: Atlanta, Ga

Post by EricS »

Thanks to everyone for your assistance. The following is the function I settled on to solve my problem.

Code: Select all

<?php
function htmlentities2($string) {
	$translationTable = get_html_translation_table(HTML_ENTITIES, ENT_QUOTES);
	$translationTable[chr(38)] = '&';
	return preg_replace( "/&(?![A-Za-z]{0,4}\w{2,3};|#[0-9]{2,4};|#x[0-9a-fA-F]{2,4};)/", "&", strtr($string, $translationTable));
 }
?>
I understand almost everything except for two characters in the regular expression pattern.

The ?! doesn't make any sense to me. I thought the ? was to make quantitiers match the least number of characters possible but I thought you put the ? after the quantitier you are limiting. In this case I see it at the beginning of the group. I also don't know what the ! does. Is that a negator?

Anyway, the function works beautifully and I was just wondering if someone could help me demystify those two characters of the regular expressions pattern. I'm really trying to get a good grasp of regular expressions but this one has stumped me. Thanks a million.
rehfeld
Forum Regular
Posts: 741
Joined: Mon Oct 18, 2004 8:14 pm

Post by rehfeld »

EricS
Forum Contributor
Posts: 183
Joined: Thu Jul 11, 2002 12:02 am
Location: Atlanta, Ga

Post by EricS »

Yes Danielson, it all makes sense now.

Thanks rehfeld for the help. I'll figure these blasted regular expressions out one day!
Post Reply