Page 1 of 1

unicode unprintable character for regex

Posted: Fri Jul 06, 2007 11:11 am
by pppswing
Hi,

I need to get the norwegian link for interwiki in wikipedia.
So I made a pattern to use with preg_match().

At begining I put this :
$no_pattern='/<li class="interwiki-no"><a href="(.*)">Norsk \(b/u';


But it doesn't work, there is an unprintable utf8 character between ">" and "Norsk" in wikipedia :?

The hex code are E2 80 AA, so there is 3 hex caracters this correspond to Left-to-Right Embedding, U+202A.

I don't know how to complete my pattern to make it work properly.

Thanks.
:D

Posted: Fri Jul 06, 2007 12:07 pm
by superdezign
Put in a one character wild card.

Posted: Fri Jul 06, 2007 12:27 pm
by pppswing
I tried a '.'
But it didn't make it.

Posted: Fri Jul 06, 2007 12:30 pm
by volka
Can we see this <a> element somewhere live and in action?

Posted: Fri Jul 06, 2007 12:37 pm
by pppswing
I just fix it, as the 3 characters described correspond to an extended Unicode sequence, I put /X to fix it.

It's in php pattern syntax doc:


Unicode character properties

Since PHP 4.4.0 and 5.1.0, three additional escape sequences to match generic character types are available when UTF-8 mode is selected. They are:

\p{xx}
a character with the xx property
\P{xx}
a character without the xx property
\X
an extended Unicode sequence