unicode unprintable character for regex

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
pppswing
Forum Commoner
Posts: 33
Joined: Thu Jun 10, 2004 2:04 am
Location: Tallinn, Estonia

unicode unprintable character for regex

Post by pppswing »

Hi,

I need to get the norwegian link for interwiki in wikipedia.
So I made a pattern to use with preg_match().

At begining I put this :
$no_pattern='/<li class="interwiki-no"><a href="(.*)">Norsk \(b/u';


But it doesn't work, there is an unprintable utf8 character between ">" and "Norsk" in wikipedia :?

The hex code are E2 80 AA, so there is 3 hex caracters this correspond to Left-to-Right Embedding, U+202A.

I don't know how to complete my pattern to make it work properly.

Thanks.
:D
Last edited by pppswing on Fri Jul 06, 2007 12:23 pm, edited 1 time in total.
User avatar
superdezign
DevNet Master
Posts: 4135
Joined: Sat Jan 20, 2007 11:06 pm

Post by superdezign »

Put in a one character wild card.
pppswing
Forum Commoner
Posts: 33
Joined: Thu Jun 10, 2004 2:04 am
Location: Tallinn, Estonia

Post by pppswing »

I tried a '.'
But it didn't make it.
User avatar
volka
DevNet Evangelist
Posts: 8391
Joined: Tue May 07, 2002 9:48 am
Location: Berlin, ger

Post by volka »

Can we see this <a> element somewhere live and in action?
pppswing
Forum Commoner
Posts: 33
Joined: Thu Jun 10, 2004 2:04 am
Location: Tallinn, Estonia

Post by pppswing »

I just fix it, as the 3 characters described correspond to an extended Unicode sequence, I put /X to fix it.

It's in php pattern syntax doc:


Unicode character properties

Since PHP 4.4.0 and 5.1.0, three additional escape sequences to match generic character types are available when UTF-8 mode is selected. They are:

\p{xx}
a character with the xx property
\P{xx}
a character without the xx property
\X
an extended Unicode sequence
Post Reply