Page 1 of 1

PCRE and japanese unicode

Posted: Thu Apr 02, 2009 9:46 pm
by tr3online
Hi there,

I am newish to scripting and regex, I have a thread that is already on this forum, but I figured I would repost due to this being a different question.

A forewarning, there is going to be unicode-japanese text in this message! If you see boxes, sorry!

First off, I am trying to strip data using regex from a URL, $url.

The contents of this $url are as follows :
LYRICS=<p align="center"><b>????/????</b><br><br>???????/???????<br><br>??????????<br>???????<br>??????<br>???????<br>??????
Where I am trying to use regex to select the points in blue.

So far, I have been able to get the data up until (marked in red): LYRICS=<p align="center"><b>????/????</b><br><br>???????/???????<br><br>??????????<br>???????<br>??????<br>???????<br>??????

My script is as follows:

Code: Select all

 
<?php
$page = file_get_contents($url);
 
preg_match_all('@LYRICS=<p align="center"><b>([^/]+)\X{3}/@', $page, $match1);
 
preg_match_all('@/\X{3}([^/]+)</b><br><br>@',$page,$match2);
 
preg_match_all('@</b><br><br>\X{12}([^/]+)\X{3}/@', $page, $match3);
 
echo "<br />";
echo $match1[1][0];
echo "<br />";
echo $match2[1][0];
echo "<br />";
echo $match3[1][0];
 
?>
 
The width of one Japanese character is 3, so you will see me using \X{3} to read over Japanese characters / spaces.

The issue is, once I get to LYRICS=<p align="center"><b>????/????</b><br><br>???????/??????? I am unable to use \X anymore due to the fact that there is already a " / " with unicode characters following it (LYRICS=<p align="center"><b>????/????</b><br><br>???????/???????)

I was reading about PCRE and unicode and it says you should be able to use \x{utf-code} to scan individual characters. The only issue with that is, ? ,which can be seen here (or alternatively ?) when used in a regex expression such as :

Code: Select all

 
preg_match_all('@\x{66F2}([^/]+)<br><br>@',$page,$match4);
 
returns:
Warning: preg_match_all() [function.preg-match-all]: Compilation failed: character value in \x{...} sequence is too large at offset 7
I notice on fileformat.info, that both unicode characters are listed as "not a valid unicode character."

Can anyone recommend me what I should do in order to grab the data I need?

Thanks!

Re: PCRE and japanese unicode

Posted: Thu Apr 02, 2009 10:01 pm
by tr3online
Also, just using :

Code: Select all

 
preg_match_all('@???([^/]+)<br><br>@',$page,$match4);
 
does not work.

Thanks!

Re: PCRE and japanese unicode

Posted: Fri Apr 03, 2009 1:33 am
by Apollo
tr3online wrote:The contents of this $url are as follows :
LYRICS=<p align="center"><b>????/????</b><br><br>???????/???????<br><br>??????????<br>???????<br>??????<br>???????<br>??????
Where I am trying to use regex to select the points in blue.
If I understand correctly, you need only stuff between html tags (i.e. <tag>bla<tag> should yield bla), and if there's "bla1 / bla2" you need those separately, and if there's "bla1: bla2" you only need bla2. Right?

This should do it:

Code: Select all

preg_match_all('#[>:/] *([^:<>/$]+?) *(?:(?=[</])|$)#',$url,$matches);
print_r($matches[1]);

Re: PCRE and japanese unicode

Posted: Fri Apr 03, 2009 12:58 pm
by tr3online
Hmmmm...

That's an interesting idea! I didn't even think of it like that at all.

Thanks for the fresh perspective!

I get a 32 array, but I can manipulate the data in each array before I export to CSV etc.

This should make everything a lot more manageable (if I can do the code, haha). I'll give it a shot.

Thanks again.

Re: PCRE and japanese unicode

Posted: Sat Apr 04, 2009 3:03 pm
by tr3online
Resurrecting this thread because I ran into an issue with the regex.

Apparently, sometimes, the data contains "/" marks in listed areas, greatly increasing the array data. It also doesn't allow me to compile the data predictably.

The current line I am trying to read from is:
LYRICS=<p align="center"><b>????/????</b><br><br>??????/a/b/c/d/e?/???????/a/b/c/d/e<br><br>??????????<br>???????<br>??????<br>???????<br>??????
Using the following regex:

Code: Select all

 
preg_match_all('#[>:/] *([^:<>/$]+?) *(?:(?=[</])|$)#',$page,$matches);
 
Is there a way where I can just get the stuff highlighted blue into one array value? The ideal way to do this would be to use the unicode characters to dictate the beginning and ending of a string, but I've been unable to do this as of yet (ie: Get info from start:?? to end: ?? would yield ??/a/b/c/d/e).

With the current regex, I am getting:
[2] => ??????
[3] => a
[4] => b?
[..] => etc.
Thanks!