I am newish to scripting and regex, I have a thread that is already on this forum, but I figured I would repost due to this being a different question.
A forewarning, there is going to be unicode-japanese text in this message! If you see boxes, sorry!
First off, I am trying to strip data using regex from a URL, $url.
The contents of this $url are as follows :
Where I am trying to use regex to select the points in blue.LYRICS=<p align="center"><b>????/????</b><br><br>???????/???????<br><br>??????????<br>???????<br>??????<br>???????<br>??????
So far, I have been able to get the data up until (marked in red): LYRICS=<p align="center"><b>????/????</b><br><br>???????/???????<br><br>??????????<br>???????<br>??????<br>???????<br>??????
My script is as follows:
Code: Select all
<?php
$page = file_get_contents($url);
preg_match_all('@LYRICS=<p align="center"><b>([^/]+)\X{3}/@', $page, $match1);
preg_match_all('@/\X{3}([^/]+)</b><br><br>@',$page,$match2);
preg_match_all('@</b><br><br>\X{12}([^/]+)\X{3}/@', $page, $match3);
echo "<br />";
echo $match1[1][0];
echo "<br />";
echo $match2[1][0];
echo "<br />";
echo $match3[1][0];
?>
The issue is, once I get to LYRICS=<p align="center"><b>????/????</b><br><br>???????/??????? I am unable to use \X anymore due to the fact that there is already a " / " with unicode characters following it (LYRICS=<p align="center"><b>????/????</b><br><br>???????/???????)
I was reading about PCRE and unicode and it says you should be able to use \x{utf-code} to scan individual characters. The only issue with that is, ? ,which can be seen here (or alternatively ?) when used in a regex expression such as :
Code: Select all
preg_match_all('@\x{66F2}([^/]+)<br><br>@',$page,$match4);
I notice on fileformat.info, that both unicode characters are listed as "not a valid unicode character."Warning: preg_match_all() [function.preg-match-all]: Compilation failed: character value in \x{...} sequence is too large at offset 7
Can anyone recommend me what I should do in order to grab the data I need?
Thanks!