Page 1 of 1

preg_match[_all]($regex) question

Posted: Wed Apr 01, 2009 8:21 pm
by tr3online
Hey guys, I'm pretty new to regex and could use some help.
I've tried to find some examples of what I can use before coming here to ask, but they all seem incredibly too verbose for me to try and grasp.
In any event, here's my issue:

I am trying to strip some data from a pretty long data string in the form of (it's in another language so I'll just include random english):

KASI=<p align="center"><b>Hi there / What's up</b><br><br>kakusi: this / sakkyoku: that<br><br>lots of data

Ideally, I want to grab that data, strip out the "Hi there" , the "What's up" , the " this " , the " that " , and "lots of data" which terminates at the end with no line break.
I was trying to use a preg_match with regex to strip it. I imagine preg_match_all may be more suitable?

In any event, I'd like some help coming up with the regex to help me isolate things from a string like that.

Thanks!

code snipit:

$page = file_get_contents($url);
$pattern = regex here;
preg_match($regex,$data,$match);
var_dump($match);
echo $match[1];

Re: preg_match[_all]($regex) question

Posted: Thu Apr 02, 2009 3:48 am
by prometheuzz
You need to be more precise about your requirements before being able to construct a regex (or get help constructing one). What happened to the word "sakkyoku"? Why did you leave it out?

Re: preg_match[_all]($regex) question

Posted: Thu Apr 02, 2009 1:56 pm
by tr3online
Sakkyoku is more like a tag.
Ideally the regex would be search along the lines of:

Starts with : KASI=<p align="center"><b> , ends with /
So (*) in KASI=<p align="center"><b>(*) / would be selected

then starts with / and ends with </b><br><br>
so (*) in / (*)</b><br><br> would be selected

kakusi: (*) /
sakkyoku: (*) /
<br><br>(*) [end]
would be selected

Most of the data in (*) won't be ANSI, if that matters. It will be UTF-8 chars as it's a foreign language.

Does that help at all?

Re: preg_match[_all]($regex) question

Posted: Thu Apr 02, 2009 1:57 pm
by tr3online
or is regex not the best way to go about doing that?

Re: preg_match[_all]($regex) question

Posted: Thu Apr 02, 2009 6:49 pm
by semlar

Code: Select all

preg_match_all('@KASI=<p align="center"><b>([^/]+)/@', $page, $matches)
Matches all occurrences of [KASI=<p align="center"><b>] followed by stuff and then [/] in $page, returns array as $matches.

I have no idea how regex handles foreign characters (hiragana/kanji).

I don't know if your example is supposed to be one continuous string or not, if it is you would do something like this..

Code: Select all

$pattern = '@KASI=<p align="center"><b>([^/]+)/((?:[^<]|<(?!/b><br><br>))+)</b><br><br>(.*)@';
preg_match( $pattern, $page, $match )

Re: preg_match[_all]($regex) question

Posted: Thu Apr 02, 2009 7:05 pm
by tr3online
Thanks for the input. I tried to run what you said with the following code:

Code: Select all

 
<?php
$page = file_get_contents('$url');
preg_match_all('@LYRICS=<p align="center"><b>([^/]+)/@', $page, $matches);
var_dump(matches);
echo $matches;
?>
 
which returns
string(7) "matches" Array
Maybe this isn't working right?

An exact example with unicode of what I'm trying to parse is:
LYRICS=<p align="center"><b>????/????</b><br><br>???????/???????<br><br>??????????<br>???????<br>??????
Where blue marks the spots I want to strip.

Thanks in advance!

Re: preg_match[_all]($regex) question

Posted: Thu Apr 02, 2009 7:11 pm
by tr3online
Oh,
I tried running a print_r($matches);, without the var_dump, which returned a:
Array
(
[0] => Array
(
[0] => LYRICS=<p align="center"><b>ふたり /
)

[1] => Array
(
[0] => ふたり 
)

)
So I guess that worked ;) Thanks a lot. I just need to grab the other info now :)

Re: preg_match[_all]($regex) question

Posted: Thu Apr 02, 2009 7:14 pm
by semlar
If you know it's only going to be on the page once, use the preg_match function, since it stops after the first match.

If you need to match the same pattern multiple times on the page use preg_match_all.

I've been trying to learn how to read Japanese online (basically started last week), and katakana and hiragana are pretty simple, but kanji are really confusing for me -.-

Re: preg_match[_all]($regex) question

Posted: Thu Apr 02, 2009 7:20 pm
by tr3online
If you need any help with Japanese let me know ;)

Appriciate the regex help. I'm so lost with it :| New to scripting.