Page 1 of 1
Regular expressions problem
Posted: Wed Jul 24, 2002 12:02 pm
by RandomEngy
Hey, been messing around with regular expressions with some success; they're pretty cool.
What I've been trying to do is get a bunch of publication entires into a database from an html file. The html file looks something like this:
Code: Select all
<h3 align="center">2002</h3>
<p>Here's a sample publication</p>
<p>Publication number 1</p>
<p>Publication with <b>HTML tags</b> in it</p>
<h3 align="center">2001</h3>
...
I wrote a script that can sucessfully return to me the contents of the paragraphs without the <p> and </p> tags:
Code: Select all
preg_match_all("/<p>(.*)<\/p>/", $message, $results);
However, I'd like to find the year of the publication as well, so I thought finding all the paragraphs that are between ">2002<" and ">2001<" would match all publications from 2002. All my tries at expressions have returned an empty result array.
This is my latest attempt:
Code: Select all
preg_match_all("/>2002<.*(<p>(.*)<\/p>.*)+>2001</", $message, $results);
Anyone handy with regular expressions know how to do this?
Posted: Wed Jul 24, 2002 2:34 pm
by twigletmac
Posted: Wed Jul 24, 2002 2:46 pm
by RandomEngy
Yeah, I've read that one and a number of other articles, but it doesn't help me with the problem I have now.
I made a little step forward though. It doesn't normally search across line breaks so you have to give it /s for it to work:
Code: Select all
preg_match_all("/>2002<.*(<p>(.*)<\/p>.*)+.*>2001</s",$message,$results);
That will return the last item in the section I want to take from, but none of the others. :/
Posted: Wed Jul 24, 2002 3:23 pm
by RandomEngy
Well, I was able to make a workaround with 2 calls to preg_replace_all, but it's not very pretty. I get what's between the >2002< and >2001< then I apply my working "/<p>(.*?)<\/p>/s" regular expression to it. It still would be cool to see it done in one expression.

Posted: Wed Jul 24, 2002 3:36 pm
by gnu2php
You could try something like this. If you don't understand parts of it, let me know--
Suppose you have some data:
Code: Select all
$data = <<<END
<h3 align="center">2002</h3>
<p>1Here's a sample publication</p>
<p>1Publication number 1</p>
<p>1Publication with <b>HTML tags</b> in it</p>
<h3 align="center">2001</h3>
<p>2Here's a sample publication</p>
<p>2Publication number 2</p>
<p>2Publication with <b>HTML tags</b> in it</p>
<h3 align="center">2000</h3>
<p>3Here's a sample publication</p>
<p>3Publication number 3</p>
<p>3Publication with <b>HTML tags</b> in it</p>
END;
You can grab each of the headings and paragraphs like this:
Code: Select all
$array = array(); // All the items found
$curr_array = array(); // Used in callback function
// Use this regular expression
preg_replace_callback('/(<h3ї^>]+>(.*)<\/h3>|<p>(.*)<\/p>)/iUs', 'callback_func', $data);
// And this callback function
function callback_func($matches)
{
// A heading is found:
if (preg_match('/<h3ї^>]+>(.*)<\/h3>/iUs', $matchesї0]))
{
if (!empty($GLOBALSї'curr_array']))
{
array_push($GLOBALSї'array'], $GLOBALSї'curr_array']);
}
$GLOBALSї'curr_array'] = array($matchesї2]);
}
// A paragraph is found:
else array_push($GLOBALSї'curr_array'], $matchesї3]);
return $matchesї0]; // So we don't alter $data
}
// Then you'll need to add one more to the end
if (!empty($curr_array)) array_push($array, $curr_array);
// And print the items found
print_r($array);
Posted: Wed Jul 24, 2002 3:58 pm
by RandomEngy
That's really cool; thanks for posting that. But why do you need to return $matches[0] for it not to lose data?
Posted: Wed Jul 24, 2002 5:33 pm
by gnu2php
It's because the return value is what the match is replaced with. If you returned "foo", it would replace all your headings and paragraphs in $data with foo.
The reason why I use preg_replace_callback is because it's the only preg function that lets you do a "callback."