Regular expressions problem

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
RandomEngy
Forum Contributor
Posts: 173
Joined: Wed Jun 26, 2002 3:24 pm
Contact:

Regular expressions problem

Post by RandomEngy »

Hey, been messing around with regular expressions with some success; they're pretty cool.

What I've been trying to do is get a bunch of publication entires into a database from an html file. The html file looks something like this:

Code: Select all

<h3 align="center">2002</h3>

<p>Here's a sample publication</p>

<p>Publication number 1</p>

<p>Publication with <b>HTML tags</b> in it</p>

<h3 align="center">2001</h3>

...
I wrote a script that can sucessfully return to me the contents of the paragraphs without the <p> and </p> tags:

Code: Select all

preg_match_all("/<p>(.*)<\/p>/", $message, $results);
However, I'd like to find the year of the publication as well, so I thought finding all the paragraphs that are between ">2002<" and ">2001<" would match all publications from 2002. All my tries at expressions have returned an empty result array.

This is my latest attempt:

Code: Select all

preg_match_all("/>2002<.*(<p>(.*)<\/p>.*)+>2001</", $message, $results);
Anyone handy with regular expressions know how to do this?
User avatar
twigletmac
Her Royal Site Adminness
Posts: 5371
Joined: Tue Apr 23, 2002 2:21 am
Location: Essex, UK

Post by twigletmac »

This has been quite useful for me in the past:
http://www.evolt.org/article/Regular_Ex ... index.html

Mac
User avatar
RandomEngy
Forum Contributor
Posts: 173
Joined: Wed Jun 26, 2002 3:24 pm
Contact:

Post by RandomEngy »

Yeah, I've read that one and a number of other articles, but it doesn't help me with the problem I have now.

I made a little step forward though. It doesn't normally search across line breaks so you have to give it /s for it to work:

Code: Select all

preg_match_all("/>2002<.*(<p>(.*)<\/p>.*)+.*>2001</s",$message,$results);
That will return the last item in the section I want to take from, but none of the others. :/
User avatar
RandomEngy
Forum Contributor
Posts: 173
Joined: Wed Jun 26, 2002 3:24 pm
Contact:

Post by RandomEngy »

Well, I was able to make a workaround with 2 calls to preg_replace_all, but it's not very pretty. I get what's between the >2002< and >2001< then I apply my working "/<p>(.*?)<\/p>/s" regular expression to it. It still would be cool to see it done in one expression. :wink:
gnu2php
Forum Contributor
Posts: 122
Joined: Thu Jul 11, 2002 2:53 am

Post by gnu2php »

You could try something like this. If you don't understand parts of it, let me know--

Suppose you have some data:

Code: Select all

$data = <<<END
<h3 align="center">2002</h3>
<p>1Here's a sample publication</p>
<p>1Publication number 1</p>
<p>1Publication with <b>HTML tags</b> in it</p>

<h3 align="center">2001</h3>
<p>2Here's a sample publication</p>
<p>2Publication number 2</p>
<p>2Publication with <b>HTML tags</b> in it</p>

<h3 align="center">2000</h3>
<p>3Here's a sample publication</p>
<p>3Publication number 3</p>
<p>3Publication with <b>HTML tags</b> in it</p>
END;
You can grab each of the headings and paragraphs like this:

Code: Select all

$array = array(); // All the items found
$curr_array = array(); // Used in callback function


// Use this regular expression
preg_replace_callback('/(<h3&#1111;^>]+>(.*)<\/h3>|<p>(.*)<\/p>)/iUs', 'callback_func', $data);


// And this callback function
function callback_func($matches)
&#123;
	// A heading is found:
	if (preg_match('/<h3&#1111;^>]+>(.*)<\/h3>/iUs', $matches&#1111;0]))
	&#123;
		if (!empty($GLOBALS&#1111;'curr_array']))
		&#123;
			array_push($GLOBALS&#1111;'array'], $GLOBALS&#1111;'curr_array']);
		&#125;

		$GLOBALS&#1111;'curr_array'] = array($matches&#1111;2]);
	&#125;
	// A paragraph is found:
	else array_push($GLOBALS&#1111;'curr_array'], $matches&#1111;3]);

	return $matches&#1111;0]; // So we don't alter $data
&#125;


// Then you'll need to add one more to the end
if (!empty($curr_array)) array_push($array, $curr_array);


// And print the items found
print_r($array);
User avatar
RandomEngy
Forum Contributor
Posts: 173
Joined: Wed Jun 26, 2002 3:24 pm
Contact:

Post by RandomEngy »

That's really cool; thanks for posting that. But why do you need to return $matches[0] for it not to lose data?
gnu2php
Forum Contributor
Posts: 122
Joined: Thu Jul 11, 2002 2:53 am

Post by gnu2php »

It's because the return value is what the match is replaced with. If you returned "foo", it would replace all your headings and paragraphs in $data with foo.

The reason why I use preg_replace_callback is because it's the only preg function that lets you do a "callback."
Post Reply