Page 1 of 1

Regex for EVERYTHING before or after an item

Posted: Thu May 21, 2009 9:11 pm
by GeoBear
I'm modifying pages from the CIA's World Factbook for use on my website. I'd like to make several copies of each file, then reduce each copy to a single topic.

For example, imagine a page for France that looks something like this:

<h2>Introduction</h2>
(Some text)

<h2>Geography</h2>
(Some text)

<h2>History</h2>
(Some text>

Let's say I'm working on the Geography section. So I want to select a folder filled with files and do two search and replace operations with Dreamweaver.

First, I need to replace EVERYTHING on each file up to but not including <h2>Geography, leaving me with this...

<h2>Geography</h2>
(Some text)

<h2>History</h2>
(Some text>

Next, I would simply replace <h2>History and EVERYTHING following it on each file.

So can someone tell me how to make a regex that will delete everything on an entire file preceding <h2>Geography and a second regex that will delete everything after and including <h2>History?

I worked through some regex tutorials, but my experiments haven't been going too well. The following identifies <h2>Geography as the beginning of my search and replace target:

^<h2>Geography

However, it doesn't seem to match unless it comes first in the text. If the text contains...

<h2>Introduction</h2>
<h2>Geography</h2>

...then there's no match. Also, when I try to use $ to indicate an ending anchor point, that doesn't work, either.

There is a third regex that would be very handy...

Some files don't include certain topics, which means I'll have to perform multiple search and replace operations, targeting one remaining topic at a time. I'd like to know how to do that - delete everything from <h1 Whatever to the end of the file.

However, it would also be useful to select a beginning anchor point associated with the topic I want to keep. So I came up with a plan:

First I replace every instance of <h2> with STOP<h2>. Then, if I'm focusing on the Geography section, I'd simply replace everything following the first instance of STOP that occurs after <h2>Geography...

* * * * *

<h2>Introduction</h2>
(Some text)
STOP

<h2>Geography</h2>
(Some text)
STOP

(Delete everything from here to the end of the file.)

<h2>History</h2>
(Some text>
STOP

* * * * *

I tried the following regex, with no luck:

^<h2>Geography STOP$

Sorry for the long post. In summary, I'm looking for three regexes:

1. One where the starting anchor is the beginning of the file and the end is <h2>Geography.

2. A second regex where the starting anchor is <h2>History and the ending anchor is the end of the file.

3. A third regex where the starting anchor is the first instance of STOP following <h2>Geography and the ending anchor is the end of the file.

Even if you only know how to do one of these, it would be a big help.

Thanks.

Re: Regex for EVERYTHING before or after an item

Posted: Thu May 21, 2009 9:47 pm
by Christopher
Why not first explode the file on '<h2>' to separate the sections. And then explode each section on '</h2>' to separate the title and content.

Re: Regex for EVERYTHING before or after an item

Posted: Thu May 21, 2009 9:55 pm
by GeoBear
arborint wrote:Why not first explode the file on '<h2>' to separate the sections. And then explode each section on '</h2>' to separate the title and content.
OK, I'm using PHP, but I don't have much experience with the explode function. Are you saying I can actually split a file into several smaller files, each with a portion of the content from the original file? If so, that would be awesome.

Thanks.

Re: Regex for EVERYTHING before or after an item

Posted: Thu May 21, 2009 10:15 pm
by Christopher
Check the manual for file_get_contents() and explode(). (Hint: the online manual has a seach). You also might want to read the section on arrays and the foreach statement.

Re: Regex for EVERYTHING before or after an item

Posted: Thu May 21, 2009 10:26 pm
by GeoBear
arborint wrote:Check the manual for file_get_contents() and explode(). (Hint: the online manual has a seach). You also might want to read the section on arrays and the foreach statement.
Thanks.

Re: Regex for EVERYTHING before or after an item

Posted: Fri May 22, 2009 7:45 am
by prometheuzz
arborint wrote:Why not first explode the file on '<h2>' to separate the sections. And then explode each section on '</h2>' to separate the title and content.
Indeed, many people think it's necessary to do something using one complicated regex while easier to understand (and maintain!) solutions are far better, quite often without using any regex at all.

@OP:
Or perhaps better: use an html parser. Parsing (x)html files with regex is not advisable, to put it mildly.