Regex for EVERYTHING before or after an item
Posted: Thu May 21, 2009 9:11 pm
I'm modifying pages from the CIA's World Factbook for use on my website. I'd like to make several copies of each file, then reduce each copy to a single topic.
For example, imagine a page for France that looks something like this:
<h2>Introduction</h2>
(Some text)
<h2>Geography</h2>
(Some text)
<h2>History</h2>
(Some text>
Let's say I'm working on the Geography section. So I want to select a folder filled with files and do two search and replace operations with Dreamweaver.
First, I need to replace EVERYTHING on each file up to but not including <h2>Geography, leaving me with this...
<h2>Geography</h2>
(Some text)
<h2>History</h2>
(Some text>
Next, I would simply replace <h2>History and EVERYTHING following it on each file.
So can someone tell me how to make a regex that will delete everything on an entire file preceding <h2>Geography and a second regex that will delete everything after and including <h2>History?
I worked through some regex tutorials, but my experiments haven't been going too well. The following identifies <h2>Geography as the beginning of my search and replace target:
^<h2>Geography
However, it doesn't seem to match unless it comes first in the text. If the text contains...
<h2>Introduction</h2>
<h2>Geography</h2>
...then there's no match. Also, when I try to use $ to indicate an ending anchor point, that doesn't work, either.
There is a third regex that would be very handy...
Some files don't include certain topics, which means I'll have to perform multiple search and replace operations, targeting one remaining topic at a time. I'd like to know how to do that - delete everything from <h1 Whatever to the end of the file.
However, it would also be useful to select a beginning anchor point associated with the topic I want to keep. So I came up with a plan:
First I replace every instance of <h2> with STOP<h2>. Then, if I'm focusing on the Geography section, I'd simply replace everything following the first instance of STOP that occurs after <h2>Geography...
* * * * *
<h2>Introduction</h2>
(Some text)
STOP
<h2>Geography</h2>
(Some text)
STOP
(Delete everything from here to the end of the file.)
<h2>History</h2>
(Some text>
STOP
* * * * *
I tried the following regex, with no luck:
^<h2>Geography STOP$
Sorry for the long post. In summary, I'm looking for three regexes:
1. One where the starting anchor is the beginning of the file and the end is <h2>Geography.
2. A second regex where the starting anchor is <h2>History and the ending anchor is the end of the file.
3. A third regex where the starting anchor is the first instance of STOP following <h2>Geography and the ending anchor is the end of the file.
Even if you only know how to do one of these, it would be a big help.
Thanks.
For example, imagine a page for France that looks something like this:
<h2>Introduction</h2>
(Some text)
<h2>Geography</h2>
(Some text)
<h2>History</h2>
(Some text>
Let's say I'm working on the Geography section. So I want to select a folder filled with files and do two search and replace operations with Dreamweaver.
First, I need to replace EVERYTHING on each file up to but not including <h2>Geography, leaving me with this...
<h2>Geography</h2>
(Some text)
<h2>History</h2>
(Some text>
Next, I would simply replace <h2>History and EVERYTHING following it on each file.
So can someone tell me how to make a regex that will delete everything on an entire file preceding <h2>Geography and a second regex that will delete everything after and including <h2>History?
I worked through some regex tutorials, but my experiments haven't been going too well. The following identifies <h2>Geography as the beginning of my search and replace target:
^<h2>Geography
However, it doesn't seem to match unless it comes first in the text. If the text contains...
<h2>Introduction</h2>
<h2>Geography</h2>
...then there's no match. Also, when I try to use $ to indicate an ending anchor point, that doesn't work, either.
There is a third regex that would be very handy...
Some files don't include certain topics, which means I'll have to perform multiple search and replace operations, targeting one remaining topic at a time. I'd like to know how to do that - delete everything from <h1 Whatever to the end of the file.
However, it would also be useful to select a beginning anchor point associated with the topic I want to keep. So I came up with a plan:
First I replace every instance of <h2> with STOP<h2>. Then, if I'm focusing on the Geography section, I'd simply replace everything following the first instance of STOP that occurs after <h2>Geography...
* * * * *
<h2>Introduction</h2>
(Some text)
STOP
<h2>Geography</h2>
(Some text)
STOP
(Delete everything from here to the end of the file.)
<h2>History</h2>
(Some text>
STOP
* * * * *
I tried the following regex, with no luck:
^<h2>Geography STOP$
Sorry for the long post. In summary, I'm looking for three regexes:
1. One where the starting anchor is the beginning of the file and the end is <h2>Geography.
2. A second regex where the starting anchor is <h2>History and the ending anchor is the end of the file.
3. A third regex where the starting anchor is the first instance of STOP following <h2>Geography and the ending anchor is the end of the file.
Even if you only know how to do one of these, it would be a big help.
Thanks.