Match all text but a specific word?

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
cardician
Forum Newbie
Posts: 3
Joined: Wed Aug 13, 2008 12:09 pm

Match all text but a specific word?

Post by cardician »

Hello all,

First, let me say that the programming language doesn't matter. I'm just looking for a basica regex and I'll work on stuffing it into a language at a later point.

What I'm trying to do is basically as the subject says but a little more than that. So basically I'm trying to run a regex on a single file that has the text from thousands of files within it. These files all follow a basic structure, though there can be differences. Here is a small example of what I mean.

1234567
TITLE: News Headline
SOURCE: USA Today
TEXT: News article text goes here.

04654888
TITLE: News Headline 2
SOURCE: Newsweek
TEXT: News article text goes here.

So in this file you see the basic structure of two different articles. The articles always tend to start with a numeric code that can very in length, and then they always seem to be followed by the TITLE: section. What I want is a regex that can seperate these and return them individually. So the first result would be the whole first section:

1234567
TITLE: News Headline
SOURCE: USA Today
TEXT: News article text goes here.

Hopefully that makes sense. If anyone can offer any thoughts I would greatly appreciate it. Thank you.
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: Match all text but a specific word?

Post by GeertDD »

Trying a simple solution first. Based on the example text you posted, couldn't you split the text on double newlines? In PHP can use the explode() funtion then.
cardician
Forum Newbie
Posts: 3
Joined: Wed Aug 13, 2008 12:09 pm

Re: Match all text but a specific word?

Post by cardician »

Thank you for the response, but that won't quite work. The example I provided is just a small subset of example text that I formatted for ease of reading. The real file I'm trying to process has many more tags in it with a lot more data overall. Plus I can't necessarily rely on it being formatted in a specific way, or at all. The only thing I can count on is the general structure of each file shoved into the larger file. I could be wrong, maybe I can figure out an alternate solution, but ideally a regex would do it for me. Here is the regex I currently have though it doesn't get all the text..

[a-z]{0,2}[0-9]+\s+TITLE:(?!TITLE)

The lookahead negation bit at the end doesn't appear to do anything as all this does is get:

1234567
TITLE:

But I think its along the right track...or maybe I'm way off.
Last edited by cardician on Wed Aug 13, 2008 1:21 pm, edited 1 time in total.
User avatar
lukewilkins
Forum Commoner
Posts: 55
Joined: Tue Aug 12, 2008 2:42 pm

Re: Match all text but a specific word?

Post by lukewilkins »

Looks to me like you need to not be messing with a flat data file and move towards storing this in a database if you're going to be having thousands of rows. Much more efficient ... and much easier to code.
cardician
Forum Newbie
Posts: 3
Joined: Wed Aug 13, 2008 12:09 pm

Re: Match all text but a specific word?

Post by cardician »

Unfortunately I can't dictate how I receive the data. Agreed there would be much smarter ways to acquire it, but I have to make do with what I get and make it work.
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: Match all text but a specific word?

Post by GeertDD »

Well, if you wanted to extract the titles you could use something like

Code: Select all

/^TITLE:\S*+(.+)/m
The final "m" is the multiline modifiers which makes ^ and $ match at the beginning and end of each line.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Match all text but a specific word?

Post by prometheuzz »

GeertDD wrote:Well, if you wanted to extract the titles you could use something like

Code: Select all

/^TITLE:\S*+(.+)/m
The final "m" is the multiline modifiers which makes ^ and $ match at the beginning and end of each line.
Geert, shouldn't that be a lower case 's' in your regex?
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Match all text but a specific word?

Post by prometheuzz »

cardician wrote:Thank you for the response, but that won't quite work. The example I provided is just a small subset of example text that I formatted for ease of reading. The real file I'm trying to process has many more tags in it with a lot more data overall. Plus I can't necessarily rely on it being formatted in a specific way, or at all. The only thing I can count on is the general structure of each file shoved into the larger file. I could be wrong, maybe I can figure out an alternate solution, but ideally a regex would do it for me. Here is the regex I currently have though it doesn't get all the text..

[a-z]{0,2}[0-9]+\s+TITLE:(?!TITLE)

The lookahead negation bit at the end doesn't appear to do anything as all this does is get:

1234567
TITLE:

But I think its along the right track...or maybe I'm way off.
What's "[a-z]{0,2}" doing in your regex? The part ":(?!TITLE)" in your regex wil only match a colon if there's no string "TITLE" in front of it, but it won't match any characters after that colon. Look-around's are always what they call "zero width": they don't match any characters.

What about such a regex:

Code: Select all

$file = '123
TITLE: News Headline 1
SOURCE: USA Today 1
TEXT: 1 News article text goes here.
 
456
TITLE: News Headline 2
SOURCE: USA Today 2
TEXT: 2 News article text goes here.
 
789
TITLE: News Headline 3
SOURCE: USA Today 3
TEXT: 3 News article text goes here.
 
101112
TITLE: News Headline 4
SOURCE: USA Today 4
TEXT: 4 News article text goes here.
';
 
$regex = '/(\d++)\nTITLE:\s++([^\n]+)\nSOURCE:\s++([^\n]+)\nTEXT:\s++([^\n]+)/';
 
if(preg_match_all($regex, $file, $matches)) {
    print_r($matches);
}
User avatar
omniuni
Forum Regular
Posts: 738
Joined: Tue Jul 15, 2008 10:50 pm
Location: Carolina, USA

Re: Match all text but a specific word?

Post by omniuni »

I'm not sure regex is quite the way to go on this one. If I were you, I would use a function like strpos() to find the position of the particular key word, and use this value to create a substr(). I would also definitely try using an explode() to break the file up, and see what that does. Also, I believe you can use regex with strpos() so you could perhaps use a technique in regex to find all words that are in all caps, and write those into an array. Find their positions, break up the string of the entire file, write it into an array with the all-caps words being used for keys, and now you have a very usable data structure that you can do what you want with.

I hope some, or any, of that helps!

-OmniUni
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: Match all text but a specific word?

Post by GeertDD »

prometheuzz wrote:Geert, shouldn't that be a lower case 's' in your regex?
Yeah, of course. My bad, thanks.
Post Reply