Page 1 of 1

Match all text but a specific word?

Posted: Wed Aug 13, 2008 12:16 pm
by cardician
Hello all,

First, let me say that the programming language doesn't matter. I'm just looking for a basica regex and I'll work on stuffing it into a language at a later point.

What I'm trying to do is basically as the subject says but a little more than that. So basically I'm trying to run a regex on a single file that has the text from thousands of files within it. These files all follow a basic structure, though there can be differences. Here is a small example of what I mean.

1234567
TITLE: News Headline
SOURCE: USA Today
TEXT: News article text goes here.

04654888
TITLE: News Headline 2
SOURCE: Newsweek
TEXT: News article text goes here.

So in this file you see the basic structure of two different articles. The articles always tend to start with a numeric code that can very in length, and then they always seem to be followed by the TITLE: section. What I want is a regex that can seperate these and return them individually. So the first result would be the whole first section:

1234567
TITLE: News Headline
SOURCE: USA Today
TEXT: News article text goes here.

Hopefully that makes sense. If anyone can offer any thoughts I would greatly appreciate it. Thank you.

Re: Match all text but a specific word?

Posted: Wed Aug 13, 2008 1:04 pm
by GeertDD
Trying a simple solution first. Based on the example text you posted, couldn't you split the text on double newlines? In PHP can use the explode() funtion then.

Re: Match all text but a specific word?

Posted: Wed Aug 13, 2008 1:14 pm
by cardician
Thank you for the response, but that won't quite work. The example I provided is just a small subset of example text that I formatted for ease of reading. The real file I'm trying to process has many more tags in it with a lot more data overall. Plus I can't necessarily rely on it being formatted in a specific way, or at all. The only thing I can count on is the general structure of each file shoved into the larger file. I could be wrong, maybe I can figure out an alternate solution, but ideally a regex would do it for me. Here is the regex I currently have though it doesn't get all the text..

[a-z]{0,2}[0-9]+\s+TITLE:(?!TITLE)

The lookahead negation bit at the end doesn't appear to do anything as all this does is get:

1234567
TITLE:

But I think its along the right track...or maybe I'm way off.

Re: Match all text but a specific word?

Posted: Wed Aug 13, 2008 1:18 pm
by lukewilkins
Looks to me like you need to not be messing with a flat data file and move towards storing this in a database if you're going to be having thousands of rows. Much more efficient ... and much easier to code.

Re: Match all text but a specific word?

Posted: Wed Aug 13, 2008 1:20 pm
by cardician
Unfortunately I can't dictate how I receive the data. Agreed there would be much smarter ways to acquire it, but I have to make do with what I get and make it work.

Re: Match all text but a specific word?

Posted: Wed Aug 13, 2008 1:39 pm
by GeertDD
Well, if you wanted to extract the titles you could use something like

Code: Select all

/^TITLE:\S*+(.+)/m
The final "m" is the multiline modifiers which makes ^ and $ match at the beginning and end of each line.

Re: Match all text but a specific word?

Posted: Wed Aug 13, 2008 2:35 pm
by prometheuzz
GeertDD wrote:Well, if you wanted to extract the titles you could use something like

Code: Select all

/^TITLE:\S*+(.+)/m
The final "m" is the multiline modifiers which makes ^ and $ match at the beginning and end of each line.
Geert, shouldn't that be a lower case 's' in your regex?

Re: Match all text but a specific word?

Posted: Wed Aug 13, 2008 2:39 pm
by prometheuzz
cardician wrote:Thank you for the response, but that won't quite work. The example I provided is just a small subset of example text that I formatted for ease of reading. The real file I'm trying to process has many more tags in it with a lot more data overall. Plus I can't necessarily rely on it being formatted in a specific way, or at all. The only thing I can count on is the general structure of each file shoved into the larger file. I could be wrong, maybe I can figure out an alternate solution, but ideally a regex would do it for me. Here is the regex I currently have though it doesn't get all the text..

[a-z]{0,2}[0-9]+\s+TITLE:(?!TITLE)

The lookahead negation bit at the end doesn't appear to do anything as all this does is get:

1234567
TITLE:

But I think its along the right track...or maybe I'm way off.
What's "[a-z]{0,2}" doing in your regex? The part ":(?!TITLE)" in your regex wil only match a colon if there's no string "TITLE" in front of it, but it won't match any characters after that colon. Look-around's are always what they call "zero width": they don't match any characters.

What about such a regex:

Code: Select all

$file = '123
TITLE: News Headline 1
SOURCE: USA Today 1
TEXT: 1 News article text goes here.
 
456
TITLE: News Headline 2
SOURCE: USA Today 2
TEXT: 2 News article text goes here.
 
789
TITLE: News Headline 3
SOURCE: USA Today 3
TEXT: 3 News article text goes here.
 
101112
TITLE: News Headline 4
SOURCE: USA Today 4
TEXT: 4 News article text goes here.
';
 
$regex = '/(\d++)\nTITLE:\s++([^\n]+)\nSOURCE:\s++([^\n]+)\nTEXT:\s++([^\n]+)/';
 
if(preg_match_all($regex, $file, $matches)) {
    print_r($matches);
}

Re: Match all text but a specific word?

Posted: Wed Aug 13, 2008 5:49 pm
by omniuni
I'm not sure regex is quite the way to go on this one. If I were you, I would use a function like strpos() to find the position of the particular key word, and use this value to create a substr(). I would also definitely try using an explode() to break the file up, and see what that does. Also, I believe you can use regex with strpos() so you could perhaps use a technique in regex to find all words that are in all caps, and write those into an array. Find their positions, break up the string of the entire file, write it into an array with the all-caps words being used for keys, and now you have a very usable data structure that you can do what you want with.

I hope some, or any, of that helps!

-OmniUni

Re: Match all text but a specific word?

Posted: Thu Aug 14, 2008 9:22 am
by GeertDD
prometheuzz wrote:Geert, shouldn't that be a lower case 's' in your regex?
Yeah, of course. My bad, thanks.