Regular expression needed. I'm working in PHP.
I'm trying to parse some text that contains an all-caps heading, followed by regular case words. It has an all-capital-letter-headline, followed by 1 or more lower case words. The first match should grab the all caps headline and the lower case text until it reaches the next all-caps headline, and so forth. An all caps headline should contain at least one word that has two or more all capital letters. So it would not match the single capital letter that starts a sentence. But the regex should also be smart enough to handle an all caps headline that starts with or contains a single all cap letter: THIS IS A HEADLINE should match. The headline and the text that follow may also contain white space characters.
Example:
THIS IS AN ALL CAPS HEADLINE followed by
some text like this. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here. THEN ANOTHER ALL CAPS HEADLINE GOES HERE followed by even more text.
I want two matches:
(1) THIS IS AN ALL CAPS HEADLINE followed by
some text like this. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here.
(2) THEN ANOTHER ALL CAPS HEADLINE GOES HERE followed by even more text.
Thanks for any help!
Tim
Need help with regex for all caps followed by lower case
Moderator: General Moderators
Re: Need help with regex for all caps followed by lower case
I think what you want is to use preg_match_all(), and a pattern like this:
That will match (and capture) all capital letters and space, followed by lower letters and space. You'll need to tweak the pattern of course, but hopefully this points you in the right direction.
Code: Select all
/([A-Z ])([a-z ])/Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
- ridgerunner
- Forum Contributor
- Posts: 214
- Joined: Sun Jul 05, 2009 10:39 pm
- Location: SLC, UT
Re: Need help with regex for all caps followed by lower case
Try this one out for size:
It captures the "ALL-CAPS" portion in group 1 and the "Non-caps" potion in group 2. It allows dashes and underscores in words as well as sentence punctuation.
Code: Select all
$regex_short = '/([A-Z]+[,.!?]?(?:\s+[A-Z]+[,.!?]?)+)((?:[A-Z](?![A-Z]))?[a-z]*[,.!?]?(?:\s+(?:[A-Z](?![A-Z]))?[a-z]*[,.!?]?)+)/';
$regex_long = '/
( # begin capture group 1 for "ALL CAPS" section
[-_A-Z]+ # match first all caps word
[,.!?]? # followed by optional sentence punctuation
(?: # followed by one or more additional all caps words
\s+ # all additional words preceeded by some whitespace
[-_A-Z]+ # match 2nd, 3rd, 4th... all caps word
[,.!?]? # followed by optional sentence punctuation
)+ # must have one or more additional all caps words
) # end capture group 1 with "ALL CAPS" section
( # begin capture group 2 for "Non caps" section
(?: # first word may begin with an optional
[A-Z] # single CAP letter but only if it is
(?![A-Z]) # not followed by another CAP letter
)? # first CAP letter of first word is optional
[-_a-z]* # match remainder of first non cap word
[,.!?]? # followed by optional sentence punctuation
(?: # followed by one or more additional non caps words
\s+ # all additional words preceeded by some whitespace
(?: # all the additional non caps words can begin
[A-Z] # with a single CAP letter but only if it is
(?![A-Z]) # not followed by another CAP letter
)? # first CAP letter of additional words is optional
[-_a-z]* # match remainder of additional non cap word
[,.!?]? # followed by optional sentence punctuation
)+ # must have one or more additional non caps words
) # end capture group 2 with "Non caps" section
/x';
if (preg_match($regex_long, $text, $matches)) {
$result = $matches[0];
$text_caps = $matches[1];
$text_noncaps = $matches[2];
} else {
$result = "";
}
Re: Need help with regex for all caps followed by lower case
The regular expression may be tricky in this case.
I put the following text in file C:/x.txt.
I ran the following script.
I got as output:
Is this what you are looking for ? The script is in biterscripting ( http://www.biterscripting.com ). I am using the regular expression (A>Z)(A>Z ). That means first CAP char, and second CAP char or space. When I used just (A>Z), then I got
Patrick
I put the following text in file C:/x.txt.
THIS IS AN ALL CAPS HEADLINE followed by
some text like this. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here. THEN ANOTHER ALL CAPS HEADLINE GOES HERE followed by even more text.
I ran the following script.
Code: Select all
# Script CAPSLower.txt
# Read file
var str text ; cat "C:/x.txt" > $text
# Get the first all caps.
var str caps ; stex -r "]^(a>z)^" $text > $caps
while ($caps <> "")
do
echo "DEBUG: CAPS=" $caps
# Now get the all lower case.
var str lower ; stex -r "]^(A>Z)(A>Z )^" $text > $lower
echo "DEBUG: lower=" $lower
# Get the next all caps.
stex -r "]^(a>z)^" $text > $caps
doneDEBUG: CAPS=THIS IS AN ALL CAPS HEADLINE
DEBUG: lower=followed by
some text like this. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here.
DEBUG: CAPS=THEN ANOTHER ALL CAPS HEADLINE GOES HERE
DEBUG: lower=
Is this what you are looking for ? The script is in biterscripting ( http://www.biterscripting.com ). I am using the regular expression (A>Z)(A>Z ). That means first CAP char, and second CAP char or space. When I used just (A>Z), then I got
I think the first regular expression is correct.DEBUG: CAPS=THIS IS AN ALL CAPS HEADLINE
DEBUG: lower=followed by
some text like this.
DEBUG: CAPS=J
DEBUG: lower=ust some more regular text here.
DEBUG: CAPS=J
DEBUG: lower=ust some more regular text here.
DEBUG: CAPS=J
DEBUG: lower=ust some more regular text here.
DEBUG: CAPS=J
DEBUG: lower=ust some more regular text here.
DEBUG: CAPS=J
DEBUG: lower=ust some more regular text here.
DEBUG: CAPS=J
DEBUG: lower=ust some more regular text here.
DEBUG: CAPS=THEN ANOTHER ALL CAPS HEADLINE GOES HERE
DEBUG: lower=
Patrick