Need help with regex for all caps followed by lower case

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
tbarmann
Forum Newbie
Posts: 3
Joined: Wed Apr 08, 2009 2:29 pm

Need help with regex for all caps followed by lower case

Post by tbarmann »

Regular expression needed. I'm working in PHP.
I'm trying to parse some text that contains an all-caps heading, followed by regular case words. It has an all-capital-letter-headline, followed by 1 or more lower case words. The first match should grab the all caps headline and the lower case text until it reaches the next all-caps headline, and so forth. An all caps headline should contain at least one word that has two or more all capital letters. So it would not match the single capital letter that starts a sentence. But the regex should also be smart enough to handle an all caps headline that starts with or contains a single all cap letter: THIS IS A HEADLINE should match. The headline and the text that follow may also contain white space characters.

Example:

THIS IS AN ALL CAPS HEADLINE followed by
some text like this. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here. THEN ANOTHER ALL CAPS HEADLINE GOES HERE followed by even more text.

I want two matches:
(1) THIS IS AN ALL CAPS HEADLINE followed by
some text like this. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here.

(2) THEN ANOTHER ALL CAPS HEADLINE GOES HERE followed by even more text.

Thanks for any help!
Tim
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Re: Need help with regex for all caps followed by lower case

Post by pickle »

I think what you want is to use preg_match_all(), and a pattern like this:

Code: Select all

/([A-Z ])([a-z ])/
That will match (and capture) all capital letters and space, followed by lower letters and space. You'll need to tweak the pattern of course, but hopefully this points you in the right direction.
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Need help with regex for all caps followed by lower case

Post by ridgerunner »

Try this one out for size:

Code: Select all

$regex_short = '/([A-Z]+[,.!?]?(?:\s+[A-Z]+[,.!?]?)+)((?:[A-Z](?![A-Z]))?[a-z]*[,.!?]?(?:\s+(?:[A-Z](?![A-Z]))?[a-z]*[,.!?]?)+)/';
 
$regex_long = '/
(                # begin capture group 1 for "ALL CAPS" section
  [-_A-Z]+       # match first all caps word
  [,.!?]?        # followed by optional sentence punctuation
  (?:            # followed by one or more additional all caps words
    \s+          # all additional words preceeded by some whitespace
    [-_A-Z]+     # match 2nd, 3rd, 4th... all caps word
    [,.!?]?      # followed by optional sentence punctuation
  )+             # must have one or more additional all caps words
)                # end capture group 1 with "ALL CAPS" section
(                # begin capture group 2 for "Non caps" section
  (?:            # first word may begin with an optional
    [A-Z]        # single CAP letter but only if it is
    (?![A-Z])    # not followed by another CAP letter
  )?             # first CAP letter of first word is optional
  [-_a-z]*       # match remainder of first non cap word
  [,.!?]?        # followed by optional sentence punctuation
  (?:            # followed by one or more additional non caps words
    \s+          # all additional words preceeded by some whitespace
    (?:          # all the additional non caps words can begin
      [A-Z]      # with a single CAP letter but only if it is
      (?![A-Z])  # not followed by another CAP letter
    )?           # first CAP letter of additional words is optional
   [-_a-z]*      # match remainder of additional non cap word
   [,.!?]?       # followed by optional sentence punctuation
  )+             # must have one or more additional non caps words
)                # end capture group 2 with "Non caps" section
/x';
if (preg_match($regex_long, $text, $matches)) {
    $result = $matches[0];
    $text_caps = $matches[1];
    $text_noncaps = $matches[2];
} else {
    $result = "";
}
 
It captures the "ALL-CAPS" portion in group 1 and the "Non-caps" potion in group 2. It allows dashes and underscores in words as well as sentence punctuation.
PM2008
Forum Newbie
Posts: 7
Joined: Mon Dec 29, 2008 10:47 am

Re: Need help with regex for all caps followed by lower case

Post by PM2008 »

The regular expression may be tricky in this case.

I put the following text in file C:/x.txt.
THIS IS AN ALL CAPS HEADLINE followed by
some text like this. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here. THEN ANOTHER ALL CAPS HEADLINE GOES HERE followed by even more text.

I ran the following script.

Code: Select all

# Script CAPSLower.txt
# Read file
var str text ; cat "C:/x.txt" > $text
 
# Get the first all caps.
var str caps ; stex -r "]^(a>z)^" $text > $caps
while ($caps <> "")
do
    echo "DEBUG: CAPS=" $caps
    # Now get the all lower case.
    var str lower ; stex -r "]^(A>Z)(A>Z )^" $text > $lower
    echo "DEBUG: lower=" $lower
 
    # Get the next all caps.
    stex -r "]^(a>z)^" $text > $caps
done
I got as output:
DEBUG: CAPS=THIS IS AN ALL CAPS HEADLINE
DEBUG: lower=followed by
some text like this. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here. Just some more regular text here.
DEBUG: CAPS=THEN ANOTHER ALL CAPS HEADLINE GOES HERE
DEBUG: lower=

Is this what you are looking for ? The script is in biterscripting ( http://www.biterscripting.com ). I am using the regular expression (A>Z)(A>Z ). That means first CAP char, and second CAP char or space. When I used just (A>Z), then I got
DEBUG: CAPS=THIS IS AN ALL CAPS HEADLINE
DEBUG: lower=followed by
some text like this.
DEBUG: CAPS=J
DEBUG: lower=ust some more regular text here.
DEBUG: CAPS=J
DEBUG: lower=ust some more regular text here.
DEBUG: CAPS=J
DEBUG: lower=ust some more regular text here.
DEBUG: CAPS=J
DEBUG: lower=ust some more regular text here.
DEBUG: CAPS=J
DEBUG: lower=ust some more regular text here.
DEBUG: CAPS=J
DEBUG: lower=ust some more regular text here.
DEBUG: CAPS=THEN ANOTHER ALL CAPS HEADLINE GOES HERE
DEBUG: lower=
I think the first regular expression is correct.

Patrick
Post Reply