Matching single words

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
trotsky
Forum Newbie
Posts: 4
Joined: Sat Jul 31, 2010 2:16 pm

Matching single words

Post by trotsky »

I have a sentence (sometimes many sentences separated by line breaks) that I want to split up into individual words and put into an array. Here is my code so far, but it won't match the example $sentence. I think the problem is in my regex. As of right now, this program only gets single characters.



Code: Select all

$sentence = "how are you 


doing today?";

$singleWordRegex = '#[A-Za-z]+#sU';
preg_match_all($singleWordRegex, $sentence, $singleWordArray);
User avatar
superdezign
DevNet Master
Posts: 4135
Joined: Sat Jan 20, 2007 11:06 pm

Re: Matching single words

Post by superdezign »

You may be interested in simplifying the process by using preg_split(), or by using explode() in combination with preg_replace().

Code: Select all

$words = explode(' ', trim(preg_replace('~\W+~', ' ', $sentence))); 
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Matching single words

Post by ridgerunner »

The reason your regex matches only one character is because you have specified the 'U' pattern modifier. This (completely useless) modifier reverses the greedy/lazy behavior for quantifiers - i.e. it makes all greedy quantifiers lazy and all lazy quantifiers greedy. There is never, ever a need to use the 'U' modifier - if you need to make a quantifier lazy, just append the ? to it. Bottom line: Never use the U modifier!

See: www.regular-expressions for a discussion of greedy vs lazy quantifiers.

Your regex has only one quantifier (the '+') and the U modifier is making this (normally greedy) quantifier behave lazily, thus it is matching only one character. Also, you are using the 's' "single-line" dot-matches-all modifier which does absolutely nothing here because there is no dot in the expression. For more on PHP PCRE regex modifiers, See: PHP Pattern Modifiers.

Here is a script which divides text into paragraphs, sentences and words.

Code: Select all

<?php
$text = file_get_contents('test.txt');

// here are regexes to match paragraphs, sentences and words
$re_paragraph = '/\s*+([^\r\n]++)\s*+/';         // Group 1 contains paragraph
$re_sentence = '/\s*+([^.?!\r\n]++[.?!]?)/';     // Group 1 contains sentence
$re_word = '/\b(\w\b|\w[\w\']*\w\b)/';           // Group 1 contains word

$paragraphs = array();
$sentences = array();
$words = array();

$paragraph_count = preg_match_all($re_paragraph, $text, $p_matches);
printf("The text has %d paragraphs:\n", $paragraph_count);
for ($i = 0; $i < $paragraph_count; $i++) {
    $paragraphs[] = $p_matches[1][$i];
    $sentence_count = preg_match_all($re_sentence, $p_matches[1][$i], $s_matches);
    printf("  Paragraph %d has %d sentences:\n", $i, $sentence_count);
    for ($j = 0; $j < $sentence_count; $j++) {
        $sentences[] = $s_matches[1][$j];
        $word_count = preg_match_all($re_word, $s_matches[1][$j], $w_matches);
        printf("    Sentence %d has %d words.\n", $j, $word_count);
        for ($k = 0; $k < $word_count; $k++) {
            $words[] = $w_matches[1][$k];
        }
    }
}
printf("The text contains a total of %d paragraphs, %d sentences and %d words.\n",
    count($paragraphs), count($sentences), count($words));
?>
Given the following text.txt file:

Code: Select all

Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi.

Nam liber tempor cum soluta nobis eleifend option congue nihil imperdiet doming id quod mazim placerat facer possim assum. Typi non habent claritatem insitam; est usus legentis in iis qui facit eorum claritatem. Investigationes demonstraverunt lectores legere me lius quod ii legunt saepius.

Claritas est etiam processus dynamicus, qui sequitur mutationem consuetudium lectorum. Mirum est notare quam littera gothica, quam nunc putamus parum claram, anteposuerit litterarum formas humanitatis per seacula quarta decima et quinta decima. Eodem modo typi, qui nunc nobis videntur parum clari, fiant sollemnes in futurum. Clarita's est etiam processus dynamicus, qui sequitur mutationem consuetudium lector'um.
Here is the script output:

Code: Select all

The text has 3 paragraphs:
  Paragraph 0 has 2 sentences:
    Sentence 0 has 21 words.
    Sentence 1 has 43 words.
  Paragraph 1 has 3 sentences:
    Sentence 0 has 19 words.
    Sentence 1 has 14 words.
    Sentence 2 has 10 words.
  Paragraph 2 has 4 sentences:
    Sentence 0 has 10 words.
    Sentence 1 has 22 words.
    Sentence 2 has 13 words.
    Sentence 3 has 10 words.
The text contains a total of 3 paragraphs, 9 sentences and 162 words.
Hope this helps!
:)
trotsky
Forum Newbie
Posts: 4
Joined: Sat Jul 31, 2010 2:16 pm

Re: Matching single words

Post by trotsky »

Thanks for the help!
What if I wanted to get every adjacent three word combination, like

would the regex be this?
$doubleWordRegex = '#[^A-Za-z]+\s[A-Za-z]+\s[A-Za-z]+#';


the code should put the following into an array
how are you
are you doing
Post Reply