The reason your regex matches only one character is because you have specified the 'U' pattern modifier. This (completely useless) modifier reverses the greedy/lazy behavior for quantifiers - i.e. it makes all greedy quantifiers lazy and all lazy quantifiers greedy. There is never,
ever a need to use the 'U' modifier - if you need to make a quantifier lazy, just append the ? to it. Bottom line: Never use the U modifier!
See:
www.regular-expressions for a discussion of greedy vs lazy quantifiers.
Your regex has only one quantifier (the '+') and the U modifier is making this (normally greedy) quantifier behave lazily, thus it is matching only one character. Also, you are using the 's' "single-line"
dot-matches-all modifier which does absolutely nothing here because there is no dot in the expression. For more on PHP PCRE regex modifiers, See:
PHP Pattern Modifiers.
Here is a script which divides text into paragraphs, sentences and words.
Code: Select all
<?php
$text = file_get_contents('test.txt');
// here are regexes to match paragraphs, sentences and words
$re_paragraph = '/\s*+([^\r\n]++)\s*+/'; // Group 1 contains paragraph
$re_sentence = '/\s*+([^.?!\r\n]++[.?!]?)/'; // Group 1 contains sentence
$re_word = '/\b(\w\b|\w[\w\']*\w\b)/'; // Group 1 contains word
$paragraphs = array();
$sentences = array();
$words = array();
$paragraph_count = preg_match_all($re_paragraph, $text, $p_matches);
printf("The text has %d paragraphs:\n", $paragraph_count);
for ($i = 0; $i < $paragraph_count; $i++) {
$paragraphs[] = $p_matches[1][$i];
$sentence_count = preg_match_all($re_sentence, $p_matches[1][$i], $s_matches);
printf(" Paragraph %d has %d sentences:\n", $i, $sentence_count);
for ($j = 0; $j < $sentence_count; $j++) {
$sentences[] = $s_matches[1][$j];
$word_count = preg_match_all($re_word, $s_matches[1][$j], $w_matches);
printf(" Sentence %d has %d words.\n", $j, $word_count);
for ($k = 0; $k < $word_count; $k++) {
$words[] = $w_matches[1][$k];
}
}
}
printf("The text contains a total of %d paragraphs, %d sentences and %d words.\n",
count($paragraphs), count($sentences), count($words));
?>
Given the following text.txt file:
Code: Select all
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi.
Nam liber tempor cum soluta nobis eleifend option congue nihil imperdiet doming id quod mazim placerat facer possim assum. Typi non habent claritatem insitam; est usus legentis in iis qui facit eorum claritatem. Investigationes demonstraverunt lectores legere me lius quod ii legunt saepius.
Claritas est etiam processus dynamicus, qui sequitur mutationem consuetudium lectorum. Mirum est notare quam littera gothica, quam nunc putamus parum claram, anteposuerit litterarum formas humanitatis per seacula quarta decima et quinta decima. Eodem modo typi, qui nunc nobis videntur parum clari, fiant sollemnes in futurum. Clarita's est etiam processus dynamicus, qui sequitur mutationem consuetudium lector'um.
Here is the script output:
Code: Select all
The text has 3 paragraphs:
Paragraph 0 has 2 sentences:
Sentence 0 has 21 words.
Sentence 1 has 43 words.
Paragraph 1 has 3 sentences:
Sentence 0 has 19 words.
Sentence 1 has 14 words.
Sentence 2 has 10 words.
Paragraph 2 has 4 sentences:
Sentence 0 has 10 words.
Sentence 1 has 22 words.
Sentence 2 has 13 words.
Sentence 3 has 10 words.
The text contains a total of 3 paragraphs, 9 sentences and 162 words.
Hope this helps!
