Page 1 of 1

Catch all capped words and expressions made by them

Posted: Sun Apr 19, 2009 12:52 pm
by MeLight
The problem:
I pass on a file, in each line I need to catch ALL the expressions made of words starting with capitals and put them as a sub array in an array.
For example if on line 2 I have the next text (the text is tokenized, ie all punctuation marks are padded with spaces):
Hello world ! Is there anybody named John Doe Spartacus here ?

I need the corresponding cell in the $results array to be (order doesn't really matters):
$results[2] =>
[0] => Hello
[1] => Is
[2] => John
[3] => John Doe
[4] => John Doe Spartacus
[5] => Doe
[6] => Doe Spartacus
[7] => Spartacus

So far the best I got is (Using preg_match_all):
$results[2] =>
[0] => Hello
[1] => Is
[2] => John Doe Spartacus
[3] => Spartacus

as this isn't exactly what I was looking for I moved to using preg_match with the offset flag, but still no good. I'm handling the array thing fine, I need help with the regex. Here's the code:

Code: Select all

 
$inh = fopen($inf, "r");
 
while(!feof($inh)) {
    $line = fgets($inh);
    $matches = array();
    $tmatches = array();
 
    $offset = 0;
    do {
        $ret = preg_match("/([A-Z][A-Za-z\-]+)( [A-Z][A-Za-z\-]+)*/", $line, $tmatches, PREG_OFFSET_CAPTURE, $offset);
        if($ret == 1) {
            $offset = $tmatches[0][1] + strlen($tmatches[0][0]);
            $matches[$i][] = $tmatches[0][0];
        }
    }
    while($ret == 1);
    
    print_r($matches[$i]);
    
    $i++;
}
 
Thanx!

edit: added "Is" to the arrays
2nd edit: added "Doe" as a singular match :D

Re: Catch all capped words and expressions made by them

Posted: Sun Apr 19, 2009 1:03 pm
by prometheuzz
Why is the word "Is" missing? It also starts with a capital.

Re: Catch all capped words and expressions made by them

Posted: Sun Apr 19, 2009 1:26 pm
by MeLight
ok, added "Is" to the arrays, thanx :D

Re: Catch all capped words and expressions made by them

Posted: Sun Apr 19, 2009 1:37 pm
by prometheuzz
MeLight wrote:ok, added "Is" to the arrays, thanx :D
Okay, and why isn't "Doe" a "singular match"?

But it seems you want some sort of combinatorial output from your matches. This isn't possible with regex. Once a regex matches some text, it doesn't give up that match for another match. Yes, you could work around that with look-aheads, but I an positive the regex-engine will not be able to match the text as you posted in your original post. Sure, you can probably use regex as a part of your solution, but you will not be able to gte it using a single preg_match_all(...).

Re: Catch all capped words and expressions made by them

Posted: Sun Apr 19, 2009 6:54 pm
by MeLight
Yeap, "Doe" is supposed to be there too:D
Anywayz, if it's not possible then I'll probably just loop through every regex I find. I'll post the solution when I finish it

Re: Catch all capped words and expressions made by them

Posted: Mon Apr 20, 2009 3:37 am
by MeLight
Ok, if someone needs it, here's the solution:

Code: Select all

 
<?php
$inf = $argv[1];
 
if($argc != 2) {
    $script = $_SERVER['SCRIPT_NAME'];
    echo "usage: php $script <infile>\n";
    exit(1);
}
 
$inh = fopen($inf, "r");
 
while(!feof($inh)) {
    $line = fgets($inh);
    $matches = array();
    $tmatches = array();
 
    $offset = 0;
    do {
        $ret = preg_match("/([A-Z][A-Za-z\-]+)( +[A-Z][A-Za-z\-]+)*/", $line, $tmatches, PREG_OFFSET_CAPTURE, $offset);
        if($ret == 1) { //if a match for our pattern was found
            $exp = $tmatches[0][0];
            $offset = $tmatches[0][1] + strlen($exp);
            $matches[$i][] = [color=#FF0000]procExp($exp);[/color] //this function actually breaks the sentence to an array (implemented below)
            
        }
    }
    while($ret == 1);
    
    print_r($matches[$i]);
    
    $i++;
}
 
/* gets a sentence, returns it as combinatoric array (didn't find a better name :D)
For: $exp = Max Edipo Payne
Returns: Array (
[1] => Max
[2] => Max Edipo
[3] => Max Edipo Payne
[4] => Edipo
[5] => Edipo Payne
[6] => Payne
)
*/
[color=#FF0000]function procExp($exp) {
    $break = explode(" ", $exp);
    $retArr = array();
    $num = count($break);
    if($num == 1) return $break;
    
    for($i = 0; $i < $num; $i++) {
        for($k = 1; $k < $num - $i + 1; $k++) {
        
            $retArr[count($retArr) + 1] = "";
    
            for($j = $i; $j < $i + $k; $j++) {
                $retArr[count($retArr)] .= $break[$j]." ";
            }
    
            $retArr[count($retArr)] = trim($retArr[count($retArr)]);
        }
    }
    
    return $retArr;
}[/color]
?>