Page 1 of 1

Regular expression turns invalid once it enters PHP

Posted: Thu Aug 20, 2009 1:06 pm
by dahwan

Code: Select all

<?php
    $link = $_POST["link"];
    $foo = file($link);
    $data = "";
 
    $data = preg_grep("[\w-.]+?@([\w-]+?\.)+[\w]{2,}", $foo);
    
    print_r($data)
?>
This script is designed to extract the email adresses from any page. The expression is valid and i have tested it in several regex testers, f.inst. http://gskinner.com/RegExr/

But when i try to use it in php i get this error message:
Warning: preg_grep() [function.preg-grep]: Unknown modifier '+' in /home/dahwan/public_html/emailextractor/emailextractor.php on line 6
Am i using it wrong?

Any help appreciated

Re: Regular expression turns invalid once it enters PHP

Posted: Thu Aug 20, 2009 1:16 pm
by jackpf
With PCRE functions in PHP, you need to start and end your expression with a non alphanumeric character.

Like:

Code: Select all

<?php
    $link = $_POST["link"];
    $foo = file($link);
    $data = "";
 
    $data = preg_grep("/[\w-.]+?@([\w-]+?\.)+[\w]{2,}/", $foo);
   
    print_r($data)
?>

Re: Regular expression turns invalid once it enters PHP

Posted: Thu Aug 20, 2009 1:35 pm
by dahwan
Thanks for the quick reply. At least php doesn't crash now :P But i'm getting unexpected results. When i tried this in the regex tester, i found all the emails perfectly. But in php, it seems, it grabs the whole line. This is the result i got
<td width="50">&nbsp;<a href="mailto:fjernlaan-nbo@nb.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:innlaan@nb.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:musikk-oslo@nb.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:samkat@nb.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:dag.t.henriksen@uis.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:filmbibliotek@nb.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:depot-fjernlaan@nb.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:bib-hald@hiof.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:bib.krsund@himolde.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:bib-sarp@hiof.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:bib-fred@hiof.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:bib-figur@hiof.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:post@ostfoldforskning.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:firmapost@ij.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:biblioteket@so-hf.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:info@frambu.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:len@sormarka.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:biblioteket@umb.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:karin.lyngmo@umb.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:library.noragric@umb.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:bibliotek@skogoglandskap.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:anne.ombustvedt@umb.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:liv.korslund@umb.no">E-post</a>&nbsp;</td>
<br /><td width="50">&nbsp;<a href="mailto:plantehelse.bibl@bioforsk.no">E-post</a>&nbsp;</td>
<br />
I was expecting a list with the emails only. Could anyone shed some light on this?

Thanks

EDIT: On second thought, the function of preg_grep is probably to return the array elements that contains a match. Any idea how i can cut the extra html out of there?

Re: Regular expression turns invalid once it enters PHP

Posted: Thu Aug 20, 2009 1:48 pm
by jackpf
Well, normally strip_tags(), but that would strip out the link, which includes the email address, which is what I presume you want to keep...

What does this output?

Code: Select all

preg_match_all('/\a href\=\"mailto\:(.*?)\"/', $data, $matches);
 
print_r($matches);

Re: Regular expression turns invalid once it enters PHP

Posted: Thu Aug 20, 2009 2:03 pm
by dahwan
Actually i got it working.

Code: Select all

<?php
    $link = $_POST["link"];
    $foo = file($link);
    $pattern = "/[\w-.]+?@([\w-]+?\.)+[\w]{2,}/";
 
    $data = preg_grep($pattern, $foo);
    
    $formattedstring = "";
    
    foreach($data as $piece)
    {
        $matches[] = 0;
        
        preg_match_all($pattern, $piece, $matches);
        
        $formattedstring .= $matches[0][0] . "<br />\n";
    }
    
    echo $formattedstring;
?>
I know it's a little messy, and it wont work if there are several email addresses pr line, but I'll cross that bridge if it comes, and return to this forum. Thanks for priceless help!

Re: Regular expression turns invalid once it enters PHP

Posted: Thu Aug 20, 2009 2:30 pm
by jackpf
Cool, no problem.