Page 1 of 1

Problems matching a pattern (RegExps)

Posted: Thu Jan 13, 2005 8:34 am
by visionmaster
Hello PHP and RegExps experts,


==> Question: Why is there no match with my RegExp '(.{50,250})56130\s+Bad Ems(.{50,250})'is ???
This is really making me crazy for hours.... Would really appreciate your help!

Code: Select all

ї...]
function verifyAddress($arrDaten,$content) {
		
       ##### Funktion die überprüft ob PLZ/Ort Kombination existiert
	   ##### checkPlzOrt();
	   
	   global $arrDaten;
   
   
	   //Alle Stellen suchen welche die PLZ enthalten, mindestens 50 Zeichen und maximal 200 Zeichen davor und danach
	   //D.h. es werden Blöcke erzeugt welche die Plz UND den Ort enthalten.
	  	   $arrParsedBlocks = getDataUsingRegexp("'(.{50,250})".preg_quote($arrDatenї'Plz'])."\s+".preg_quote($arrDatenї'Ort'])."(.{50,250})'is",$content); 	
	   
ї...]

	   
ї...]
function getDataUsingRegexp($strRegexp,$string)
{      
	global $arrDaten;

	preg_match_all($strRegexp, $string, $matches);
	
	
	### DEBUG BEGIN
	print "RegExp=".$strRegexp.'<br><br>';
	print "String=".$string.'<br><br>';
	
	if(count($matches&#1111;0])<= '0')
	&#123;
		print "Match=0<br><br>";
	&#125;
	else
	&#123;
		print "Match=".$matches&#1111;0].'<br><br>';
	&#125;
	### DEBUG END
		
	$arrListe = array();	

	for ($i=0; $i< count($matches&#1111;0]); $i++)
	&#123;   
	   print "enter";
	   $strData = trim($matches&#1111;1]&#1111;$i].$arrDaten&#1111;'Plz'].' '.$arrDaten&#1111;'Ort'].$matches&#1111;2]&#1111;$i]);      	     	    
	     
	   $arrListe&#1111;] = $strData;        	               
	&#125;

	return $arrListe;

&#125;	   
&#1111;...]
My debug lines in the function getDataUsingRegexp() output the following in my browser. The source code (html) is included below, is a bit long, I know.

RegExp='(.{50,250})56130\s+Bad Ems(.{50,250})'is

String= Ihre Gestaltung und Produkte gefallen uns! Bitte senden Sie uns ausfuehrliche Informationen ueber: "> Kacheloefen
Wand & Boden
Wandbilder Outdoor-Bereich
Denkmalpflege Rufen Sie uns bitte an unter Unsere Adresse: Firma: Name: Straße: PLZ & Ort: Land:
Baukeramik Ebinger GmbH
Lindenbach 2
56130 Bad Ems
Tel. 02603 / 2196
Fax 02603 / 2993
info@baukeramik-ebinger.de Download der Prospekte als .pdf (Klicken Sie zum speichern die rechte Maustaste: "Ziel speichern unter")
Bruchmosaik
Saeulen
Ebinger
Wand & Boden
Pflaster
Kacheloefen
Dripping
Saeulenoefen
Outdoor

Match=0



Thanks!


<TITLE>Baukeramik Ebinger GmbH: Kontakt</TITLE>









Ihre
Gestaltung und Produkte gefallen uns!



Bitte
senden Sie uns ausfuehrliche Informationen ueber:


">

Kacheloefen<br>

Wand & Boden<br>

Wandbilder


Outdoor-Bereich<br>

Denkmalpflege












Rufen Sie uns bitte an unter














Unsere Adresse:




Firma:






Name:






Straße:







PLZ &
Ort:





Land:










<br>




Baukeramik Ebinger
GmbH<br>
Lindenbach 2<br>
56130 Bad Ems<br>
Tel. 02603 / 2196<br>
Fax 02603 / 2993<br>
info@baukeramik-ebinger.de







Download
der Prospekte als .pdf



(Klicken
Sie zum speichern die rechte Maustaste: "Ziel speichern unter")




<br>

Bruchmosaik
<br>
Saeulen
<br>
Ebinger


<br>
Wand & Boden
<br>
Pflaster
<br>
Kacheloefen


<br>

Dripping
<br>
Saeulenoefen
<br>
Outdoor

Posted: Thu Jan 13, 2005 8:40 am
by feyd
I'm a bit unsure from your example. What I can suggest is that you try tossing your pattern and text into the regex coach. It may point out what is happening.

Posted: Thu Jan 13, 2005 9:12 am
by visionmaster
Hi feyd thanks for the quick reply!

I'll have to try out regex coach...

Here another debug output, strangely it works out fine for this example.
For short explanation I'm fishing out blocks holding postal code + city.

Maybe an explanation why it works out just fine for this example:


RegExp='(.{50,250})26123\s+Oldenburg(.{50,250})'is

String=


CMC
Conceptagentur fuer Marketing
und Communication GmbH
Nadorster Straße 222
26123 Oldenburg
Telefon 0441 93370-0
Telefax 0441 93370-11
E-Mail: info@cmc-concept.de
Kontakt
Ihr Name: Firma: Straße und Nr.: Ort und PLZ: Telefon: Telefax: e-Mail: Internet: Ihre Nachricht:
GF: Wolf D. Feuerlein
Amtsgericht Oldb., HRB 3191
UST-ID-Nr.: DE16203954

Unsere Verbundpartner im Dialog:
FFI fullservice foodmarketing institut
Heintel Marketing im Netzwerk
T.I.M. Team Impulse Marketing · Hartmut Witte






Match=Array

enter

array(1) {
[0]=>
string(411) "

















CMC

Conceptagentur fuer Marketing

und Communication GmbH

Nadorster Straße 222

26123 Oldenburg

Telefon 0441 93370-0

Telefax 0441 93370-11

E-Mail: info@cmc-concept.de






Kontakt










Ihr Name:



Firma:"
}


Source code of the above output is following:

<title>CMC-Concept</title>









<br>
<br>

<br>


CMC <br>
Conceptagentur fuer Marketing <br>
und Communication GmbH <br>
Nadorster Straße 222 <br>
26123 Oldenburg <br>
Telefon 0441 93370-0 <br>

Telefax 0441 93370-11 <br>
E-Mail: info@cmc-concept.de <br>





Kontakt





<br>



Ihr Name:



Firma:



Straße und Nr.:



Ort und PLZ:



Telefon:



Telefax:



e-Mail:



Internet:



Ihre Nachricht:




<br>











GF: Wolf D. Feuerlein <br>
Amtsgericht Oldb., HRB 3191 <br>

UST-ID-Nr.: DE16203954 <br>










<br>




Unsere Verbundpartner im Dialog: <br>
FFI fullservice foodmarketing institut <br>
Heintel Marketing im Netzwerk <br>

T.I.M. Team Impulse Marketing · Hartmut Witte <br>















<title>CMC-Concept</title>











<br>
<br>
<br>

















<br>

Posted: Thu Jan 13, 2005 9:40 am
by feyd
hmm.. there appears to be a potential that the space you think is between Bad & Ems is actually not a space, but some other whitespace character, like a carriage return.. try changing the pattern used with the first regex to use \s+ as the gap between the words.

Posted: Thu Jan 13, 2005 10:16 am
by visionmaster
feyd wrote:hmm.. there appears to be a potential that the space you think is between Bad & Ems is actually not a space, but some other whitespace character, like a carriage return.. try changing the pattern used with the first regex to use \s+ as the gap between the words.
Hey, I didn't think of that... But what else than \s+ could I use?

Ich actually tested the following RegExp:

(.{10,250})56130\s+Bad Ems(.{50,250})

with the following pattern:

Land:










<br>




Baukeramik Ebinger
GmbH<br>
Lindenbach 2<br>
56130 Bad Ems<br>
Tel. 02603 / 2196<br>
Fax 02603 / 2993<br>
info@baukeramik-ebinger.de







Download
der Prospekte als .pdf

=> RegExp shows me a match (yellow highlighting). Or does something happen when I copy the text from my browser source code? Don't understand that one.

Thanks a lot!

Posted: Thu Jan 13, 2005 10:24 am
by feyd
hmm.. okay, try changing your original regex pattern to not use single quotes as the pattern start and end markers.. should also make sure the calls to preg_quote escape the marker.. just incase it appears in the pattern.

Posted: Thu Jan 13, 2005 3:51 pm
by visionmaster
feyd wrote:hmm.. okay, try changing your original regex pattern to not use single quotes as the pattern start and end markers.. should also make sure the calls to preg_quote escape the marker.. just incase it appears in the pattern.

Hi feyd,

O.k., I replaced the single quotes to a |
That actually didn't change anything. How do I escape the | marker in the call to preg_quote?

Again the question, what else than \s+ could I use? It looks like the whitespace isn't a whitespace, as you suggested. I don't see any other reason why I don't get a match in my php script... :(

Code: Select all

&#1111;...] 
function verifyAddress($arrDaten,$content) &#123; 
       
       ##### Funktion die überprüft ob PLZ/Ort Kombination existiert 
      ##### checkPlzOrt(); 
       
      global $arrDaten; 
    
    
      //Alle Stellen suchen welche die PLZ enthalten, mindestens 50 Zeichen und maximal 200 Zeichen davor und danach 
      //D.h. es werden Blöcke erzeugt welche die Plz UND den Ort enthalten. 
           $arrParsedBlocks = getDataUsingRegexp("|(.&#123;50,250&#125;)".preg_quote($arrDaten&#1111;'Plz'])."\s+".preg_quote($arrDaten&#1111;'Ort'])."(.&#123;50,250&#125;)|is",$content);     
       
&#1111;...] 

       
&#1111;...] 
function getDataUsingRegexp($strRegexp,$string) 
&#123;      
   global $arrDaten; 

   preg_match_all($strRegexp, $string, $matches); 
    
    
   ### DEBUG BEGIN 
   print "RegExp=".$strRegexp.'<br><br>'; 
   print "String=".$string.'<br><br>'; 
    
   if(count($matches&#1111;0])<= '0') 
   &#123; 
      print "Match=0<br><br>"; 
   &#125; 
   else 
   &#123; 
      print "Match=".$matches&#1111;0].'<br><br>'; 
   &#125; 
   ### DEBUG END 
       
   $arrListe = array();    

   for ($i=0; $i< count($matches&#1111;0]); $i++) 
   &#123;    
      print "enter"; 
      $strData = trim($matches&#1111;1]&#1111;$i].$arrDaten&#1111;'Plz'].' '.$arrDaten&#1111;'Ort'].$matches&#1111;2]&#1111;$i]);                     
         
      $arrListe&#1111;] = $strData;                           
   &#125; 

   return $arrListe; 

&#125;       
&#1111;...]

Posted: Thu Jan 13, 2005 4:06 pm
by feyd
\s is the best thing to use for any form of whitespace.

I'm going to guess that $arrDaten['Ort'] is 'Bad Ems' in this case? I was suggesting that you use str_replace() to replace the space character in that, with a "\s+"

just for testing's sake, in your debugging code, place this:

Code: Select all

echo htmlentities($string) . '<br><br>';
instead of

Code: Select all

print "String=".$string.'<br><br>';

Posted: Fri Jan 14, 2005 3:53 am
by visionmaster
Hi feyd,

Yes, $arrDaten['Ort'] is 'Bad Ems', thats a city name.
Hmm, this is really strange, I think my brain has difficulties dealing with this one... ;-) Allow me to get into the details once more.

Here a snippet from http://www.baukeramik-ebinger.de/html_g ... ntakt.html :

Code: Select all

&#1111;...]
 <TABLE cellSpacing=0 cellPadding=0 width="100%" border=0>
        <TBODY>
        <TR>
          <TD vAlign=top><INPUT style="WIDTH: 150px" type=submit value="Info anfordern"><BR><INPUT style="WIDTH: 150px" type=reset value=Korrigieren...> 
            &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</TD>
          <TD>
            <DIV align=left><FONT face="Arial, Helvetica" 
            size=-1><STRONG>Baukeramik Ebinger GmbH<BR>Lindenbach 
            2<BR>56130&nbsp; Bad Ems<BR>Tel. 02603&nbsp;/&nbsp;2196<BR>Fax 
            02603&nbsp;/&nbsp;2993<BR><A 
            href="mailto:info@baukeramik-ebinger.de"><NOBR>info@baukeramik-ebinger.de</NOBR></A> 
            </STRONG></FONT></DIV></TD></TR></TBODY></TABLE></FORM></P></TD>
    <TD bgColor=#89432c><IMG height=10 
      src="Baukeramik Ebinger GmbH Kontakt-Dateien/transparent_pixel.gif" 
      width=1></TD>
&#1111;...]
My script makes the following out of the above:

Code: Select all

<TITLE>Baukeramik Ebinger GmbH: Kontakt</TITLE>








  
     Ihre 
      Gestaltung und Produkte gefallen uns!
      
        
           
             Bitte 
              senden Sie uns ausfuehrliche Informationen ueber:
           
              
              ">
              
              Kacheloefen<br>
              
              Wand & Boden<br>
              
              Wandbilder 
            
            
            Outdoor-Bereich<br>
            
            Denkmalpflege 
            
            
           
              
           
        
        
	
	  
	    
	 
	  
	   Rufen Sie uns bitte an unter
	   
	   
	  
	 
	  
	    
	 
	  
	    
	  
	    
	 
	  
	    
	   Unsere Adresse: 
	  
	 
	  
	   
	  Firma:
	   
	   
	  
	 
	  
	   
	  Name:
	   
	   
	  
	 
	  
	    
	   Straße: 
	  
	   
	   
	  
	 
	  
	    
	   PLZ & 
	   Ort: 
	   
	    
	 
	  
	   
	  Land: 
	   
	  

	  
	 
	
         
        
           
             
              <br>

               
                   
            
 Baukeramik Ebinger 
		GmbH<br>
		Lindenbach 2<br>
		56130  Bad Ems<br>
		Tel. 02603 / 2196<br>
		Fax 02603 / 2993<br>
		info@baukeramik-ebinger.de 
		
          
        
      
    
  
    
   Download 
	der Prospekte als .pdf
   
	 
	 
(Klicken 
	   Sie zum speichern die rechte Maustaste: "Ziel speichern unter")
	  
	 
	
	 
	 <br>

	  Bruchmosaik 
	 <br>
	  Saeulen 
	 <br>
	  Ebinger
	
	 
	 <br>
	  Wand & Boden
	 <br>
	  Pflaster 
	 <br>
	  Kacheloefen
	
	 
	 <br>

	  Dripping 
	 <br>
	  Saeulenoefen 
	 <br>
	   Outdoor


=> My RegExp |(.{10,250})56130\s+Bad Ems(.{50,250})|is matches me nothing, my debug outputs 'Match=0'.

=> Next step: I came up with the idea of changing the html source code at one specific line from
56130 Bad Ems<br>
to
56130 Bad Ems<br>

Notice one space less between '56130' and 'Bad Ems'. (Oops, can't display that here in this posting.)

What I find very strange is that this time I get a match. I really don't understand why, since \s+ should take care of one or more whitespaces.
To be sure I tested the example with 'The Regex Coach', works fine there.

=> Question: What am I doing wrong, I really can't solve this problem. Something is going wrong in my script, but what? Please help somebody very frustrated... :-(

Best Regards,
visionmaster

Posted: Fri Jan 14, 2005 8:36 am
by feyd
this was the problem most likely

Code: Select all

56130&nbsp; Bad Ems
when you sent it to the regex engine, &nbsp; isn't a whitespace character to it, it's 5 seperate characters.

Posted: Fri Jan 14, 2005 2:24 pm
by visionmaster
feyd wrote:this was the problem most likely

Code: Select all

56130&nbsp; Bad Ems
when you sent it to the regex engine, &nbsp; isn't a whitespace character to it, it's 5 seperate characters.

Oops, be careful, my script makes a
56130 Bad Ems<br>
from
56130&nbsp; Bad Ems

So, $nbsp; can't be the problem. I'll check out my code later on and post the solution to my problem as soon as I find it. Thanks for your hints!