Page 1 of 1

Few simple regular expression questions

Posted: Sun Oct 15, 2006 9:36 am
by mickd
Hi, i'm currently trying to understand and be able to write my own regular expressions. However, i have come to a problem in which i haven't seem to been able to solve.

I have a simple script, which i’m testing on at the moment shown below:

Code: Select all

<?php

error_reporting(E_ALL);

$string = '
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title test="test">Untitled Document</title>
</head>

<body></body>
</html>
';

preg_match_all('/(<([\w]+)([^>]*)>)(.*)(?:<\/\\2>)/', $string, $matches, PREG_SET_ORDER);

print_r($matches);

?>
The return result of this is:

Code: Select all

Array
(
    [0] => Array
        (
            [0] => <title test="test">Untitled Document</title>
            [1] => <title test="test">
            [2] => title
            [3] =>  test="test"
            [4] => Untitled Document
        )

    [1] => Array
        (
            [0] => <body></body>
            [1] => <body>
            [2] => body
            [3] => 
            [4] => 
        )

)
My first question is using ?: or an equivalent, how would i stop capturing in the two arrays the keys 0 and 1? Whenever i try to, the result becomes one empty array.

Secondly, what does \\2 do?

And lastly, the above regex only captures what's displayed on one line. For example, <body></body>, but won't capture tags spread across multiple lines, how would i go about fixing this?

Thanks, any input greatly appreciated.

Posted: Sun Oct 15, 2006 10:53 am
by feyd

Code: Select all

<?php

error_reporting(E_ALL);

$string = '
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title test="test">Untitled Document</title>
</head>

<body></body>
</html>
';

//preg_match_all('/(<([\w]+)([^>]*)>)(.*)(?:<\/\\2>)/', $string, $matches, PREG_SET_ORDER);
preg_match_all('/(?:<(\w+)[^>]*>)(.*?)(?:<\/\\1>)/s', $string, $matches, PREG_SET_ORDER);

var_dump($matches);

?>
produces

Code: Select all

array(1) {
  [0]=>
  array(3) {
    [0]=>
    string(208) "<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title test="test">Untitled Document</title>
</head>

<body></body>
</html>"
    [1]=>
    string(4) "html"
    [2]=>
    string(158) "
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title test="test">Untitled Document</title>
</head>

<body></body>
"
  }
}
mickd wrote:My first question is using ?: or an equivalent, how would i stop capturing in the two arrays the keys 0 and 1? Whenever i try to, the result becomes one empty array.
If there's a match, the zero element will always be filled. Any elements afterward are captured subpattern matches. In your instance, the captured subpatterns are the opening tag, the tag's element name, any extra attributes in the tag, and finally the tag's contents.
mickd wrote:Secondly, what does \\2 do?
It's generally called a back-reference. It refers to a captured subpattern. In your instance, the second capture (the tag's element name.)
mickd wrote:And lastly, the above regex only captures what's displayed on one line. For example, <body></body>, but won't capture tags spread across multiple lines, how would i go about fixing this?
By default, PCRE considers each line a separate entitiy to check. You can add a pattern modifier such as "s" (see my example) that tells PCRE to consider the supplied string as a single line. This modification alters the capture behaviour of the dot metacharacter to include carriage returns, line feeds and a few other characters.

Posted: Sun Oct 15, 2006 4:57 pm
by mickd
Thanks feyd!