Few simple regular expression questions

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
mickd
Forum Contributor
Posts: 397
Joined: Tue Jun 21, 2005 9:05 am
Location: Australia

Few simple regular expression questions

Post by mickd »

Hi, i'm currently trying to understand and be able to write my own regular expressions. However, i have come to a problem in which i haven't seem to been able to solve.

I have a simple script, which i’m testing on at the moment shown below:

Code: Select all

<?php

error_reporting(E_ALL);

$string = '
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title test="test">Untitled Document</title>
</head>

<body></body>
</html>
';

preg_match_all('/(<([\w]+)([^>]*)>)(.*)(?:<\/\\2>)/', $string, $matches, PREG_SET_ORDER);

print_r($matches);

?>
The return result of this is:

Code: Select all

Array
(
    [0] => Array
        (
            [0] => <title test="test">Untitled Document</title>
            [1] => <title test="test">
            [2] => title
            [3] =>  test="test"
            [4] => Untitled Document
        )

    [1] => Array
        (
            [0] => <body></body>
            [1] => <body>
            [2] => body
            [3] => 
            [4] => 
        )

)
My first question is using ?: or an equivalent, how would i stop capturing in the two arrays the keys 0 and 1? Whenever i try to, the result becomes one empty array.

Secondly, what does \\2 do?

And lastly, the above regex only captures what's displayed on one line. For example, <body></body>, but won't capture tags spread across multiple lines, how would i go about fixing this?

Thanks, any input greatly appreciated.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Code: Select all

<?php

error_reporting(E_ALL);

$string = '
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title test="test">Untitled Document</title>
</head>

<body></body>
</html>
';

//preg_match_all('/(<([\w]+)([^>]*)>)(.*)(?:<\/\\2>)/', $string, $matches, PREG_SET_ORDER);
preg_match_all('/(?:<(\w+)[^>]*>)(.*?)(?:<\/\\1>)/s', $string, $matches, PREG_SET_ORDER);

var_dump($matches);

?>
produces

Code: Select all

array(1) {
  [0]=>
  array(3) {
    [0]=>
    string(208) "<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title test="test">Untitled Document</title>
</head>

<body></body>
</html>"
    [1]=>
    string(4) "html"
    [2]=>
    string(158) "
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title test="test">Untitled Document</title>
</head>

<body></body>
"
  }
}
mickd wrote:My first question is using ?: or an equivalent, how would i stop capturing in the two arrays the keys 0 and 1? Whenever i try to, the result becomes one empty array.
If there's a match, the zero element will always be filled. Any elements afterward are captured subpattern matches. In your instance, the captured subpatterns are the opening tag, the tag's element name, any extra attributes in the tag, and finally the tag's contents.
mickd wrote:Secondly, what does \\2 do?
It's generally called a back-reference. It refers to a captured subpattern. In your instance, the second capture (the tag's element name.)
mickd wrote:And lastly, the above regex only captures what's displayed on one line. For example, <body></body>, but won't capture tags spread across multiple lines, how would i go about fixing this?
By default, PCRE considers each line a separate entitiy to check. You can add a pattern modifier such as "s" (see my example) that tells PCRE to consider the supplied string as a single line. This modification alters the capture behaviour of the dot metacharacter to include carriage returns, line feeds and a few other characters.
mickd
Forum Contributor
Posts: 397
Joined: Tue Jun 21, 2005 9:05 am
Location: Australia

Post by mickd »

Thanks feyd!
Post Reply