Not bad, but the regex does have a few errors. The following is a detailed analysis...
- Firstly: When posting, its best to present the regex in the source code context where it is being used. i.e. I assume you are using the preg_*() functions of PHP, yes? And you are probably using the "i" ignore-case modifier, yes?
- Secondly: With a longish regex like this one, it is much better to write it with comments in free-spacing mode so you can see the nesting of the parentheses.
I've taken the liberty to apply the two comments above. The following listing is a verbose commented version of your regex:
[text]if (preg_match(
'%# original regex from "Regex expression for complex dates" thread
( # group $1: has 2 alternatives 1-words, 2-digits only
( \b # group $2: Alternative 1: (word month day year)
( # group $3: word version of month
january|jan|february|feb|march|mar|april|apr|may|june|jun|
july|jul|august|aug|september|sep|october|oct|november|nov|december|dec
) # end group $3
\b\s # (redundant) word boundary, one whitespace
( 0? # group $4: for day of month alternatives. Either...
( # group $5: alternatives of 1-9 suffixes
1(st)? # either 1 or 01 and optional group $6 = "st"
| 2(nd)? # or... 2 or 02 and optional group $7 = "nd"
| 3(rd)? # or... 3 or 03 and optional group $8 = "rd"
| [4-9](th)? # or... 4-9 or 04-09 and optional group $9 = "th"
) # end group $5
| [12] # or... alternatives 11-19, 21-29
( # group $10: alternatives of 11-19, 21-29 suffixes
1(st)? # either 11 or 21 optional group $11 = "st" 11st=Error
| 2(nd)? # or... 12 or 22 optional group $12 = "nd" 12nd=Error
| 3(rd)? # or... 13 or 23 optional group $13 = "rd" 13rd=Error
| [4-9](th)? # or... 14-19 or 24-29 and optional group $14 = "th"
) # end group $10 (error: missed 10, 10th, 20 and 20th)
| 3 # or... alternatives 30-31
( # group $15: for 30-31 alternatives
0(th)? # either 30 and optional group $16 = "th"
| 1(st)? # or... 31 and optional group $17 = "st"
) # end group $15
) # end group $4
\b
( ,?\s # group $18: for (optional) year
(19|20)\d\d\b # group $19: for 19 or 20 century alternatives
)? # end group $18
) # end (unnecessary) group $2
| # or... Alternative 2: mm/dd/yyyy
( \b # group $20: (unnecessary)
( # group $21: (optional = Error!) whole mm/dd/yyyy
( 0?[1-9] | 1[012]) # group $22: mm month alternatives 1-12, 01-12
[- /.] # separator
( 0?[1-9] # group $23: dd day alternatives either 1-9
| [12][0-9] # or... 10-19 or 20-29
| 3[01] # or... 30 or 31
) # end group $12: day
( [- /.] # group $24: unnecessary group for separator and year
(19|20)?\d\d # group $25: for century alternatives
) # end (unnecessary) group $24
)? # end (unnecessary) group $21 and make it optional? Error!
\b # ending word boundary (can be factored out)
) # end (unnecessary) group $20
) # end group $1
%ix',
$str)) {
# Successful match
} else {
# Match attempt failed
}
[/text]
The above regex does have some problems and room for improvement as follows:
- Error: Fails to match 10, 10th, 20 and 20th day of month (with word month syntax variation)
- Error: The suffixes for 11 12 and 13 are erroneously matched as: 11st, 12nd and 13rd (should be 11th, 12th and 13th).
- Error: Capture group $21 is optional (but shouldn't be), so overall regex erroneously matches at every word boundary position!
- Error: Regex matches mm-dd-yyyy text having non-matching separator char (e.g. 07/06-2010 or 07.06/10)
- Too many capturing parentheses. (25 of them - Eee-gads!)
- Unnecessary parentheses.
- \b word boundary conditions are common to all options. These can be factored out for efficiency.
As you can see, when you write out a complex regex in long commented format, you can more easily spot defficiencies and errors.
The following listing is an improved version which corrects the above mentioned errors and implements some more efficient code:
[text]if (preg_match(
'%# Fixed regex from "Regex expression for complex dates" thread
\b # date always begins on a word boundary
( # group $1: capture Date two alternatives
(?: # Date alt 1: (Jul 6th, 2010) month specified in words
january|jan|february|feb|march|mar|april|apr|may|june|jun|
july|jul|august|aug|september|sep|october|oct|november|nov|december|dec
) # end non-capture group
\s+ # one or more whitespace required between month and day
(?: # non-capture group for day of month alternatives
(?!32|33) # ensure this is not the (erroneous) 32nd or 33rd day of month
[023]? # Day alt 1: 1st, 2nd, 3rd, 21st, 22nd, 23rd, 31st
(?: # non-capture group for alternatives: 1st, 2nd and 3rd variations
1(?:st)? # either 1st, 01st, 21st, 31st
| 2(?:nd)? # or... 2nd, 02nd, 22nd
| 3(?:rd)? # or... 3rd, 03rd, 23rd
) # end non-capture group
| # or... Day alt 2: all the Nth variations
(?: # non-capture group for 4th-20th, 24th-30th variations
[012]?[4-9] # either 4th-9th, 14th-19th, 24th-29th
| [123]0 # or... 10th, 20th, 30th
| 1[123] # or... 11th, 12th, 13th (odd ball special case)
) (?:th)? # end non-capture group. match optional "th" suffix
) # end day of month alternatives group
(?: # non-capture group for (optional) year
,?\s+ # optional comma and one or more whitespace
(?:19|20)\d\d # non-capture group for 19 or 20 century alternatives
)? # end optional non-capture group
| # or... Date alt 2: digits with separators mm/dd/yyyy
(?: 0?[1-9] | 1[012]) # non-capture group for mm month alternatives 1-12, 01-12
([- /.]) # group $2: capture month-day-year separator char
(?: # non-capture group for dd day alternatives 1-31, 01-31
0?[1-9] # either 1-9 or 01-09
| [12][0-9] # or... 10-19 or 20-29
| 3[01] # or... 30 or 31
) # end non-capture group
\2 # match previously specified year-month-day separator char
(?:19|20)? # non-capture group for optional century 19 or 20 alternatives
\d\d # year
) # end group $1
\b # date always ends on a word boundary
%ix',
$str)) {
# Successful match
} else {
# Match attempt failed
}
[/text]
Note that this regex is smaller, faster, easier to read and more accurate.
Note also that neither of the above regexes match all common date formats. For example, they:
- Do not match "6th January" syntax (day before month)
- Do not match 2010-07-06 syntax (yyyy-mm-dd)
- Do not match "July 6th, 10" (2 digit year)
- Both of the above regexes match 01st, 02nd and 03rd. May want to disallow this funky syntax in an improved version.
Hope this helps!
