Need to match single and double tag html elements

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
ViserExcizer
Forum Newbie
Posts: 24
Joined: Tue Nov 25, 2008 1:17 pm

Need to match single and double tag html elements

Post by ViserExcizer »

Hi, my grasp of regex is a little tenuous since i learnt it years back, please help me with identifying html form input and textarea tags

for instance,

i need a php regex that can parse this html block :

Code: Select all

 
<tr>
<td>
<input type="submit" name="submit" value="submit" />
</td>
<td>
<input type="hidden" name="action" value="process_step2">
</td>
<td width="306">
<p><input type="text" name="url" value="http://" size="25"></p>
</td>
</tr>
 
and just return all the 3 matches as

Code: Select all

 
<input type="submit" name="submit" value="submit" />
<input type="hidden" name="action" value="process_step2">
<input type="text" name="url" value="http://" size="25">
 
meaning i dont mind what type= attribute the input is. As long as its an input tag, it has to be matched, and some might have xhtml like "[space]/>" instead of ">" as the closing tag


and please help me get another regex for extracting from this html block:

Code: Select all

 
<table><tr><td><p>
<textarea name="testing">
 content
</textarea>
</p>
</td></tr></table></div>
 
that just returns the following for every match

Code: Select all

 
<textarea name="testing">
 content
</textarea>
 
thanks soo much.
ViserExcizer
Forum Newbie
Posts: 24
Joined: Tue Nov 25, 2008 1:17 pm

Re: Need to match single and double tag html elements

Post by ViserExcizer »

I've figured it out pretty much,
to get input tags for html and xhtml or any possible input tag variations that dont follow proper html standards :

Code: Select all

 
'/\<input.+?type=".+?".+?[{\/*}|{\s?}]?>/i'
 
and for textarea

Code: Select all

 
'/<textarea.+?[{\/*}|{\s*}]?>.*?<\/textarea>/i'
 
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Need to match single and double tag html elements

Post by ridgerunner »

Part of your expression, '[{\/*}|{\s?}]', does not do what you think is does.

Square Brackets:
Anything between square brackets is a character class which always matches exactly one character, and one character only. Your '[{\/*}|{\s?}]' character class is read by the PCRE regex engine as: "Match exactly one character that is a '{' or a '/' or a '*' or a '}' or a '|' or a '{' (redundant) or a '\s=whitespace char' or a '?' or a '}' (also redundant)". Also, many characters that are special metacharacters outside of a character class are not special inside the character class and don't need to be escaped. (i.e. [*+?(){}$] and others).

Curly Brackets:
Curly brackets are used as a quantifier used to specify a precise number of times that the preceding token can be repeated. For example: '/X{3,5}/' matches XXX, XXXX and XXXXX but does not match XX.

Here are some examples of invalid tags that your regexes happily match:

Code: Select all

<input type="text" {>
<input type="text" *>
<input type="text" }>
<input type="text" ?>
<input type="text" invalid arbitrary non-attribute stuff here>
<textarea rows="5" cols="10" />data</textarea>
<textarea rows="5" cols="10" {>data</textarea>
<textarea rows="5" cols="10" }>data</textarea>
<textarea rows="5" cols="10" |>data</textarea>
That said, here are two regexes which work more along the lines of what you are trying to do:

Code: Select all

Match a self-closing XHTML tag:
'%<(\w+)(\s+\w+\s*=\s*("[^"]*"|\'[^']*\'))*\s*/>%'
 
Match a normal non-self-closing XHTML tag:
'%<(\w+)(\s+\w+\s*=\s*("[^"]*"|\'[^']*\'))*\s*>.*?</\1>%'
For a quick refresher course, chack out: http://www.regular-expressions.info/
Post Reply