HTML head+meta regex

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
Maugrim_The_Reaper
DevNet Master
Posts: 2704
Joined: Tue Nov 02, 2004 5:43 am
Location: Ireland

HTML head+meta regex

Post by Maugrim_The_Reaper »

Looking for some help improving a Regex. As a bit of background, I need to select all <meta> element from any HTML document (valid HTML/XHTML is not a condition) which must be nested with a <head> element. After some playing around I came up with the following:

Code: Select all

$metaRegex = "%<meta[^>]+http-equiv=([\"]{0,1})([^\"]*)([\"]{0,1})[^>]+content=([\"]{0,1})([^\"]*)([\"]{0,1})[^>]*>%i";
The original version was a lot stricter, and I needed to modify it under the assumption it would possibly turn up with varying spacing. I'll admit I'm not incredibly great at Regex ;), so I'm posting it here in case someone more experienced has extra tips. I'm not sure how to go about ensuring this is nested in <head> tags. Maybe matching the full <head> element first, then running the above regex across the match result?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Do you have a test string with various versions of tags?
User avatar
Maugrim_The_Reaper
DevNet Master
Posts: 2704
Joined: Tue Nov 02, 2004 5:43 am
Location: Ireland

Post by Maugrim_The_Reaper »

I have some (all passing) but the regex is for user sourced input - i.e. I can't predict the level of html validity or even correctness. Short of searching the internet for actual examples (next step after this post's responses - not that I expect that many to be honest) there's few deliberate examples I could come up with except the obvious. I'm likely so used to writing and working on XHTML that I'm worthless at knowing what the possibilities are for non-standard yet browser useable elements ;).

I think a regex should do the job without having to resort to any messy tokenising and sorting.

Full code if it makes more sense...

Code: Select all

/**
     * Assuming this user is hosting a third party sourced identity under an
     * alias personal URL, we'll need to check if the website's HTML body
     * has a http-equiv meta element with a content attribute pointing to where
     * we can fetch the XRD document.
     *
     * @param   Zend_Http_Response $response
     * @return  boolean
     * @throws  Zend_Service_Yadis_Exception
     */
    protected function _isMetaHttpEquiv(Zend_Http_Response $response)
    {
        if (!in_array($response->getHeader('Content-Type'), $this->_validHtmlContentTypes)) {
            return false;
        }
        /**
         * Find a match for a relevant <meta> element, then iterate through the
         * results to see if a valid http-equiv value and matching content URI
         * exist.
         * Todo: need to check this is located inside the <head> element too.
         */
        $metaRegex = "%<meta[^>]+http-equiv=([\"]{0,1})([^\"]*)([\"]{0,1})[^>]+content=([\"]{0,1})([^\"]*)([\"]{0,1})[^>]*>%i";
        $matches = null;
        $location = null;
        preg_match_all($metaRegex, $response->getBody(), $matches, PREG_PATTERN_ORDER);
        for ($i=0;$i < count($matches[1]);$i++) {
            if (strtolower($matches[1][$i]) == "x-xrds-location" || strtolower($matches[1][$i]) == "x-yadis-location") {
                $location = $matches[2][$i];
            }
        }
        if (empty($location)) {
            return false;
        } elseif (!Zend_Uri::check($location)) {
            require_once 'Zend/Service/Yadis/Exception.php';
            throw new Zend_Service_Yadis_Exception('The URI parsed from the HTML document appears to be invalid: ' . htmlentities($location, ENT_QUOTES, 'utf-8'));
        }
        /**
         * Should now contain the content value of the http-equiv type pointing
         * to an XRDS resource for the user's Identity Provider, as found by
         * passing the meta regex across the response body.
         */
        $this->_metaHttpEquivUrl = $location;
        return true;
    }
User avatar
Mordred
DevNet Resident
Posts: 1579
Joined: Sun Sep 03, 2006 5:19 am
Location: Sofia, Bulgaria

Re: HTML head+meta regex

Post by Mordred »

Maugrim_The_Reaper wrote:

Code: Select all

$metaRegex = "%<meta[^>]+http-equiv=(["]{0,1})([^"]*)(["]{0,1})[^>]+content=(["]{0,1})([^"]*)(["]{0,1})[^>]*>%i";
Try this:

Code: Select all

<  meta hmm = " ooh?"  http-equiv                   =          'other quotes' hey="not fair!" content='could be before "http-equiv" as well'   >
Post Reply