Page 1 of 1

Help validating submitted URLs

Posted: Thu May 21, 2009 10:29 pm
by bschaeffer
I am creating a submissions app for a site called Tumblr. The basic premise is that the user submits the url of a Tumblr Blog, and then someone can go in and look at the submissions and either post the submissions or delete them.

The part I am stuck on is the validating the URL. I have already gotten past validating that the URL is an actual URL, but I would like to check that the submitted URL actually points to a blog on Tumblr.

A typical tumblr URL looks like this: http://sometext.tumblr.com/. I have already written some code that checks this URLs like this one and returns a true or false if the blog exists, so I'm good there.

But some people choose to have a domain name point to their Tumblr blog. These can look like almost anything. http://blog.name.com/ or http://blogname.info/ and on and on and on.

If a url like this is submitted, the function always returns true, weather the url points to a tumblr or not.

Does anybody know how I can validate these urls?

One thing that might be helpful is that a blog hosted on Tumblr's server would return JSON output, and I was thinking I might somehow check whether the output of the URL call is JSON or not.

Another thing is the fact that I am getting 404 Not Found output on urls that aren't official Tumblr blogs. Is there a way capture/check for 404 error and do something with that?

Here's my basic code with an idea of what I am trying to do:

Code: Select all

$url = 'http://url.com';
 
$c = curl_init($url);
curl_setopt($c,CURLOPT_HEADER,1);
curl_setopt($c,CURLOPT_RETURNTRANSFER,1);
$output = curl_exec($c);
 
// Tumblr returns the following if the URL points to Tumblr's server
// but is not a registered Tumblr blog.
$check = preg_match("/We couldnt find the page you were looking for./", $output);   
if($check) {
        $this->error = 'Sorry...';
        return false;
}
// Here id like to check $output for either a 404 Not Found
// or JSON output (if my thinking is correct). Any ideas?
Thanks in advance for taking a look at this problem.

Re: Help validating submitted URLs

Posted: Thu May 21, 2009 11:43 pm
by atonalpanic
Do blogs with http://example.com that point to a tumblr blog actually host the software or does tumblr still host it?
If it always goes back to tumblr, just check where it ends up. or search for something in the source code like
"<!-- BEGIN TUMBLR CODE -->" that I saw in the source for a blog.

Re: Help validating submitted URLs

Posted: Sun May 31, 2009 12:34 pm
by bschaeffer
When you request blog information using Tumblr's API, it doesn't return the actual html of the page, just the blog data in JSON or XML output. So something like that wouldn't show up.

I got it to work, but first I want to say it was because you suggested looking for valid output that would indicate it came from Tumblr's server. Here's the code:

Code: Select all

$url = $this->submit['url'].'api/read/json';
      
$c = curl_init($url);
curl_setopt($c,CURLOPT_HEADER,1);
curl_setopt($c,CURLOPT_RETURNTRANSFER,1);
$output = curl_exec($c);
$status = curl_getinfo($c, CURLINFO_HTTP_CODE);
curl_close($c);
        
$check_one = preg_match("/We couldn\'t find the page you were looking for./", $output);
$check_two = preg_match("/proxy.tumblr.com/", $output);
// If the status is not 200
// or $output contains "We couldn't find..."
// or $output does not contain "proxy.tumblr.com"
// Then the url is not a valid Tumblr blog
if(($status != 200) || ($check_one) || (!$check_two)) {
        return false;
}
The easy URLs to check were ones formatted like this: http://someuser.tumblr.com/. These always pointed to the Tumblr server, so a false URL would always return a We couldn't find the page... message.

The problem really came in URLs that were formatted like any url, i.e. http://www.somesite.com/. I don't know why, but just checking for a status of 200 wasn't enough. It would completely bypass that check, whether the user had setup their own domain to point to Tumblr or not. It was returning true for URLs like apple.com or yahoo.com.

When Tumblr returned JSON output for valid URLs, it included some server data, so I just checked for the existence of that (i.e. proxy.tumblr.com).

That worked just fine, for now. It's still a beta type deal, so I'm sure I'll run into some problems in the future.

Thanks for the reply, sorry it took me so long to let you know I read it.