Comparing web pages

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Comparing web pages

Post by onion2k »

I have an idea for a side project that takes a URL and detects whether or not there's a mobile specific version of it. Essentially the idea is to grab a page using cURL with the User Agent string set to a mobile browser (eg iPhone) and see if there's any redirection to a mobile.domain.com or m.domain.com, test the resulting page for things like a "media=mobile" stylesheet, and so on. Easy stuff.

However...

I also want to compare the version of the page delivered to the mobile browser to a version of the page grabbed using a desktop browser User Agent like Firefox. Easy, right? Not so. Comparing the pages is easy enough, but a problem arises if there's random content on the page. For example, if the website displays the time with seconds then the two versions of the page could be different. I've solved that problem by stripping out all the content and only leaving the HTML tags, then comparing what is essentially the DOM tree between the two pages. This is good, but it's not quite enough.

The problem I have is that the pages can have random html. Eg, the mobile page might have "Here's an article about <span>cats</span>!" while the desktop version gets "Here's an article about <div>dogs</div>!". The DOM trees will be different but there's no real difference between the pages, it's just editor content. I'm interested in the structural HTML and seeing if that's been tailored to mobile browsers.

I have an idea for a solution - grab the desktop version 3 times, then, by comparing the HTML between them only keep what appears in all three versions. By doing the same thing on the mobile version it should get to only the structural stuff (unless the same random content comes up all three times, but that's not a big problem). Then I can compare the two versions to see if they're the same.

Problem is though... how do I take three similar strings and return only the stuff that's in all of them?
Post Reply