Identifying Similar Contents

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
ajay600
Forum Newbie
Posts: 14
Joined: Tue Jan 19, 2010 8:54 am

Identifying Similar Contents

Post by ajay600 »

Consider I have two Web pages and i have one image that appears in both the web pages ..

Please post the code that will help me to find the image that is repeated in both the web page and a way to block the image ..

my idea is to find the contents(the file name given within quotes) of the img tag and then compare with the img tag in the second page to find out the content that is repeated..
i tried using the code below but it says nesting level is too deep here if(in_array($element_a,$images_b)) ...
please help me to find the similar images and block them from being displayed


$html_a = file_get_html('http://www.website.com/page_a.html');
$html_b = file_get_html('http://www.website.com/page_b.html');
// Find all images
$images_a = $html_a->find('img');
$images_b = $html_b->find('img');
foreach($images_a as $element_a) {
if(in_array($element_a,$images_b)) {
echo $element_a->src . ' is on both sites<br>';
}
}
User avatar
AbraCadaver
DevNet Master
Posts: 2572
Joined: Mon Feb 24, 2003 10:12 am
Location: The Republic of Texas
Contact:

Re: Identifying Similar Contents

Post by AbraCadaver »

file_get_html() and the method find() are not core PHP classes/methods so who knows what they do or what they return? Maybe a var_dump() of $html_a and $images_a would help.
mysql_function(): WARNING: This extension is deprecated as of PHP 5.5.0, and will be removed in the future. Instead, the MySQLi or PDO_MySQLextension should be used. See also MySQL: choosing an API guide and related FAQ for more information.
User avatar
McInfo
DevNet Resident
Posts: 1532
Joined: Wed Apr 01, 2009 1:31 pm

Re: Identifying Similar Contents

Post by McInfo »

One solution:

Create two instances of DOMDocument, one for each file. Call DOMDocument::loadHTMLFile() to import an HTML file.

Code: Select all

$doc = new DOMDocument(); // An instance of DOMDocument
$doc->loadHTMLFile('./source.html');
Create two instances of DOMXPath, one for each document. XPath simplifies finding XML/HTML elements. Find all of the images with DOMXPath::query(). The expression is "//img". You could use DOMDocument::getElementsByTagName() for something simple like this, but you will need the extra power of XPath later to find specific image tags based on the src attribute, so you might as well test-drive XPath now.

Code: Select all

$xpath = new DOMXPath($doc); // An instance of DOMXPath attatched to a DOMDocument
$imgList = $xpath->query('//img'); // A DOMNodeList containing all <img> tags
Loop through the DOMNodeLists returned by the queries. For each DOMNode (actually DOMElement) in each list, append the image's src attribute to an array. To get an element's attribute, use DOMElement::getAttribute().

Code: Select all

$srcList = array(); // An array to hold src attribute strings
foreach ($imgList as $img) { // Loops through the DOMNodeList of images
    $srcList[] = $img->getAttribute('src'); // Stores the src attribute of each image
}
Compare the src attributes in the two arrays to determine which are common to both documents. To do this, use array_intersect() which returns an array of the elements in one array that also appear in another.

Before you loop through the resulting array and modify the documents, use array_unique() to remove any duplicate items that might have appeared because of the same image being used more than once in the same document. You will next be finding images based on the src attribute, and there is no need to look for the same src more than once.

Here, the last two steps are combined in a single statement.

Code: Select all

$srcBoth = array_unique(array_intersect($srcList, $srcList2));
For each src string, find the image elements that match. For this, use the XPath expression shown in the next example. Loop through the DOMNodeList of matching images and do something with each image.

Code: Select all

foreach ($srcBoth as $src) { // Loops through the src strings that are common to both documents
    $imgs = $xpath->query('//img[@src="'.$src.'"]'); // A DOMNodeList of images with a matching src attribute
    foreach ($imgs as $img) { // Loops through the images
        $img->setAttribute('src', ''); // Modifies the src attribute
        $img->setAttribute('alt', 'Deleted');
    }
}
In the previous example, I set the src attribute to an empty string and set the alternative text to "Deleted", but you could remove the image entirely with the removeChild method of the image's parent node.

Code: Select all

$img->parentNode->removeChild($img); // Deletes the image element
Finally, convert the modified document back to a string with DOMDocument::saveHTML() and output it, or save it with DOMDocument::saveHTMLFile().

Code: Select all

echo $doc->saveHTML();
PHP Manual: DOM

Edit: This post was recovered from search engine cache.
Last edited by McInfo on Thu Jun 17, 2010 4:34 pm, edited 1 time in total.
ajay600
Forum Newbie
Posts: 14
Joined: Tue Jan 19, 2010 8:54 am

Re: Identifying Similar Contents

Post by ajay600 »

Thanks mate ....that worked perfectly .....wow i submitted in many forums but none helped...
ajay600
Forum Newbie
Posts: 14
Joined: Tue Jan 19, 2010 8:54 am

Re: Identifying Similar Contents

Post by ajay600 »

The above code to identify images repeated in 2 pages and to block the images worked perfectly...
Now can some one please modify this code , so that the links that are repeated in 2 pages are identified and the repeated links are not displayed the web page(by using the contents in <a href > tag)
Post Reply