Page 1 of 1

Identifying Similar Contents

Posted: Tue Jan 19, 2010 9:00 am
by ajay600
Consider I have two Web pages and i have one image that appears in both the web pages ..

Please post the code that will help me to find the image that is repeated in both the web page and a way to block the image ..

my idea is to find the contents(the file name given within quotes) of the img tag and then compare with the img tag in the second page to find out the content that is repeated..
i tried using the code below but it says nesting level is too deep here if(in_array($element_a,$images_b)) ...
please help me to find the similar images and block them from being displayed


$html_a = file_get_html('http://www.website.com/page_a.html');
$html_b = file_get_html('http://www.website.com/page_b.html');
// Find all images
$images_a = $html_a->find('img');
$images_b = $html_b->find('img');
foreach($images_a as $element_a) {
if(in_array($element_a,$images_b)) {
echo $element_a->src . ' is on both sites<br>';
}
}

Re: Identifying Similar Contents

Posted: Tue Jan 19, 2010 9:13 am
by AbraCadaver
file_get_html() and the method find() are not core PHP classes/methods so who knows what they do or what they return? Maybe a var_dump() of $html_a and $images_a would help.

Re: Identifying Similar Contents

Posted: Tue Jan 19, 2010 5:34 pm
by McInfo
One solution:

Create two instances of DOMDocument, one for each file. Call DOMDocument::loadHTMLFile() to import an HTML file.

Code: Select all

$doc = new DOMDocument(); // An instance of DOMDocument
$doc->loadHTMLFile('./source.html');
Create two instances of DOMXPath, one for each document. XPath simplifies finding XML/HTML elements. Find all of the images with DOMXPath::query(). The expression is "//img". You could use DOMDocument::getElementsByTagName() for something simple like this, but you will need the extra power of XPath later to find specific image tags based on the src attribute, so you might as well test-drive XPath now.

Code: Select all

$xpath = new DOMXPath($doc); // An instance of DOMXPath attatched to a DOMDocument
$imgList = $xpath->query('//img'); // A DOMNodeList containing all <img> tags
Loop through the DOMNodeLists returned by the queries. For each DOMNode (actually DOMElement) in each list, append the image's src attribute to an array. To get an element's attribute, use DOMElement::getAttribute().

Code: Select all

$srcList = array(); // An array to hold src attribute strings
foreach ($imgList as $img) { // Loops through the DOMNodeList of images
    $srcList[] = $img->getAttribute('src'); // Stores the src attribute of each image
}
Compare the src attributes in the two arrays to determine which are common to both documents. To do this, use array_intersect() which returns an array of the elements in one array that also appear in another.

Before you loop through the resulting array and modify the documents, use array_unique() to remove any duplicate items that might have appeared because of the same image being used more than once in the same document. You will next be finding images based on the src attribute, and there is no need to look for the same src more than once.

Here, the last two steps are combined in a single statement.

Code: Select all

$srcBoth = array_unique(array_intersect($srcList, $srcList2));
For each src string, find the image elements that match. For this, use the XPath expression shown in the next example. Loop through the DOMNodeList of matching images and do something with each image.

Code: Select all

foreach ($srcBoth as $src) { // Loops through the src strings that are common to both documents
    $imgs = $xpath->query('//img[@src="'.$src.'"]'); // A DOMNodeList of images with a matching src attribute
    foreach ($imgs as $img) { // Loops through the images
        $img->setAttribute('src', ''); // Modifies the src attribute
        $img->setAttribute('alt', 'Deleted');
    }
}
In the previous example, I set the src attribute to an empty string and set the alternative text to "Deleted", but you could remove the image entirely with the removeChild method of the image's parent node.

Code: Select all

$img->parentNode->removeChild($img); // Deletes the image element
Finally, convert the modified document back to a string with DOMDocument::saveHTML() and output it, or save it with DOMDocument::saveHTMLFile().

Code: Select all

echo $doc->saveHTML();
PHP Manual: DOM

Edit: This post was recovered from search engine cache.

Re: Identifying Similar Contents

Posted: Tue Jan 19, 2010 11:40 pm
by ajay600
Thanks mate ....that worked perfectly .....wow i submitted in many forums but none helped...

Re: Identifying Similar Contents

Posted: Mon Jan 25, 2010 12:23 am
by ajay600
The above code to identify images repeated in 2 pages and to block the images worked perfectly...
Now can some one please modify this code , so that the links that are repeated in 2 pages are identified and the repeated links are not displayed the web page(by using the contents in <a href > tag)