Consider I have two Web pages and i have one image that appears in both the web pages ..
Please post the code that will help me to find the image that is repeated in both the web page and a way to block the image ..
my idea is to find the contents(the file name given within quotes) of the img tag and then compare with the img tag in the second page to find out the content that is repeated..
i tried using the code below but it says nesting level is too deep here if(in_array($element_a,$images_b)) ...
please help me to find the similar images and block them from being displayed
$html_a = file_get_html('http://www.website.com/page_a.html');
$html_b = file_get_html('http://www.website.com/page_b.html');
// Find all images
$images_a = $html_a->find('img');
$images_b = $html_b->find('img');
foreach($images_a as $element_a) {
if(in_array($element_a,$images_b)) {
echo $element_a->src . ' is on both sites<br>';
}
}
Identifying Similar Contents
Moderator: General Moderators
- AbraCadaver
- DevNet Master
- Posts: 2572
- Joined: Mon Feb 24, 2003 10:12 am
- Location: The Republic of Texas
- Contact:
Re: Identifying Similar Contents
file_get_html() and the method find() are not core PHP classes/methods so who knows what they do or what they return? Maybe a var_dump() of $html_a and $images_a would help.
mysql_function(): WARNING: This extension is deprecated as of PHP 5.5.0, and will be removed in the future. Instead, the MySQLi or PDO_MySQLextension should be used. See also MySQL: choosing an API guide and related FAQ for more information.
Re: Identifying Similar Contents
One solution:
Create two instances of DOMDocument, one for each file. Call DOMDocument::loadHTMLFile() to import an HTML file.
Create two instances of DOMXPath, one for each document. XPath simplifies finding XML/HTML elements. Find all of the images with DOMXPath::query(). The expression is "//img". You could use DOMDocument::getElementsByTagName() for something simple like this, but you will need the extra power of XPath later to find specific image tags based on the src attribute, so you might as well test-drive XPath now.
Loop through the DOMNodeLists returned by the queries. For each DOMNode (actually DOMElement) in each list, append the image's src attribute to an array. To get an element's attribute, use DOMElement::getAttribute().
Compare the src attributes in the two arrays to determine which are common to both documents. To do this, use array_intersect() which returns an array of the elements in one array that also appear in another.
Before you loop through the resulting array and modify the documents, use array_unique() to remove any duplicate items that might have appeared because of the same image being used more than once in the same document. You will next be finding images based on the src attribute, and there is no need to look for the same src more than once.
Here, the last two steps are combined in a single statement.
For each src string, find the image elements that match. For this, use the XPath expression shown in the next example. Loop through the DOMNodeList of matching images and do something with each image.
In the previous example, I set the src attribute to an empty string and set the alternative text to "Deleted", but you could remove the image entirely with the removeChild method of the image's parent node.
Finally, convert the modified document back to a string with DOMDocument::saveHTML() and output it, or save it with DOMDocument::saveHTMLFile().
PHP Manual: DOM
Edit: This post was recovered from search engine cache.
Create two instances of DOMDocument, one for each file. Call DOMDocument::loadHTMLFile() to import an HTML file.
Code: Select all
$doc = new DOMDocument(); // An instance of DOMDocument
$doc->loadHTMLFile('./source.html');Code: Select all
$xpath = new DOMXPath($doc); // An instance of DOMXPath attatched to a DOMDocument
$imgList = $xpath->query('//img'); // A DOMNodeList containing all <img> tagsCode: Select all
$srcList = array(); // An array to hold src attribute strings
foreach ($imgList as $img) { // Loops through the DOMNodeList of images
$srcList[] = $img->getAttribute('src'); // Stores the src attribute of each image
}Before you loop through the resulting array and modify the documents, use array_unique() to remove any duplicate items that might have appeared because of the same image being used more than once in the same document. You will next be finding images based on the src attribute, and there is no need to look for the same src more than once.
Here, the last two steps are combined in a single statement.
Code: Select all
$srcBoth = array_unique(array_intersect($srcList, $srcList2));Code: Select all
foreach ($srcBoth as $src) { // Loops through the src strings that are common to both documents
$imgs = $xpath->query('//img[@src="'.$src.'"]'); // A DOMNodeList of images with a matching src attribute
foreach ($imgs as $img) { // Loops through the images
$img->setAttribute('src', ''); // Modifies the src attribute
$img->setAttribute('alt', 'Deleted');
}
}Code: Select all
$img->parentNode->removeChild($img); // Deletes the image elementCode: Select all
echo $doc->saveHTML();Edit: This post was recovered from search engine cache.
Last edited by McInfo on Thu Jun 17, 2010 4:34 pm, edited 1 time in total.
Re: Identifying Similar Contents
Thanks mate ....that worked perfectly .....wow i submitted in many forums but none helped...
Re: Identifying Similar Contents
The above code to identify images repeated in 2 pages and to block the images worked perfectly...
Now can some one please modify this code , so that the links that are repeated in 2 pages are identified and the repeated links are not displayed the web page(by using the contents in <a href > tag)
Now can some one please modify this code , so that the links that are repeated in 2 pages are identified and the repeated links are not displayed the web page(by using the contents in <a href > tag)