Page 1 of 1

Checking for the duplicate strings

Posted: Fri Jan 07, 2011 1:35 am
by mudgil.gaurav
Dear All,

I am working on a application.I want to prevent the users of application to post duplicate content.

For example if a user insert the content like:

this is test content this is test content.you can check the prof
this is test content this is test content.

So here he is repeating the "this is test content" many times. if a user do so i want to report the content as spam.

Any help or suggestion would be greatly appriciated.

Thanks
Gaurav

Re: Checking for the duplicate strings

Posted: Fri Jan 07, 2011 3:07 am
by spedula
Interesting idea. Suggest playing around with strpos().

I can't even begin to think about how to define the needle to search for in the haystack.

I'm going to subscribe to this thread. Would really like to know what others come up with. For now, going to play around with some code to figure this out.

Re: Checking for the duplicate strings

Posted: Fri Jan 07, 2011 3:26 am
by spedula
Okay. I found an interesting script someone else wrote online:

Code: Select all

$str="this is test content this is test content";
//trim the whitespace
$str=trim($str);
//compress the whitespace
$str=ereg_replace('[[:space:]]+', ' ',$str);
//decompose the string into array of words
$words=explode(' ',$str);
//count occurence of each word
foreach($words as $w)
{
$wordstats[strtolower($w)]++;
}
//print all duplicate words
foreach($wordstats as $k=>$v)
{
	if($v>=2)
	{
		print "$k \r\n";
	}
}
You can base your function on this script. With a good deal of modification, but it's a start.

You'll have a int value, $v, of each word that was repeated. This can be used to determine the spam value of the input, which would have to be some sort of algorithm that will need to be tested very thoroughly.

Of course there should be a dictionary of words like "the" and "an" that naturally occur multiple times which should not be included in the spam value.

I leave you with this. PM me when you have an algorithm in place, I'm very interested in what can be done with this.