Page 1 of 2

[SOLVED] removing duplicate records..

Posted: Thu Aug 25, 2005 11:39 am
by waskelton4
Hello all,

I have a database that i've accidentally created duplicate records in. even down to the timestamp field. there are probably 3000 so it's too much to weed out by hand. Is there a way for me to delete the duplicates without deleting both rows?

thanks
Will

Posted: Thu Aug 25, 2005 11:48 am
by nielsene
What database are you using?

There a large number of possible ways to fix this, but most require implementation-dependent workarounds.

Posted: Thu Aug 25, 2005 11:50 am
by waskelton4
Whoops..

mysql ver 4.1

Posted: Thu Aug 25, 2005 11:56 am
by nielsene
I don't know MySQL, typically the tricks rely on the database having a hidden "row identifier" -- sort of an extra/hidden primary key -- that you can use to group by/min(row_id) to delete all by the lowest row_id match.

Posted: Thu Aug 25, 2005 1:30 pm
by josh
using php you could select all the primary keys, and whatever value you want to check for duplicates, then build an array of records to delete then

Code: Select all

delete from table where `key` in (1,5,100,400,700)
like that

Posted: Thu Aug 25, 2005 1:37 pm
by waskelton4
jshpro2 wrote:using php you could select all the primary keys, and whatever value you want to check for duplicates, then build an array of records to delete then

Code: Select all

delete from table where `key` in (1,5,100,400,700)
like that
But wouldn't that delete both of the records and not just one of the copies? The table doesn't have an enforced primary key currently. (i know.. real bad) after i get all the duplicate records out I'm gonna create it.

will

Posted: Thu Aug 25, 2005 1:44 pm
by feyd
you should be able to find duplicates like so:

Code: Select all

SELECT t1.*, COUNT(t1.`someField`) AS siblings FROM `table1` t1 NATURAL LEFT JOIN `table1` t2 GROUP BY `someField1`, `someField2` HAVING siblings > 1

(untested)

Posted: Thu Aug 25, 2005 3:38 pm
by waskelton4
feyd wrote:you should be able to find duplicates like so:

Code: Select all

SELECT t1.*, COUNT(t1.`someField`) AS siblings FROM `table1` t1 NATURAL LEFT JOIN `table1` t2 GROUP BY `someField1`, `someField2` HAVING siblings > 1
(untested)
Hrmm..

I tried that above and the value of 'siblings' came out to be the square of what the value should have been. (i.e. all the 2s were 4s and the 3 was 9)

Would there be a solution similar to something like this..

Code: Select all

SELECT ID, Count(ID) as counter FROM `table` group by ID where counter > 1
or

Code: Select all

SELECT ID, Count(ID) as counter FROM `table` group by ID where count(ID) > 1
Those both throw errors.. but it seems to me that they should/would work.

Thanks for the help all..

will

Posted: Thu Aug 25, 2005 4:01 pm
by feyd
oops, change the > 1 to > 2. Yes they will be squares, because the join attaches to all instances of the duplicate for each one that exists ;)

Posted: Thu Aug 25, 2005 5:01 pm
by onion2k
If there's a unique bit of information such as an email address in the data just select everything, but group by that field. It'll then select one of the two:

delete from table where id in (select id from table group by email)

That's not tested by the way.

Posted: Fri Aug 26, 2005 7:57 am
by AGISB
Just read all the entries and put it in an array. Filter out all duplicates and write it to a new table.

Empty the whole table and rewrite it with the other tables data. Could be by SQL or PHP

Posted: Fri Aug 26, 2005 8:35 am
by waskelton4
AGISB wrote:Just read all the entries and put it in an array. Filter out all duplicates and write it to a new table.

Empty the whole table and rewrite it with the other tables data. Could be by SQL or PHP
I've thought of similar ways but I was hoping to do it with just SQL...

I think I will end up doing something similar to this.

Thanks for everyone's help!

will

Posted: Fri Aug 26, 2005 8:57 am
by waskelton4
feyd wrote:oops, change the > 1 to > 2. Yes they will be squares, because the join attaches to all instances of the duplicate for each one that exists ;)
Actually I think that > 1 works fine because the square of 1 is still 1.. so the lowest value returned is for the rows with one duplicate (2 rows) so that value is 4.

Thanks again for the help all..

Now I just have to shut down the application for a few minutes while I remove these things.. blech..

ws

Posted: Fri Aug 26, 2005 9:24 am
by bg
Use "LIMIT 1" so that only one record is being deleted (assuming there is only one duplicate for each entry)

Posted: Fri Aug 26, 2005 9:40 am
by waskelton4
bgzee wrote:Use "LIMIT 1" so that only one record is being deleted (assuming there is only one duplicate for each entry)
thats the ticket....