Page 1 of 1

DISTINCT or GROUP BY - Dropping Duplicates

Posted: Tue Feb 23, 2010 9:34 am
by alex.barylski
I have an annoying problem here.

I have a series of tables (about 6-7) that make up the core of our application. The primary table, that drives all others is called a procedure. A procedure has many states (Draft, Archive, Approved, etc).

When a procedure is in Approved state, it cannotbe changed directly but must be cloned and flagged as Draft, where edits may be made and then approved. Upon approval, an procedure then replaces the previous approved procedure.

This is all well and good and most of the system works fine, however one of the tables is a called a sequence table, which is simple in nature:

Code: Select all

pkid, time_bench, time_machine, description
Because one or more procedures might reference the same sequence (via junction table) and these procedures may be under different states (approved, archive, draft, etc) the constraint I have implemented is to simply disallow any edits to any procedure (and thus sequences, etc) until it's cloned and flagged as 'Draft'.

What this means is, there are duplicates inside the sequences table, which normally would not pose an issue, but the sequences table is also shown indepently of any procedure association, as a master list, to allow procedures to be easily associated with ALL sequences in the system; via click and drag. Not to mention editing of a sequence which might be found via a search filter.

Here is the kicker and real problem I am facing.

The duplicates should not be shown in the sequence list which is currently populated with a default SELECT query:

[sql]SELECT * FROM sequences[/sql]

This will obviously show any duplicate entries, which in case I have no explained, are introduced when an procedure is cloned, all it's supporting table entries are also cloned. So if a procedure has 10 sequences and it's currently under active status and someone edits one sequence associated with it, those sequences are cloned in the table above, and thus you have two sequences with identical fields, except the primary key ID.

[sql]SELECT * FROM sequences GROUP BY description[/sql]

[sql]SELECT DISTINCT(description) FROM sequences[/sql]

Both of these will remove duplicates, based on description text. While this works, it also fails to meet my exact requirements.

Unfornately, DISTINCT and GROUP BY will stop duplicating on the first record found in the database, regardless of ORDER BY or other statements. What that means is that, the sequences that are associated with the 'Approved' procedures are shown, while the sequences just cloned, are not. Again not normally a problem, except that when populating my sequences master list, the ID's of the 'Approved' sequences are used and I actually need the ID's of the records just cloned, as 'Approved' sequences cannot be directly modified. The cloned sequences need to be listed, as they are direclty modified and only once approved, are the cloned entities bumped to 'Approved' and the original is deleted.

Adding a field to the sequences table as some sort of flag, is not an ideal solution. Without going into details, it feels hackish and I would rather find a SQL solution to meet my requirements.

Procedures, on the other hand, have a status field (id_status) which indicates whether they are in 'Draft' 'Approved' or whatever. So what I had yesterday was a little more complicated in terms of SQL queries.

[sql] SELECT * FROM procedures_sequencesLEFT JOIN proceduresON procedures.id_primary = procedure_sequences.id_procedureRIGHT JOIN sequencesON sequences.id_primary = procedure_sequences.id_sequenceWHERE procedures.id_status = 3 OR procedures.id_status IS NULL[/sql]

This essentially showed all sequences which were associated with Draft procedures and sequences which were NOT associated with any procedures. However this will NOT work because I also need to show all sequences associated with live procedures, just not duplicates, in which case the cloned version should be returned bot the original.

Lastly, one alternative option I considered is basically making the clone of a procedure, it's sequences, etc...and swapping the associations so the Approved procedure references the recently cloned sequences and the clone would reference the original sequences. This way DISTINCT and GROUP BY will always show the original sequences (which become clone vis-versa) and thus editing the sequences do not affect Approved procedures.

Cheers,
Alex

Re: DISTINCT or GROUP BY - Dropping Duplicates

Posted: Tue Feb 23, 2010 10:02 am
by Eran
I had a hard time following your requirements as its quite contrived (by the way - what the hell is click-and-drag? did you mean drag-and-drop?), but I think you should try and understand how GROUP BY and DISTINCT work. DISTINCT will return the DISTINCT values of a specific field (obviously) but they do not affect the rest of the row. This is usually useful in conjunction with aggregated functions that you do not want operating on duplicate values.

A GROUP BY statement folds down rows by a set of criteria, but you lose specific row details that have different values in the group. In the SQL standard, you cannot select columns that have different values in a group using a GROUP BY statement. MySQL deviates from that standards and populates the rest of the columns (which prevent an error from occurring), but you have no control on what row those values are taken from.

What you want to do in practice, if I understand correctly, is take the latest row in a group (or using some other unique criteria) and populate the non-unique columns in the group with it. You need to use GROUP BY and join it again to the same table using that field as a filter - take a look at this article which covers several approaches for achieving increasing difficult criteria for the merge - http://www.xaprb.com/blog/2006/12/07/ho ... up-in-sql/

Alternatively, if you know in advance the value you want to filer by (like an enumerated status field) you can simply add that to the where clause to get rid of all the irrelevant rows.

Re: DISTINCT or GROUP BY - Dropping Duplicates

Posted: Tue Feb 23, 2010 10:09 am
by alex.barylski
what the hell is click-and-drag? did you mean drag-and-drop?),
Yes I did...haha my bad :P
What you want to do in practice, if I understand correctly, is take the latest row in a group (or using some other unique criteria) and populate the results with it.
Sort of. This list by default will show a user every sequence in the system, regardless of where or how it's used. However to avoid introducing temp tables for every cloned record (5-6 tables) I simply copy the record internally and wish to hide it somehow, without introducing any context flag field to the existing tables (ideally).

GROUP BY seems to do the trick, however it is not returning the latest copy added but the first it finds. I assume the RDBMS probably just blindly accepts the first match and ignores others. What scares me about this assumption is the possibility for error when the table becomes fragmented and OPTIMIZE (or whatever) is executed on the table, reshuffling records and possibly skewing the result.

So ideally I would have preceise control over the results, I will look into that link you sent me, thank you. :)

Cheers,
Alex

Re: DISTINCT or GROUP BY - Dropping Duplicates

Posted: Tue Feb 23, 2010 4:35 pm
by VladSun
PCSpectra wrote:GROUP BY seems to do the trick, however it is not returning the latest copy added but the first it finds. I assume the RDBMS probably just blindly accepts the first match and ignores others.
Tell me about it : viewtopic.php?f=2&t=72439 ;)