Page 1 of 1

How far to go with normalization?

Posted: Thu Jan 05, 2006 10:52 am
by matthijs
After reading some tutorials I have a basic understanding of the why and how of database normalization. As an exercise I've designed a database scheme with some tables. I have created multiple tables in the normalization process and have been able to get rid of most redundant data and get a good (?) relational model.

As my tables are now, there's still some data duplication. But I'm not sure if making more seperate tables is necessary or wanted. I'll first show what I have.

Code: Select all

Climbs
-------------------------------------------------------------------
climb_id(PK) route_id(FK) climber_id(FK) climb_date climb_style climb_tries climb_notes
  1               1            2         12/12/05      style1         2         blabla
  2               2            1         21/12/05      style2         6         blabla
  3               5            1         15/12/05      style2         4         blabla
  4               4            2         17/11/05      style1         2         blabla
  5               4            4         17/11/05      style1         6         blabla

..

Routes
------------------------------------------------------------------
route_id(PK)  area_id(FK)  route_name  route_grade
   1           2            somename1        4
   2           2            somename2        5b 
   3           1            somename3        3 
   4           3            somename4        7a
   5           5            somename5        7a 
   6           5            somename5        7a 
   7           5            somename6        6b 
..

Climbers
-----------------------------------------------------------------
climber_id(PK)  climber_nickname  climber_realname  climber_sex  climber_country
  1              nic                   Ann             F          USA
  2              red                   Chris           M          Spain
  3              boy                   Bart            M          USA
  4              bull                  Jan             M          Russia
..

Areas
-----------------------------------------------------------------
area_id(PK)  area_name  area_country  area_notes
  1            areaA         USA         blabla
  2            areaB         USA         blablabla
  3            areaC         Spain        blabla
..
As you can see, some columns have duplicate data.

In the table climbs, the climbing style is duplicated. When should I decide if a new table for the available styles is better? I know in advance there's only a couple of styles and they wont change.

In the table routes the grades are duplicated sometimes. As there are only 30 different possible grades which wont change very soon, creating a new table for the grades would seem too much to me.

In both tables climbers and areas there's a column country. A user using the system (searching or putting in data) would select a country from a dropdown menu, so I'm pretty sure a valid country will be selected (to prevent data inconsistencies/anomalies). But still, what if I want to change the name of a country in the future. For example, due to a civil war the name of some country is changed. In that case a seperate table for countries would be easier for updating.

I know that for a further normalization I would have to create more tables. However, the downside will be that the queries will be more complicated and (from what I understand) slower.

Any advice in this matter would be appreciated.
- Are there problems with this scheme?
- What could be improved?
- Are there general rules to decide how far to go with the normalization?

Posted: Thu Jan 05, 2006 11:34 am
by Chris Corbyn
I'd break your areas tables down into separate tables for area and country. The areas would have a FK for the country table.

The members table should also be using FK's for the country column rather than duplicating the data too ;)

Other than that you've done it well.

Third normal form is as far as your average devloper would take it and there's not often a need to go higher unless you really are dealing with HUGE amounts of data :) You'd be at 3NF if you make those changes AFAIK.

Posted: Thu Jan 05, 2006 12:00 pm
by matthijs
I'd break your areas tables down into separate tables for area and country. The areas would have a FK for the country table. The members table should also be using FK's for the country column rather than duplicating the data to
Ok, I'll do that.
Other than that you've done it well
That's good to hear :)

It's not too hard following an example while reading a tutorial on db normalization, but it's something else to be able to apply it on your own situation...
Therefore your suggestions are appreciated, thanks!

Re: How far to go with normalization?

Posted: Thu Jan 05, 2006 5:03 pm
by timvw
matthijs wrote: Climbs: climb_id(PK) route_id(FK) climber_id(FK) climb_date climb_style climb_tries climb_notes
Routes: route_id(PK) area_id(FK) route_name route_grade
Climbers: climber_id(PK) climber_nickname climber_realname climber_sex climber_country
Areas: area_id(PK) area_name area_country area_notes
The model seems to be ok.
matthijs wrote: As you can see, some columns have duplicate data.

In the table climbs, the climbing style is duplicated. When should I decide if a new table for the available styles is better? I know in advance there's only a couple of styles and they wont change.
Is it really duplicated? I suggest you use an ENUM for the style (if you're not already doing it) to enforce that users choose only one of the availabe styles.

Btw, the foreign keys are duplicate data too. But that's ok, because it's controlled redundancy (they are their for a reason).

matthijs wrote: In both tables climbers and areas there's a column country. A user using the system (searching or putting in data) would select a country from a dropdown menu, so I'm pretty sure a valid country will be selected (to prevent data inconsistencies/anomalies). But still, what if I want to change the name of a country in the future. For example, due to a civil war the name of some country is changed. In that case a seperate table for countries would be easier for updating.
With another table for countries, you could an ON UPDATE clause to cascade the changes in all the tables..
matthijs wrote: I know that for a further normalization I would have to create more tables. However, the downside will be that the queries will be more complicated and (from what I understand) slower.
What is most important? A good data model or premature optimisation?

Posted: Fri Jan 06, 2006 3:07 am
by matthijs
I suggest you use an ENUM for the style (if you're not already doing it) to enforce that users choose only one of the availabe styles
I hadn't thought about that yet, good idea.
With another table for countries, you could an ON UPDATE clause to cascade the changes in all the tables..
That's what d11wtq suggested as well. So I'll certainly make a seperate country table.
What is most important? A good data model or premature optimisation?
timvw, I'm not sure if I completely understand this one. Do you mean one excludes the other? Or do you mean that in trying to get a good data model (optimization, more tables etc) one runs the risk of doing premature optimization? And that wouldn't be too wise, as "premature optimization is the root of all evil"?
(I read some discussions about (premature) optimization here, here or here).

Mike Hillyer:
Normalization is the process of removing redundant data from your tables in order to improve storage efficiency, data integrity and scalability. This improvement is balanced against an increase in complexity and potential performance losses from the joining of the normalized tables at query-time
So, from what I understand at this point - and please correct me - is that if I try to maximally optimise the data model the advantages are:
- storage efficiency, data integrity, scalability
But the disadvantages are:
- increased complexity, performance loss

I like to keep things as simple as possible. However, putting in some time and energy up front to prevent headaches later is ok with me. So any thoughts, suggestions or good reading tips in this matter are welcome.

Posted: Fri Jan 06, 2006 7:40 am
by timvw
matthijs wrote: So, from what I understand at this point - and please correct me - is that if I try to maximally optimise the data model the advantages are:
- storage efficiency, data integrity, scalability
- increased complexity, performance loss
Imho, those arguments (both pro as contra) aren't related to your datamodel. They are related to the implementation of your DBMS.

I don't understand why a normalized model would mean more complexity. If your model isn't normalized you have to take care of redundant data. That is what i call unwanted complexity and performance loss.

Admitted, with a normalized model you can end up with a more difficult path to solve the query, but that's an internal problem. And if you want to take care of it, you shouldn't do it by changing things in the external or conceptual level.

Posted: Sat Jan 07, 2006 5:01 pm
by matthijs
I don't understand why a normalized model would mean more complexity. If your model isn't normalized you have to take care of redundant data. That is what i call unwanted complexity and performance loss.
A normalized model would probably look more complex because instead of one big table there's a couple of smaller ones. But I understand that dealing with redundant data or even data anomalies later on is more complex.
Admitted, with a normalized model you can end up with a more difficult path to solve the query, but that's an internal problem. And if you want to take care of it, you shouldn't do it by changing things in the external or conceptual level.
How could one take care of the more difficult paths to solve the query then?

And about the performance issue: from what I have read so far I understand that after normalizing the data model, you have more tables to join in the queries and that could be a performance loss when dealing with large amounts of data/traffic. Or isn't that so? Or are there other ways to refactor the model to improve the performance?

[added:] I was thinking about performance. Say I have a table about the users of a system (in my case climbers). If I store a lot of information about them (profiles), I was thinking maybe it would improve the performance of the system by splitting the table in 2. One for the most often used (queried) data, and one for less often used data (details about a profile). I couple them by referencing to the primary key.

Posted: Sat Jan 07, 2006 7:19 pm
by timvw
you have more tables to join in the queries and that could be a performance loss when dealing with large amounts of data/traffic. Or isn't that so?
Instead of "could be" i would even say "will be". Let's go back to the beginning. There are three levels, the external, the logical and the internal. Do you really want to give up that separation of concerns for a small performance gain (which you loose later on again when you're updating the redundant data...)
Or are there other ways to refactor the model to improve the performance?
From a pragmatic POV it seems true that you can gain a performance boost by denormalizing your data (but it's often forgotten that it costs resources to keep that redundant data up to date).

I tried to make clear that the model (logical representation) is not concerned about performance. That is a task of the internal representation.
I was thinking about performance. Say I have a table about the users of a system (in my case climbers). If I store a lot of information about them (profiles), I was thinking maybe it would improve the performance of the system by splitting the table in 2. One for the most often used (queried) data, and one for less often used data (details about a profile). I couple them by referencing to the primary key.
Do you really want to think about that when you model your data? Or do you want it to be silently done by the dbms (internal representation). Offcourse it would be nice if you could hint that layer which data you're going to use often to optimize performance. It would be even nicer that the dbms itself could analyze your queries and figure out which data it will need often.. So i repeat once more: this should not be done at the logical representation of your data.

If you search the web you will find that many have gone through the debate pro/against denormalization of the model in turn for performance optimalisation. I think i've made my POV clear so i'll rest my case.

Posted: Sun Jan 08, 2006 4:15 am
by matthijs
Thanks for your response. I'm slowly starting to understand it better.

Again, thanks for the explanation! I'll go and do some more reading/homework on the issue now.

Matthijs

Posted: Sun Jan 08, 2006 4:40 pm
by matthijs
Ok, I've reread your post and would like to ask another question, if I may.
Do you really want to think about that when you model your data? Or do you want it to be silently done by the dbms (internal representation).
I'm not quite sure what you mean here. What exactly is the difference between the model of the data and the internal representation? My data model is the way I order my data in the different database tables, right? But what's the internal representation supposed to be then?

Posted: Sun Jan 08, 2006 7:52 pm
by timvw
matthijs wrote:I'm not quite sure what you mean here. What exactly is the difference between the model of the data and the internal representation? My data model is the way I order my data in the different database tables, right? But what's the internal representation supposed to be then?
I believe the following website explains the different levels of responsibilities: http://db.grussell.org/section002.html.

The internal representation (implemented by your dbms) decides which data is stored where. And how index files are built and how they are used. More complex dbms even have their own filesystem etc to improve performance...

Posted: Mon Jan 09, 2006 2:40 am
by matthijs
Thanks for your reply and the link. Most tutorials i've read so far dealing with (mysql)db design are quite basic, so the replies here get me going in the right direction, thanks.