Normalization vs. Fewer Joins
Posted: Thu Nov 01, 2007 2:36 pm
Hi all,
I have a theory question about how I should design the database for a project I'm just starting on.
The project is an online questionnaire program. Users can create multiple unique questions, different answers, and different questionnaires. In the first version of this system, I had 1 table for questions, 1 table for the questionnaire membership, and one table for the answers (I should clarify that "answers" aren't the respondents answers, but rather the text of the answer that is put on the questionnaire). That's fine.
This system also does a whack load of reporting on the response data & this is where I come to my quandary. The client wants to be able to do reports & be able to say to the app "Only run a report on questionnaires given for this course, in these years, in this term", etc, as well as restrict on what the respondents answered.
In the first revision of this app, the responses were in their own table & only linked by ID to an answer, a question, and a questionnaire. This made doing the reports difficult because I had to make huge queries to link up the answers that were given to other tables in order to restrict on properties relative to the questionnaire, or the question text. Reports take a while (tolerable, but I'd like them faster) right now, but when we open up this second revision to multiple institutions, the dataset will get much larger & so will the time to crunch through the report data.
I'm wondering if it would be wiser to de-normalize just the response data so the response table doesn't just link to the questionnaire table & all it's data, but rather includes all that data in the response table. This would completely remove the need for joins. Unfortunately it would duplicate some data - but not any data that can be modified.
Have I made my point clear? What does everyone think? I know conventional wisdom says to normalize as much as possible, but I'm wondering if performance hits makes that less wise.
Thanks for any & all advice.
I have a theory question about how I should design the database for a project I'm just starting on.
The project is an online questionnaire program. Users can create multiple unique questions, different answers, and different questionnaires. In the first version of this system, I had 1 table for questions, 1 table for the questionnaire membership, and one table for the answers (I should clarify that "answers" aren't the respondents answers, but rather the text of the answer that is put on the questionnaire). That's fine.
This system also does a whack load of reporting on the response data & this is where I come to my quandary. The client wants to be able to do reports & be able to say to the app "Only run a report on questionnaires given for this course, in these years, in this term", etc, as well as restrict on what the respondents answered.
In the first revision of this app, the responses were in their own table & only linked by ID to an answer, a question, and a questionnaire. This made doing the reports difficult because I had to make huge queries to link up the answers that were given to other tables in order to restrict on properties relative to the questionnaire, or the question text. Reports take a while (tolerable, but I'd like them faster) right now, but when we open up this second revision to multiple institutions, the dataset will get much larger & so will the time to crunch through the report data.
I'm wondering if it would be wiser to de-normalize just the response data so the response table doesn't just link to the questionnaire table & all it's data, but rather includes all that data in the response table. This would completely remove the need for joins. Unfortunately it would duplicate some data - but not any data that can be modified.
Have I made my point clear? What does everyone think? I know conventional wisdom says to normalize as much as possible, but I'm wondering if performance hits makes that less wise.
Thanks for any & all advice.