Page 1 of 1

Normalization vs. Fewer Joins

Posted: Thu Nov 01, 2007 2:36 pm
by pickle
Hi all,

I have a theory question about how I should design the database for a project I'm just starting on.

The project is an online questionnaire program. Users can create multiple unique questions, different answers, and different questionnaires. In the first version of this system, I had 1 table for questions, 1 table for the questionnaire membership, and one table for the answers (I should clarify that "answers" aren't the respondents answers, but rather the text of the answer that is put on the questionnaire). That's fine.

This system also does a whack load of reporting on the response data & this is where I come to my quandary. The client wants to be able to do reports & be able to say to the app "Only run a report on questionnaires given for this course, in these years, in this term", etc, as well as restrict on what the respondents answered.

In the first revision of this app, the responses were in their own table & only linked by ID to an answer, a question, and a questionnaire. This made doing the reports difficult because I had to make huge queries to link up the answers that were given to other tables in order to restrict on properties relative to the questionnaire, or the question text. Reports take a while (tolerable, but I'd like them faster) right now, but when we open up this second revision to multiple institutions, the dataset will get much larger & so will the time to crunch through the report data.

I'm wondering if it would be wiser to de-normalize just the response data so the response table doesn't just link to the questionnaire table & all it's data, but rather includes all that data in the response table. This would completely remove the need for joins. Unfortunately it would duplicate some data - but not any data that can be modified.

Have I made my point clear? What does everyone think? I know conventional wisdom says to normalize as much as possible, but I'm wondering if performance hits makes that less wise.

Thanks for any & all advice.

Posted: Thu Nov 01, 2007 2:44 pm
by RobertGonzalez
From a performance perspective, can the tables be indexed further, or in a way that allows for faster searching for reporting? I know what you want to do. Oddly, my associate has just written an employee review app that does what yours seems to do only without the scale.

I, personally, am not a fan of duplicate data and de-normalized architecture.

By the way, what database server is this and what version?

Posted: Thu Nov 01, 2007 2:50 pm
by pickle
It'll be MySQL 5.x. The current system runs on Postgres 7.3.4. The current system is foreign keyed all to heck, but I could probably index some of the other properties.

I'm also investigating using the InnoDB vs. MyISAM format.

Posted: Thu Nov 01, 2007 2:55 pm
by RobertGonzalez
Something that might work also is leaving the underlying structure alone (with proper indexing of course) and writing procedures that handle the reporting queries and joins in a sort of interface layer.

We do that quite a bit at work, where the underlying data is basically the source of the enterprise. So our DBA handles the interfacing through custom procs that we hit with our applications (whether is be PHP, ODBC in Access or whatever).

Posted: Thu Nov 01, 2007 3:12 pm
by pickle
I am the DBA. I'm the designer, coder, DBA, administrator, maintainer, sysadmin, and everything else relative to the system that doesn't involve data entry.

I'm not sure I understand what you mean by procedures? No matter how I access the data, I'll still either have to use joins or not.

Posted: Thu Nov 01, 2007 3:47 pm
by RobertGonzalez
Right. We have, for example, a procedure that joins eight tables by use of several other smaller procs that handle smaller subqueries. Of course, this is in Sybase, but I am almost certain that you can use the output of a proc inside another proc.

What our DBA does is essentially joins data sets from proc returns inside one procedure. Does that make any sense?

Posted: Thu Nov 01, 2007 4:11 pm
by Christopher
I would first make sure that every column that is in a JOIN, WHERE, GROUP BY or ORDER BY is indexed.

The other thing I have found in reporting particularly is that sometime you don't need to do the join at all. The problem with reports is that you fetch a lot of data base with joined columns -- lots of rows and lots of columns per row. I sometimes pre-fetch the report layout data, in your case info about the questions and institutions, needed for the report into arrays. Then I fetch only the actual data, the responses in your case, with no join. The reason is that in reports you need to loop in PHP to generate the report anyway, so adding so array lookups to create heading, etc. is little additional overhead.

Posted: Thu Nov 01, 2007 4:15 pm
by pickle
Thanks ~arborint. I have gone that route in other systems & it works fine. It seems dirty though having to interface with the language to speed up the database. When push comes to shove I guess though ugliness is secondary to performance.

Thanks.

Posted: Thu Nov 01, 2007 7:00 pm
by Christopher
Yeah it may seem ugly in general, but I think reports are an example of an exception because the program will be iterating over a large data set anyway -- so doing the "join" in code makes sense to reduce the complexity and size of the result data.

Posted: Thu Nov 01, 2007 7:26 pm
by s.dot
I do the same arborint. I find joins are typically very slow, when creating arrays to store data for two separate queries and cross-refencing them in loops is generally faster.

Of course, I guess it depends on how much data you're going through. But I do this on a forum with about 1.5 million posts and the entire page loads (not just the forum part) in .01 seconds.

Posted: Thu Nov 01, 2007 9:41 pm
by Kieran Huggins
Maybe you could add tables that act as a "join cache" of sorts? That would allow you to keep normalized tables and cache the results from previous joins in separate tables, thus speeding up subsequent reports.

Is that cheating?