Normalization vs. Fewer Joins

Questions about the MySQL, PostgreSQL, and most other databases, as well as using it with PHP can be asked here.

Moderator: General Moderators

Post Reply
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Normalization vs. Fewer Joins

Post by pickle »

Hi all,

I have a theory question about how I should design the database for a project I'm just starting on.

The project is an online questionnaire program. Users can create multiple unique questions, different answers, and different questionnaires. In the first version of this system, I had 1 table for questions, 1 table for the questionnaire membership, and one table for the answers (I should clarify that "answers" aren't the respondents answers, but rather the text of the answer that is put on the questionnaire). That's fine.

This system also does a whack load of reporting on the response data & this is where I come to my quandary. The client wants to be able to do reports & be able to say to the app "Only run a report on questionnaires given for this course, in these years, in this term", etc, as well as restrict on what the respondents answered.

In the first revision of this app, the responses were in their own table & only linked by ID to an answer, a question, and a questionnaire. This made doing the reports difficult because I had to make huge queries to link up the answers that were given to other tables in order to restrict on properties relative to the questionnaire, or the question text. Reports take a while (tolerable, but I'd like them faster) right now, but when we open up this second revision to multiple institutions, the dataset will get much larger & so will the time to crunch through the report data.

I'm wondering if it would be wiser to de-normalize just the response data so the response table doesn't just link to the questionnaire table & all it's data, but rather includes all that data in the response table. This would completely remove the need for joins. Unfortunately it would duplicate some data - but not any data that can be modified.

Have I made my point clear? What does everyone think? I know conventional wisdom says to normalize as much as possible, but I'm wondering if performance hits makes that less wise.

Thanks for any & all advice.
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
User avatar
RobertGonzalez
Site Administrator
Posts: 14293
Joined: Tue Sep 09, 2003 6:04 pm
Location: Fremont, CA, USA

Post by RobertGonzalez »

From a performance perspective, can the tables be indexed further, or in a way that allows for faster searching for reporting? I know what you want to do. Oddly, my associate has just written an employee review app that does what yours seems to do only without the scale.

I, personally, am not a fan of duplicate data and de-normalized architecture.

By the way, what database server is this and what version?
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Post by pickle »

It'll be MySQL 5.x. The current system runs on Postgres 7.3.4. The current system is foreign keyed all to heck, but I could probably index some of the other properties.

I'm also investigating using the InnoDB vs. MyISAM format.
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
User avatar
RobertGonzalez
Site Administrator
Posts: 14293
Joined: Tue Sep 09, 2003 6:04 pm
Location: Fremont, CA, USA

Post by RobertGonzalez »

Something that might work also is leaving the underlying structure alone (with proper indexing of course) and writing procedures that handle the reporting queries and joins in a sort of interface layer.

We do that quite a bit at work, where the underlying data is basically the source of the enterprise. So our DBA handles the interfacing through custom procs that we hit with our applications (whether is be PHP, ODBC in Access or whatever).
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Post by pickle »

I am the DBA. I'm the designer, coder, DBA, administrator, maintainer, sysadmin, and everything else relative to the system that doesn't involve data entry.

I'm not sure I understand what you mean by procedures? No matter how I access the data, I'll still either have to use joins or not.
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
User avatar
RobertGonzalez
Site Administrator
Posts: 14293
Joined: Tue Sep 09, 2003 6:04 pm
Location: Fremont, CA, USA

Post by RobertGonzalez »

Right. We have, for example, a procedure that joins eight tables by use of several other smaller procs that handle smaller subqueries. Of course, this is in Sybase, but I am almost certain that you can use the output of a proc inside another proc.

What our DBA does is essentially joins data sets from proc returns inside one procedure. Does that make any sense?
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Post by Christopher »

I would first make sure that every column that is in a JOIN, WHERE, GROUP BY or ORDER BY is indexed.

The other thing I have found in reporting particularly is that sometime you don't need to do the join at all. The problem with reports is that you fetch a lot of data base with joined columns -- lots of rows and lots of columns per row. I sometimes pre-fetch the report layout data, in your case info about the questions and institutions, needed for the report into arrays. Then I fetch only the actual data, the responses in your case, with no join. The reason is that in reports you need to loop in PHP to generate the report anyway, so adding so array lookups to create heading, etc. is little additional overhead.
(#10850)
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Post by pickle »

Thanks ~arborint. I have gone that route in other systems & it works fine. It seems dirty though having to interface with the language to speed up the database. When push comes to shove I guess though ugliness is secondary to performance.

Thanks.
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Post by Christopher »

Yeah it may seem ugly in general, but I think reports are an example of an exception because the program will be iterating over a large data set anyway -- so doing the "join" in code makes sense to reduce the complexity and size of the result data.
(#10850)
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Post by s.dot »

I do the same arborint. I find joins are typically very slow, when creating arrays to store data for two separate queries and cross-refencing them in loops is generally faster.

Of course, I guess it depends on how much data you're going through. But I do this on a forum with about 1.5 million posts and the entire page loads (not just the forum part) in .01 seconds.
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
User avatar
Kieran Huggins
DevNet Master
Posts: 3635
Joined: Wed Dec 06, 2006 4:14 pm
Location: Toronto, Canada
Contact:

Post by Kieran Huggins »

Maybe you could add tables that act as a "join cache" of sorts? That would allow you to keep normalized tables and cache the results from previous joins in separate tables, thus speeding up subsequent reports.

Is that cheating?
Post Reply