Which design do you think runs faster on PostgreSQL?
A. Making a 15 column table of varchars and the like, but putting all TEXT columns in a separate table with an fkey link back to this table. And let's imagine you want to search for the record with ID of "4" but then pull all the rows back, including the stuff from the TEXT columns in the joined table. And let's imagine the tables have 500,000 rows.
B. Making a 15 column table of varchars and the like, and include your TEXT columns in the same table. Again, imagine the same as above -- grab record ID 4 and pull the full record, and there are 500,000 rows in the table.
I mean, in most databases, the way I understand it, when you go down to the physical layer of how those TEXT columns work, they keep a small ID actually in the table column on each row, and that ID goes to a separate, exclusive page block in the database. So, to me, it would seem that option B would run faster because there's no need for the overhead of the fkey join, and because the TEXT columns are not actually occupying any more than an integer space in that column in that given table -- and that integer is a key in the database to a page block somewhere else.
Which Design Is Faster on PostgreSQL?
Moderator: General Moderators
Re: Which Design Is Faster on PostgreSQL?
You are probably correct on the second option. Regardless, I always advise leave the optimizations to the database engine and go with the most normalized design. When you encounter real-world performance issues (not speculative ones), optimize it to suite your requirements (so still the second option is best
).
- volomike
- Forum Regular
- Posts: 633
- Joined: Wed Jan 16, 2008 9:04 am
- Location: Myrtle Beach, South Carolina, USA
Re: Which Design Is Faster on PostgreSQL?
pytrin wrote:...I always advise leave the optimizations to the database engine and go with the most normalized design. When you encounter real-world performance issues (not speculative ones), optimize it to suite your requirements...
Yeah, I usually follow that approach in my database designs. The way I was taught -- there's a spectrum with denormalization on the left and normalization on the right. The denormalized tables mean more potential data redundancy and more integrity checks that need to be done in the app instead of the database. However, they give you a lot more read-intensive performance. The normalized tables mean less or no data redundancy and less required integrity checks (if you set up the table constraints properly). But they also give you a lot more write-intensive performance and less read performance.
If I know there's an obvious place where I'm going to be doing a lot of table writes, such as call center orders coming in, then I'll shoot for movement towards the normalization spectrum.
For everything else, I'll lean more towards the denormalized side, but not a completely denormalized database where everything is in 2 or 3 tables! (Ha! Sounds like a funny xkcd cartoon.)
Then, I check performance. When I find bottlenecks, I move towards a normalized design on those things that need it. If I inherit a database design from someone, and the bottlenecks indicate poor read performance, then I'll do the inverse and move towards a denormalized design on the tables that actually need it.
But I was on a contract where a guy had me all screwed up and liked to see TEXT values in a separate table, so I thought I'd ask if someone else knew which was best.
I also checked on IRC and unanimously a lot of people said that if you have, say, 500,000 job listings, and a couple TEXT columns, then yeah, stick them in the same table as the rest of the job listing columns, not in an exclusive table. One guy also said that PostgreSQL will, on its own, drop an ID and move stuff into separate parts of the disk when the column gets too big, anyway.
Re: Which Design Is Faster on PostgreSQL?
You are missing the point. Normalization is not for performance, but integrity and relationships. Normalized databases can perform very well for reads if optimized properly, and denormalizations for performance sake is not very common on the web except for extreme cases - think Facebook, wikipedia scale. Though wikipedia's database structure is pretty normalized.
You can get great performance with a normalized structure if you know how to build your indexes and structure your queries properly. In addition you get those other minor things such as data integrity, consistent design and flexibility.
By the way 500,000 columns is not a large table at all... I have some databases with 10+ millions records per table, some queries joining 4-5 tables and still completing in the micro-second area.
You can get great performance with a normalized structure if you know how to build your indexes and structure your queries properly. In addition you get those other minor things such as data integrity, consistent design and flexibility.
By the way 500,000 columns is not a large table at all... I have some databases with 10+ millions records per table, some queries joining 4-5 tables and still completing in the micro-second area.