pydanny: Normalization noitazilamroN

Since pretty much the start of my career as a developer back in the 1990s one skill I've carried from job-to-job has been an understanding of relational databases. Over the years I've worked with Foxpro, Access, Oracle, SQL Server, MySQL, Sqlite, and now PostGreSQL.

Interestingly enough, database normalization comes instinctively to me. I knew about complex SQL joins and unions and subqueries before I read anything about normalization. As I read up on normalization, it was rather exciting to discover that my natural instinct during database design was to hit the fourth or fifth normal form without thinking about it. And since for most of my pre-Python career the number of records I dealt with was measured in the tens of thousands, normalization was a great tool. I was aware that my record sets were smallish, and good database design kept my stuff running fast.

Relational Databases are not a panacea that lets you overcome bad code.

It surprises me how many developers I've encountered over the years who complained about the performance issues of normalized data but didn't understand normalization. Instead, they refused to follow any sort of standard and every table seemed to duplicate data and every query requires complex joins for trivial data calls. And usually with sets of records in the count of tens of thousands, not millions or billions. The end result are projects that were/are unmaintainable and slow, with or without normalization.

NoSQL is not a panacea that lets you overcome bad code.

Which brings me to the current state of things. NoSQL is a big thing, with advantages of NoSQL being touted in the arenas of speed, reliability, flexible architecture, avoidance of Object relational impedance mismatch, and just plain ease of development. I've spent a year spinning an XML database stapled on top of MS SQL Server, years using ZODB, and about a woefully short time working on MongoDB projects. Like relational databases, the sad truth about XML, ZODB, and MongoDB is that there are problems. And just as with relational databases, the worst of it stemmed not from any issues with data systems, but developers and engineers. Like any other tool you can make terrible mistakes that lead to unmaintainable projects.

So for now, like most of the developers I know, what I like to do is as follows:

Create a well-normalized database preferably using PostGreSQL.
Cache predicted slowdown areas in Redis.
Use data analysis to spot database bottlenecks and break normalization via specific non-normalized tables.
Use a queue system like Celery or even chronjobs to populate the non-normalized table so the user never sees anything slow.
Cache the results of queries against the specific non-normalized tables in Redis.

The end result is something with the rigidity of a relational database but with the delivery speed of a key/value database. Since I work a lot in Django this means I get the advantage of most of the Django Packages ecosystem (at this time you lose much of the ecosphere if you go pure NoSQL). You can do the same in Pyramid, Rails, or whatever. Maybe its a bit conservative, but it works just fine.

pydanny

Tuesday, July 12, 2011

Normalization noitazilamroN

No comments:

Post a Comment