[FoRK] Database Theory Query

J.Andrew Rogers andrew at ceruleansystems.com
Fri Aug 27 00:23:25 PDT 2004

On Aug 25, 2004, at 9:49 PM, johnhall wrote:
> So Question: Is there a general wisdom on the use of intelligent keys,
> combination keys, or auto-number fields?  Or does it depend on
> attributes of the problem?

This isn't a clear cut issue.  There used to be good reasons to use 
serial numbers for keys when constraint handling was clumsy even ten 
years ago, but most real RDMBS handle such things very smoothly now 
such that the old foreign key issues are moot.

Another reason to use serial numbers is that they make very fast 
indexes.  Searching an index of 64-bit fields is much faster than 
searching an index of varchars.  However, for tables with relatively 
small rows one can easily make the contrary argument that the reduced 
efficiency of block usage causes greater block churn in the block 
cache, thereby reducing performance, though it is probably negligible 
in most cases as a practical matter.  For validation tables, there is 
absolutely no reason to use serials and lots of reasons not to, though 
I've seen people do it out of apparent reflex.  Also, it is pretty lame 
to have a primary key index on the serial and another unique index on 
the actual key field and certainly not performant if you have a lot of 
writing going on, so the specifics of the table relations and access 
patterns factor into the decision.

These days, I tend to do it both ways in the same database depending on 
what is appropriate.  There are minor performance issues, but databases 
are smart enough these days as a practical matter that there is no 
inconvenience or difficulty by doing it either way.  It is mostly a 
matter of style, preference, and old habits.  I think that the argument 
for serials is a lot weaker than it used to be.

> Issue #2:
> Excessive (to me) normalization.  Professor W insists that ALL phone
> numbers must be split into area code, exchange, and number.  He also
> wants Social Security Numbers split in 3.  Now, I can understand
> splitting the phone number into area code vs. balance of the number, 
> but
> 3 columns?  Not unless I'm doing a phone system ...

Totally excessive normalization -- was this guy ever an experienced 
professional at this?

What he is doing is arguably correct, but broken in practical 
implementation.  For starters, it slows down the database to have 
dozens of tiny fields, particularly when most queries against SSN or 
phone numbers will be complete matches against the aggregate of the 
fields.  The only way one can remotely justify this kind of table 
structure is if you are working in a narrow application that routinely 
indexes queries against parts of the field.

The correct way to do this is a single field with a formatting 
constraint.  That way matches are fast and you can re-format the field 
to your heart's desire.  Note that if you do want to occasionally run 
queries against partial fields that you can use functional indexes that 
will chop up the column virtually to meet the requirements of some 
query -- just as fast as splitting the column into a bunch of little 
columns for searching, faster in the general case, and without the 
nuisance of a bunch of little columns.  Or at least that's how you'd do 
it in the real world.

> Note that students are also factoring out the phone number into a
> separate table (in case multiple people are room mates, for example, 
> and
> share the same phone number).  But I think that is a horrible idea, and
> if you have room mates their phone number is the same VALUE but
> different DATA (you do want to be able to update and change it at any
> time, right?).

You are correct.  I'm starting to suspect that Professor W is a bozo.

There is a progression when learning how to build real databases that 
many people never complete.  At some point, you learn all the theory of 
normalization and ideal database design, and apparently Professor W has 
taken it to heart.  Unfortunately, at this point you are still a total 
amateur because becoming a working professional is all about learning 
that the theory is broken in practice in that the model doesn't even 
consider runtime trade-offs and dynamics, which are far more important 
to business than having some ivory tower notion of the perfect schema.  
If "ideal" structure means integer factor reductions in performance, it 
isn't ideal by any metric that will fly in the real world, particularly 
if the "sub-optimal" data model is still reasonably clean.

When I interview DBAs and database folks, I don't test them on how well 
they normalize data but on if they can  break theoretically ideal 
designs to maximize the system and solve problems.  You can learn how 
to normalize from a book, but knowing how to do a smartly engineered 
denormalizations and parametric asymmetries to solve problems is 
usually something people learn from experience or from other highly 
experienced folks.  Its hard to fake that part.

> Does a 1 semester course go further than that?  What are the most
> sophisticated things you think a student should be (a) taught and (b)
> remember after taking a single undergraduate course in databases?

Realistically, they should know joins, aggregates, and how to build 
moderately complex queries with semi-optimal WHERE clauses.  That will 
get you most of what you need anyway.

j. andrew rogers

More information about the FoRK mailing list