[FoRK] Database Theory Query
andrew at ceruleansystems.com
Fri Aug 27 00:23:25 PDT 2004
On Aug 25, 2004, at 9:49 PM, johnhall wrote:
> So Question: Is there a general wisdom on the use of intelligent keys,
> combination keys, or auto-number fields? Or does it depend on
> attributes of the problem?
This isn't a clear cut issue. There used to be good reasons to use
serial numbers for keys when constraint handling was clumsy even ten
years ago, but most real RDMBS handle such things very smoothly now
such that the old foreign key issues are moot.
Another reason to use serial numbers is that they make very fast
indexes. Searching an index of 64-bit fields is much faster than
searching an index of varchars. However, for tables with relatively
small rows one can easily make the contrary argument that the reduced
efficiency of block usage causes greater block churn in the block
cache, thereby reducing performance, though it is probably negligible
in most cases as a practical matter. For validation tables, there is
absolutely no reason to use serials and lots of reasons not to, though
I've seen people do it out of apparent reflex. Also, it is pretty lame
to have a primary key index on the serial and another unique index on
the actual key field and certainly not performant if you have a lot of
writing going on, so the specifics of the table relations and access
patterns factor into the decision.
These days, I tend to do it both ways in the same database depending on
what is appropriate. There are minor performance issues, but databases
are smart enough these days as a practical matter that there is no
inconvenience or difficulty by doing it either way. It is mostly a
matter of style, preference, and old habits. I think that the argument
for serials is a lot weaker than it used to be.
> Issue #2:
> Excessive (to me) normalization. Professor W insists that ALL phone
> numbers must be split into area code, exchange, and number. He also
> wants Social Security Numbers split in 3. Now, I can understand
> splitting the phone number into area code vs. balance of the number,
> 3 columns? Not unless I'm doing a phone system ...
Totally excessive normalization -- was this guy ever an experienced
professional at this?
What he is doing is arguably correct, but broken in practical
implementation. For starters, it slows down the database to have
dozens of tiny fields, particularly when most queries against SSN or
phone numbers will be complete matches against the aggregate of the
fields. The only way one can remotely justify this kind of table
structure is if you are working in a narrow application that routinely
indexes queries against parts of the field.
The correct way to do this is a single field with a formatting
constraint. That way matches are fast and you can re-format the field
to your heart's desire. Note that if you do want to occasionally run
queries against partial fields that you can use functional indexes that
will chop up the column virtually to meet the requirements of some
query -- just as fast as splitting the column into a bunch of little
columns for searching, faster in the general case, and without the
nuisance of a bunch of little columns. Or at least that's how you'd do
it in the real world.
> Note that students are also factoring out the phone number into a
> separate table (in case multiple people are room mates, for example,
> share the same phone number). But I think that is a horrible idea, and
> if you have room mates their phone number is the same VALUE but
> different DATA (you do want to be able to update and change it at any
> time, right?).
You are correct. I'm starting to suspect that Professor W is a bozo.
There is a progression when learning how to build real databases that
many people never complete. At some point, you learn all the theory of
normalization and ideal database design, and apparently Professor W has
taken it to heart. Unfortunately, at this point you are still a total
amateur because becoming a working professional is all about learning
that the theory is broken in practice in that the model doesn't even
consider runtime trade-offs and dynamics, which are far more important
to business than having some ivory tower notion of the perfect schema.
If "ideal" structure means integer factor reductions in performance, it
isn't ideal by any metric that will fly in the real world, particularly
if the "sub-optimal" data model is still reasonably clean.
When I interview DBAs and database folks, I don't test them on how well
they normalize data but on if they can break theoretically ideal
designs to maximize the system and solve problems. You can learn how
to normalize from a book, but knowing how to do a smartly engineered
denormalizations and parametric asymmetries to solve problems is
usually something people learn from experience or from other highly
experienced folks. Its hard to fake that part.
> Does a 1 semester course go further than that? What are the most
> sophisticated things you think a student should be (a) taught and (b)
> remember after taking a single undergraduate course in databases?
Realistically, they should know joins, aggregates, and how to build
moderately complex queries with semi-optimal WHERE clauses. That will
get you most of what you need anyway.
j. andrew rogers
More information about the FoRK