Database Development/Talk

Here's a discussion about primary keys that happened in email that's relevant to this page:

Sorry, the point/question I had was why
have an 'id' when the 'id' is inferred from (tagname, pagename) —
e.g. (tagname, pagename) is the true primary key? once you slash out
the id, then the separate table 'tags' becomes useless, too, and you'd
be best w/ just cur_ and old_ ? — PhilipNeustrom

Most relational database design people say that you should use a meaningless primary key and they put all meaningful attributes in
other rows. Not that those guys are always right, and there is some
controversy about it, but it makes sense to me for a couple of
reasons.

Peformance: The most expensive relational operation is a join. When
you use a small integer key (4 bytes) vs a meaningful key (like 40-50
bytes) for tables that are going to be joined, then you minimize the
amount of data that needs to be examined when the join occurs.

Minimize cascading updates/deletes: If a user changes a tag name and
tag name is the primary key, then every table where that key is a
foreign key needs to be updated. If tag name is an attribute rather
than a key, then only the row where the tag is defined needs updating.

Jon said that there was a possibility that there would be some kind of
grouping or hierarchy of tags. I think he was interpreting something
you said earlier. Anyway, that's another reason to separate the tag
definition from the tag/page relation. I put "class" in the tag
definition. This allows the later use of a classification scheme to
arrange tags. If we wanted to go nuts, we could arrange tags into
some kind of parent-child hierarchy. By separating the tag from the
tag/page relationship, that gives us all kinds of flexibility to
express the relations between tags.
—RottenChester

> Most relational database design people say that you should use a > meaningless primary key and they put all meaningful attributes in
> other rows. Not that those guys are always right, and there is some
> controversy about it, but it makes sense to me for a couple of
> reasons.
>
> Peformance: The most expensive relational operation is a join. When
> you use a small integer key (4 bytes) vs a meaningful key (like 40-50
> bytes) for tables that are going to be joined, then you minimize the
> amount of data that needs to be examined when the join occurs.

It seems this would matter when doing a join with the entire primary
key. Aren't a lot of joins going to be on one indexed attribute or
another, e.g. orphaned pages / wanted pages and the outgoing and
incoming attributes of the links table?

>
> Minimize cascading updates/deletes: If a user changes a tag name and
> tag name is the primary key, then every table where that key is a
> foreign key needs to be updated. If tag name is an attribute rather
> than a key, then only the row where the tag is defined needs updating.

This is a good pioint.

>
> Jon said that there was a possibility that there would be some kind of
> grouping or hierarchy of tags. I think he was interpreting something
> you said earlier. Anyway, that's another reason to separate the tag
> definition from the tag/page relation. I put "class" in the tag
> definition. This allows the later use of a classification scheme to
> arrange tags. If we wanted to go nuts, we could arrange tags into
> some kind of parent-child hierarchy. By separating the tag from the
> tag/page relationship, that gives us all kinds of flexibility to
> express the relations between tags.
>
> I hadn't thought about tracking tag changes (cur versus old). I'll
> talk with Jon about that when we get together tonight.

When wouldn't you do this sort of fake primary key business? Note
that none of the current DB uses fake primary keys. Would curPages
have a fake primary key? So it'd be a number that corresponded to the
pagename. Then, each time you'd want information about a page you'd
need to go and figure out the page's ID? It seems like this could
easily go over-board, and you'd want to create a fake index for every
attribute you'd want an index on, so that the index would be smaller?

I'd like to hear more of your thoughts on this. I've seen this
fake-primary-key stuff before, but while designing the database
initially I didn't take it into account because I believed it was a
poor practice. Maybe it would be wise to use natural primary keys and
then figure out where else (based on your answer to the previous
paragraph) we ought to switch to fake keys, then go through the whole
database at the same time and convert things over? —PhilipNeustrom

Let me answer your last question first: Yes, I think a re-design of the Sycamore schema to use integer primary keys would be a good idea
and result in better performance. Yes, it is a bit of a pain for the
programmer, but if you're doing good oo design, the pain is confined
to the object that's mapped to the table and is transparent to the
rest of the application.

> It seems this would matter when doing a join with the entire primary
> key. Aren't a lot of joins going to be on one indexed attribute or
> another, e.g. orphaned pages / wanted pages and the outgoing and
> incoming attributes of the links table?

If the database is designed correctly, every query will have at least
one item in the where clause that is indexed, hopefully more. By
using meaningless primary keys, you give yourself flexibility in the
design of your indexes. For example, since you're using a 100 byte
page title as a primary key for curpages, you must create a 100 byte
wide index for curpages. If you had a synthetic primary key, you
could index the first 20 bytes of the page title and still get almost
the same lookup performance on curpages (since the first 20 bytes are
almost always unique).

> When wouldn't you do this sort of fake primary key business? Note
> that none of the current DB uses fake primary keys. Would curPages
> have a fake primary key? So it'd be a number that corresponded to the
> pagename. Then, each time you'd want information about a page you'd
> need to go and figure out the page's ID? It seems like this could
> easily go over-board, and you'd want to create a fake index for every
> attribute you'd want an index on, so that the index would be smaller?

You can use common sense - if the natural key is short and will never
change, use it. But if the natural key is significantly longer than
an int (4 bytes), or subject to change, then use an integer primary
key and a prefix index on the natural key. In practice, the dbs I
design sometimes have a code as a natural key, so I retain it as the
primary key. Example, NAICS or SIC code, which are short (6-8-digit)
codes for commodities. There's no reason to create a meaningless key
for a SIC code table.

> I'd like to hear more of your thoughts on this. I've seen this
> fake-primary-key stuff before, but while designing the database
> initially I didn't take it into account because I believed it was a
> poor practice. Maybe it would be wise to use natural primary keys and
> then figure out where else (based on your answer to the previous
> paragraph) we ought to switch to fake keys, then go through the whole
> database at the same time and convert things over?

Yep..the general principle is to "skinny up" indexes as much as
possible. Using a primary key that's a number generally helps to do
this, because it gives more flexibility in index design. The skinnier
the index, the faster it is to query and update. —RottenChester

> Let me answer your last question first: Yes, I think a re-design of > the Sycamore schema to use integer primary keys would be a good idea
> and result in better performance. Yes, it is a bit of a pain for the
> programmer, but if you're doing good oo design, the pain is confined
> to the object that's mapped to the table and is transparent to the
> rest of the application.

This should be our goal in the long-term, then!

Thanks for you well-written thoughts! —PhilipNeustrom