The importance of a technical ID as primary key for any entities

During my work in a big IT service company, I worked on several legacy projects with an old database. Not surprisingly, the model used on these databases were not flawlessly designed (to say the least). In this article, I want to write about one of the most common issues I encountered in these databases : the primary keys.

But before I talk about that, let me make a quick introduction about the concept of entities.

What exactly is an entity ?

First, it's important not to confuse the concept of an entity with the technical concept of a Java Entity Bean (which is something that was created to model an entity in a Java program, but is not exactly the same thing). An entity is a business concept, something that exists in the business domain. It may be a person, a bank account, a car, or any other object, really. What defines an entity is its identity, rather than its attributes.

The most straightforward example is the one of a person. What exactly makes a person herself? This may seem like a philosophical question, but it's still very important to ask ourselves this question when designing a software and implementing a model that represent a part of the real world. Is a person defined by its name? Its surname? Its social security number? What if two people have the same name? It turns out that there is no set of attribute that allows us to completely and safely define a person. Any attribute of a person may change, or be shared with someone else. And at the same time, it's very important to know which person is being manipulated when performing operations on an object. Are these person and that person the same? Is this person the owner of that bank account? A person is an entity, because it has an identity that extends beyond the life of an object representing it, and beyond the current value of its attribute at a given moment or in a given representation.

On the other hand, let's imagine an object that represents an amount of money. Two obvious attributes would be a number, and a symbol to represent the currency associated with this amount. In this case, it doesn't really matter if two objects are the same or not. Their identity doesn't matter : I have the amount and the currency, I can do whatever is needed of this amount of money. This object is not an entity because it's completely defined by its attributes and it has no identity beyond the values of those attributes.

With that clear, let's move on to the problem itself.

Storing entities in a relational database

Primary keys as the representation of the identity

A relational database is often the de facto standard for persisting entities in a software system. And one of the concept of a relational database is the concept of primary keys. A primary key is the perfect system for representing the identity of an entity because the properties of a primary key are exactly those of the identity of an entity :

  • Every row in the relation has a value for its primary key. This is normal because having a primary key (= an identity) is what defines an entity.
  • For each row, the value of the primary key is unique in the relation. Two rows cannot have the same primary key, which is good because we don't want two rows for one entity.
  • Every row has one and only one primary key. And it's important because that means that if two objects have different keys, they are not the same entity even if all other attributes are identical.

It is clear that primary keys are perfect to model the identity of an entity. Therefore, one of the key tasks when designing a model using a relational database is choosing the primary key to use in each relation (or table).

Defining the primary key : A matter of choosing the right columns?

Let's come back to the Person example. Imagine we have a table Person representing the entity. The table contains the following attributes:

First Name | Family Name | Social Security number | Phone Number | Email address | Postal Address

Which column(s) would you use as a primary key?

  • Some would make a multi-column key using First and Family names, but what about homonyms? I know two people that have the same first name and family name. You could add a middle name, but how sure are you that there are no two people with all their name identical?
  • The phone number is hard to use. Not only are there people that don't have one, there's also a lot of people sharing a phone. Given a phone number, I'm not sure you can find the right person.
  • Some would use the social security number. After all, no one has two social security number, and no one has the same as anyone else. Okay, but are you sure that everyone has a social security number? Are we talking only about the people in your country? What if a kid wants to register?

As you see, it's pretty hard to find the right fit. But a model is not the reality, and in many cases, one of these choices would be good enough. Or at least it would seem good enough. And a choice is made, one that works in the particular software for which the database model is being designed.

The problem is that this severely hinders the ability to refactor the software and make it evolve. If you chose the social security number for example, you make it very difficult to open the service to another country, or to people outside the initial assumption that you made when choosing your primary key.

Moreover, if the business domain model evolves (which is often the case as a project lives) the primary key could change. In that cases, it makes a mess of all your foreign keys. You have to prepare a complex migration procedure, which will cost money and create potential data corruption.

What you should do instead: Creating the identity column

The important thing about an entity, as said earlier, is that it cannot be defined by its representation or the values of its attribute. Therefore, it's pointless to try to find the right fit in the existing columns of a table.

Any entity should have a new technical attribute to define its identity. I call this column the technical ID. We can be sure that it's unique, because it's completely independent from the business domain, so we can generate it as we see fit. It will never change, and it will always be there, because there are no business rules that will put constraints on it.

This seem obvious to many people, but I'm not sure that everyone understands why there is this "ID" column with an auto_increment property. The important keyword here is *technical*. The customer doesn't have to know about this. It's only for the software, so that it can know the identity of an entity (is it a coincidence if identity is spelled *id entity* ?).

Yes, there will still be the need to find a person using whatever information we have at our disposal. Maybe a phone number, maybe a name and a city, whatever. It depends on the business domain. But this is not the identity of the entity, so it shouldn't be a primary key.

Using a technical ID has several advantages:

  • An entity is completely and uniquely defined by its technical ID. You are sure to have the right entity when searching by technical ID in the table.
  • The technical ID is purely technical, and free from any business rules. You are therefore sure that it will never disappear, change, or be manipulated in strange ways by a business rule or a management decision.
  • The technical id can always be a number. It's easier to manipulate and it simplifies the life of many developers.
  • Foreign keys are easier to create, because you're sure the primary key will never change. What a pain to create a foreign key to a table, only to find out later that you need to add a column to this primary key because it's no longer unique. This cannot happen when you use a technical ID. Your relationships always remain healthy and you avoid data corruption.
  • You will never need to use multi-column keys. Multi column keys can be a pain to use with some frameworks, forcing you to create "value objects" when you wouldn't need them otherwise. they can break the genericity of your model. With a technical ID, you don't need to use multi-column keys and all is well!

The case of a business domain ID

Often, the entities of the business domain already have an attribute called "ID" (for example, a customer ID). While it is tempting to use this attribute as the primary key for the table containing this entity, I would advise against it. The reason is that this "ID" column is part of the business domain. You never know when it could change or even disappear! Besides, it's not always an integer and you lose some of the advantages mentioned earlier. In that case, you'll have an ID column (part of the business domain model) and a technical ID used for the purpose of your software project only.

I hope I've convinced you that you should always use a technical ID column, separate from the business domain model, to represent the identity of your entities. For those that already did so, hopefully you now know why this is so important and you'll be able to convince others that it's the right course of action.

With that in mind, let me suggest a few naming rules that I've used several time and that I find to be very practical on most projects. These rules make a model very clear and easy to navigate when you have access only to the database and not to the code. They are very useful for non-technical people for example.

Bonus : naming conventions for primary keys and foreign keys

When designing a relational database model, I keep these rules in mind:

  • Each table corresponding to an entity should have a column named "tid" (for technical id). I don't use the name "ID" because most business domain entities already have an attribute that's called ID (see above). It's not necessary to put the name of the entity in the column. The context where it is used should always make the intended usage clear.
  • For one-to-many relationships, I use a column in the child table called "{parent_entity}_tid". For example, in a relation between a Customer and Orders, the Order table could contain a column called "customer_tid". Using this convention, I immediately know which columns of a table are one-to-many relationships, and to which table they refer to.
  • For many-to-many relationships, I name the association table "{entity_a}_{entity_b}_relationship". The "relationship" part is there to ensure that there's never a confusion between an association table and an entity that just happens to have a name containing other entities' name. In a association table, I dont create the tid column, the primary key is the combination of the two (or more) columns refering to the two sides of the many-to-many association. I name both these columns like normal foreign keys (see previous point).
  • Of course, if your association table has attributes and is part of the business domain model, then name it accordingly and treat it as an entity, creating the tid column.

Conclusion

There is not much more to say. I know I have a tendency to write way too many lines for a simple problem, but I like to be as comprehensive as possible about problems like this, which seem insignificant, but are not in reality. Creating primary keys without thinking can cause many problems of maintainability and evolutivity, while at the same time making the model less clear. If you truly understood the reason why this is necessary, it's really easy to follow this guideline and the benefit is great.

This article is my 5th oldest. It is 2029 words long