A Simple ISAPI Filter for Authentication on IIS
Symfony 2 Crash Course
Installing Xdebug for use with Eclipse or Netbeans on Linux
Native Linux Space Warfare: Freespace 2
Changing Mailman Python Scripts for Virtual Host Support
Using Multi-Byte Character Sets in PHP (Unicode, UTF-8, etc)

ENUMs, User Preferences, and the MySQL SET Datatype

Saturday, 6 March 10, 3:28 pm
The MySQL SET datatype is a very space-efficient way to store sets of binary flags. It's a little like the ENUM type, in that you define a range of possible string values, but unlike the ENUM, a column that holds a SET can have any number of the string values at one time.

It's useful to consider how MySQL stores SETs and ENUMs in order to appreciate what benefits they can bring.

Here's a simple table with an ENUM column and a SET column:
CREATE TEMPORARY TABLE critters ( species ENUM ('cat', 'dog', 'mouse', 'bear', 'elephant') NOT NULL DEFAULT 'cat', eats SET ('cat', 'dog', 'mouse', 'ant', 'bear') NOT NULL DEFAULT 'mouse' );
Any ENUM column always occupies a single byte in a table, and each string defined for the ENUM is mapped to one of the possible values that an 8-bit byte may hold. The string values can be arbitrarily long, and may even contain spaces if desired. Because the ENUM must fit in a single byte, you can only have as many string values as there are possible numbers that can be stored in 8 bits.

Thus in MySQL ENUMs may define no more than 255 different string values: each string value is represented by a number from 1 to 255 depending on where it was placed in the column definition (zero is used when the column gets set to an invalid value).

SETs on the other hand can occupy anywhere from 1 to 8 bytes, depending on how many string values are defined for them. In the simplest case, when we define a set with 8 or less elements (as in the above eats column), the column takes up just one byte. Each bit of this byte represents one of the string values given in the column definition, and thus you can see each string value can be toggled on or off by setting its corresponding bit.

In the above example for instance, we could define a couple of critters like so:
INSERT INTO critters SET species='cat', eats='mouse';
As you may expect, this creates the expected result in the database:
+---------+-------+ | species | eats | +---------+-------+ | cat | mouse | +---------+-------+
Great, this makes sense. We have a critter that is a cat species, and eats mice. Let's add another critter, the mouse. Mice we all know eat ants:
INSERT INTO critters SET species='mouse', eats='ant';
But what about when we want to add a more sophisticated critter, like a dog? Dogs eat cats, but they also eat mice if there aren't any cats available. Here's how we do that:
INSERT INTO critters SET species='dog', eats='cat,mouse';
Here's what we see in the database:
+---------+-----------+ | species | eats | +---------+-----------+ | cat | mouse | | mouse | ant | | dog | cat,mouse | +---------+-----------+
The cool thing here is that this table's data is only taking up 6 bytes on disk - each row is just 2 bytes. So we can store a lot of rows of data like this very efficiently. We can read and write more data at once, and cache more data; the speed-ups of this kind of efficiency are real.

A more valuable speed boost comes from the fact that ENUMs and SETs are both fixed width datatypes. It's always preferable to make a table using only fixed width datatypes when possible, especially for tables that are used frequently and contain large amounts of data. Because each row always occupies the exact same number of bytes, MySQL can very quickly work out where record boundaries are, which means that random access read/write operations are quicker, and it also greatly simplifies table indexing.

That said, the ENUM is by far the most useful of these two datatypes, and there are a couple of drawbacks with SETs which limits their utility.

SETs are basically just a bit finickerty and lack elegance. Firstly, they are not good columns to search on. If you expect to frequently use a column in WHERE clauses, SET is probably not the right choice. For instance, say we want to search our critters table for all critters which eat mice:
SELECT * FROM critters WHERE eats & 4=4;
Or all those which eat cats:
SELECT * FROM critters WHERE eats & 1=1;
Yeah, exactly. WTF? Not only do you need to be confident with bit-wise arithmetic but the order in which string elements are defined in your column definition matters. In the eats column definition, 'cat' is the first in the list, thus it is represented by 1. 'mouse' on the other hand was third in the list, so it is represented by 4, 22.

This is not great from a maintainability perspective. This kind of low-level link to your database implementation is dangerous ground. Say you're using your database in PHP. You could define a bunch of application-wide constants to hold the numerical value of each SET value. This would certainly help, but complex queries are still going to look a bit of a mess, and inserting a new value in the middle of the existing SET values would not be without headaches.

For storing some types of data though, such as boolean preference flags for users of a web site, the SET can fit the bill.

Please enter your comment in the box below. Comments will be moderated before going live. Thanks for your feedback!

Cancel Post

/xkcd/ Schwa