Karlsson on databases and stuff: September 2013

Friday, September 27, 2013

Optionutils 2.7 released

This is not big deal, but I have published my generic option file and commandline handler in a new version. If you have used any of my C-programs that are on sourceforge, I think you have seen that most of them use this, in one version or another. It was a long time since I did a generic upgrade of this though, and all utilities seems to have had it's own version embedded, so I have now created a version with most of the new things in it, and I have also updated the documentation.

The cool thing with this library is that it is generic and largely follows the MySQL / MariaDB style of options and config files. Handling commandline options and config files in C is otherwise something you largely have to do yourself. This little library handles all that for you, and it is also quite advanced and has many cool options, such as support for integers and string with proper type checking, array of values, a kind of associative arrays is also supported, as well as configuration files with the MySQL format.

Version 2.7 of optionutil, including full documentation, is now available for download at sourceforge.

/Karlsson

Using 4-byte UTF-8 (aka 3-byte UNICODE) in MariaDB and MySQL

As I wrote in a previous post, MariaDB / MySQL has some issues with the standard UTF-8 encoding there. This UTF-8 encoding limits us to 3 UTF-8 bytes or 2 UNICODE bytes if you want to look at it that way. This is slightly limiting, but for languages it is usually pretty much OK, although there are some little used languages in the 3 byte UNICODE range. But in addition to languages, you will be missing symbols, such as smileys!

Help is on the way though, in the utf8mb4 character set that is part of both MariaDB and MySQL. This is a character set that is just like the one just called utf8, except this one accepts all the UNICODE characters with up to 3 UNICODE bytes, or 4 bytes using the UTF-8 encoding.

This means that there are more limits to how long a column might be when using utf8mb4 compared to latin1 and even when compared to utf8. Let's try it:

First, I don't want MySQL to substitute any data types for me (why I would EVER want that is beyond me):
mysql> set sql_mode='strict_all_tables';
Query OK, 0 rows affected (0.00 sec)
Then we can create a table to hold a message using the latin1 character set:
mysql> CREATE TABLE msg(msgid INT NOT NULL PRIMARY KEY AUTO_INCREMENT, msgtext VARCHAR(65528) CHARACTER SET latin1);
Now, we know this isn't terribly international, so lets go with UTF-8 instead:
mysql> DROP TABLE msg;
Query OK, 0 rows affected (0.12 sec)

mysql> CREATE TABLE msg(msgid INT NOT NULL PRIMARY KEY AUTO_INCREMENT, msgtext VARCHAR(65528) CHARACTER SET utf8);
ERROR 1074 (42000): Column length too big for column 'msgtext' (max = 21845); use BLOB or TEXT instead
OK, that didn't work, so we'll reduce the size of the msg column and try again:
mysql> DROP TABLE msg;
Query OK, 0 rows affected (0.05 sec)

mysql> CREATE TABLE msg(msgid INT NOT NULL PRIMARY KEY AUTO_INCREMENT, msgtext VARCHAR(21842) CHARACTER SET utf8);
Query OK, 0 rows affected (0.29 sec)
OK, so now we have a table with a UTF-8 column in it. Let's insert some data into it:
mysql> INSERT INTO msg VALUES(NULL, 'Some test message');
Query OK, 1 row affected (0.04 sec)

mysql> INSERT INTO msg VALUES(NULL, 'Some test message with utf8 åäö');
Query OK, 1 row affected (0.01 sec)
Right. That was OK. But what happens if we insert a 3-byte UTF-8 character into this table, say a "smiling face with horns" which is UNICODE 01F608 or as UTF-8 F09F9888. We do this using the MySQL UNHEX function:
mysql> INSERT INTO msg VALUES(NULL, CONCAT('Some test message with 3-buyte UNICODE ', UNHEX('F09F9888')));
ERROR 1366 (22007): Incorrect string value: '\xF0\x9F\x98\x88' for column 'msgtext' at row 1

Hmm, too bad, back to the drawing board and let's use the utf8mb4 character set:
mysql> DROP TABLE msg;
Query OK, 0 rows affected (0.08 sec)

mysql> CREATE TABLE msg(msgid INT NOT NULL PRIMARY KEY AUTO_INCREMENT, msgtext VARCHAR(21842) CHARACTER SET utf8mb4);
ERROR 1074 (42000): Column length too big for column 'msgtext' (max = 16383); use BLOB or TEXT instead
Close, but no cigar I guess. We need to further reduce the size of the msg column, as each utf8mb4 character potentially takes up 1 more byte than when using utf8. So try this instead:
mysql> CREATE TABLE msg(msgid INT NOT NULL PRIMARY KEY AUTO_INCREMENT, msgtext VARCHAR(16382) CHARACTER SET utf8mb4);
Query OK, 0 rows affected (0.29 sec)
OK, this worked better, now let's try that insert again:
mysql> INSERT INTO msg VALUES(NULL, CONCAT('Some test message with 3-byte UNICODE ', UNHEX('F09F9888')));
Query OK, 1 row affected (0.00 sec)
Cool, now we have some data in there, let's then just check if this really is a 4-byte UTF-8 encoded character in there.
mysql> SELECT LENGTH(msgtext), CHAR_LENGTH(msgtext) FROM msg;
+-----------------+----------------------+
| LENGTH(msgtext) | CHAR_LENGTH(msgtext) |
+-----------------+----------------------+
| 43 | 40 |
+-----------------+----------------------+
1 row in set (0.00 sec)
As can be seen, length(which returns the length of the column in bytes) returns a value 3 longer than the number of characters, which is just right!

So in conclusion, as utf8mb4 really is a superset of utf8, should I always use utf8mb4? No, you shouldn't, as in most cases you don't need those 3-byte UNICODE characters in strings for names, places, companies etc. But in this case, when it comes to some free format text that the user enters, then it might be useful. Also, as I have pointed out before, even though this is VARCHAR, which is variable length, there are situations where the full possible length is used, such as when the table data is sorted. And finally, as for collations, these are slightly different than those for utf8, but they are all there, like utf8mb4_bin, utf8mb4_general_ci, utf8mb4_unicode_ci etc.

I'm really bored with character sets and UNICODE and stuff like that now, so my next blog will NOT be about the fascinating subject of Byte Order Marks (BOM).

/Karlsson

How MariaDB and MySQL makes life with UTF-8 a bit too easy. And how to fix it...

UNICODE is getting more and more traction and most new applications, at least web applications, support UNICODE. I have written about UNICODE and related stuff before in Character sets, Collations, UTF-8 and all that but before I go into some more specific and some issues, and fixes, let me tell you about UNICODE, UTF-8 and how MySQL interprets it. See the blogpost linked to above for more information on the subject, surprisingly even more boring, on Collations.

So, let's begin with UNICODE. UNICODE is a character set that is very complete, you should be able to make yourself understood in any language using the characters from this vast character set. This is not to say that all characters from all languages are in UNICODE, some are missing here and there and sometimes new characters make their way into languages, like different currency signs and stuff like that. In general though, you can use all the more than 100.000 characters in UNICODE to express text in any language.

So, then we know UNICODE; what is UTF-8, UTF-16 and all that? Those are ways to represent the bytes that make up UNICODE. The smartest, and most common, of these, by far, is UTF-8. To represent the highest byte-value of a UNICODE character, you would need 3 bytes for every character, but this is wasting space. Also, this would mean that all old program code that assumes that a character is 1 byte only (with 7 bits representing the good old US ASCII characters, but there are local variations of this, or interpreting up to 8 bits as latin1). Ken Thompson fixed this for us though with the UTF-8 representation.

UTF-8 is variable length, and the way this works is that if the first (highest) bit is 0, then this is a character in the range 0-127. If you stick with US ASCII, you can say this is actually UNICODE represented as UTF-8, as the 128 "first" caharacters in UNICODE are the same as 7-bit US ASCII. No issues there. When there are more than 1 byte to the character, then at least the first bit is 1 and the number of leading 1's, up to the first 0, tells how many following bytes there are. So the first byte either starts with 0, in which case there is 1 byte only, or with 110 (10 would mend 1 byte, but that starts with a 0). This in turn means that the first byte never starts with the bit pattern 10, so 10 is used as the leading bit pattern for any of following bytes. Using this scheme you can't represent as many bytes as you want, as you run out of 1's in the first byte, but enough bytes for UNICODE at least! And it's US ASCII compatible.

Potentially you should be able to have 8 bytes in total (all bits set to one in the first byte and then 7 following bytes). As the following bytes all start with 10, only 6 bits are significant, and as we have used up all the significant bits in the first byte, this means that we can now represent 7 * 6 = 42 bits of data or 5 bytes + 2 "extra" bits. But the scheme was never intended to go this far, originally it was envisioned that UNICODE might use all up to 6 bytes in the UTF-8 encoding (this doesn't mean that we would have that many characters, not all code-points are used as characters are organized in separate codepages). Eventually it was agreed that UNICODE would only use up to 21 bits or slightly less than 3 bytes, which encoded as UTF-8 requires up to 4 bytes.

So, now back to good old MySQL. MySQL does support UNICODE and it supports UTF-8, right, so we are all set then? Nope, I'm afraid not. You might have seen that MySQL makes space 3 * in a CHAR or VARCHAR column, to make sure that a 3-byte UTF-8 character can fit. In a VARCHAR that might not be significant, as it is variable length anyway? Right? Nope, depending on the MySQL Storage Engine, VARCHAR is actually allocated to the full potential length. OK, but you use InnoDB anyway, which doesn't behave like this, so you are in the clear? Again, I have to disappoint you, unless you never sort on that column, you need consider this as sorting involves sorting the potential length of data! Also, if you use the NDB Stoarge Engine and use Disk Data, you have the same issue (see: http://dev.mysql.com/doc/refman/5.6/en/mysql-cluster-disk-data-storage-requirements.htm).

But there is another thing that might have caught your attention in the above section. I write that MySQL allocated 3 bytes per UTF-8 encoded UNICODE character, but I also say that UNICODE, when encoded as UTF-8, might occupy up to 4 bytes? What is going on is that the MySQL UTF-8 encoding is limited to 3 bytes, which means that MySQL can, when using UTF-8, not represent the whole UNICODE character set, specifically, it is limited to 4 + 6 + 6 = 16 bits or 2 bytes. This might not be a big deal, but it depends on your situation. I was reading up on this on Wikipedia for example, and it says "Three bytes are needed for characters in the rest of the Basic Multilingual Plane (which contains virtually all characters in common use[11]). Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters and various historic scripts and mathematical symbols." but this isn't altogether true, but someone who implemented the original UTF-8 probably read something like this.

What one has to understand is that UNICODE isn't only about languages, it's also about symbols, such as currency signs etc. But again, most of this also lives in UTF-8 3-byte space (even new ones do, as there is unused space in UNICODE). But then Facebook and Twitter comes around and screws it all up and nothing works again :-( suddenly the use of Smileys exploded. And this is where trouble starts. The simple smileys, like :-) all fit within 2 bytes of UNICODE, but the more advanced ones do not! Oh my!

There is help on the way though, as MySQL also supports a full UTF-8 encoding called (in MySQL lingo then) utf8mb4. Now you know what that is used for!

But I'm going to close here, and follow up real soon with a blog specifically on using utf8mb4. Fun, isn't it!
/Karlsson

Thursday, September 19, 2013

MariaDB Dynamic Columns

MariaDB has a feature called Dynamic Columns which is not in MySQL, and this feature requires some explanation. It is used for example by the Cassandra Storage Engine, which is also unique to MariaDB, and as this is a schema-less database, which means we need a way to handle the fact that one one end MariaDB has a fixed set of columns defined by the schema, and on the other end, Cassandra provides any kind of attribute that the developer feels he wants to have for a particular "row" (which is a row in MariaDB but is not called a row in Cassandra).

But ignoring the Cassandra engine for a while, let's look at what us mere mortals, brought up on mothers milk, SQL and Relational Theory under the watching eyes of E.F. Codd, can use this for, and fact is that it can be quite useful. All in all, what Dynamic Columns provide here is a means of adding non-schema data to a row in a structured way, you know where you used to emulate an array or something in a VARCHAR or use full blown Anchor Modelling for properties.

Which is not to say that Dynamic Columns is the best things since sliced bread, but it does have some good uses, so let me show you how they work. To begin with, a Dynamic Column contains structured data in the form of one or more pairs of <attribute name> = <attribute value>. In MariaDB 5.5, <attribute name> was numeric, in 10.x it is a proper name as it should have been from the start. All examples here assume you are using MariaDB 10.x.

The attribute names are not predefined or in a schema or anything, each instance of a Dynamic Column (i.e. every Dynamic Column in every row) might have different attributes with different values and types. And yes, is typed, but typing is automatic per instance of a Dynamic Columns (i.e. every Dynamic Column in every row of the table might have different type for an for the same ). The possible types are:

String (Binary or Normal)
Date
Datetime
Decimal
Double
Integer (Signed or Unsigned)
Time
Dynamic column

As can be seen from this, nesting is actually available, i.e. might be in the form of another Dynamic Column. So let's dig in with some simple examples.

First, lets create a table that holds some useful attributes, in this case I have a table with my users, and for each user I want to store a range of mixed attributes that the application finds useful, but which are not in the normal TABLE schema, such a birthplace, car make, etc etc.
CREATE TABLE users(userid INT NOT NULL PRIMARY KEY AUTO_INCREMENT, username VARCHAR(255), firstname VARCHAR(255), lastname VARCHAR(255), attrs BLOB);
Now we have to insert some data into this table. To create a Dynamic Column value you use the COLUMN_CREATE function which takes a even variable number of arguments, which are name value pairs. For example:
INSERT INTO users VALUES(NULL, 'bloggs', 'Joe', 'Bloggs', COLUMN_CREATE('shirtsize', 'L', 'colorpref', 'red'));
INSERT INTO users VALUES(NULL, 'doe', 'John', 'Doe', COLUMN_CREATE('shirtsize', 'L', 'vegetarian', FALSE));
Now we hopefully have some data in the table, and as you can see, we are mixing schema data with schemaless data in the same row. Before I end for this time on this subject, let's get a look at that data in the attrs column. There are a few different means of doing that, either you can get a specific Dynamic Column value, like this:
SELECT COLUMN_GET(attrs, 'colorpref' AS CHAR) FROM users;
+----------------------------------------+
| COLUMN_GET(attrs, 'colorpref' AS CHAR) |
+----------------------------------------+
| red                                    |
| NULL                                   |
+----------------------------------------+
2 rows in set (0.00 sec)
Or I can get all the attributes of a dynamic column in one go, in which case I get JSON back:
SELECT COLUMN_JSON(attrs) FROM users;
+-------------------------------------+
| COLUMN_JSON(attrs)                  |
+-------------------------------------+
| {"colorpref":"red","shirtsize":"L"} |
| {"shirtsize":"L","vegetarian":0}    |
+-------------------------------------+
2 rows in set (0.00 sec)
That's it for now, but I'll be back with some more details and some cool things you can do with MariaDB Dynamic Columns!

/Karlsson