Karlsson on databases and stuff: October 2009

Friday, October 30, 2009

InnoDB Plugin (With some Windows focus) - Part 1

So InnoDB is now, from 5.1.38, distributed with MySQL. It is still not enabled by default though. What this means is that there are 2 InnoDB implementations that comes with your MySQL installation. The old-school standard one, which is compiled in with the MySQL binary and is enabled by default, and one plugin.

The plugin comes as a DLL, and to load it, you have to do two things:
- Disable the builtin InnoDB Engine.
- Load the plugin.
To do this, on windows with a standard configuration, you edit the my.ini file and add the following two lines:
ignore_builtin_innodb
plugin_load=innodb=ha_innodb_plugin.dll

And that's about it, if you restart MySQL now, the plugin version of InnoDB will be used. So, what is new in the plugin then? Many things, actually, some rather cool features, a bunch of performance and scalability improvements and some enhancements to the operational aspect.

I will write a few blogposts on the subject on the InnoDB plugin, and I will use Windows for what I am doing, mainly as I have noticed there is little in the way of documentation, blogs etc on this same subject with a Windows focus.

And let me tell you that although the performance and scalability improvements are what is most talked about, this by far is not the only changes. Even if you have OK performance today, use Windows and have little need for more scalability, there is stuff here for you. But before I go into today's subject, let me tell you one important thing: The InnoDB plugin, as it stands here, is not considered GA yet. This is planned for December this year, but right now, it's considered Beta, despite the fact that MySQL 5.1.38 is GA. So be a bit careful here. And that said, the plugin in and of itself actually is GA, so I don't think we'll see many issues here.

Now, today's lesson: Fast index creation
This is a feature that is waaay overdue, if you ask me, but it's available now. What it means, is that in most cases, creating or dropping an indexes does not mean that the table needs to be rewritten anymore! This is a major advantage when it comes to managing a MySQL installation. To me, this is reason alone to leave MyISAM / Maria and whathaveyou, and go InnoDB instead!

The interesting thing right now, in 5.1.38 and up, is that there are 2 InnoDB engines, and we can compare them, old and new, by just flipping a few parameters and restarting MySQL! So lets do that.

I create an InnoDB table:
CREATE TABLE t1(c1 INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
c2 VARCHAR(1024)) ENGINE=InnoDB;

The next step is to insert some data. I insert some semi random data in the c2 column, 300 bytes long. It's not terribly random actually, but I wanted a simple test and go on and insert some 311296 rows (just some arbitrary number).

I then want to create an index on the 10 first bytes of the c2 column. This is what happened with the InnoDB plugin being used:
mysql> create index ix1 on t1(c2(10));
Query OK, 0 rows affected (21.75 sec)
Records: 0 Duplicates: 0 Warnings: 0

And then I drop the index:
mysql> drop index ix1 on t1;
Query OK, 0 rows affected (0.14 sec)
Records: 0 Duplicates: 0 Warnings: 0

At this point, I should say one thing. Before I created the index, I restarted MySQL, to make sure that I could "benchmark" this properly, without data in the cache.

OK, now I flip the configuration back to using the built-in InnoDB Engine, restart MySQL and run the same simple test again, and this is what happened:
mysql> create index ix1 on t1(c2(10));
Query OK, 311296 rows affected (1 min 6.28 sec)
Records: 311296 Duplicates: 0 Warnings: 0

mysql> drop index ix1 on t1;
Query OK, 311296 rows affected (43.27 sec)
Records: 311296 Duplicates: 0 Warnings: 0

Quite a difference, right? Using the InnoDB plugin, index creation is 3 times as fast! And this is not that a big index! And dropping the index is more than 300 times faster! That's the kind of performance improvement I like!

/Karlsson

Monday, October 12, 2009

Of UNICODE, UTF-8, Character sets part 2

Welcome to this second post in this series on UNICODE, Character sets and what have you not. In the first of these posts, I went through some of the history of character set support, and some anomalies, and finished around the mid-1990's, when we had a bunch of reasonably well stanardized 8-bit character set. And then something happens...

Jacques Chirac becomes president of France. Now wait, that wasn't it. No, what happened was the Internet, and suddenly the time of IT as isolated islands, where we could determine ourselves how we wanted our computers to operate and what character set to use. came to an end. Suddenly, a user in France could view a webpage created in Japan. And now the issue woith Character sets becomes a real problem. Luckily, stuff has been going on since the late 1980's, more specifically UNICODE. Whoa, UNICODE comes to the rescue.

The first installments of UNICODE utilized a 16-bit character format. This later sneaked into operating systems, libraries and all over the place. The char datatype in C was supposed to be substituted for the wchar_t datatype (W as in Wide). This did non become very popular, but the scheme presisted and is still in use. Windows has many hooks for this, and most Windows API functions have wchar_t counterparts, and there are even portability macros (the being one of them, working together dit datatypes such LPTSTR and TCHAR). This turned out to be more used in countries, mostly in Asia, with large character sets, which was present in the UNICODE character set. The Windows macros and types made life easier for far-east developers to make their code more portable to more ISO-8859-1 friendly languages.

What made this not so popular was that it still was a headace creating portable code, and also, the UNICODE character set was moving on, and was soon more than 16-bit. So the 2 byte datatype we used for characters (wchar_t is mapped to a short usually), wasn't enough. Now, working with 2 bytes instead of one was bad enough, but working with 4, 5 or 6 per character was just too much.

The UNICODE character set now needs 6 bytes to cover the full representation. So much for all that hard work by Microsoft and all those macros and API rewrites. Just consider looking for the end of a C-style NULL-terminated string, even with 2 fixed bytes, this is much more difficult than what it used to be. With 6 bytes even more so!

So along comes some schmes that allow you to "encode" a UNICODE character in some other, hopefully easier to manage, form. The most popular of these, by far, is UTF-8. This is a means of encoding a 6-byte long UNICODE character to a format that is a variable length. The nice thing with UTF-8 is that the first 128 positions are encoded exactly like old-style 7-bit ASCII! One byte, highest bit being 0! The Way UTF-8 works means that these 128 bytes position will never appear as part of any other UNICODE character, or in other words, the high-order bit is ALWAYS 1, except for these first characters.

So, given all this, it seems like functions like strlen will work with a UNICODE string? Well, sort of, but it will give you the length in bytes not in characters. But besides that, it will work. And so will strcpy, strcat etc.

So is all hanky-panky then? Nah! Let's look at UNICODE transformation besides UTF-8, namely UTF-16. This is still a variable length encoding, like UTF-8, and it's not that much used actually. But some people tend to THINK it is. What is being heavly used, as mentioned above in Windows, as well as in many Linux technologies though, is UCS-2. So what is that? Well, UCS-2 is based on a ISO Standard, ISO 10646. This preceeded UNICODE slightly, and was, like early UNICODE, fixed 16-bit. UCS-2 means "Universal Character Set, 2 bytes"! When UNICODE came around, the (then) fixed encoding was UTF-16. These two, UCS-2 and UTF-16 are very often confused. But whereas UCS-2 is still a fixed 16-bit character set encoding, UTF-16 has developed and is now a variable length encoding, but of course still very similar to UCS-2. Gosh, I wonder what these folks were smoking whi figured this one up.

There is one UNICODE encoding that is fixed length, and that is called UTF-32 (or, if you ask ISO, UCS-4). This is very seldom used in practice.

But UNICODE seems to persist, in particular UTF-8, which is, for example, standardized in Java. As for MySQL, it supports UTF-8 as well as classic ISO-8859-1 (which is called latin1 in MySQL, which you know by now, if you did last weeks lesson) and several other character sets. One character set not well supported though is UCS-2. You can define it as a server character set, but a UCS-2 client is not accepted. And I think this may be difficult to implement, there is just too many places where MySQL regards a character as being 8 bits. And UTF-8 is much easier, as long as we don't care about the actual length, we can treat a UTF-8 string as long ASCII string, and all special characters that we may be looking for in a UTF-8 string, such a CR/LF, ASCII NULL and semicolon, are in the 0-127 range.

Now, lets have a quick look at UTF-8 and MySQL. As UTF-8 is variable length, any string space, if defined by the # of characters, is bound to be also variable length, right? Ha ha, got you there! No, MySQL will allocate 3 bytes for each character in a UTF-8 string. What, 3 bytes, isn't UTF-8 up to 4 bytes? Yes, but MySQL only supports UNICODE "codepoints" that can be represented with up to 3 UTF-8 bytes.

OK, that's enough for now, I'll finish of this series with a post on collations, some general ideas on character set development and testing, and a few beers.

/Karlsson

Thursday, October 8, 2009

Of UNICODE, UTF-8, Character sets part 1

Why would you care about UNICODE? Come on now, most people can read english and english can be written using only 7-bit ASCII, so who needs more? Well, I think it's safe to say that Internet (remember that? Netscape, WWW, .com booms, pet food on the net etc) changed all that. Now applications can be found and run everywhere by anyone, more or less, so even if the application speaks english, and even if the user does, you may end up with users inputing data using some other character sets.

For someone like myself, having grown up in a "beyond A-Z" part of the world (Sweden, which is one of the easy cases), I can tell you how annoying it is when I input my address on some webpage (this happens even on swedish website)s using some swedish characters (I got 2 of the 3 beyond A-Z characters in the name of the street where I live), and it comes out looking like someone just smashed a fly prominently placed in the name of my street.

For a developer, this is difficult. Having someone test it is bad enough. And then we have things like localized keyboards (I got one of them), printers, OCR software etc. With that in mind, I plan to write a few blog posts on character sets, unicode and stuff like that.

Before I start this first post though, let me tell you that although I am pretty familiar with the subject, I'm far from an expert, so you may well catch an error or two. Please help me correct them if you find any!

So, that said, where shall we start? Well, lets begin with some basics, way back in the Reagan administration. The 7-bit ASCII character set was what was used all over the place when I started in this industry. The only competition was EBCDIC, but that was IBM mainframe only. This was in the early 1980's, but even then, we needed to use Swedish characters sometimes (I am a "swedish character" myself, I guess), and in the 7-bit ASCII world, this was handled by changing some lesser used punctuation marks to the 6 swedish characters (å, ä, ö and the upper case versions Å, Ä and Ö). This was an issue as a C developer, as I was back then, as the puctuation marks changed was the pipe, backslash, and the curly and square brackets! Yikes! Usually, you could put you VT220 terminal in US ASCII mode, but when you printed, the printer was often shared with office workers, meaning that the printouts often looked like:
main(int argc, char *argvÄÅ)
ä
printf("Hello WorldÖn");
å
Well, you see what I mean, quite unreadable, looks even worse than a Python script. I was about to write that I might have gotten the above slightly wrong, as it was a long time ago since I used this, and then I decided to look it up, and when I did, I actually had it all right, which goes to show that this was something you really had to learn if you were writing code in C here in Sweden back then in the stoneages.

Now, time went on (well, actually, it didn't. 7-bit swedish ASCII is still in use out there, quite a bit in homebrew ERP systems and stuff like that), and the next step was support for all (or most) of the western world characters in one character set. And the 8-bit ASCII set was born. This was pretty good, actually, and was pioneered most in the DEC VT220 terminal and then spread. There were still some variations of the 8-bit character set, but they were much fewer. The most common, by far, is the ISO 8859-1 character set, which contains most characters used in major western world common languages.

Why do I use such weird language here, you ask "major western world common languages", why do I just not say "western world langauges". Because that would be incorrect, that's why. Take my native Sweden for example. I think most swedes will agree that 8859-1 contains all character used in the official swedish language, and that there is just one such language. And this just isn't true, I'm afraid. Neither 8859-1 or any of the other 8859 variations cover any of the special characters in the 4 (I think there are 4, where 3 are sort-of common and used) sami languages / dialects.

8859-1 has a few variations (I know, I know, this is getting boring. ALL these character sets have variations). One such is the 8859-15, which, among other things, contains the Euro symbol. 8859 also has another name, which should be well known to you MySQLers: latin-1! And what about Windows? Windows uses codepages (cp) and cp1252 is the one used by non-UNICODE Windows variations in most of the western world. And cp1252 is the same as 8859-1, right? Nope, it's not, but for our practical people, it can be trested as being so.

So what is the difference between cp1252 and ISO-8859-1 you ask? The difference lies in something that hardly anyone uses anymore, which is in the control characters. CP1252 contains only the non-printable characters as used in 7-bit ASCII in range 0-31, whereas 8859-1 and -15 also has some control characters in the range 128-159. In the latter range, CP1252 has some real characters.

This difference is due to ISO 8859-1 being so much older, from days when we actually used control characters (do you youngsters reading this even know when these are? If not, ask your grandaddy). But besides this, they are the same. This means that web-pages, which typically use 7-bit (very old pages do), 8859-1 or UTF-8 (other variations DO exist, but these are the most common ones), using 7-bit ASCII or 8859-1 can be displayed on Windows using CP1252, as 1252 just adds characters in a control characters range, and control characters aren't used on a web-page (except the basic LF, CR/LF, LF/CR and ... NO, dont get be started on THAT for gods sake!).

So along comes 8859-15, which builds on 8859-1, but adds the Euro sign, among a few other things. And as CP1252 was already in wide use, and as 8859-1 was largely compatible with CP1252 for all practical uses, and because noone in their right mind use much of control characters anymore, the committe defining 8859-15 was smart enough to put the additional characters in the same place as the existing ones in CP1252 (the Euro sign is a good example, CP1252 contains the Euro sign in the upper control characters range). HA HA HA Got you there. This is ISO, a bunch of smart people, of course they would not put the Euro sign in 8859-15 in the same place as it was in CP1252! The effect was that, I think most people who think they use 8859 actually use CP1252 (as the Euro sign is used more and more, and the 1252 encoding of it is probably more well known).

OK, so this is a mess. You understand that by now I think, it's not just me who is a mess, the whole character set thing is. Luckily UNICODE will fix that, so more on that in the next post of this subject (and if you beleive that UNICODE will fix this and stop the controversy, let me tell you about a New York Bridge that I can get you a real good deal on). And also something on collations. What are those? Any why? And what happened to the squirrel? We'll be right back, so don't touch that dial!

/Karlsson
AKA The Swedish character

Wednesday, October 7, 2009

MyQuery 3.2.1 ready for use!

If you need an ad-hoc Query Tool for MySQL on Windows, then MyQuery might be what you aere looking for. MyQuery is an Open Source, free Windows tool, supporting colour syntax highlightning, multi-tab editing, several information, status and dictionary features and much more.

The latest release is MyQuery 3.2.1, and if you have followed the development of MyQuery, you know that the 3.x releases are focused on features and usability, and 3.2.1 has advances in both areas:

Much enhanced keyboard navigation. It is now possible to navigate the main window and most functions and dialogs, using only the keyboard. Tab-ing around in teh main dialog is now supported for example.
Enhanced Keyboard accelerators, where accelerators for almost all functions has been added, and there is now also a non-modal dynamic help dialog, ie. dynamic in the sense that is shows the current key assignments. This help may also be printed, if necessary.
Printing support. This was much more difficult than expected, but I set my goals high. I wanted to be able to print both script and results (the former was much easier than the latter, due to Scintilla handling most of it for me), and both at once, ie. printing a query with the results of it on one page. Also, I wanted to have some nice touches to it, like headers, framing etc. But I'm happy with the way it turned out.

MyQuery 3.2.1 is available for download from Sourceforge!

I will follow up this post with some more, detailing some of the features of MyQuery!

/Karlsson

What's the matter with Sourceforge

Sourceforge used to be my preferred way of sharing and finding software. Sourceforge used to be there always, hosting loads of projects and stuff, some of them good, some bad, some dead, some very active. But many fine projects was there, incluing a big bunch of good MySQL projects, and most of my own projects are hosted there.

Now, Sourceforge has, somehow, turned dead slow. The interface is now nice and inviting, sort of, but some of the old interface is still there. But it is so sloooow. Gosh. Is lauchpad the name of the game, I have seen lauchpad more as a tool for codeveloping software and stuff, but I have hardly looked at it, so I don't know. All I want is a means of sharing my projects. They are so small and insignificant that few are likely to want to step in and work on them with me, so although they are Open Source, I think there are few people reading the source (and maybe that is just as well).

Anyway, I need to look for something else than Sourceforge, because Sourceforge, at least for me, is just a joke now.

Or am I missing something? And if not, what's the best alternative these days?

/Karlsson