Karlsson on databases and stuff: 2008

Friday, December 26, 2008

You said you are a SALES engineer? What the heck is that?

I'm a Sales Engineer, I admit it. This blog is mostly about technical stuff, but I felt I should write a not-so-technical post for once. For a techie like myself, and to someone else out there, even the Sales Engineer title may look weird. Frankly Sun doesn't have Sales Engineers, they have System Engineers, which is the same thing. Not. Not at all, it is something different. But I insist on being a Sales Engineer.

I have been an SE (as in Sales Engineer) on and off (mostly on) for a good 20+ years now, more or less always in the database industry (a short term with security software is the exception). I learnt the SE trade at another RDBMS company in the 1980's. I had joined them as Support Engineer, having previously been a developer. I learnt SQL the hard way, by using it. I learnt to use indexes by having not used indexes, realising things were slow, looked at the syntax and found the word INDEX which sounded like something that would speed things up, and it did.

Anyway, I now was working for a US corporation as a Support Engineer in Sweden. There was 8 of us. No, not 8 support engineers, 8 employees in Sweden. Only one Sales Engineer, and that was NOT me. I was not too keen on having to meet a customer, as I had been forced to be a trainer early on, and I was scared stiff by that experience. And the Sales Guys then. I hadn't dealt much with them previously, but now.. Oh my. So this Sales Guy comes into my room one day, the SE was off to something else, and a techie was needed at a customer meeting the next day. I was the only one available, so "Take your best suit, a white shirt and a tie and you are all set, it's easy", the Sales Guys said, trying to calm me down. It didn't. I didn't have a nice suit, nor a white shirt and a tie, well I did not ever want to wear one, and I didn't have one either.

But what could I do, I was the youngest employee by far, the lowest on the ladder and self-confidence wasn't something that was I too familiar with. So I got myself a suit, a shirt and a tie (in those days, if you were in Sales at all, you just HAD to wear a tie) and went along with the fast-talking salesguy. And I found out I liked it. So after some time, I went along, took some training and became an SE and was assigned the important task of carrying the Sales Guys cellphone (this is true. And cellphones were BIG in those days. And only management and salesguys had them).

Now, 20+ years later, what am I doing you ask? A consultant that isn't necessarily paid? A tech support service for sales guys and customers that Sales Guys think deserves it? Someone with the most contradicory title on the planet? Frankly, the job of an SE includes all those things. But the deal with an SE doing, say consulting or tech support isn't what we do, but why. We do engineering work in the name of Sales, and that isn't necessarily a bad thing. That's the what and the why.

As for any customers reading this, and a few of you have probably meet me out there, what does this mean? Do we fake engineering abilities so we can get paid? Nope, a Sales Engineer is always an Engineer at heart. If my customers are happy, then I am happy. We are honest guys, it's just what we do is for a different reason than, say, tech support. But that doesn't mean we are evil, at least I try to avoid it (Santa was here fo me this year, so I think I am doing OK).

Doing what is right, from a technical perspective, is what we try to achieve. For example, a prospective customer wants to start a project based on this or that idea, and he is using MySQL for it. It is not the Sales Persons job to say yes or no to that, or even to understand if this is a good thing or not. No, the sales persons job is to make sure that the customer gets a fair deal on MySQL for this project. But what if the design of the thing is completely flawed? Anyone can see that this will not work, right? Well, the customer might not have worked with MySQL ´before, and maybe not even with an RDBMS. And it's not the salespersons job to understand or question this (which doesn't necessarily mean that he doesn't understand, it's just that it isn't his job). So whose job is it then to get back to the customer with Sales person and say that, "Hey, this is not really such a good idea" or something to that effect? The SE. That is what we do. And also the SE should provide some alternative design here.

And if the customer says, "I want MySQL, but I'm not sure about the design", well an SE can answer that. So we can spend a few weeks at a site for free fixing the customers design, before any contract of any kind has been signed. Nope, wrong. There is a balance here, SEs don't do things "for free", as it seems sometimes. But we do help out, point out a direction and give advice, in the extent we can do just that.

What kind of training does an SE have, what competences to we hold? Typically more a combination of many things than in most other professions. An SE typically knows little about many things, and combine that with knowing A LOT of a few things. I'm a good C hack for example (hey, I am *old* now) and I know a few other things well too.

Is what I am saying here trade secrets of MySQL / Sun Microsystems then? Nope, if you have worked as a Sales Engineer, you know it's about the same thing everywhere. Sometimes we are called Pre-Sale consultants, sometimes SE, sometimes Field Engineers or something, but the job and what we do is similar.

We are all in all the engineering side of the Sales Process. When a sales person doesn't know if technical project has any chance of being successful, he asks us. When a prospective customer has some technical questions, we answer them or get them answered.

And then there is one other thing. When the customer has paid up, are we gone into thin air, never to be seen again. Nope, as I said, we are after all Engineers, we like to see technology work, we want to help, and as MySQL is selling subscriptions, we will see you again next year.

SE's with MySQL / Sun Microsystems has a lot of freedom, more than with any company I have worked for an an SE. If a technology or a customer interests me, I am allowed to dig into that and get my hands dirty, should I wish so. Aand are we cometing with the MySQL / Sun Microsystem Preofessions Services consultants? Nope, we try hard not to.

All in all, we are techies, the work we do is stuff that a Support Engineer or PS Consultant can do, at times we do those jobs at least (we are probably not as fast as those guys at it though). We just do these things for a slightly different reason. And we ARE a bit salesy, I admit, we have to know what is going on in the market, we have to understand how to explain something in a few words to someone who might not be a techie, but someone who is signing the checks, for a customer.

After so many years as an SE, more than most I think, I can say that being an SE at MySQL / Sun is great. And I look forward to more of it. And (blushing a bit) I a´was here easrly enough to start it (we were 2 SE's back then). I can tell you more about that, but this posting is long enough already. Next posting will be back at technology, I promise!

Yours truly
/Karlsson

Wednesday, December 24, 2008

Using triggers for performance

Can you use TRIGGERs to increase performance? Really? Isn't so that a TRIGGER on a table will reduce performance, just as any FOREIGN KEYs will? Right?

Nope, none of those statements is necessarily very true. In the case of reads, for example, neither TRIGGERs, nor FOREIGN KEYs has any negative impact at all. Then, assuming we have a read-intensive system, such as some website, then performance isn't much of an argument for not using triggers or foreign keys.

You already know I'm quite a fan of foreign keys. Let me reiterate why I like them: They keep my data integrity intact, that's why. The argument that if I have transactions, then the only thing that might cause data integrity to fail is if there is a bug in the application. There are two things I have to say against that:

Many, if not most, users use more than one application. For example, if nothing else, the mysql and mysqldump commandline tools are used. To me, if I use foreign keys properly, I can safely mock around manually in the database with SQL, without causing data integrity issues.
Foreign keys are declarative. This means that the rather strict things they enforce are easy to look up and maintain, without looking into numerous pieces of code spread all over the place.

Now, I have made my point with Foreign keys, so what about the original subject of tjis article, i.e. Triggers to increase performance, all I have said so far is that they do not at least decrease performance for reads, but can they increase performance? Yepp. Let me give you an example.

Even with semi-complex data structures, you sometimes get into a situation where even a moderately normalized RDBMS structure cause some queries to be unnecessarily complex. On the other hand, denormalizing it usually has other issues. For example a de-normalized design is often best (performance wise) for SELECTs, whereas a normalized design keeps my data integrity better maintained, in particular with INSERT, UPDATE and DELETE statements. And this anomality is where triggers help!

This is how I think: Start with a pretty normalized structure. Identify the areas where a SELECT would be unnecessarily complex, then add the "de-normalizing" columns to the tables, without removing the normalized column structure or foreugn keys. This means I will have duplicate data, but it also means that SELECTs will be performaing much better. INSERTs, DELETEs and UPDATEs will be more complex and less performant, but this is no big issue as:

INSERT, UPDATE and DELETE are less frequent! I said this was a mostly read-only system, right!
And here is the magic sauce: I can use triggers to maintain the "denormalized" columns. My INSERTs, DELETEs and UPDATEs will not change at all compared to the normalized structure, as this is still in place as it always has been.

I had this issue the other day. I run a website on my spare time where I, among other things, keep track of bluesbands and gigs with them. I figured out that the bands usually were a combination of artists, i.e. I had "Fat Foo" on one gig, and on another I had "Fat Foo and the Bar". So I ended with having the latter being a lineup and the former an artist. In an ideal world, this would have been a great candidate for a hierarcical CONNECT BY, scenario, but not so now, without that particular feature in MySQL.

All the same, I realized, for a number of reasons, that this would be a bit sluggish. Every time I looked up a gig (in the gig table), got get the name of the artist I needed at least 3 lookups: The lineup, the lineup->artist table and the artist table. A gig always references just a lineup, so every artist is part of at least one lineup, the one with just one member:
Artist "Fat Foo" is the only artist in the Lineup "Fat Foo"
but
Artist "Fat Foo" is also one the members in the Lineup "Fat Foo and the Bar", the other artist in that lineup being "The Bar".

A lot of table lookups, just to get the name of a band. So I created not a trigger here (I'll soon tell you why), but a stored function that would update the corresponding lineups everytime an artist changed. The function would simply sync the lineup, having been given an artist ID.

Now it was to to create the trigger that would call the function above. So why didn't I embed the code in the function right into the trigger? Because I also wanted to make sure that I could always redo all the lineup names, if I decided I wanted to change the scheme I use for this. If I did that, I would use the function easily for that, with a simple SELECT.

So, now you ask, how does it work in practice? And what about the case when "Fat Foo" and "The Bar" keeps their names, but the combo is renamed to "The Fat Bar". I just have two columns in the lineup table, one which contains a "generated name" by the function above, and one which is manually entered. I use the manually entered name if it exists, else I use the generated one, a simple IFNULL will do for this, no extra lookups or anything!

Look at the result on PapaBlues.

And by the way, folks, Merry Christmas to you all
/Karlsson

Thursday, December 18, 2008

The forgotten FOREIGN KEY constraint

If you are using MySQL with the InnoDB Storage Engine, you know and probably use FOREIGN KEYs. There are some who dislikes the use of these, saying the are bad for performance, but I tend to disagree, if the price of keeping my data consistent is a slight performance degradation, then so be it.

Anyway, I was going to write about a couple of lesser known FOREIGN KEY constraint attributes here. We all know what a FOREIGN KEY constraint is, right? A reference from a column or set of columns in one table to the PRIMARY KEY in another table? Yes, this is correct, but it's not the whole truth, actually, a FOREIGN KEY is a reference to a PRIMARY KEY or UNIQUE key in another table. Being able to reference a UNIQUE constraint is sometimes rather useful.

But before we go into that, lets look at a few subtle differences between a PRIMARY KEY and a UNIQUE constraint. In RDBMS design lingo, the UNIQUE constraing usually represents a candidate key. This means that the UNIQUE constraint columns could be used as a PRIMARY KEY, but they are not.

Often a UNIQUE constraint is used to ensure that, say, a NAME of an entity is unique, even though the entity itself has an identifier on it. The identifier used as the PRIMARY KEY is then often an AUTO_INCREMENT column. A few reasons why we don't want to have the name column as the PRIMARY KEY is that having a long string as an identifier is cumbersome, and that an INTEGER AUTO_INCREMENT column is so much easier to handle as an identifier (AUTO_INCREMENT being one such reason).

The PRIMARY KEY is often then, at least if I can decide, a surrogate key or an opaque identifier, i.e. a value that carries no other information than being a key.

That the PRIMARY KEY column is an opaque value has the advantage of removing the issues when using, say, a name as a key, that is referenced by columns in other tables, and then the name is changed.

A UNIQUE constraint is similar to the PRIMARY KEY constraint in that is provides uniqueness of a column or set of columns in a table. The PRIMARY KEY is just the primary identifier. One difference though is that a UNIQUE constraint column or set of columns may be NULL.

A FOREIGN KEY column may also NOT reference an entity in a PRIMARY KEY or UNIQUE constraint. This is when the FOREIGN KEY column is NULL. Note that a NULL in a FOREIGN KEY is not matching a NULL UNIQUE tuple!

Lets show an example of how to use a FOREIGN KEY that references a UNIQUE constraint, an example that is useful and demonstrates most the things I described above.

Let's say I want to keep track of rockbands in a table called bands. Among the things I am tracking is the location of the band. I want to at least know the country the band comes from, but possibly also the city. So I have two more tables, country and city. Each city is of course also assigned to a country.

All in all, I have a table of countries, a table of cites in coutries and a table of bands know to be in a specific country or possibly a city.

So:
CREATE TABLE country(id INTEGER NOT NULL, name CHAR(30) NOT NULL, PRIMARY KEY(id));
CREATE TABLE city(id INTEGER NOT NULL, country_id INTEGER NOT NULL, name CHAR(30) NOT NULL, PRIMARY KEY(id), FOREIGN KEY(country_id) REFERENCES country(id));

That look right, OK? Now, I want to create the bands table. The first attempt is this:
CREATE TABLE bands(id INTEGER NOT NULL, name CHAR(30) NOT NULL, country_id INTEGER NOT NULL, city_id INTEGER, PRIMARY KEY(id), FOREIGN KEY(country_id) REFERENCES country(id), FOREIGN KEY(city_id) REFERENCES city(id));

Does this look right to you? Does it ensure data consistency? In a sense yes, but not in the case of country and city in the bands table: It is possible for me to designate a band to be in a city that is not in the country that that city is in! So I could say that The Rolling Stones are in Berlin in England, which, considering the shape of Keith Richards and the medication he is using, might not be so surprising. All the same, this is not correct.

What we have is not really just a bunch of countries and cities. The city table not only defines the cities, it also defines the valid combinations of city and country! And these combinations are unique! (yes, I know there are many cities with the same name in many countries). So what my cities table really should look like is this:
CREATE TABLE city(id INTEGER NOT NULL, country_id INTEGER NOT NULL, name CHAR(30) NOT NULL, PRIMARY KEY(id), FOREIGN KEY(country_id) REFERENCES country(id), UNIQUE(country_id, id));

And now I can ensure that Mick Jagger and his friends doesn't end of in the wrong country? I create my bands table like this:
CREATE TABLE bands(id INTEGER NOT NULL, name CHAR(30) NOT NULL, country_id INTEGER NOT NULL, city_id INTEGER, PRIMARY KEY(id), FOREIGN KEY(country_id, city_id) REFERENCES city(country_id, id));

And note that it is OK to create a band where city_id is null! But we are still not quite home yet. The issue is with NULL values and the combination of NULL and non-NULL values. The above setup will actually allow us to enter ANY value for country_id for a band, as long as city_id is NULL. This is because, as we know, NULL means undefined. And the combination of a know defined value and a non-defined NULL values is... NULL and non-defined. So we need one more foregn key here:
CREATE TABLE bands(id INTEGER NOT NULL, name CHAR(30) NOT NULL, country_id INTEGER NOT NULL, city_id INTEGER, PRIMARY KEY(id), FOREIGN KEY(country_id, city_id) REFERENCES city(country_id, id), FOREIGN KEY(country_id) REFERENCES country(id));

What about the NULL discussion above? Wouldn't it be reasonable to think that if country_id is specified, and city_id is NULL, that one could ensure that a country exists by a lookup in the city table? And that this would be valid RDBMS design? Nope, it's actually the other way around, this is actually HELPING us create a workable database design! Think about what would happen if we had a country with no defined cities yet? We would not be able to place a band in that country, even by setting city_id to NULL, as there would be no such country_id in the city table!

Wednesday, October 15, 2008

Using SQL to generate SQL

Using SQL to generate SQL is a trick that is used by many DBA's and is a really useful way to achieve certain things. It's not that much used in applications, but for DBA's it is great, as it allows you to do certain things that would otherwise require a script or something like that. The idea is simple enough, you write a SQL statement that in turn will generate another SQL statement or set of statements, the generated SQL is written to a file, and then you run (for example run the mysql "source" command) on the file with the generated SQL.

Simple? Yes, but effective. As an example, using MySQL, lets say you wanted to have a script that would run "OPTIMIZE TABLE" on all tables with 1000 rows or more in the current database? You cannot do this with one SQL statement, right? To begin with, you can only run OPTIMIZE TABLE on one table at the time?
So we need an SQL statement that will generate a bunch of relevant OPTIMIZE TABLE statements, and then we need to run the latter. Like this:

SELECT CONCAT('OPTIMIZE TABLE ', table_name, ';')
FROM information_schema.tables
WHERE table_schema = DATABASE()
AND table_rows >= 1000;

So all you have to do now is to write the output of the above SQL statement into a file, and then run that file, right? If you have been using Oracle or some database tool other than the mysql commandline, you know that these tools support writing SQL output to a file, without getting any extra output that what you get from the SQL statement, which is what we want here. No column names, no summaries, no timings, just the SQL result. You can to an extent achieve this with the mysql commandline, but you cannot create a script and run it from inside the same SQL script, as you cannot embed the "tee" command in a script that you run in batch. But from the commandline it works, to an extent. So to achieve the above with a script, you can use a script like this:

tee tmp.sql
SELECT CONCAT('OPTIMIZE TABLE ', table_name, ';')
FROM information_schema.tables
WHERE table_schema = DATABASE()
AND table_rows >= 1000;
notee
source tmp.sql

Now, for the above to work, there are some assumptions. To begin with, as I said, you have to run this from the commandline, you cannot use tee whith mysql in batch mode, such as when piping a command into it. Also, you have to run mysql with the options --skip-column-names and --silent. And finally, the mysql "tee" command has an irritating aspect: It will append, not overwrite, the outfile. Yikes!

A better way then, if you really want a script for this, is to pipe the output from one mysql session into another. If I, into a file named runopt.sql put this:

SELECT CONCAT('OPTIMIZE TABLE ', table_name, ';')
FROM information_schema.tables
WHERE table_schema = DATABASE()
AND table_rows >= 1000;

I can then create a script like this to optimize tables in the test database with 1000 rows or more:

mysql --skip-column-names test < tmp.sql | mysql test

Now, this is all well, but wouldn't it be nice to be able to write to a file and then source that file from one SQL script (why the tee command is not allowed in batch mode beats me. I'll have to check this up)? Well you can use my MyQuery Open Source tool for that (available for download here: https://sourceforge.net/projects/myquery/ DIsclaimer: This is a Windows GUI tool, no Unix / Linux support as yet. Sorry). This has a special output command for these kind of no-frills output: sparse_output. To keep MyQuery scripts somewhat compatible with mysql, commands unique to MyQuery, such as this one, are commented.

So with myquery, the above can be achived with this simple script:

/*MYQUERY sparse_output_file tmp.sql */
SELECT CONCAT('OPTIMIZE TABLE ', table_name, ';')
FROM information_schema.tables
WHERE table_schema = DATABASE()
AND table_rows >= 1000;
source tmp.sql