Karlsson on databases and stuff: February 2009

Thursday, February 19, 2009

Even more on Stored Procedure performance

Well now, there has been some commenting and even a follow up blogpost to me previous blog on MySQL Stored Procedure Performance.

One of the issues, and I knew about that one, was that when I compared 100000 client INSERTs with 100000 INSERTs done from a Stored Procedure, the overhead of the former would largely be the Client / Server communication latency. This was noted by Anthony T Curtis in his blog Yet more on Stored Procedure performance.

Me not being one who will put a stop to a debate, or whatever it is, we are mostly fact-fining I think, I have done YET some more tests. What Anthony was doing was to limit the overhead of the network latency by using multi-statement SQL for the INSERT, hence limiting the network roundtrips. So I decided to better that and remove the network altogether, but recompiling, my testprogram spperf.c as a libmysqld program. I will not show the code here, as it is basically the same as what you have already seen. The results were somewhat interesting, I think, at least of you are an acknowledged database geek.

So, here we go, using the same procedures and data and tables as before, but the program is now using libmysqld (the embedded MySQL Server):
First, my straightforward INSERTs that took some 7 s real time before:
-bash-3.00$ time ./spperf_emb 100000 "INSERT INTO foo VALUES(57, 'Some data')"

real 0m3.498s
user 0m2.288s
sys 0m1.165s

As we can see, this was about TWICE as fast! Yo. But libmysqld really IS fast here. What about using my procedure:
-bash-3.00$ time ./spperf_emb 1 "CALL perf(100000)"

real 0m3.361s
user 0m1.654s
sys 0m1.706s
Still fast, not as much faster than the straght INSERTs when I used the networked version of the program, but still faster! And faster than any results performed by Anthony (which is what I hoped to achieve anyway :-). Testing different combinations of things, running the loop inside the procedure and out of it, went slightly faster and faster until I had about 1000 call to the procedure, with the procedure doing 100 loops. After that point, performance was about what I achived above.

/Karlsson

Tuesday, February 17, 2009

More on Stored Procedure performance

There has been some discussion on the performance of MySQL Stored Procedures here, last up was Anthony T Curtis in his blog where he writes about Perl stored procedures. Brooks Johnson writes on the SP performance in his blog and concludes that procedures really are slow., in terms of compute intensive operations at least.

As a follow-up to this, I wanted to test what the difference really is when we use database intensive operations. Here. procedures really should be faster, as they run inside MySQL itself. Books has already compared the performance of MySQL Compute intensive operations to MSSQL, and regretable, MySQL came out behind the Redmond thingy.

But now to my really simple database operations performance test. I'm using a procedure to INSERT data into a table, and to reduce the impact of the INSERT itself, I'm using the Blackhole Storage Engine. Then I write a simple procedure to INSERT into the table. All in all, this looks like this:
CREATE TABLE IF NOT EXISTS foo(bar INT NOT NULL PRIMARY KEY, col2 CHAR(100))
ENGINE=Blackhole;

DROP PROCEDURE IF EXISTS perf;
delimiter //
CREATE PROCEDURE perf(nOper INTEGER)
BEGIN
  DECLARE i INTEGER DEFAULT 0;

  WHILE i < nOper DO
    INSERT INTO foo VALUES(57, 'Some data');
    SET i = i + 1;
  END WHILE;
END
//

Now, I need to run this puppy somehow and I want to run the same inserts from a client program, and for this I write a simple C program, like this:
#include
#include
#define MY_SOCKET "/tmp/mysql5131.sock"

int main(int argc, char *argv[])
  {
  MYSQL *pMySQL;
  int i, nLoop;
  char *pStmt;

  if(argc < 3)
    {
    fprintf(stderr, "Usage: %s \n", argv[0]);
    return 0;
    }
  nLoop = atoi(argv[1]);
  pStmt = argv[2];

  pMySQL = mysql_init(NULL);
  if(mysql_real_connect(pMySQL, NULL, "perf", "perf", "test",
0, MY_SOCKET, CLIENT_COMPRESS) == NULL)
    {
    fprintf(stderr, "Error %s connecting to MySQL.\n",
    mysql_error(pMySQL));
    mysql_close(pMySQL);
    return 1;
    }

  for(i = 0; i < nLoop; i++)
    {
    if(mysql_query(pMySQL, pStmt) != 0)
      {
      fprintf(stderr, "Error %s in MySQL query.\n", mysql_error(pMySQL));
      mysql_close(pMySQL);
      return 1;
      }
    }
  mysql_close(pMySQL);
  return 0;
  }

The C prgram is called spperf, and it take 2 arguments, a counter of how many times the statement should be run and the text of the SQL statement to run. This way I can do something remotely interesting, which is to run X number of loops in SP and X number outside, i.e. I can loop in both the SP and in the C program. I'll show what I mean real soon.

To begin with, I run 100000 INSERTs to the table foo created above. I use linux time to time the execution. Yes, I know this is cride, but it works for this simple test. So:
[root@moe spperf]# time ./spperf 100000 "INSERT INTO foo VALUES(57, 'Some data')"

real 0m6.967s
user 0m0.580s
sys 0m0.702s

As we can see, it took some 7 seconds for this execution. Now, to run the same INSERTs using the procedure, I do this:
[root@moe spperf]# time ./spperf 1 "CALL PERF(100000)"

real 0m3.439s
user 0m0.003s
sys 0m0.001s

And as we can see, this is a bit faster. Now we can try some other combinations, like running 4 statements at the time in the procedure, and calling the procedure 25000 times, which will cause the same 100000 INSERTs as in the examples above
[root@moe spperf]# time ./spperf 25000 "CALL PERF(4)"

real 0m5.609s
user 0m0.184s
sys 0m0.268s

And the procedure, even when called 25000 times, still outperforms 100000 straghtforward Client inserts.

There are more tests that can be done here, I'm going to do two more now, one where I only do 1 INSERT per Stored Procedure call, and one where I do 2.
[root@moe spperf]# time ./spperf 100000 "CALL PERF(1)"

real 0m12.703s
user 0m0.770s
sys 0m0.853s
[root@moe spperf]# time ./spperf 50000 "CALL PERF(2)"

real 0m7.775s
user 0m0.337s
sys 0m0.461s

As we can see, for a simple procedure, it's not terribly fast, but as soon as there is something slightly more to do, a procedure really isn't such a bad idea, at least if that something is not comptationally intensive that is. I have just one more thing to test now.

As Brooks has already determined, computational operations in a stored procedure really aren't fast at all. So then, my perf procedure above, when running with just 1 INSERT and I run it 100000 times, really can be simplified. So if I do that, let's see what happens. I create a new simpler procedure like this:
DROP PROCEDURE IF EXISTS perf2;
delimiter //
CREATE PROCEDURE perf2()
BEGIN
INSERT INTO foo VALUES(57, 'Some data');
END
//
And then I run it and compare it with running the first procedure with the argument 1 and see what happens:
[root@moe spperf]# time ./spperf 100000 "CALL PERF(1)"

real 0m12.744s
user 0m0.698s
sys 0m0.779s
[root@moe spperf]# time ./spperf 100000 "CALL PERF2()"

real 0m7.630s
user 0m0.491s
sys 0m0.779s

Ouch! That was some difference from looping just 1 time in the first procedure. Lesson: Do as little compuational stuff as possible in any procedure.

/Karlsson

Monday, February 16, 2009

Back to the future...

Wondering really what happened at MySQL? And how it happened? And when Sun Microsystems acquired us? Well, that story in it's entirety still remains to be told, although there is at least one book on the subject.

But what I wanted to Blog on here is a similar situation, way back, that actually has been documented, and not by some anomymous writer, but by someone, and a technical person, who was actually there. The subject is the growth of the very early years of commercial computing, more specifically Univac (later Sperry, later Unisys), CDC (Control data) and Cray. The story is told by David E Lundstrom in the book "A Few Good Men from Univac" (ISBN 0-262-12120-4). This is a highly entertaining and interesting book, and the story has many similarities to what goes on here at MySQL / Sun these days. As some of you has probably noticed, there is a certain amount of political game going on here at MySQL / Sun, but that was expected, and this book tells a rather similar story on how Univac was aquired by a much larger company (Remington-Rand), how some key people left and started a competing company a few years later etc. Many similarities here.

Mr Lundstrom himself is a technical, but not deeply technical person. Largely, he held technical sales and marketing jobs during these years (which were much more technical jobs in those days).

Speaking on old computers, if you visit the MySQL User Conference this year, you might want to combine this with a visit to the Computer History Museum. This museum is in Montain View, and has a large collection of old cool hardware. If you were at the now defunct computer museums in Boston and San Jose, I can tell you this is different. This is no stupid "experience computer" thingy for kids, like the thing in Boston used to be (Nothing wrong with that, but as far as a technical computer museum went, the Boston thing left a few things to be desired).

Among the objects at this museum is a working PDP-1, a Zuse Z-3 (I think it's was a Z-3 at least) and a Kitchen Computer, a WEIRD contraption that, I tell you! By the way this is stuff that I saw when I was there last, a few years ago, so things might have chnged. I'll probably be going to this museum on the Saturday before the UC, and maybe we can arrange something, if someone wants to join for an exploration of really old hardware. Drop me an email, if you are interested. If you want to go there on your own, take the streetcar outside the UC hotel, and travel to Mountain View. The Museum is not in downtown MV thogh, so there is a 20 minute walk or so a ahead, not a nice wolk, but possible (you have to cross the 101 on an overpass).

And a note to UC organizers: Maybe you want to talk to these friendly guys at the Museum, they might want to promote themselves at the UC and entertain us with some horrible computer artifacts from the good old days.

/Karlsson
Brushing of the dust from digging into the history of computers

Sunday, February 15, 2009

Studid database-geek jokes

Are you allowed to have jokes here? Or will I be forever banned if I try to be funny? I don't know, but that's not stopping me at least! This came from a friend:

A SQL Query walks into a bar. In one corner of the bar are two tables. The Query walks up to the tables and asks:
- Mind if I join you?

/Karlsson

The mess with Dates - And what really IS a valid date?

To add to the confusion with valid DATE values in MySQL, there is the issue that it's not really 100% clear what a valid date is. Most RDBMS systems aren't as forgiving as MySQL, so trying to insert a date in a DATE column like this '2009-02-30' will cause an error with, say Oracle (which has it's own problems, as the Oracle date datatype is really a DATETIME, but the default is to only display the DATE part. This is onle of the #1 mistakes newcomers to Oracle do:
SELECT datecolumn FROM table_with_dates;
datecol
=========
2009-01-01
SELECT * FROM table_with_dates WHERE datecol = '2009-01-01'
No rows returned;
)
The above was a long parentesis, maybe a recordbreaking such?

Anyway, getting back to my original point, is 2009-02-30 a valid date or not? Look in your calendar, and you see that it's not. On the other hand, ask someone who work with fincial instruments, and the answer will be maybe!

So what's the deal here? Well, way back, to be able to distribute, say, an interest rate, that is always set by year, into months, if we were to use, say, the right number of days per month, we would pay a different interest rate depending on the days of each month. If you have a loan, on say a house or a car, you know that this is not so, you pay the same amount of interest every month, namely 1/12 of the yearly interest! And think about it, not doing so would be complicated.

Another issue is the leap day. Now we have the complexity of certain years being "longer" than others, which has the issue of coluation, say, a value of a instrument is different in different years. So for some instruments, you use a 365 day calenday, i.e. there are no leap years!

In summary, in the finacial worlds, one uses three calendars (at least), the standard (often called a 366 day calendar, although only one year in 4 has 366 days), a 365 day calendar and a 360 day calendar. In the latter case, which is true for interest calculations for example, there are "invalid" dates.

Now, if we assume that you wanted to map the value of one instrument on another, say you take a loan (360 day calendar for interest) to pay for an instrument that uses a 366 day calenday, then calculation of, say, the current value of this investment, gets real complex here. I should now as I have worked with these systems, you often get into the situation that depending on the order of calculations, the result is different, so there are certain ways to follow here, and these ways are not always that well docmented.

I once talked to David (Axmark) around this, and according to him, one reason for the current relaxed DATE-handling in MySQL is due to just this: There was a need for storing dates for financial instruments, more specificaly for interest rates, and for computations.

Let's say you buy stock and take a loan to pay for those stocks. A month later, you want to compute the value of your investment. Now, a month later may mean a different number of days when it comes to the interest that the stock. Hey, go figure! (I have actullay forgotten which calendar stocks use, I know interests use a 360 though).

I can't say I think this is any good. Buy way back, when financial transactions were easier, and the number of instruments limited and cross instrument deals were uncommon, and so was financial risk analysis (which is what I used to work with), this was reasonable.

So all in all, even though I'm no fan of the handling of DATE values in MySQL, it actually has it's good points.

/Karlsson
Who has some knowledge of things that are best left alone.

Thursday, February 12, 2009

Of MySQL 5.1 and Joomla part 2 - A dirty trick and an idea

I don't know if you have had the chance to look at my PapaBlues.com website yet, possibly not. And you have no interest in the Blues, then it is perfectly understandable. Anyway, the site use Joomla and runs on MySQL 5.1.30. As I have written before, the support in Joomla for MySQL 5.x features are rather limited, but as a whole, it works OK. Some things not as well as others. Joomla's handling of NULL values is weird for example, but this is inherited I guess from the fact that HTML doesn't know much about NULL values either. I have fixed that partly by using a special class for my tablöes, derived from JTable. My main issue was that as I insist on using InnoDB and proper foreign keys, I can not have Joomla insert a 0 in a column when I really want a NULL, as that can make FK relations fail (NULL means no parent is specified, 0 means it IS specified and has the value 0).

Anyway, if you look at the site and have a look at the gig calendar, you will notice that there are two types of gigs listed:

Gigs that are just that, gigs, i.e. band so-and-so playing at venue so-and-so.
Festival gigs, i.e. gigs that happen at festivals.

As it looks right now, these are not treated differently, but really, there is a problem here. What I want to show is that there is a festival, even when no gigs had been determined yet. As it stands now, I do a join between the table that contains the gig and the table that contains the festivals and that is it. Which means that for a festival that has no gigs yet, I will not get any return values and these will not show up.

So will an outer join work? Well, no, for a specific reason: I order gigs according to the gig-date, which is reasonable. A festival on the other hand has a start-date and an end-date. If I outer join gigs with festivals and do an outer join, what date will I use for the join? And if I have no match (i.e. the festival has no gigs registered yet), what will I get the festival date from?

What I would want would be a table with festivals where each and every festival date is present, i.e. if the festival is from august 3 to august 5, I want 3 rows back, one for august 3, one for 4 and one for 5.

If I had this, I could do a UNION between my current gig select (which is more complex than it looks here) and this table with one row for each festival day. But the way I am handling festivals now, with a start and end-date is really practical and how I want it, really. So how do I fix this?

I had a few ideas. One was to get the festivals separately, and then join them with the gigs in my php code. Possible, but messy. Another idea I had, and which I actually implemented before I scrapped it, was to have a stored procedure do this job for me.

Eventually, I figured it out. What I needed was a table with X number of rows, containing the numbers 0 and up to X, X here being the maximum number of days in any festival. Then I could join the festival table and this "dummy" table and get the as many rows as there are days in the festival. Like this:
SELECT f.*,
DATE_ADD(f.`startdate`, INTERVAL il.`id` DAY) AS `eventdate`
FROM `jos_festival` AS f
JOIN `jos_intlist` AS il
ON il.`id` <= DATEDIFF(IFNULL(f.`enddate`, f.`startdate`), f.`startdate`);

As you can see, this is not a very complex SELECT, it will get me back as many rows from the festival table as there are festivals times the number of days in each festival, or to put it differently, as many rows as there are festival days. The eventdate column that is returned is the actual day of the event. So in my example above, I would get back three rows, with eventdate set to August3, 4 and 5, just the way I wanted it.

Now, the above works, but it is rather dirty. The "dummy" table has to have as many rows as the max number of festival days, so this is not really a generic solution, although it works fine in my case. But for now, it works, and I realized I could put another MySQL 5.x feature to good udse here, views. I could create a simple view that would give me back all festival days following a specific date (which is how I show my calendar, I only display upcoming gigs). The view looks like this:
CREATE OR REPLACE VIEW `jos_festivaldate` AS
SELECT f.*
DATE_ADD(f.`startdate`, INTERVAL il.`id` DAY) AS `eventdate`
FROM `jos_festival` AS f
JOIN `jos_intlist` AS il ON il.`id` <= DATEDIFF(IFNULL(f.`enddate`, f.`startdate`), f.`startdate`);

This view is now really easy to use, I just do a SELECT from this view, with a WHERE clause based on the eventdate column, and I will get the correct information back. I then join this with the existing gig SELECT, and that's it!

So what about the kludge with the fixed # of rows in the intlist table above? Well, I know how to fix that. What is needed is a storage engine, that only supports one integer column, and that will always return as many rows as the WHERE-clause determines. This should not be too difficult, and I will get back to you on this eventually. For now though, the above describes how I will remodel the current list of gigs, to be as comprehensive as I want it to be.

/Karlsson