Wednesday, February 25, 2009

Is SQL Slow?

Last time I demonstrated a case where stored procedures are slow when they have to do computationally expensive work.  The more interesting, to me at least, question is how slow is SQL?  The answer to that is far more complex. 

For example, this simple SQL takes 2.2 seconds to run on MySQL.  This is a painfully faster than the 696 seconds it took a stored procedure to produce similar results. 

select customerName, count(*)
  from Salet s
  join customer c on s.customerId = c.customerId
group by customerName

As demonstrated in the previous article, the equivalent C# code took 1.4 seconds to produce the same results.  Some may find it surprising that it take less time to ship 1 million rows out of the database and then summarize it in code than it does for MySQL to summarize the same data in the database. 

In this simple case the performance difference isn’t much and is not worth the extra code complexity.  For more complex SQL, with more joins, perhaps nested case statements and temp tables, and similar standard SQL techniques, it is often much faster to do the transformation logic in non-SQL code.   I’ve found cases where I was able to improve performance by over ten times by using C# or Java code over SQL.  Still, my inclination is to always see if I can’t get acceptable performance from SQL first as less code is generally required, and usually far less code.  SQL is an exceptionally expressive language for certain types of problems. 

Plus, the performance advantage of C# or Java won’t be true in all cases.  Stating the obvious,  shipping out all the sales data won’t make sense for more selective queries that query only a few sales.  In this case it makes far more sense to write in SQL. 

Deciding where to execute code, in the database or elsewhere, is a complex problem that would require a series of articles to answer reasonably (a hint, think about using sequential IO as much as possible).  For now I just want to point out that running everything in SQL isn’t always the best performing method. 

13 comments:

rpbouman said...

Hi!

Although I would never categorically dismiss the possibility that processing on the client can be faster than you can achieve on the database natively, I have some reservations in this case.

I read in your earlier article that this deals with a million rows. Is that one million Sales rows, or one million customer rows? Can you also please post the EXPLAIN of your query?

TIA,

Roland

Antony said...

This is more of an implementation issue about how MySQL calculates aggregate functions.

For your query
SELECT customerName, count(*)
FROM Salat s
JOIN customer c ON s.customerId = c.customerId
GROUP BY customerName;

MySQL would be performing something like the following:

DECLARE names CURSOR FOR
SELECT DISTINCT customerName
FROM customer c;
REPEAT
FETCH names INTO name;
IF NOT done THEN
SELECT customerName, COUNT(*)
FROM Salet s
JOIN customer c ON s.customerId = c.customerId
WHERE c.customerName = name;
END IF;
UNTIL done END REPEAT;

(this is assuming that c.customerName column does not have a unique constraint on it)

Now if you were to code equivalently for C#, you should be able to appreciate a slow down in that too.

dbscience said...

Antony, I can appreciate the issue with that code. That code isn't written in a set oriented way, but in a procedural one customer at a time way.

Roland, here is the default plan:


*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: c
type: ALL
possible_keys: PRIMARY
key: NULL
key_len: NULL
ref: NULL
rows: 10
Extra: Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: s
type: ref
possible_keys: pk_sale_customer
key: pk_sale_customer
key_len: 4
ref: test.c.customerId
rows: 55566
Extra: Using index

If I hint the sql like this:

select STRAIGHT_JOIN customerName, count(*)
from Salet s
join customer c on s.customerId = c.customerId
group by customerName;

I get this plan.

*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: s
type: index
possible_keys: pk_sale_customer
key: pk_sale_customer
key_len: 4
ref: NULL
rows: 1000197
Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: c
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: test.s.customerId
rows: 1
Extra:

Both versions run in about 2.2 seconds. There are 1 million rows in the salet table, and 10 rows in the customer table.

I suspect Antony has explained the performance issue.

Anonymous said...

Hi,
might I also mention that when specifying GROUP BY, and implicit ORDER BY is executed, as well.
To override, add ORDER BY NULL.
I've seen performance gain when adding this.

Regards

dbscience said...

I tried the order by null and in this case it didn't seem to make much of a difference, although I didn't do extensive testing.

I suspect this is because the work to sort 10 rows is far less than the work to summarize 1 million rows.

I can see how this would make a difference with more data that needs to be ordered.

Robert Hodges said...

One of the problems with moving query processing to the client is that optimization from that point on is purely manual. Depending on your data set site what is fast now may be very slow later on when the data volumes or distributions change. Also, you will find that there are very subtle dependencies between clients and and the underlying database schema that will make maintenance a lot harder. All in all it seems as if you are better off using alternative strategies like caching in the client or fixing queries so they generate better plans.

dbscience said...

Robert, I agree if you don't have high volume complex problems this type of solution would be overkill. I also agree that is results in more complexity, as I stated in the article, and should only be done when performance is an issue.

But there are times when the data volumes are high and the solution to be solved is complex, times when non-SQL solutions are required for performance reasons. Even the best SQL execution plans are not fast enough for some problems.

How does one create that cache in the first case if it takes hours to create it in well tuned SQL and one only has 30 minutes to create it?

Robert Hodges said...

Map/reduce is a great example of moving processing out of the database that fits your description perfectly. For conventional applications I try to stick with the DBMS as much as possible, because it minimizes work over time.

dbscience said...

I want to keep the data in the database because of the complex relationships that a relational database handles so well. The source of the data is still a relational database.

The problem is there are times when generic SQL isn't fast enough for the problem being solved.

This is one of the reasons ETL tools exist.

A said...
This comment has been removed by the author.
A said...

Try NitrosBase (nitrosbase.com). According to their whitepaper they are 150 times faster then TimesTen (though 50 times cheaper and free for non-commercial use) and there is no need in a non-sql code.

dbscience said...

NitrosBase looks like an in-memory database. The data volumes I work with don't fit into memory, which rules out Nitrosbase.

You might want to look at MySQL Cluster if you are looking at Nitrosbase.

A said...

Yes, you are right.
They promise a disk version within two month but now they have only in-memory one. Nevertheless, there are lots of databases that fit in memory. I'll check their site in two month. Their test are very encouraging.