Database Science

Tuesday, September 1, 2009

Enterprise Software as a Service (SaaS) and Partitioning

Partitioning, new with MySQL 5.1, has complicated interactions with queries and indexes. If one isn’t careful it is easy to degrade performance. For example, select queries that go with that grain (queries where partition elimination occurs) can be much quicker, but select queries that go against that grain can be much slower. Queries that go against the grain must query each partition, so for a table with 12 partitions, one query against that table can result in 12 queries, one against each of the partitions. An example of this would be a query against a month partitioned table that is looking to see how much activity a product had in the past 12 months.

The ideal partitioning scheme would be a system where all queries only needs to access data from one partition. This describes enterprise software deployed as a service where multiple enterprise tenants all exist within one database . As one enterprise tenant (think of a company like a bank, manufacturing firm, or retailer, not a consumer using facebook or twitter) only queries their own data, the enterprise tenantId provides an ideal grain on which divide up the data. This means each table that has tenant specific data must have a tenantId. A Sale table for a multi-tenant database would look like this:

CREATE TABLE Sale (
   tenantId int NOT NULL,
   orderId int NOT NULL,
   customerId int NOT NULL,
   productId int NOT NULL,
   unit int NOT NULL,
   purchaseAmount decimal(16,2) NOT NULL,
   purchaseCost decimal(16,2) NOT NULL,
   purchaseDate datetime NOT NULL,
   PRIMARY KEY (tenantId, orderId),
   KEY idx_sale_product (productId),
   KEY idx_sale_customer (customerId),
   KEY idx_sale_purchaseDate (purchaseDate)
) PARTITION BY LIST(tenantId) (
   PARTITION t1 VALUES IN (1) ENGINE=InnoDB,
   PARTITION t2 VALUES IN (2) ENGINE=InnoDB,
   PARTITION t3 VALUES IN (3) ENGINE=InnoDB,
   PARTITION t4 VALUES IN (4) ENGINE=InnoDB,
   PARTITION t5 VALUES IN (5) ENGINE=InnoDB)

If you are using InnoDB, an alternative to partitioning by tenant is to create clustered indexes by tenantId. Before MySQL had partitioning, this was a good way to implement a multi-tenant database. If you are curious about this type of solution you can find more here:

http://dbscience.blogspot.com/2008_07_01_archive.html

Both partitioning by tenant and using InnoDB clustered indexes as in the above article are roughly going to perform the same for large data volumes.

The advantage that partitioning provides is on administrative tasks like server splits. When there are too many tenants on one server and a split needs to occur the stressed out database data can be replicated to another server. After the replication there will be two servers, each with roughly half of the tenants inactive. Instead of slow and hardware consuming mass delete of now inactive tenants on a server the inactive partitions can be dropped in a second. While the server split is still painful, this makes the reallocation of tenants across servers easier and the system is fully available far earlier.

There there are the other administrate benefits with partitioning, such as dropping the data for an inactive tenant quickly.

A downside is that you have to keep the partitioning list or range current as new tenants are added. You will probably want to pre-allocate tenant partitions to avoid having to add partitions at the last moment.

However, be aware of the partitioning limitations, such as only 1024 partitions per table. This means only 1024 tenants per database, so if you store more than 1024 tenants in one database you will want to combine multiple tenants into one partition.

If you expect to overwhelm a single database server, and if you are developing enterprise software as a service that is very possible as even simple enterprise applications seem to generate terabytes of data these days, you should strongly considering partitioning tables by the tenant.

Wednesday, August 5, 2009

MySQL and MS SQL Server

Recently, MySQL had an article comparing MySQL and SQL Server at http://dev.mysql.com/tech-resources/articles/move_from_microsoft_SQL_Server.html

There is one clarification I would like to make to this article. The article states that MS SQL Server has row locking, and while this is true, MS SQL Server doesn’t always use row locking and often will resort to page locking. A page is 8k of data, so, in effect, many rows are locked at the same time even if only one row in that page is being updated. At high data throughputs, this can lead to serious lock contention and to dead locks, even though the two processes with the contention or dead lock are updating different rows. Stated differently, no matter how well you order your multi-row update processes in MS SQL Server you should expect deadlocks.

There is a way to turn off page locks on an index, but a index can’t be reorganized if page locks are off as index reorganization uses page locks. Alternatively, one can add a row lock hints to each SQL statement, but this becomes tiresome fast. The general approach to this problem is to keep each transaction small, smaller than you perhaps want, which reduces but doesn’t eliminate the page lock contention and dead locks.

Easy to use with sensible defaults, MS SQL Server is a great database; but page locking is a relic from the past when row locks consumed too much of that rare memory. In the 64 bit processor world, memory is too plentiful to worry about how much memory row locks take and page locks hurt scalability. As more recent database design, the MySQL model of only using row locks makes far more sense.

Thursday, May 14, 2009

Partitioning Presentation from MySQL Conference

It appears that my presentation isn’t available at the conference site. I’ve added it below. Sorry for the delay.

Without some context that I talked about at the conference the presentation may not make sense.

Partitioning creates a grain in the table. Select queries that go with that grain can be quicker, at times much faster, but select queries that go against that grain can be slower, at times much slower.

An example is a table that is partitioned by date and has an orderId primary key. Queries that select data by date will either be almost as fast (partitioning can add a slight overhead) as the non-partitioned table to much faster. But queries that don’t query by date will have to query all the partitions. A table partitioned 12 times, one partition for each month, results in 12 index partitions. As the indexes are partitioned by date, a query by the primary key of orderId means all twelve index partitions must be queried to look for the order in question. Querying twelve index partitions is going to be slower than just one unpartitioned table.

Having said that, there are cases where partitioning dramatically improves performance, such as optimizing a table partition by partition.

But now I’m starting to cover too much of what is in the presentation. So here it is.

http://www.slideshare.net/brooksaix/divide-and-be-conquered

Wednesday, April 1, 2009

Two New Databases Announced

The open source coderati announced a new database today called “Monsoon”. Norm “Al” Modelle said much effort was spent by the team so it can handle internet sized floods of data. He expected it to have the licensing of PostgreSQL, the multi-engine support of MySQL, the functionality of Oracle, the ease of use of MS SQL Server, and the massively parallel scalability of DB2.

Not to be outdone, Computer Dilettantes (CD) also announced a new database, the oddly named “Mud Puddle”. Eye Samme, the Executive VP in charge of databases we intend to milk for support revenue, said he knows the customer is being squeezed by licensing costs in these hard times. So CD spent considerable effort designing a real time application that constantly analyzes the customer’s use of the database. It then notifies the sales force how they can maximize revenue extraction from the customer. He said the database will have the licensing of Oracle, the ease of use of DB2, the single engine support of PostgreSQL, the scalability of MySQL, and the functionality of MS SQL Server.

Oh, and please attend MySQL partitioning session if you are thinking about implementing partitioning. I promise it will have a bit more technical content.

Wednesday, February 25, 2009

Is SQL Slow?

Last time I demonstrated a case where stored procedures are slow when they have to do computationally expensive work. The more interesting, to me at least, question is how slow is SQL? The answer to that is far more complex.

For example, this simple SQL takes 2.2 seconds to run on MySQL. This is a painfully faster than the 696 seconds it took a stored procedure to produce similar results.

select customerName, count(*)
from Salet s
join customer c on s.customerId = c.customerId
group by customerName

As demonstrated in the previous article, the equivalent C# code took 1.4 seconds to produce the same results. Some may find it surprising that it take less time to ship 1 million rows out of the database and then summarize it in code than it does for MySQL to summarize the same data in the database.

In this simple case the performance difference isn’t much and is not worth the extra code complexity. For more complex SQL, with more joins, perhaps nested case statements and temp tables, and similar standard SQL techniques, it is often much faster to do the transformation logic in non-SQL code. I’ve found cases where I was able to improve performance by over ten times by using C# or Java code over SQL. Still, my inclination is to always see if I can’t get acceptable performance from SQL first as less code is generally required, and usually far less code. SQL is an exceptionally expressive language for certain types of problems.

Plus, the performance advantage of C# or Java won’t be true in all cases. Stating the obvious, shipping out all the sales data won’t make sense for more selective queries that query only a few sales. In this case it makes far more sense to write in SQL.

Deciding where to execute code, in the database or elsewhere, is a complex problem that would require a series of articles to answer reasonably (a hint, think about using sequential IO as much as possible). For now I just want to point out that running everything in SQL isn’t always the best performing method.

Monday, February 23, 2009

Stored Procedures Are Slow Part 2

Last time I demonstrated a case where stored procedures are slow. There were a few comments that I should include selects in the stored procedure to make the tests more realistic. From experience I already knew the answer so I didn’t go into that level of detail, but since stored procedures are all about database access this is a reasonable comment.

This is a simple stored procedure that selects data and then summarizes it by customer. No one would actually write this as it is far too easy to use a SQL statement instead. Assume the logic is more complex and can’t be done easily by the standard SQL techniques of case statements, temp tables, etc, and then this makes more sense.

The end result is this stored procedure takes 696 seconds to run.

create procedure slowCounter()
begin

    declare done int default 0;
    declare curCustomerId int default 0;
    declare customerCount int default 0;
    declare slowCur cursor for select customerId from saleT;
    declare continue handler for not found
        set done = TRUE;

    create temporary table tempCount
    (
        tempCustomerId int not null,
        tempCustomerCount int not null,
        primary key (tempCustomerId)
    );

    open slowCur;
    fetch slowCur into curCustomerId ;
    while not done do
        insert into tempCount(tempCustomerId, tempCustomerCount)
            values (curCustomerId , 1)
        on duplicate key update tempCustomerCount = tempCustomerCount + 1;
        fetch slowCur into curCustomerId ;
    end while;

close slowCur;

select * from tempCount;

end

This more complex C# code (and still not fully exception proofed) only takes 1.4 seconds to run. That is nearly 1/500 the time it took the stored procedure to execute.

using System;
using System.Collections.Generic;
using MySql.Data.MySqlClient;

namespace Summarization
{
    /// <summary>
    /// Tests the performance of C# to simulate a SQL group by
    /// </summary>
    class SumTestGroup : BaseSum
    {
        protected MySqlConnection sqlConn;

        private Dictionary<Int32, String> customer =
            new Dictionary<Int32, String>();
        private Dictionary<String, Int32> customerRowCount =
            new Dictionary<String, Int32>();

        public void execute()
        {
            try
            {
                sqlConn = getConnectionMySql();
                sqlConn.Open();
                int count = 0;
                Int32 customerId;
                String customerName;

                getCustomers();

                MySqlCommand cmd = new MySqlCommand("select s.customerId from Salet s", sqlConn);
                MySqlDataReader rdr = cmd.ExecuteReader();

                while (rdr.Read())
                {
                    customerId = (Int32)rdr["customerId"];
                    customerName = customer[customerId];
                    count++;

                    if (customerRowCount.ContainsKey(customerName))
                    {
                        customerRowCount[customerName]++;
                    }
                    else
                    {
                        customerRowCount[customerName] = 1;
                    }
                    //Console.WriteLine("Count: " + count);
                }

                foreach (KeyValuePair<String, Int32> customerRow in customerRowCount)
                {
                    Console.WriteLine("Customer: " + customerRow.Key + " Customer Count: " + customerRow.Value);
                }
                rdr.Close();

            }
            finally
            {
                sqlConn.Close();
            }
        }

        private void getCustomers()
        {
            MySqlCommand cmd = new MySqlCommand("select customerId, customerName from customer", sqlConn);
            MySqlDataReader rdr = cmd.ExecuteReader();

            Int32 customerId;
            String customerName;

            while (rdr.Read())
            {
                if (!rdr.IsDBNull(0))
                {
                    customerId = (Int32)rdr["customerId"];
                    customerName = (String)rdr["customerName"];
                    customer.Add(customerId, customerName);
                }
            }
            rdr.Close();
        }
    }
}

Both produce results like this:

customer Name 0, 100287
customer Name 1, 100009
customer Name 2, 100130
customer Name 3, 99655
customer Name 4, 100134
customer Name 5, 100102
customer Name 6, 99854
customer Name 7, 100172
customer Name 8, 99846
customer Name 9, 99812

In this case, the effort required to retrieve one million rows out of the database and then process them takes orders of magnitude less than a stored procedure takes to process the data inside the database.

No one would actually write this code when a simple SQL alternative is possible. But given complex logic that isn’t possible or easy to express in SQL that needs to be applied against a large volume of data, using a stored procedure cursor (or loop in general) isn’t a fast way to go. This is odd as the whole point of stored procedures is to write code that deals with data contained in a database.

Friday, February 13, 2009

Are MySQL stored procedures slow?

Yes, if compared to code in Java or C#. For example, this overly simple code took 284 seconds.

CREATE PROCEDURE CountTest()
begin
    declare counter int default 0;
    select now();
    repeat
        set counter = counter + 1;
    until counter > 120000000
    end repeat;
    select counter;
    select now();
end

Ignoring my off by one error, here is equivalent code in C# (the language I’m currently learning). It took 419 milliseconds, or MySQL took 677 times longer to execute. From my experience, Java isn’t going to be any slower.

int counter = 0;
while (counter < 120000000)
{
counter++;
}

Slow stored procedure performance is one of the reasons why it usually isn’t wise to implement computationally expensive business logic in the database. With networks not being the bottleneck they once were it is often better to extract the data to the application layer and process it in the programming language of your choice, even if that takes a few extra round trips to the database. There are exceptions where those extra round trips are too numerous and prohibitively expensive. With something as complex as a database there are always exceptions.

This example is a bit too simple, but, based on experience, it is representative of the general performance of stored procedure logic.

How can MySQL get away with such poor performance? Well, the equivalent code in MS SQL Server took 80 seconds on the same hardware, which is also two orders of magnitude slower than C# or Java code.

create procedure testCount
as
begin
    declare @countNo int;
    set @countNo = 0;
    while (@countNo < 120000000)
    begin
        set @countNo = @countNo + 1
    end
    select @countNo
end

MS SQL Server code runs 3.5 times faster than MySQL, but given how much older MS SQL Server is, MySQL is doing well here. So are MySQL Stored procedures slow? Not really, if compared to other databases I’ve used.

Just don’t use them to do computationally expensive business logic.