Friday, February 22, 2008

Clustered indexing and query performance

Last time I showed where partitioning could negatively impact performance, with one partitioned query being four times slower than a non-partitioned one when the data was partitioned by the same column as it was clustered by.  This time I’m going to show a way to get better performance by selecting a good clustered index. With the InnoDB, the create table primary key syntax encourages one to create the clustered index the same as the primary key. For transaction systems, in many cases, this makes sense.  But there are times, particularly for reporting systems, when this isn't advisable. 

To demonstrate this two similar tables will be created where the only difference is the indexing.  The below SQL shows an one of these tables, a 20 gig, 120 million rows tables, representing one year (about 10 million per month) of data.  This table is clustered by the primary key. 

create table SaleOrderCluster (

  orderId int not null,

  customerId int not null,

  productId int not null,

  productBigId int not null,

  unit int not null,

  purchaseAmount decimal(16,2) not null,

  purchaseCost decimal(16,2) not null,

  purchaseDate datetime not null,

  primary key (orderId),

  key idx_SaleOrderCluster_purchaseDate (purchaseDate),

  key idx_SaleOrderCluster_product (productId),

  key idx_SaleOrderCluster_customer (customerId),

  constraint pf_SaleOrderCluster_customer foreign key (customerId) references Customer (customerId),

  constraint pf_SaleOrderCluster_product foreign key (productId) references Product (productId)

) ENGINE=InnoDB

 

But in other cases, it makes sense to cluster the data on another key. In this case, the primary key of purchasedate is used to cluster the data by date as many queries scan a range of dates. There is still the need for a constraint on the orderId column, so a unique index is created on this column. This SQL creates the 28 gig, 120 million row table.  Note that the larger clustered index results in a noticeably larger disk utilization than the previous table (28 vs 20 gig). This is because all the other indexes point to the clustered index, which means the large clustered index increases the size of all the other indexes.  In both cases the table is much larger than the 4 gig of memory assigned to MySQL.

 

CREATE TABLE  Sale (

  orderId int not null,

  customerId int(11) not null,

  productId int(11) not null,

  productBigId int(11) not null,

  unit int(11) not null,

  purchaseAmount decimal(16,2) not null,

  purchaseCost decimal(16,2) not null,

  purchaseDate datetime not null,

  primary key  (purchaseDate,orderId),

  unique key idx_sale_order (orderId),

  key pf_sale_product (productId),

  key pf_sale_customer (customerId),

  constraint pf_sale_customer foreign key (customerId)  references Customer (customerId),

  constraint pf_sale_product foreign key (productId)  references  Product (productId)

) ENGINE=InnoDB

 

The only difference between these two table is which column is the clustered index.  To test the performance of these two indexes this stored procedure is used to see how fast a single row lookup executed. 

 

CREATE PROCEDURE testClusterSpeed()

begin

    declare vorderId INT default 0;

    declare loopcounter INT default 0;

    declare vcustomerId INT default 0;

    repeat

        set loopcounter = loopcounter + 1;

        set vorderId  = floor(1 + (rand()*120000000));

        set vcustomerId = (select customerId from SaleOrderCluster where orderId = vorderId);

    until loopcounter > 1000000

    end repeat;

end

The execution plan show it uses the clustered index to find the data.  The procedure took 983 seconds to query the data. 

+----+-------------+------------------+-------+---------+---------+-------+------+
| id | select_type | TABLE            | type  | KEY     | key_len | ref   | rows |
+----+-------------+------------------+-------+---------+---------+-------+------+
|  1 | SIMPLE      | SaleOrderCluster | const | PRIMARY | 4       | const | 1    |  
+----+-------------+------------------+-------+---------+---------+-------+------+

 

This is the same basic stored procedure except it is against the table with the purchasedate clustered index. 

 

CREATE PROCEDURE  testNoPartitionSpeed()

begin

    declare vorderId INT default 0;

    declare loopcounter INT default 0;

    declare vcustomerId INT default 0;

    repeat

        set loopcounter = loopcounter + 1;

        set vorderId  = floor(1 + (rand()*120000000));

        set vcustomerId = (select customerId from Sale where orderId = vorderId);

    until loopcounter > 1000000

    end repeat;

 

The plan for this is the expected lookup by the unique idx_sale_order index.

 

+----+-------------+-------+-------+----------------+---------+-------+------+
| id | select_type | TABLE | type  | KEY            | key_len | ref   | rows |
+----+-------------+-------+-------+----------------+---------+-------+------+
|  1 | SIMPLE      | Sale  | const | idx_sale_order | 4       | const | 1    |  
+----+-------------+-------+-------+----------------+---------+-------+------+

 

As the unique index needs to point to the clustered index, this stored proc takes about three times longer to run – 2780 seconds.  Stated differently, MySQL first uses the idx_sale_order to find the OrderID, but as all non clustered indexes point to the clustered index, the clustered index must also be read to ultimately find the row.  This last step isn't required if the clustered index is by OrderId as the data is organized by OrderId.  If most of the queries are against OrderId then using OrderId as the clustered index makes sense.  Plus, the larger Sale table with the purchaseDate clustered index, which would result in less cache hits for queries as less of it could fit into memory, can't help performance either. 

 

However, if most of the queries are against purchaseDate the performance of the two indexes flips. The below query runs against the table with the clustered index on the purchaseDate and runs in about 335 seconds. I ran this after restarting MySQL to ensure the data wasn’t in memory. If it was in memory it ran in seconds.

 

select sum(unit) from Sale

where purchaseDate < '2002-01-01'

   and purchaseDate > '2001-12-01'

 

+----+-------------+-------+-------+----------------+---------+-------+----------+
| id | select_type | TABLE | type  | KEY            | key_len | ref   | rows     |
+----+-------------+-------+-------+----------------+---------+-------+----------+
|  1 | SIMPLE      | Sale  | const | PRIMARY        | 8       |       | 28836984 |  
+----+-------------+-------+-------+----------------+---------+-------+----------+

 

When I ran the corresponding query on the table with the clustered index OrderId, I had to kill this query after 9000 seconds, which is over 27 times longer. This is because the query plan wasn’t very good as MySQL was using the date index to find each row rather than a much faster table scan.

 

select sum(unit) from SaleOrderCluster

where purchaseDate < '2002-01-01'

   and purchaseDate > '2001-12-01'

+----+-------------+------------------+-------+-----------------------------------+---------+-------+----------+
| id | select_type | TABLE            | type  | KEY                               | key_len | ref   | rows     |
+----+-------------+------------------+-------+-----------------------------------+---------+-------+----------+
|  1 | SIMPLE      | SaleOrderCluster | range | idx_SaleOrderCluster_purchaseDate | 8       |       | 16055948 |  
+----+-------------+------------------+-------+-----------------------------------+---------+-------+----------+

 

This is an example where the MySQL query optimizer is a bit immature.  So, I hinted the query to do a table scan and it took 3000 seconds, still about nine times slower than the other index, but much faster than a indexed lookup on 10 million rows.  The reason this is being so slow is the data for a month is distributed throughout the table as the table is organized by orderId, so finding one month of data means the entire table must be scanned. 

 

select sum(unit) from SaleOrderCluster use index (primary)

where purchaseDate < '2002-01-01'

   and purchaseDate > '2001-12-01'

+----+-------------+------------------+--------+-----+---------+-------+----------+
| id | select_type | TABLE            | type   | KEY | key_len | ref   | rows     |
+----+-------------+------------------+--------+-----+---------+-------+----------+
|  1 | SIMPLE      | SaleOrderCluster | all    |     | 8       |       | 29216712 |  
+----+-------------+------------------+--------+-----+---------+-------+----------+

 

Proving this, the time to query the for one month of dates takes a long as a query against the full table, as the below sql takes 3000 seconds to run as well.

 

select sum(unit) from SaleOrderCluster

+----+-------------+------------------+--------+-----+---------+-------+-----------+
| id | select_type | TABLE            | type   | KEY | key_len | ref   | rows      |
+----+-------------+------------------+--------+-----+---------+-------+-----------+
|  1 | SIMPLE      | SaleOrderCluster | all    |     | 8       |       | 129216712 |  
+----+-------------+------------------+--------+-----+---------+-------+-----------+

 

As more months are added, assuming the older months aren’t deleted, the SaleOrderCluster table performance of this type of date range queries will degrade even further.  Against the purchaseDate clustered table the same date range queries will run in close to constant time as the table grows, assuming the amount of data for a month remains constant.  This is because, while the size of the purchase date clustered index will be growing, the number of rows for the month of date being range scanned in that index will remain constant.  Partitioning by date in this case won't provide performance benefits as the data is already organized by date. 

So it seems there are two choices, either optimize for random reads by the real primary key of OrderId, or, optimize for date range queries.  There is a third alternative of adding a covering index on the all the columns being queried, not just the purchase date, on the SaleOrderCluster table, with the first column of the index being the PurchaseDate.  In this case all the above queries will be fast as both queries will be answered fully by a index, but the downside is the inserts against a larger covering index will be slower.  Perhaps I shall investigate that next time. 

No comments: