Friday, August 22, 2008

InnoDB's Suitability for Reporting

I started using Oracle, a MVCC database, to develop reporting (data warehousing, BI, take your pick) systems years ago.  I’ve come to appreciate the scalability improvements that MVCC provides, particularly for pseudo real-time reporting applications, the ones where loads are occurring at the same time as report generation.  So when people say InnoDB, partly due to MVCC, isn’t as good as MyISAM for reporting I had to look into this in more detail.

What I found is InnoDB is a good engine for reporting.  In some ways, such as performance, it is at times better than MyISAM, and one of the downsides, such as a larger disk requirement, can be mitigated.  The trick is to for the primary key to be the one predominant access path.  In this example, the InnoDB clustered index, is purchaseDate and another column, such as orderId is added to make it unique.  This has a number of advantages.  In my experience, most reporting systems queries are for a date range, so date provides a great all round clustered index.   Also, one of the problems with indexes is they can fragment, which both slows down queries and takes up more disk.  But all the new date oriented data is added to the end of the clustered index, meaning most of the index doesn’t change and the old data, if defragmented, remains so.   Finally, as most queries are for the recent data, clustering by date tends to keep the most frequently used recent data in memory, a benefit that is hard to quantify but can be substantial. 

Here is the InnoDB version of the table.  It holds 120 million rows, about 10 million for each month in 2001.  As the server has 28 gig of memory, in all cases I ensured that all these queries ran in memory and there was no physical io.  This isn’t completely fair as MyISAM does compact data and takes up less disk, but I’ll show later how to minimize the amount of disk that InnoDB consumes.  Plus, if there is enough (any? - comments would be great) contention on this point, I’ll run some tests with physical io for both engines.

 

CREATE TABLE  `test`.`Sale` (

  `orderId` int(11) NOT NULL,

  `customerId` int(11) NOT NULL,

  `productId` int(11) NOT NULL,

  `productBigId` int(11) NOT NULL,

  `unit` int(11) NOT NULL,

  `purchaseAmount` decimal(16,2) NOT NULL,

  `purchaseCost` decimal(16,2) NOT NULL,

  `purchaseDate` datetime NOT NULL,

  PRIMARY KEY  (`purchaseDate`,`orderId`),

  UNIQUE KEY `idx_sale_order` (`orderId`),

  KEY `pf_sale_product` (`productId`),

  KEY `pf_sale_customer` (`customerId`),

  CONSTRAINT `fk_sale_customer` FOREIGN KEY (`customerId`) REFERENCES `Customer` (`customerId`),

  CONSTRAINT `fk_sale_product` FOREIGN KEY (`productId`) REFERENCES `Product` (`productId`)

) ENGINE=InnoDB

 

The InnoDB version of this query takes seconds 4 seconds and sensibly uses the primary key to limit the amount of data it needs to query.   This means only about 1/12 of the table needs to be queried, and when it is read the access if sequential, which is fast.  

 

explain select sum(unit) from Sale where purchaseDate >= '2001-11-01' and purchaseDate < '2001-12-01'  \G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: Sale

         type: range

possible_keys: PRIMARY

          key: PRIMARY

      key_len: 8

          ref: NULL

         rows: 19447290

        Extra: Using where

 

This is the equivalent MyISAM table with a more standard primary key of orderId. 

 

CREATE TABLE  `test`.`SaleI` (

  `orderId` int(11) NOT NULL,

  `customerId` int(11) NOT NULL,

  `productId` int(11) NOT NULL,

  `productBigId` int(11) NOT NULL,

  `unit` int(11) NOT NULL,

  `purchaseAmount` decimal(16,2) NOT NULL,

  `purchaseCost` decimal(16,2) NOT NULL,

  `purchaseDate` datetime NOT NULL,

  PRIMARY KEY  (`orderId`),

  KEY `pk_salei_product` (`productId`),

  KEY `pk_salei_customer` (`customerId`),

  KEY `idx_saleI_purchaseDate` (`purchaseDate`)

) ENGINE=MyISAM

 

As 12 seconds, the same query for MyISAM is much slower as it must visit both the index and the page which contains the data.  

 

mysql> explain select sum(unit) from SaleI where purchaseDate >= '2001-11-01' and purchaseDate < '2001-12-01' \G;

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: SaleI

         type: range

possible_keys: idx_saleI_purchaseDate

          key: idx_saleI_purchaseDate

      key_len: 8

          ref: NULL

         rows: 9079951

        Extra: Using where

 

However, this isn’t completely fair to MyISAM as a better covering index for MyISAM can be written that replicates some of performance of a clustered index.   So I change the table definition to add the idx_saleI_covering index. 

 

CREATE TABLE  `test`.`SaleI` (

  `orderId` int(11) NOT NULL,

  `customerId` int(11) NOT NULL,

  `productId` int(11) NOT NULL,

  `productBigId` int(11) NOT NULL,

  `unit` int(11) NOT NULL,

  `purchaseAmount` decimal(16,2) NOT NULL,

  `purchaseCost` decimal(16,2) NOT NULL,

  `purchaseDate` datetime NOT NULL,

  PRIMARY KEY  (`orderId`),

  KEY `pk_salei_product` (`productId`),

  KEY `pk_salei_customer` (`customerId`),

  KEY `idx_saleI_covering` (`purchaseDate`,`productId`,`customerId`,`unit`,`purchaseAmount`,`purchaseCost`)

) ENGINE=MyISAM

 

With the better idx_saleI_covering index the query now takes 8 seconds, still about twice a long as InnoDB.  

 

select sum(unit) from SaleI where purchaseDate >= '2001-11-01' and purchaseDate < '2001-12-01' \G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: SaleI

         type: range

possible_keys: idx_saleI_covering

          key: idx_saleI_covering

      key_len: 8

          ref: NULL

         rows: 25370368

        Extra: Using where; Using index

 

How does a more selective query do, one which picks a specific date range and a another criteria, in this case customer?  For InnoDB, almost zero seconds, 0.0079 seconds to be exact. 

 

select sum(unit) from Sale where purchaseDate >= '2001-11-01'  and purchaseDate < '2001-12-01' and customerId = 50 \G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: Sale

         type: index_merge

possible_keys: PRIMARY,pf_sale_customer

          key: pf_sale_customer,PRIMARY

      key_len: 4,8

          ref: NULL

         rows: 12

        Extra: Using intersect(pf_sale_customer,PRIMARY); Using where

1 row in set (0.00 sec)

 

The query plan for MyISAM is different, but the results are the same, with the query taking 0.0079 seconds to run. 

 

mysql> explain select sum(unit)   from SaleI  where purchaseDate >= '2001- -01'    and purchaseDate < '2001-12-01'    and customerId = 50 \G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: SaleI

         type: ref

possible_keys: pk_salei_customer,idx_saleI_covering

          key: pk_salei_customer

      key_len: 4

          ref: const

         rows: 102

        Extra: Using where

 

So, not an extensive series of tests, but two of the more common types of reporting queries and InnoDB is either as fast or faster than MyISAM.  At this point, all other things being equal, I would much prefer the higher reliability of InnoDB over MyISAM. 

But what about the size of the of data on disk?  MyISAM, for the version of SaleI with the covering index, at 13.6 gig, is 63.3% of InnoDB’s 21.5 gigs.   A significant advantage for MyISAM, yes, but not enough of itself to choose MyISAM over InnoDB.  Part of the reason is all the indexes are on numbers and dates, which MyISAM doesn’t compress by default.  Generally, for performance reasons, indexes should be non-character in a reporting system, so non-character indexes are a reasonable assumption.   But still, this comparison isn’t entirely fair as the InnoDB table has been optimized and hasn’t had a chance to get fragmented, right?.   What happens if another month of data into the InnoDB Sale table without it being optimized?  I use this SQL to populate a new month of data.

 

insert into Sale (orderId, customerId, productId, productBigId, unit,

                  purchaseAmount, purchaseCost, purchaseDate)

select orderId + 120000000, customerId, productId, productBigId, unit,

                  purchaseAmount, purchaseCost, date_add(purchaseDate, interval 1 year)

  from Sale

where purchaseDate  >= '2001-01-01'

   and purchaseDate <  '2001-02-01'

And the end result of adding one more month of data to twelve existing months is InnoDB now consumes 23.4 gigs.  In other words, adding 8.3% more data results in 8.1% more disk consumed.  The lesson from this is, if you can, load your data in primary key order. 

However, the data will still get fragmented over time, in particular the other non-clustered indexes.  Plus, it isn’t always possible to load data in primary key order.   A table optimize always optimizes the entire table, so this takes a while. 

At least is does for 5.0.  There is a new way with 5.1 to reduce the table maintenance of InnoDB - one can partition by date.  The advantage this provides is instead of optimizing an entire InnoDB table, one can do an ALTER TABLE REBUILD PARTITION on only the most recent data.  This means that the table defrag should take much less time.  I’ve never done this yet, so I’m not sure if it really works and I can’t recommend it, but if/when it does work, it is another mitigating factor against the larger database sizes of InnoDB.  However, is could also be used to make date selective MyISAM queries much faster.  But see my http://dbscience.blogspot.com/2008/01/divide-and-be-conquered.html article on a potential performance downside of partitioning.

This issue is more complex than this one article can cover, but for those reporting applications where is one predominant access path, such as date, the clustered indexing advantages of the InnoDB engine makes it worth consideration for reporting applications.  

1 comment:

Anonymous said...

Thanks, nice article. One other thing to bear in mind about partitioning in MySQL 5.1 is that you can't use FKEYs and any primary key index also has to contain the partition columns.