Friday, August 29, 2008

Can a timestamp be slower than a datetime?

One should use timestamps rather than datetimes as timestamps are four bytes and datetimes are eight bytes, right?  This makes sense as smaller indexes and tables are faster than larger ones. However, when I was trying to quantify the improvement (I’m strange that way), I found the exact opposite - when used as a criteria, a timestamp is slower than a datetime. Often much slower.

For example, for this simple sql, when purchaseDate was a timestamp, the query took about 9.4 seconds and the datetime version took about 4.3 seconds. In this case the data was in memory.

    select sum(unit) from Sale where purchaseDate >= '2001-07-01' and purchaseDate < '2001-08-01'

I did other tests, such as when data wasn’t cached and there was io, and I found cases where datetimes were much faster than timestamps, differences so extreme I’m hesitant to report them until I have a better explanation of what is going on. I really can’t believe the performance differences I'm measuring and I have this suspicion that something else is occurring as these numbers seem too extreme, but what I can’t say.

Was there a bug for this I don’t know about? It is hard to find information on timestamps as it is such a common word, but I did scan the MySQL bugs for timestamps and didn’t find anything. Is it possible the extra functionality of timestamps, like timezone conversions, makes timestamps slower than datetimes?

In a few other cases, I found them about the same speed; only in one case were timestamps faster than datetimes, and that was 94 seconds for timestamps and 97 seconds for datetimes, which is an insignificant difference.

I tested this on 5.1.26 and 5.0.22, and found the performance difference between timestamps and datetimes on both versions.

I'm not sure what to conclude from this. If you aren’t frequently using the date oriented columns as criteria in queries, and dates are something like a rarely used record of when an event occurred, the smaller timestamp is the better choice. If, however, if many queries are run against the date in question, and assuming these results I’m getting are accurate, then a datetime column should at least be under consideration. In other words, benchmark datetimes and timestamps in your environment with representative queries. The results are worse for timestamps when they are used as a selection criteria.

If you get the crazy performance differences I’m getting, or if you don’t, let me know.

 

For the curious, the plans for the timestamp queries looked like this:

mysql> explain select sum(unit) from SaleT where purchaseDate >= '2001-07-01'  and purchaseDate < '2001-08-01' \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: SaleT
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 20267520
        Extra: Using where
1 row in set (0.04 sec)

And the datetime queries looked like this:

mysql> explain select sum(unit) from Sale where purchaseDate >= '2001-07-01'  and purchaseDate < '2001-08-01' \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: Sale
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: NULL
         rows: 20548080
        Extra: Using where
1 row in set (0.06 sec)

And the table looks like this (with the other version having a purchaseDate timestamp column instead of a datetime).

 

CREATE TABLE  `test`.`Sale` (
  `orderId` int(11) NOT NULL,
  `customerId` int(11) NOT NULL,
  `productId` int(11) NOT NULL,
  `productBigId` int(11) NOT NULL,
  `unit` int(11) NOT NULL,
  `purchaseAmount` decimal(16,2) NOT NULL,
  `purchaseCost` decimal(16,2) NOT NULL,
  `purchaseDate` datetime NOT NULL,
  PRIMARY KEY  (`purchaseDate`,`orderId`),
  UNIQUE KEY `idx_sale_order` (`orderId`),
  KEY `pf_sale_product` (`productId`),
  KEY `pf_sale_customer` (`customerId`),
  CONSTRAINT `pf_sale_customer` FOREIGN KEY (`customerId`) REFERENCES `Customer` (`customerId`),
  CONSTRAINT `pf_sale_product` FOREIGN KEY (`productId`) REFERENCES `Product` (`productId`)
) ENGINE=InnoDB

 

On a somewhat related note, I also got this repeatable bug when trying to insert data into a timestamp column on 5.1.26:

    Incorrect datetime value: '2001-03-11 02:00:39' for column 'purchaseDate' at row 226

Friday, August 22, 2008

InnoDB's Suitability for Reporting

I started using Oracle, a MVCC database, to develop reporting (data warehousing, BI, take your pick) systems years ago.  I’ve come to appreciate the scalability improvements that MVCC provides, particularly for pseudo real-time reporting applications, the ones where loads are occurring at the same time as report generation.  So when people say InnoDB, partly due to MVCC, isn’t as good as MyISAM for reporting I had to look into this in more detail.

What I found is InnoDB is a good engine for reporting.  In some ways, such as performance, it is at times better than MyISAM, and one of the downsides, such as a larger disk requirement, can be mitigated.  The trick is to for the primary key to be the one predominant access path.  In this example, the InnoDB clustered index, is purchaseDate and another column, such as orderId is added to make it unique.  This has a number of advantages.  In my experience, most reporting systems queries are for a date range, so date provides a great all round clustered index.   Also, one of the problems with indexes is they can fragment, which both slows down queries and takes up more disk.  But all the new date oriented data is added to the end of the clustered index, meaning most of the index doesn’t change and the old data, if defragmented, remains so.   Finally, as most queries are for the recent data, clustering by date tends to keep the most frequently used recent data in memory, a benefit that is hard to quantify but can be substantial. 

Here is the InnoDB version of the table.  It holds 120 million rows, about 10 million for each month in 2001.  As the server has 28 gig of memory, in all cases I ensured that all these queries ran in memory and there was no physical io.  This isn’t completely fair as MyISAM does compact data and takes up less disk, but I’ll show later how to minimize the amount of disk that InnoDB consumes.  Plus, if there is enough (any? - comments would be great) contention on this point, I’ll run some tests with physical io for both engines.

 

CREATE TABLE  `test`.`Sale` (

  `orderId` int(11) NOT NULL,

  `customerId` int(11) NOT NULL,

  `productId` int(11) NOT NULL,

  `productBigId` int(11) NOT NULL,

  `unit` int(11) NOT NULL,

  `purchaseAmount` decimal(16,2) NOT NULL,

  `purchaseCost` decimal(16,2) NOT NULL,

  `purchaseDate` datetime NOT NULL,

  PRIMARY KEY  (`purchaseDate`,`orderId`),

  UNIQUE KEY `idx_sale_order` (`orderId`),

  KEY `pf_sale_product` (`productId`),

  KEY `pf_sale_customer` (`customerId`),

  CONSTRAINT `fk_sale_customer` FOREIGN KEY (`customerId`) REFERENCES `Customer` (`customerId`),

  CONSTRAINT `fk_sale_product` FOREIGN KEY (`productId`) REFERENCES `Product` (`productId`)

) ENGINE=InnoDB

 

The InnoDB version of this query takes seconds 4 seconds and sensibly uses the primary key to limit the amount of data it needs to query.   This means only about 1/12 of the table needs to be queried, and when it is read the access if sequential, which is fast.  

 

explain select sum(unit) from Sale where purchaseDate >= '2001-11-01' and purchaseDate < '2001-12-01'  \G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: Sale

         type: range

possible_keys: PRIMARY

          key: PRIMARY

      key_len: 8

          ref: NULL

         rows: 19447290

        Extra: Using where

 

This is the equivalent MyISAM table with a more standard primary key of orderId. 

 

CREATE TABLE  `test`.`SaleI` (

  `orderId` int(11) NOT NULL,

  `customerId` int(11) NOT NULL,

  `productId` int(11) NOT NULL,

  `productBigId` int(11) NOT NULL,

  `unit` int(11) NOT NULL,

  `purchaseAmount` decimal(16,2) NOT NULL,

  `purchaseCost` decimal(16,2) NOT NULL,

  `purchaseDate` datetime NOT NULL,

  PRIMARY KEY  (`orderId`),

  KEY `pk_salei_product` (`productId`),

  KEY `pk_salei_customer` (`customerId`),

  KEY `idx_saleI_purchaseDate` (`purchaseDate`)

) ENGINE=MyISAM

 

As 12 seconds, the same query for MyISAM is much slower as it must visit both the index and the page which contains the data.  

 

mysql> explain select sum(unit) from SaleI where purchaseDate >= '2001-11-01' and purchaseDate < '2001-12-01' \G;

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: SaleI

         type: range

possible_keys: idx_saleI_purchaseDate

          key: idx_saleI_purchaseDate

      key_len: 8

          ref: NULL

         rows: 9079951

        Extra: Using where

 

However, this isn’t completely fair to MyISAM as a better covering index for MyISAM can be written that replicates some of performance of a clustered index.   So I change the table definition to add the idx_saleI_covering index. 

 

CREATE TABLE  `test`.`SaleI` (

  `orderId` int(11) NOT NULL,

  `customerId` int(11) NOT NULL,

  `productId` int(11) NOT NULL,

  `productBigId` int(11) NOT NULL,

  `unit` int(11) NOT NULL,

  `purchaseAmount` decimal(16,2) NOT NULL,

  `purchaseCost` decimal(16,2) NOT NULL,

  `purchaseDate` datetime NOT NULL,

  PRIMARY KEY  (`orderId`),

  KEY `pk_salei_product` (`productId`),

  KEY `pk_salei_customer` (`customerId`),

  KEY `idx_saleI_covering` (`purchaseDate`,`productId`,`customerId`,`unit`,`purchaseAmount`,`purchaseCost`)

) ENGINE=MyISAM

 

With the better idx_saleI_covering index the query now takes 8 seconds, still about twice a long as InnoDB.  

 

select sum(unit) from SaleI where purchaseDate >= '2001-11-01' and purchaseDate < '2001-12-01' \G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: SaleI

         type: range

possible_keys: idx_saleI_covering

          key: idx_saleI_covering

      key_len: 8

          ref: NULL

         rows: 25370368

        Extra: Using where; Using index

 

How does a more selective query do, one which picks a specific date range and a another criteria, in this case customer?  For InnoDB, almost zero seconds, 0.0079 seconds to be exact. 

 

select sum(unit) from Sale where purchaseDate >= '2001-11-01'  and purchaseDate < '2001-12-01' and customerId = 50 \G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: Sale

         type: index_merge

possible_keys: PRIMARY,pf_sale_customer

          key: pf_sale_customer,PRIMARY

      key_len: 4,8

          ref: NULL

         rows: 12

        Extra: Using intersect(pf_sale_customer,PRIMARY); Using where

1 row in set (0.00 sec)

 

The query plan for MyISAM is different, but the results are the same, with the query taking 0.0079 seconds to run. 

 

mysql> explain select sum(unit)   from SaleI  where purchaseDate >= '2001- -01'    and purchaseDate < '2001-12-01'    and customerId = 50 \G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: SaleI

         type: ref

possible_keys: pk_salei_customer,idx_saleI_covering

          key: pk_salei_customer

      key_len: 4

          ref: const

         rows: 102

        Extra: Using where

 

So, not an extensive series of tests, but two of the more common types of reporting queries and InnoDB is either as fast or faster than MyISAM.  At this point, all other things being equal, I would much prefer the higher reliability of InnoDB over MyISAM. 

But what about the size of the of data on disk?  MyISAM, for the version of SaleI with the covering index, at 13.6 gig, is 63.3% of InnoDB’s 21.5 gigs.   A significant advantage for MyISAM, yes, but not enough of itself to choose MyISAM over InnoDB.  Part of the reason is all the indexes are on numbers and dates, which MyISAM doesn’t compress by default.  Generally, for performance reasons, indexes should be non-character in a reporting system, so non-character indexes are a reasonable assumption.   But still, this comparison isn’t entirely fair as the InnoDB table has been optimized and hasn’t had a chance to get fragmented, right?.   What happens if another month of data into the InnoDB Sale table without it being optimized?  I use this SQL to populate a new month of data.

 

insert into Sale (orderId, customerId, productId, productBigId, unit,

                  purchaseAmount, purchaseCost, purchaseDate)

select orderId + 120000000, customerId, productId, productBigId, unit,

                  purchaseAmount, purchaseCost, date_add(purchaseDate, interval 1 year)

  from Sale

where purchaseDate  >= '2001-01-01'

   and purchaseDate <  '2001-02-01'

And the end result of adding one more month of data to twelve existing months is InnoDB now consumes 23.4 gigs.  In other words, adding 8.3% more data results in 8.1% more disk consumed.  The lesson from this is, if you can, load your data in primary key order. 

However, the data will still get fragmented over time, in particular the other non-clustered indexes.  Plus, it isn’t always possible to load data in primary key order.   A table optimize always optimizes the entire table, so this takes a while. 

At least is does for 5.0.  There is a new way with 5.1 to reduce the table maintenance of InnoDB - one can partition by date.  The advantage this provides is instead of optimizing an entire InnoDB table, one can do an ALTER TABLE REBUILD PARTITION on only the most recent data.  This means that the table defrag should take much less time.  I’ve never done this yet, so I’m not sure if it really works and I can’t recommend it, but if/when it does work, it is another mitigating factor against the larger database sizes of InnoDB.  However, is could also be used to make date selective MyISAM queries much faster.  But see my http://dbscience.blogspot.com/2008/01/divide-and-be-conquered.html article on a potential performance downside of partitioning.

This issue is more complex than this one article can cover, but for those reporting applications where is one predominant access path, such as date, the clustered indexing advantages of the InnoDB engine makes it worth consideration for reporting applications.