Wednesday, October 1, 2008

Infobright Review – Part 2

First, a retraction, it turns out that the performance problem with datatimes in the previous article wasn’t due to high cardinality (I speculated too much here), but due to a type conversion issue.  From a helpful comment from Victoria Eastwood of Infobright (a good sign for a startup), the Infobright engine considered ‘2001-01-01’ to be a date, not a datetime, and it couldn’t do a conversion to a datetime.  Instead it pushed the date filtering logic from the Infobright engine to MySQL.  Effectively, the slow queries were a table scan.   The solution is to add the 00:00:00 to the dates to make them datetimes.  

With that in mind, here are some much better numbers for Infobright.   For Infobright this query took 0.05 seconds. 

1) Select sum(unit) from Sale where purchaseDate >= '2001-04-01 00:00:00' and purchaseDate < '2001-05-01 00:00:00'

This compares very favorably to 4 seconds for InnoDB if the data is cached and 8.8 if the data needs to be read from disk. 

In doing further tests I found some interesting patterns one might not expect coming from a row database background.  This more selective #2 query took 0.27 seconds for Infobright, longer than the less selective query #1.  Still 0.27 is fast enough.  The odd part is a row based database is generally faster, or at least the same performance, the more selective the query.  For Infobright the exact opposite occurs, at least in these cases I explored.   For this more selective query, InnoDB took 0.047 seconds with a cold cache (all following queries will be using a cold cache unless otherwise noted) compared to that 0.27 for Infobright.  It seems that the that Infobright is faster than InnoDB for less selective queries, and InnoDB is faster for more selective queries. 

2) select sum(unit) from Sale where purchaseDate >= '2001-04-01 00:00:00' and purchaseDate < '2001-05-01 00:00:00' and productId = 199824

This more selective query #3 took longer still, at 0.99 seconds for Infobright, but only took 0.047 seconds for InnoDB, the same time as query #2.  

3) select sum(unit) from Sale where purchaseDate >= '2001-04-01 00:00:00' and purchaseDate < '2001-05-01 00:00:00' and productId = 199824 and customerId = 418845

The same pattern becomes more evident with double the time range for the queries, with two months of data compared to just one.   Query #4 took 0.07 seconds for Infobright whereas InnoDB took 19.9 seconds, far longer.  

4) select sum(unit)  from Sale where purchaseDate >= '2001-04-01 00:00:00' and purchaseDate < '2001-06-01 00:00:00'

At 1.25 seconds, adding another criteria noticeably slowed down the performance from query #4 for Infobright.  At 0.061 seconds, InnoDB was much faster than the less selective query #4,

5) select sum(unit) from Sale where purchaseDate >= '2001-04-01 00:00:00' and purchaseDate < '2001-06-01 00:00:00' and productId = 199824

At 1.99 seconds, this more selective query #6 was again slower than #5 for Infobright.  Following the trend, InnoDB was even faster at 0.048 seconds.  

6) select sum(unit) from Sale where purchaseDate >= '2001-04-01 00:00:00' and purchaseDate < '2001-06-01 00:00:00' and productId = 199824 and customerId = 418845

But there is another difference between InnoDB and Infobright.  The below query #7 takes 0.06 seconds for InnoDB, and takes 11.5 seconds for Infobright with a cold cache (I restarted the database in both cases).  Once the cache was warmed up Infobright took 2.2 seconds to run and retrieve one row.  As there are 120 million different orderIds in the Sale table, perhaps my speculation that Infobright does worse the higher the cardinality of a column wasn't that off.  

7) select * from Sale where orderId = 1

At least for the cases I examined, InnoDB is better for more selective queries, at times far better, and Infobright is better for less selective ones, at times far better.  This assumes defining good indexes for InnoDB, work which Infobright does not require.  Which one to use for reporting?  It really does depend on the nature of the workload.   

Having said that, Infobright behaves quite differently from InnoDB and I know there is much I'm missing as I only put in a few hours of work.   I would try your own workload and come to your own conclusions. 

Tuesday, September 23, 2008

An Infobright Review

With open source software I can install reasonably complete software and try it with my data. This way I get to see how it works in a realistic setting without having to rely on benchmarks and hoping they are a good match for my environment. And I get to do this without having to deal with commercial software sales people.

So I glad to hear the Infobright had gone open source as I have been wanting test a column based database for a while. I was even happier that it was a MySQL based engine as I would already know many of the commands. I decided to run some of the same tests I had run when comparing InnoDB and MyISAM for reporting (http://dbscience.blogspot.com/2008/08/innodb-suitability-for-reporting.html ).  InnoDB performed better than MyISAM in my reporting tests so I’m going to compare Infobright to InnoDB.

The first task is to load the data with this command.

load data infile '/dbdata/info/data/sale.dat' into table Sale fields terminated by '\t' ;

Infobright loaded 120 million rows in 8 min 2.62 seconds. Loading 14 million rows a minute is fast enough for me, considering I’m running on a commodity machine with an external disk subsystem that cost about 12k USD in total.

Next I run a simple query that takes 4 seconds to run on InnoDB when the data is cached, and 8.8 when disk is involved. This type of query is common in reporting systems.

select sum(unit) from Sale where purchaseDate >= '2001-11-01' and purchaseDate < '2001-12-01'

This same query takes 151 seconds to run on Infobright. I get the message “Brighthouse: Value out of range. Query executed by MySQL engine.” I have 631,530 different purchaseDate values for 120,000,000 rows in the Sale table, so perhaps it is the high cardinality of the column that is causing the problem. I’ve heard that column based databases can have this issue. I don’t see any noticeable disk io.

But, if I run the exact query again it now takes 0.0069 seconds; obviously data is being cached at some level.

I’m curious if the various caches need to warm up before the Infobright has fast performance. To test this I run the same query on a subset of the above date. This sql still takes 153 seconds. So it appears only the result set is being cached.

select sum(unit) from Sale where purchaseDate >= '2001-11-02' and purchaseDate < '2001-12-01'

If I limit the result set to one customer the performance doesn’t improve, with this sql taking 205 seconds. I’m still getting the message “Brighthouse: Value out of range. Query executed by MySQL engine.”

select sum(unit) from Sale where purchaseDate >= '2001-11-01' and purchaseDate < '2001-12-01' and customerId = 50

If I ignore including dates in the query and just go with simple ints conditions this sql takes 1.25 seconds. I’m no longer getting the message “Brighthouse: Value out of range. Query executed by MySQL engine”, which explains the better performance.

select sum(unit) from Sale where customerId = 50 and productId = 22641

The same query in InnoDB takes 0.035 seconds if the io is required and 0.0079 if the data is cached (I'm not using result set caching, I'm just letting the db cache the underlying data).

Perhaps the problem is all my data is too high in cardinality. I have created 10,000,000 products and 1,000,000 customers, with the data being randomly generated. Is this too much for Infobright to handle? Perhaps the data is not realistic enough; perhaps it has too high a cardinality. I do have a tendency to push to extremes to find out when things get interesting (in other words, break). Yet InnoDB handled this cardinality without a problem. Plus, in my case, for the SaaS application I’m working on, the customer count is low if we are a success and 10 million customers is plausible. Storing dates to the second precision isn't an unreasonable requirement, although 10 million products is undeniably unrealistic for most organizations. 

In spite of that, next I think about truncating the seconds on the purchaseDate column to reduce the cardinality. Then I remember the community version doesn’t allow updates (or deletes, or inserts). This is a bit too crippled for my taste for I often want to update customer demographic information.  This concerns me, as the data would have to be stored elsewhere to be updated and then the updated data, in fact the entire table, dropped and recreated in Infobright.  Having to store data twice reduces the compression advantages of Infobright and adds to the complexity of the solution, when one of the stated advantages of Infobright is simplicity compared to row based databases. 

And at that I will end this preliminary assessment. I may reduce the cardinality of the data and try the tests again.

Friday, September 19, 2008

Datetimes and Ints

In my last article I found, and others confirmed (thanks), that timestamps were much slower than datetimes. I next wondered if datetimes were slower than storing dates as ints in the UNIX timestamp format (seconds since January 1, 1970). The conclusion is that ints are a bit faster than datetimes, but not enough for me to give up all the functionality that datetimes provide.

I ran this sql 100 times against tables of about 20 gig, with the innodb_buffer_pool_size=2G and innodb_flush_method=O_DIRECT, to get physical io and make the tests reasonably realistic.

Running this datetime sql 100 times took 21.2 minutes.  The month randomly varied for each of the 100 executions to avoid caching issues. 

select sum(unit) from Sale where purchaseDate >= '2001-02-01' and purchaseDate < '2001-03-01'

Running the equavlent int sql 100 times took 20.7 minutes, a difference of about 2.4%.

select sum(unit) from SaleTI where purchaseDate >= 981014400 and purchaseDate < 983433600

So even though a datetime takes 8 bytes and an int 4 bytes, the performance is roughly comparable.  Since I find datetimes much easier to work with than ints I'll be using datetimes. 

For the curious, here is the datetime version of Sale table:

 

CREATE TABLE `test`.`Sale` (

`orderId` int(11) NOT NULL,

`customerId` int(11) NOT NULL,

`productId` int(11) NOT NULL,

`productBigId` int(11) NOT NULL,

`unit` int(11) NOT NULL,

`purchaseAmount` decimal(16,2) NOT NULL,

`purchaseCost` decimal(16,2) NOT NULL,

`purchaseDate` datetime NOT NULL,

PRIMARY KEY (`purchaseDate`,`orderId`),

UNIQUE KEY `idx_sale_order` (`orderId`),

KEY `pf_sale_product` (`productId`),

KEY `pf_sale_customer` (`customerId`),

CONSTRAINT `pf_sale_customer` FOREIGN KEY (`customerId`) REFERENCES `Customer` (`customerId`),

CONSTRAINT `pf_sale_product` FOREIGN KEY (`productId`) REFERENCES `Product` (`productId`)

) ENGINE=InnoDB

 

And here is the int version of SaleTI table.

 

CREATE TABLE `test`.`SaleTI` (

`orderId` int(11) NOT NULL,

`customerId` int(11) NOT NULL,

`productId` int(11) NOT NULL,

`productBigId` int(11) NOT NULL,

`unit` int(11) NOT NULL,

`purchaseAmount` decimal(16,2) NOT NULL,

`purchaseCost` decimal(16,2) NOT NULL,

`purchaseDate` int(11) NOT NULL,

PRIMARY KEY (`purchaseDate`,`orderId`),

UNIQUE KEY `idx_saleTI_order` (`orderId`),

KEY `idx_saleTI_product` (`productId`),

KEY `idx_saleTI_customer` (`customerId`),

CONSTRAINT `fk_saleTI_customer` FOREIGN KEY (`customerId`) REFERENCES `Customer` (`customerId`),

CONSTRAINT `fk_saleTI_product` FOREIGN KEY (`productId`) REFERENCES `Product` (`productId`)

) ENGINE=InnoDB

Update:  While it doesn't represent the environments I work in, I just did a ran with same sql with innodb_buffer_pool_size=28G.  The tables were now able to fit into memory.  In this case datetimes were about 1% faster than ints, which I can't explain.  I suspect the difference is just random fluctuation.  The conclusion remains the same, that datetimes performance is about the same as Unix ints and datetimes have far more functionality. 

Friday, August 29, 2008

Can a timestamp be slower than a datetime?

One should use timestamps rather than datetimes as timestamps are four bytes and datetimes are eight bytes, right?  This makes sense as smaller indexes and tables are faster than larger ones. However, when I was trying to quantify the improvement (I’m strange that way), I found the exact opposite - when used as a criteria, a timestamp is slower than a datetime. Often much slower.

For example, for this simple sql, when purchaseDate was a timestamp, the query took about 9.4 seconds and the datetime version took about 4.3 seconds. In this case the data was in memory.

    select sum(unit) from Sale where purchaseDate >= '2001-07-01' and purchaseDate < '2001-08-01'

I did other tests, such as when data wasn’t cached and there was io, and I found cases where datetimes were much faster than timestamps, differences so extreme I’m hesitant to report them until I have a better explanation of what is going on. I really can’t believe the performance differences I'm measuring and I have this suspicion that something else is occurring as these numbers seem too extreme, but what I can’t say.

Was there a bug for this I don’t know about? It is hard to find information on timestamps as it is such a common word, but I did scan the MySQL bugs for timestamps and didn’t find anything. Is it possible the extra functionality of timestamps, like timezone conversions, makes timestamps slower than datetimes?

In a few other cases, I found them about the same speed; only in one case were timestamps faster than datetimes, and that was 94 seconds for timestamps and 97 seconds for datetimes, which is an insignificant difference.

I tested this on 5.1.26 and 5.0.22, and found the performance difference between timestamps and datetimes on both versions.

I'm not sure what to conclude from this. If you aren’t frequently using the date oriented columns as criteria in queries, and dates are something like a rarely used record of when an event occurred, the smaller timestamp is the better choice. If, however, if many queries are run against the date in question, and assuming these results I’m getting are accurate, then a datetime column should at least be under consideration. In other words, benchmark datetimes and timestamps in your environment with representative queries. The results are worse for timestamps when they are used as a selection criteria.

If you get the crazy performance differences I’m getting, or if you don’t, let me know.

 

For the curious, the plans for the timestamp queries looked like this:

mysql> explain select sum(unit) from SaleT where purchaseDate >= '2001-07-01'  and purchaseDate < '2001-08-01' \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: SaleT
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 20267520
        Extra: Using where
1 row in set (0.04 sec)

And the datetime queries looked like this:

mysql> explain select sum(unit) from Sale where purchaseDate >= '2001-07-01'  and purchaseDate < '2001-08-01' \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: Sale
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: NULL
         rows: 20548080
        Extra: Using where
1 row in set (0.06 sec)

And the table looks like this (with the other version having a purchaseDate timestamp column instead of a datetime).

 

CREATE TABLE  `test`.`Sale` (
  `orderId` int(11) NOT NULL,
  `customerId` int(11) NOT NULL,
  `productId` int(11) NOT NULL,
  `productBigId` int(11) NOT NULL,
  `unit` int(11) NOT NULL,
  `purchaseAmount` decimal(16,2) NOT NULL,
  `purchaseCost` decimal(16,2) NOT NULL,
  `purchaseDate` datetime NOT NULL,
  PRIMARY KEY  (`purchaseDate`,`orderId`),
  UNIQUE KEY `idx_sale_order` (`orderId`),
  KEY `pf_sale_product` (`productId`),
  KEY `pf_sale_customer` (`customerId`),
  CONSTRAINT `pf_sale_customer` FOREIGN KEY (`customerId`) REFERENCES `Customer` (`customerId`),
  CONSTRAINT `pf_sale_product` FOREIGN KEY (`productId`) REFERENCES `Product` (`productId`)
) ENGINE=InnoDB

 

On a somewhat related note, I also got this repeatable bug when trying to insert data into a timestamp column on 5.1.26:

    Incorrect datetime value: '2001-03-11 02:00:39' for column 'purchaseDate' at row 226

Friday, August 22, 2008

InnoDB's Suitability for Reporting

I started using Oracle, a MVCC database, to develop reporting (data warehousing, BI, take your pick) systems years ago.  I’ve come to appreciate the scalability improvements that MVCC provides, particularly for pseudo real-time reporting applications, the ones where loads are occurring at the same time as report generation.  So when people say InnoDB, partly due to MVCC, isn’t as good as MyISAM for reporting I had to look into this in more detail.

What I found is InnoDB is a good engine for reporting.  In some ways, such as performance, it is at times better than MyISAM, and one of the downsides, such as a larger disk requirement, can be mitigated.  The trick is to for the primary key to be the one predominant access path.  In this example, the InnoDB clustered index, is purchaseDate and another column, such as orderId is added to make it unique.  This has a number of advantages.  In my experience, most reporting systems queries are for a date range, so date provides a great all round clustered index.   Also, one of the problems with indexes is they can fragment, which both slows down queries and takes up more disk.  But all the new date oriented data is added to the end of the clustered index, meaning most of the index doesn’t change and the old data, if defragmented, remains so.   Finally, as most queries are for the recent data, clustering by date tends to keep the most frequently used recent data in memory, a benefit that is hard to quantify but can be substantial. 

Here is the InnoDB version of the table.  It holds 120 million rows, about 10 million for each month in 2001.  As the server has 28 gig of memory, in all cases I ensured that all these queries ran in memory and there was no physical io.  This isn’t completely fair as MyISAM does compact data and takes up less disk, but I’ll show later how to minimize the amount of disk that InnoDB consumes.  Plus, if there is enough (any? - comments would be great) contention on this point, I’ll run some tests with physical io for both engines.

 

CREATE TABLE  `test`.`Sale` (

  `orderId` int(11) NOT NULL,

  `customerId` int(11) NOT NULL,

  `productId` int(11) NOT NULL,

  `productBigId` int(11) NOT NULL,

  `unit` int(11) NOT NULL,

  `purchaseAmount` decimal(16,2) NOT NULL,

  `purchaseCost` decimal(16,2) NOT NULL,

  `purchaseDate` datetime NOT NULL,

  PRIMARY KEY  (`purchaseDate`,`orderId`),

  UNIQUE KEY `idx_sale_order` (`orderId`),

  KEY `pf_sale_product` (`productId`),

  KEY `pf_sale_customer` (`customerId`),

  CONSTRAINT `fk_sale_customer` FOREIGN KEY (`customerId`) REFERENCES `Customer` (`customerId`),

  CONSTRAINT `fk_sale_product` FOREIGN KEY (`productId`) REFERENCES `Product` (`productId`)

) ENGINE=InnoDB

 

The InnoDB version of this query takes seconds 4 seconds and sensibly uses the primary key to limit the amount of data it needs to query.   This means only about 1/12 of the table needs to be queried, and when it is read the access if sequential, which is fast.  

 

explain select sum(unit) from Sale where purchaseDate >= '2001-11-01' and purchaseDate < '2001-12-01'  \G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: Sale

         type: range

possible_keys: PRIMARY

          key: PRIMARY

      key_len: 8

          ref: NULL

         rows: 19447290

        Extra: Using where

 

This is the equivalent MyISAM table with a more standard primary key of orderId. 

 

CREATE TABLE  `test`.`SaleI` (

  `orderId` int(11) NOT NULL,

  `customerId` int(11) NOT NULL,

  `productId` int(11) NOT NULL,

  `productBigId` int(11) NOT NULL,

  `unit` int(11) NOT NULL,

  `purchaseAmount` decimal(16,2) NOT NULL,

  `purchaseCost` decimal(16,2) NOT NULL,

  `purchaseDate` datetime NOT NULL,

  PRIMARY KEY  (`orderId`),

  KEY `pk_salei_product` (`productId`),

  KEY `pk_salei_customer` (`customerId`),

  KEY `idx_saleI_purchaseDate` (`purchaseDate`)

) ENGINE=MyISAM

 

As 12 seconds, the same query for MyISAM is much slower as it must visit both the index and the page which contains the data.  

 

mysql> explain select sum(unit) from SaleI where purchaseDate >= '2001-11-01' and purchaseDate < '2001-12-01' \G;

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: SaleI

         type: range

possible_keys: idx_saleI_purchaseDate

          key: idx_saleI_purchaseDate

      key_len: 8

          ref: NULL

         rows: 9079951

        Extra: Using where

 

However, this isn’t completely fair to MyISAM as a better covering index for MyISAM can be written that replicates some of performance of a clustered index.   So I change the table definition to add the idx_saleI_covering index. 

 

CREATE TABLE  `test`.`SaleI` (

  `orderId` int(11) NOT NULL,

  `customerId` int(11) NOT NULL,

  `productId` int(11) NOT NULL,

  `productBigId` int(11) NOT NULL,

  `unit` int(11) NOT NULL,

  `purchaseAmount` decimal(16,2) NOT NULL,

  `purchaseCost` decimal(16,2) NOT NULL,

  `purchaseDate` datetime NOT NULL,

  PRIMARY KEY  (`orderId`),

  KEY `pk_salei_product` (`productId`),

  KEY `pk_salei_customer` (`customerId`),

  KEY `idx_saleI_covering` (`purchaseDate`,`productId`,`customerId`,`unit`,`purchaseAmount`,`purchaseCost`)

) ENGINE=MyISAM

 

With the better idx_saleI_covering index the query now takes 8 seconds, still about twice a long as InnoDB.  

 

select sum(unit) from SaleI where purchaseDate >= '2001-11-01' and purchaseDate < '2001-12-01' \G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: SaleI

         type: range

possible_keys: idx_saleI_covering

          key: idx_saleI_covering

      key_len: 8

          ref: NULL

         rows: 25370368

        Extra: Using where; Using index

 

How does a more selective query do, one which picks a specific date range and a another criteria, in this case customer?  For InnoDB, almost zero seconds, 0.0079 seconds to be exact. 

 

select sum(unit) from Sale where purchaseDate >= '2001-11-01'  and purchaseDate < '2001-12-01' and customerId = 50 \G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: Sale

         type: index_merge

possible_keys: PRIMARY,pf_sale_customer

          key: pf_sale_customer,PRIMARY

      key_len: 4,8

          ref: NULL

         rows: 12

        Extra: Using intersect(pf_sale_customer,PRIMARY); Using where

1 row in set (0.00 sec)

 

The query plan for MyISAM is different, but the results are the same, with the query taking 0.0079 seconds to run. 

 

mysql> explain select sum(unit)   from SaleI  where purchaseDate >= '2001- -01'    and purchaseDate < '2001-12-01'    and customerId = 50 \G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: SaleI

         type: ref

possible_keys: pk_salei_customer,idx_saleI_covering

          key: pk_salei_customer

      key_len: 4

          ref: const

         rows: 102

        Extra: Using where

 

So, not an extensive series of tests, but two of the more common types of reporting queries and InnoDB is either as fast or faster than MyISAM.  At this point, all other things being equal, I would much prefer the higher reliability of InnoDB over MyISAM. 

But what about the size of the of data on disk?  MyISAM, for the version of SaleI with the covering index, at 13.6 gig, is 63.3% of InnoDB’s 21.5 gigs.   A significant advantage for MyISAM, yes, but not enough of itself to choose MyISAM over InnoDB.  Part of the reason is all the indexes are on numbers and dates, which MyISAM doesn’t compress by default.  Generally, for performance reasons, indexes should be non-character in a reporting system, so non-character indexes are a reasonable assumption.   But still, this comparison isn’t entirely fair as the InnoDB table has been optimized and hasn’t had a chance to get fragmented, right?.   What happens if another month of data into the InnoDB Sale table without it being optimized?  I use this SQL to populate a new month of data.

 

insert into Sale (orderId, customerId, productId, productBigId, unit,

                  purchaseAmount, purchaseCost, purchaseDate)

select orderId + 120000000, customerId, productId, productBigId, unit,

                  purchaseAmount, purchaseCost, date_add(purchaseDate, interval 1 year)

  from Sale

where purchaseDate  >= '2001-01-01'

   and purchaseDate <  '2001-02-01'

And the end result of adding one more month of data to twelve existing months is InnoDB now consumes 23.4 gigs.  In other words, adding 8.3% more data results in 8.1% more disk consumed.  The lesson from this is, if you can, load your data in primary key order. 

However, the data will still get fragmented over time, in particular the other non-clustered indexes.  Plus, it isn’t always possible to load data in primary key order.   A table optimize always optimizes the entire table, so this takes a while. 

At least is does for 5.0.  There is a new way with 5.1 to reduce the table maintenance of InnoDB - one can partition by date.  The advantage this provides is instead of optimizing an entire InnoDB table, one can do an ALTER TABLE REBUILD PARTITION on only the most recent data.  This means that the table defrag should take much less time.  I’ve never done this yet, so I’m not sure if it really works and I can’t recommend it, but if/when it does work, it is another mitigating factor against the larger database sizes of InnoDB.  However, is could also be used to make date selective MyISAM queries much faster.  But see my http://dbscience.blogspot.com/2008/01/divide-and-be-conquered.html article on a potential performance downside of partitioning.

This issue is more complex than this one article can cover, but for those reporting applications where is one predominant access path, such as date, the clustered indexing advantages of the InnoDB engine makes it worth consideration for reporting applications.  

Wednesday, July 23, 2008

SAAS Multi-tenant Databases

There are some good high level discussions on the various database architectures for implementing software as a service.  I’m on going to duplicate that.  Instead, this article is going to highlight some of the issues and tradeoffs when multiple tenants (customers of the software service provider) are put into one database.   The core of the problem is having multiple tenants in one set of database tables can make queries that select data for only one customer much slower. 

First, I’m testing on a new system, one with 32 gig of memory.  Currently,  the innodb_buffer_pool_size set to 28 gig.   I’m using the same size tables as previously, both Sale and SaleTenant have 120 million rows and are about 27 and 28 gig respectively.  This means that the tables fit into memory, so the following tests do not have an IO component.  Not perfect, but it shouldn't make a difference in the conclusions for these tests.  

For the other two tables, there are 100 tenants, and 1 million customers. 

The ddl looks like this (yeah, key and constraint names don’t make sense).   First, the non-multi-tenant table. 

 

CREATE TABLE  `test`.`Sale` (

  `orderId` int(11) NOT NULL,

  `customerId` int(11) NOT NULL,

  `productId` int(11) NOT NULL,

  `productBigId` int(11) NOT NULL,

  `unit` int(11) NOT NULL,

  `purchaseAmount` decimal(16,2) NOT NULL,

  `purchaseCost` decimal(16,2) NOT NULL,

  `purchaseDate` datetime NOT NULL,

  PRIMARY KEY  (`purchaseDate`,`orderId`),

  UNIQUE KEY `idx_sale_order` (`orderId`),

  KEY `pf_sale_product` (`productId`),

  KEY `pf_sale_customer` (`customerId`),

  CONSTRAINT `pf_sale_customer` FOREIGN KEY (`customerId`) REFERENCES `Customer` (`customerId`),

  CONSTRAINT `pf_sale_product` FOREIGN KEY (`productId`) REFERENCES `Product` (`productId`)

) ENGINE=InnoDB

 

Then the table with a tenantId in the table. 

 

CREATE TABLE  `test`.`SaleTenant` (

  `orderId` int(11) NOT NULL,

  `tenantId` int(11) NOT NULL,

  `customerId` int(11) NOT NULL,

  `productId` int(11) NOT NULL,

  `productBigId` int(11) NOT NULL,

  `unit` int(11) NOT NULL,

  `purchaseAmount` decimal(16,2) NOT NULL,

  `purchaseCost` decimal(16,2) NOT NULL,

  `purchaseDate` datetime NOT NULL,

  PRIMARY KEY  (`tenantId`,`purchaseDate`,`orderId`),

  UNIQUE KEY `idx_saletenant_order` (`orderId`),

  KEY `pf_saletenant_product` (`productId`),

  KEY `pf_saletenant_customer` (`customerId`),

  CONSTRAINT `pf_saletenant_customer` FOREIGN KEY (`customerId`) REFERENCES `Customer` (`customerId`),

  CONSTRAINT `pf_saletenant_product` FOREIGN KEY (`productId`) REFERENCES `Product` (`productId`),

  CONSTRAINT `pf_saletenant_tentant` FOREIGN KEY (`tenantId`) REFERENCES `Tenant` (`tenantId`)

) ENGINE=InnoDB

 

Now, the tenant table. 

 

CREATE TABLE  `test`.`Tenant` (

  `tenantId` int(11) NOT NULL,

  `tenantName` varchar(32) NOT NULL,

  PRIMARY KEY  (`tenantId`)

) ENGINE=InnoDB DEFAULT

 

And a customer table with a tenantId column. 

 

CREATE TABLE  `test`.`Customer` (

  `customerId` int(11) NOT NULL default '0',

  `customerName` varchar(32) NOT NULL,

  `tenantId` int(11) NOT NULL default '0',

  PRIMARY KEY  (`customerId`),

  KEY `pf_customer_tenant` (`tenantId`),

  CONSTRAINT `pf_customer_tenant` FOREIGN KEY (`tenantId`) REFERENCES `Tenant` (`tenantId`)

) ENGINE=InnoDB

 

#1) Against a table that is clustered by the TenantId, this query takes 0.5 seconds to run.  As it selects about 1/100 of the Sale Tenant database, the performance isn’t bad.

 

select sum(unit) from SaleTenant  where TenantId = 77;

 

And the query plan shows that the custered index was used to scan on the relevant rows. 

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: SaleTenant

         type: ref

possible_keys: PRIMARY

          key: PRIMARY

      key_len: 4

          ref: const

         rows: 2708904

        Extra:

#2) The query against the Sale table without the tenantId is about 5 times slow, at 2.5 seconds.

 

select sum(unit)

  from Customer c

  join Sale s

    on c.customerId = s.customerId

where c.TenantId = 77

 

The plan shows that the database drives off the customer table and joins to the sales table on the customerId column. 

 

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: c

         type: ref

possible_keys: PRIMARY,pf_customer_tenant

          key: pf_customer_tenant

      key_len: 4

          ref: const

         rows: 5040

        Extra: Using index

*************************** 2. row ***************************

           id: 1

  select_type: SIMPLE

        table: s

         type: ref

possible_keys: pf_sale_customer

          key: pf_sale_customer

      key_len: 4

          ref: test.c.customerId

         rows: 57

        Extra:

 

3) Further qualifying by date shows where adding the tentantId to the SaleTenant id column further improves performance.   This sql runs in about 0.08 seconds, and is much faster than the query #1, which doesn't have the more selective date criteria. 

 

select sum(unit) from SaleTenant

where TenantId = 77

   and purchaseDate >= '2001-06-01'

   and purchaseDate < '2001-07-01'

The query plan shows that the clustered index of (tenantId, pruchaseDate, orderId) allowed the database to quickly find the relevant rows and  sequentially the relevant rows. 

 

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: SaleTenant

         type: range

possible_keys: PRIMARY

          key: PRIMARY

      key_len: 12

          ref: NULL

         rows: 150491

        Extra: Using where

 

4) As the tenant criteria drives off of the customer table, and the date criteria drives off of the sales table, and the optimizer needs to start with one of these two tables, this sql can’t be as efficient as the query #3.  In this case, the optimizer again (like query #2) decides to drive off the customer table.  Unlike query #3, the extra date criteria doesn’t improve performance as it must join to each Sale row to determine if the date is valid, and in this case, only about 1/12 of the rows are relevant.  This means 11/12 of the joins are, in effect, tossed away.   Thus, it has to do the same amount of work as query #2 and runs in about the same time as #2 2.5 seconds, or about 31 times slower than query #3. 

 

select sum(unit)

  from Customer c

  join Sale s

    on c.customerId = s.customerId

where c.TenantId = 77

   and purchaseDate >= '2001-06-01'

   and purchaseDate < '2001-07-01'

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: c

         type: ref

possible_keys: PRIMARY,pf_customer_tenant

          key: pf_customer_tenant

      key_len: 4

          ref: const

         rows: 5040

        Extra: Using index

*************************** 2. row ***************************

           id: 1

  select_type: SIMPLE

        table: s

         type: ref

possible_keys: PRIMARY,pf_sale_customer

          key: pf_sale_customer

      key_len: 4

          ref: test.c.customerId

         rows: 57

        Extra: Using where

 

So, obviously there are cases where adding a tenantId will make the queries run much faster.  The issue with adding a tenantId to all the tables and clustering on that column is queries that are not limited to a tenant (or set of tenants) will be slower in some cases.  The next two sql statements will show such a case. 

5) In this sql a simple sum is run against the Sale table for a month, summing about 1/12 of the table.  As the table is clustered by date the query is fast and runs in about 4.4 seconds.

 

select sum(unit)

  from Sale s

where purchaseDate >= '2001-06-01'

   and purchaseDate < '2001-07-01'

 

The query plan shows the clustered index being used to scan the table. 

 

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: Sale

         type: range

possible_keys: PRIMARY

          key: PRIMARY

      key_len: 8

          ref: NULL

         rows: 26733636

        Extra: Using where

 

The equivalent query against the SaleTenant table takes 39 seconds to run.  In this case, the entire table must be scanned as there is no date oriented index.  

 

select sum(unit) from SaleTenant

where purchaseDate >= '2001-06-01'

  and purchaseDate < '2001-07-01'

The plan show the full table scan. 

 

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: SaleTenant

         type: ALL

possible_keys: NULL

          key: NULL

      key_len: NULL

          ref: NULL

         rows: 123303794

        Extra: Using where

Assuming this cross-tenant access pattern is frequent enough to be a performance problem, a potential solution to this problem is to add a covering index.  However, this might not be an issue as the only organization likely to execute such a query would be the hosting company. 

Given enough tenants in a database, adding a tenantId to a mutli-tenant database is recommended as some queries are going to be at least an order of magnitude slower if you don’t.   As denormalizations go, this is a reasonable one as how often would the tenantId (owner) of a row change?  Just be aware that queries that don’t include a tenantId in the criteria can be much slower.  

I didn’t cover this, but partitioning by tenantId also makes sense and can make moving customers from one sharded database to another much easier.  Perhaps more about that later.

Friday, February 22, 2008

Clustered indexing and query performance

Last time I showed where partitioning could negatively impact performance, with one partitioned query being four times slower than a non-partitioned one when the data was partitioned by the same column as it was clustered by.  This time I’m going to show a way to get better performance by selecting a good clustered index. With the InnoDB, the create table primary key syntax encourages one to create the clustered index the same as the primary key. For transaction systems, in many cases, this makes sense.  But there are times, particularly for reporting systems, when this isn't advisable. 

To demonstrate this two similar tables will be created where the only difference is the indexing.  The below SQL shows an one of these tables, a 20 gig, 120 million rows tables, representing one year (about 10 million per month) of data.  This table is clustered by the primary key. 

create table SaleOrderCluster (

  orderId int not null,

  customerId int not null,

  productId int not null,

  productBigId int not null,

  unit int not null,

  purchaseAmount decimal(16,2) not null,

  purchaseCost decimal(16,2) not null,

  purchaseDate datetime not null,

  primary key (orderId),

  key idx_SaleOrderCluster_purchaseDate (purchaseDate),

  key idx_SaleOrderCluster_product (productId),

  key idx_SaleOrderCluster_customer (customerId),

  constraint pf_SaleOrderCluster_customer foreign key (customerId) references Customer (customerId),

  constraint pf_SaleOrderCluster_product foreign key (productId) references Product (productId)

) ENGINE=InnoDB

 

But in other cases, it makes sense to cluster the data on another key. In this case, the primary key of purchasedate is used to cluster the data by date as many queries scan a range of dates. There is still the need for a constraint on the orderId column, so a unique index is created on this column. This SQL creates the 28 gig, 120 million row table.  Note that the larger clustered index results in a noticeably larger disk utilization than the previous table (28 vs 20 gig). This is because all the other indexes point to the clustered index, which means the large clustered index increases the size of all the other indexes.  In both cases the table is much larger than the 4 gig of memory assigned to MySQL.

 

CREATE TABLE  Sale (

  orderId int not null,

  customerId int(11) not null,

  productId int(11) not null,

  productBigId int(11) not null,

  unit int(11) not null,

  purchaseAmount decimal(16,2) not null,

  purchaseCost decimal(16,2) not null,

  purchaseDate datetime not null,

  primary key  (purchaseDate,orderId),

  unique key idx_sale_order (orderId),

  key pf_sale_product (productId),

  key pf_sale_customer (customerId),

  constraint pf_sale_customer foreign key (customerId)  references Customer (customerId),

  constraint pf_sale_product foreign key (productId)  references  Product (productId)

) ENGINE=InnoDB

 

The only difference between these two table is which column is the clustered index.  To test the performance of these two indexes this stored procedure is used to see how fast a single row lookup executed. 

 

CREATE PROCEDURE testClusterSpeed()

begin

    declare vorderId INT default 0;

    declare loopcounter INT default 0;

    declare vcustomerId INT default 0;

    repeat

        set loopcounter = loopcounter + 1;

        set vorderId  = floor(1 + (rand()*120000000));

        set vcustomerId = (select customerId from SaleOrderCluster where orderId = vorderId);

    until loopcounter > 1000000

    end repeat;

end

The execution plan show it uses the clustered index to find the data.  The procedure took 983 seconds to query the data. 

+----+-------------+------------------+-------+---------+---------+-------+------+
| id | select_type | TABLE            | type  | KEY     | key_len | ref   | rows |
+----+-------------+------------------+-------+---------+---------+-------+------+
|  1 | SIMPLE      | SaleOrderCluster | const | PRIMARY | 4       | const | 1    |  
+----+-------------+------------------+-------+---------+---------+-------+------+

 

This is the same basic stored procedure except it is against the table with the purchasedate clustered index. 

 

CREATE PROCEDURE  testNoPartitionSpeed()

begin

    declare vorderId INT default 0;

    declare loopcounter INT default 0;

    declare vcustomerId INT default 0;

    repeat

        set loopcounter = loopcounter + 1;

        set vorderId  = floor(1 + (rand()*120000000));

        set vcustomerId = (select customerId from Sale where orderId = vorderId);

    until loopcounter > 1000000

    end repeat;

 

The plan for this is the expected lookup by the unique idx_sale_order index.

 

+----+-------------+-------+-------+----------------+---------+-------+------+
| id | select_type | TABLE | type  | KEY            | key_len | ref   | rows |
+----+-------------+-------+-------+----------------+---------+-------+------+
|  1 | SIMPLE      | Sale  | const | idx_sale_order | 4       | const | 1    |  
+----+-------------+-------+-------+----------------+---------+-------+------+

 

As the unique index needs to point to the clustered index, this stored proc takes about three times longer to run – 2780 seconds.  Stated differently, MySQL first uses the idx_sale_order to find the OrderID, but as all non clustered indexes point to the clustered index, the clustered index must also be read to ultimately find the row.  This last step isn't required if the clustered index is by OrderId as the data is organized by OrderId.  If most of the queries are against OrderId then using OrderId as the clustered index makes sense.  Plus, the larger Sale table with the purchaseDate clustered index, which would result in less cache hits for queries as less of it could fit into memory, can't help performance either. 

 

However, if most of the queries are against purchaseDate the performance of the two indexes flips. The below query runs against the table with the clustered index on the purchaseDate and runs in about 335 seconds. I ran this after restarting MySQL to ensure the data wasn’t in memory. If it was in memory it ran in seconds.

 

select sum(unit) from Sale

where purchaseDate < '2002-01-01'

   and purchaseDate > '2001-12-01'

 

+----+-------------+-------+-------+----------------+---------+-------+----------+
| id | select_type | TABLE | type  | KEY            | key_len | ref   | rows     |
+----+-------------+-------+-------+----------------+---------+-------+----------+
|  1 | SIMPLE      | Sale  | const | PRIMARY        | 8       |       | 28836984 |  
+----+-------------+-------+-------+----------------+---------+-------+----------+

 

When I ran the corresponding query on the table with the clustered index OrderId, I had to kill this query after 9000 seconds, which is over 27 times longer. This is because the query plan wasn’t very good as MySQL was using the date index to find each row rather than a much faster table scan.

 

select sum(unit) from SaleOrderCluster

where purchaseDate < '2002-01-01'

   and purchaseDate > '2001-12-01'

+----+-------------+------------------+-------+-----------------------------------+---------+-------+----------+
| id | select_type | TABLE            | type  | KEY                               | key_len | ref   | rows     |
+----+-------------+------------------+-------+-----------------------------------+---------+-------+----------+
|  1 | SIMPLE      | SaleOrderCluster | range | idx_SaleOrderCluster_purchaseDate | 8       |       | 16055948 |  
+----+-------------+------------------+-------+-----------------------------------+---------+-------+----------+

 

This is an example where the MySQL query optimizer is a bit immature.  So, I hinted the query to do a table scan and it took 3000 seconds, still about nine times slower than the other index, but much faster than a indexed lookup on 10 million rows.  The reason this is being so slow is the data for a month is distributed throughout the table as the table is organized by orderId, so finding one month of data means the entire table must be scanned. 

 

select sum(unit) from SaleOrderCluster use index (primary)

where purchaseDate < '2002-01-01'

   and purchaseDate > '2001-12-01'

+----+-------------+------------------+--------+-----+---------+-------+----------+
| id | select_type | TABLE            | type   | KEY | key_len | ref   | rows     |
+----+-------------+------------------+--------+-----+---------+-------+----------+
|  1 | SIMPLE      | SaleOrderCluster | all    |     | 8       |       | 29216712 |  
+----+-------------+------------------+--------+-----+---------+-------+----------+

 

Proving this, the time to query the for one month of dates takes a long as a query against the full table, as the below sql takes 3000 seconds to run as well.

 

select sum(unit) from SaleOrderCluster

+----+-------------+------------------+--------+-----+---------+-------+-----------+
| id | select_type | TABLE            | type   | KEY | key_len | ref   | rows      |
+----+-------------+------------------+--------+-----+---------+-------+-----------+
|  1 | SIMPLE      | SaleOrderCluster | all    |     | 8       |       | 129216712 |  
+----+-------------+------------------+--------+-----+---------+-------+-----------+

 

As more months are added, assuming the older months aren’t deleted, the SaleOrderCluster table performance of this type of date range queries will degrade even further.  Against the purchaseDate clustered table the same date range queries will run in close to constant time as the table grows, assuming the amount of data for a month remains constant.  This is because, while the size of the purchase date clustered index will be growing, the number of rows for the month of date being range scanned in that index will remain constant.  Partitioning by date in this case won't provide performance benefits as the data is already organized by date. 

So it seems there are two choices, either optimize for random reads by the real primary key of OrderId, or, optimize for date range queries.  There is a third alternative of adding a covering index on the all the columns being queried, not just the purchase date, on the SaleOrderCluster table, with the first column of the index being the PurchaseDate.  In this case all the above queries will be fast as both queries will be answered fully by a index, but the downside is the inserts against a larger covering index will be slower.  Perhaps I shall investigate that next time. 

Wednesday, January 16, 2008

Divide and be conquered?

Over the past four articles I've demonstrated cases where a denormalized data model is faster than a normalized one, but often not by that much, and a case where a normalized data model was a bit faster than a denormalized one.  My general conclusion was with today's optimizers one should target a normalized data model and then denormalize where it makes sense, even for reporting.  I'm not as much of a fan of the star schema, a heavily denormalized data model popularized by Ralph Kimball, as I used to be.  Star schemas are costly to build and maintain and the the time spent creating them can often be spent better on more productive optimizations, such as the creation of summary tables and better indexing.  I'm not saying the denormalization doesn't make sense in some cases, just that it doesn't make sense in all cases. 

Time on move on to the topic of this article, partitioning. 

There are many good articles out there on the new MySQL 5.1 partitioning feature so I'm not going to repeat them.  One of my favorites is http://dev.mysql.com/tech-resources/articles/mysql_5.1_partitions.html.  However, I want to point out how much partitions can degrade performance as I haven't seen that highlighted enough.  The difference between my test and other tests I've seen is the below tests deal with larger data volumes and result in more physical disk reads.  As tables that are most likely to be partitioned are the larger tables, the tables that won't fit into memory, I believe these tests better represent reality.  Or at least my reality.  As I currently use InnoDB these tests will be run on it. 

Right below is the non-partitioned, 10.5 gig, 60 million row table, which is has a clustered index (primary key) of purchaseDate and orderId.  MySQL has 1 gig of memory assigned to it so the most of the table won't fit into memory. 

CREATE TABLE  sale ( 
    orderId int(11) NOT NULL, 
    customerId int(11) NOT NULL, 
    productId int(11) NOT NULL, 
    productBigId int(11) NOT NULL, 
    unit int(11) NOT NULL, 
    purchaseAmount decimal(16,2) NOT NULL, 
    purchaseCost decimal(16,2) NOT NULL, 
    purchaseDate datetime NOT NULL, 
    primary key (purchaseDate,orderId), 
    unique key idx_sale_order (orderId), 
    key pf_sale_product (productId), 
    key pf_sale_customer (customerId), 
    constraint pf_sale_product foreign key (productId) references product (productId)
) engine=InnoDB

This is partitioned version of the above table.  The numbers in the partition by range clause represent monthly partitions of '2001-01-01', '2001-02-01', '2001-03-01', etc.  Also, the unique key on orderId is no longer possible due a partitioning limitation so it was changed to a non-unique index. 

create table salepartition (
    orderId int(11) not null,
    customerId int(11) not null,
    productId int(11) not null,
    productId int(11) not null,
    unit int(11) not null,
    purchaseAmount decimal(16,2) not null,
    purchaseCost decimal(16,2) not null,
    purchaseDate datetime not null,
    primary key (purchaseDate,orderId), 
    key idx_salePartition_order (orderId),
    key pf_salePartition_product (productId),
    key pf_salePartition_customer (customerId)
) engine engine =InnoDB
  partition by range (to_days(purchaseDate)) 
   (partition p0 values less than (730882) engine = InnoDB, 
    partition p1 values less than (730910) engine = InnoDB, 
    partition p2 values less than (730941) engine = InnoDB, 
    partition p3 values less than (730971) engine = InnoDB, 
    partition p4 values less than (731002) engine = InnoDB, 
    partition p5 values less than (731032) engine = InnoDB, 
    partition p6 values less than (731063) engine = InnoDB, 
    partition p7 values less than (731094) engine = InnoDB, 
    partition p8 values less than (731124) engine = InnoDB, 
    partition p9 values less than (731155) engine = InnoDB, 
    partition p10 values less than (731185) engine = InnoDB, 
    partition p11 values less than (731216) engine = InnoDB, 
    partition p12 values less than maxvalue engine = InnoDB)
 

What I did to test was to to write two short programs that would query the same customerId column of both tables.   The below program runs against the non-partitioned table and runs in about 18 seconds.


create procedure testNoPartitionSpeed()
begin
    declare vorderId INT default 0;
    declare loopcounter INT default 0;
    declare vcustomerId INT default 0;
    repeat
        set loopcounter = loopcounter + 1;
        set vorderId  = floor(1 + (rand()*60000000));
        set vcustomerId = (select customerId from sale where orderId = vorderId);
    until loopcounter > 1000
end

The and the plan is simple. 

+----+-------------+-------+----------+----------------------+---------+------------------+----------+
| id | select_type | TABLE | type     | KEY                  | key_len | ref              | rows     |
+----+-------------+-------+----------+----------------------+---------+------------------+----------+
|  1 | SIMPLE      | sale  | const    | 'idx_sale_order'     | 4       | const            | 1        |  
+----+-------------+-------+----------+----------------------+---------+------------------+----------+

Next I ran a similar program against the partitioned table and it ran in about 77 seconds, about four times slower.

create procedure testPartitionSpeed()
begin
    declare vorderId INT default 0;
    declare loopcounter INT default 0;
    declare vcustomerId INT default 0;
    repeat
        set loopcounter = loopcounter + 1;
        set vorderId  = floor(1 + (rand()*60000000));
        set vcustomerId = (select customerId from salepartition where orderId = vorderId);
    until loopcounter > 1000
    end repeat;
end

+----+------------+---------------+----------------------+------+--------------------+--------+-------+------+
| id | selecttype | TABLE         | partitions           | type | KEY                | keylen | ref   | rows |
+----+------------+---------------+----------------------+---------------------------+--------+-------+------+
|  1 | SIMPLE     | salepartition | p0,p1,p2,p3,p4,p5,p6,| ref  | idx_salepart_order | 4      | const | 13   |
|    |            |               | p7,p8,p9,p10,p11,p12 |      |                    |        |       |      |   
+----+------------+---------------+----------------------+------+--------------------+--------+-------+------+

The reason this slow performance is the salePartition table is composed of 13 partitions, 12 with actual data, and each one of which has a separate idx_salePartition_customer index.  This means when a query needs to find a customerId it needs to probe into all 13 partitions, which shows up as the 13 in the explain rows column.  This performance hit is why some databases have global indexes that cross all partitions.  MySQL doesn't have this functionality yet, but since SQL Server just added partitioning functionality in late 2005 MySQL isn't far behind.

Querying by a non-partitioned index was a severe performance hit, but the partition by date will improve queries that limit the data by date, right?  Well, unfortunately not.  The below query runs in four seconds.

select sum(unit) from sale
where purchaseDate >= '2001-09-01'
   and purchaseDate < '2001-10-01'

+----+-------------+-------+----------+----------------------+---------+------------------+----------+
| id | select_type | TABLE | type     | KEY                  | key_len | ref              | rows     |
+----+-------------+-------+----------+----------------------+---------+------------------+----------+
|  1 | SIMPLE      | sale  | range    | PRIMARY              | 4       | null             | 9907110  |  
+----+-------------+-------+----------+----------------------+---------+------------------+----------+

And this query also runs in about four seconds.

select sum(unit) from salepartition
where purchaseDate >= '2001-09-01'
   and purchaseDate < '2001-10-01'

+----+-------------+----------------+------------+---------+---------+---------+------------------+----------+
| id | select_type | TABLE          | partitions | type    | KEY     | key_len | ref              | rows     |
+----+-------------+----------------+------------+-------------------+---------+------------------+----------+
|  1 | SIMPLE      | salepartition  | 'p8,p9'    | range   | primary | 8       | null             | 2464709  |  
+----+-------------+----------------+------------+---------+---------+---------+------------------+----------+

Both queries run in the same amount of time as the primary key already orders and therefore limits the data by date, meaning the partition doesn't have any performance benefit in this case.

For InnoDB (other databases will behave differently, I didn't test them), where the partitioning is by the leading column of the primary key, partitioning isn't going to provide performance benefits and will instead decrease performance in some cases.  This assumes if multiple disks are used to store a table all of the table, partitioned or not, is striped across all the disks (as compared to putting a partition on each disk, which is suboptimal as the load is unlikely to be evenly distributed over the disks). 

There are still solid operational reasons to implement partitioning, such as making it easier to backup seperate partitions.  But as others have said, if you aren't careful and test you can find partitioning will degrade performance.