Wednesday, October 1, 2008

Infobright Review – Part 2

First, a retraction, it turns out that the performance problem with datatimes in the previous article wasn’t due to high cardinality (I speculated too much here), but due to a type conversion issue.  From a helpful comment from Victoria Eastwood of Infobright (a good sign for a startup), the Infobright engine considered ‘2001-01-01’ to be a date, not a datetime, and it couldn’t do a conversion to a datetime.  Instead it pushed the date filtering logic from the Infobright engine to MySQL.  Effectively, the slow queries were a table scan.   The solution is to add the 00:00:00 to the dates to make them datetimes.  

With that in mind, here are some much better numbers for Infobright.   For Infobright this query took 0.05 seconds. 

1) Select sum(unit) from Sale where purchaseDate >= '2001-04-01 00:00:00' and purchaseDate < '2001-05-01 00:00:00'

This compares very favorably to 4 seconds for InnoDB if the data is cached and 8.8 if the data needs to be read from disk. 

In doing further tests I found some interesting patterns one might not expect coming from a row database background.  This more selective #2 query took 0.27 seconds for Infobright, longer than the less selective query #1.  Still 0.27 is fast enough.  The odd part is a row based database is generally faster, or at least the same performance, the more selective the query.  For Infobright the exact opposite occurs, at least in these cases I explored.   For this more selective query, InnoDB took 0.047 seconds with a cold cache (all following queries will be using a cold cache unless otherwise noted) compared to that 0.27 for Infobright.  It seems that the that Infobright is faster than InnoDB for less selective queries, and InnoDB is faster for more selective queries. 

2) select sum(unit) from Sale where purchaseDate >= '2001-04-01 00:00:00' and purchaseDate < '2001-05-01 00:00:00' and productId = 199824

This more selective query #3 took longer still, at 0.99 seconds for Infobright, but only took 0.047 seconds for InnoDB, the same time as query #2.  

3) select sum(unit) from Sale where purchaseDate >= '2001-04-01 00:00:00' and purchaseDate < '2001-05-01 00:00:00' and productId = 199824 and customerId = 418845

The same pattern becomes more evident with double the time range for the queries, with two months of data compared to just one.   Query #4 took 0.07 seconds for Infobright whereas InnoDB took 19.9 seconds, far longer.  

4) select sum(unit)  from Sale where purchaseDate >= '2001-04-01 00:00:00' and purchaseDate < '2001-06-01 00:00:00'

At 1.25 seconds, adding another criteria noticeably slowed down the performance from query #4 for Infobright.  At 0.061 seconds, InnoDB was much faster than the less selective query #4,

5) select sum(unit) from Sale where purchaseDate >= '2001-04-01 00:00:00' and purchaseDate < '2001-06-01 00:00:00' and productId = 199824

At 1.99 seconds, this more selective query #6 was again slower than #5 for Infobright.  Following the trend, InnoDB was even faster at 0.048 seconds.  

6) select sum(unit) from Sale where purchaseDate >= '2001-04-01 00:00:00' and purchaseDate < '2001-06-01 00:00:00' and productId = 199824 and customerId = 418845

But there is another difference between InnoDB and Infobright.  The below query #7 takes 0.06 seconds for InnoDB, and takes 11.5 seconds for Infobright with a cold cache (I restarted the database in both cases).  Once the cache was warmed up Infobright took 2.2 seconds to run and retrieve one row.  As there are 120 million different orderIds in the Sale table, perhaps my speculation that Infobright does worse the higher the cardinality of a column wasn't that off.  

7) select * from Sale where orderId = 1

At least for the cases I examined, InnoDB is better for more selective queries, at times far better, and Infobright is better for less selective ones, at times far better.  This assumes defining good indexes for InnoDB, work which Infobright does not require.  Which one to use for reporting?  It really does depend on the nature of the workload.   

Having said that, Infobright behaves quite differently from InnoDB and I know there is much I'm missing as I only put in a few hours of work.   I would try your own workload and come to your own conclusions.