Wednesday, October 1, 2008

Infobright Review – Part 2

First, a retraction, it turns out that the performance problem with datatimes in the previous article wasn’t due to high cardinality (I speculated too much here), but due to a type conversion issue.  From a helpful comment from Victoria Eastwood of Infobright (a good sign for a startup), the Infobright engine considered ‘2001-01-01’ to be a date, not a datetime, and it couldn’t do a conversion to a datetime.  Instead it pushed the date filtering logic from the Infobright engine to MySQL.  Effectively, the slow queries were a table scan.   The solution is to add the 00:00:00 to the dates to make them datetimes.  

With that in mind, here are some much better numbers for Infobright.   For Infobright this query took 0.05 seconds. 

1) Select sum(unit) from Sale where purchaseDate >= '2001-04-01 00:00:00' and purchaseDate < '2001-05-01 00:00:00'

This compares very favorably to 4 seconds for InnoDB if the data is cached and 8.8 if the data needs to be read from disk. 

In doing further tests I found some interesting patterns one might not expect coming from a row database background.  This more selective #2 query took 0.27 seconds for Infobright, longer than the less selective query #1.  Still 0.27 is fast enough.  The odd part is a row based database is generally faster, or at least the same performance, the more selective the query.  For Infobright the exact opposite occurs, at least in these cases I explored.   For this more selective query, InnoDB took 0.047 seconds with a cold cache (all following queries will be using a cold cache unless otherwise noted) compared to that 0.27 for Infobright.  It seems that the that Infobright is faster than InnoDB for less selective queries, and InnoDB is faster for more selective queries. 

2) select sum(unit) from Sale where purchaseDate >= '2001-04-01 00:00:00' and purchaseDate < '2001-05-01 00:00:00' and productId = 199824

This more selective query #3 took longer still, at 0.99 seconds for Infobright, but only took 0.047 seconds for InnoDB, the same time as query #2.  

3) select sum(unit) from Sale where purchaseDate >= '2001-04-01 00:00:00' and purchaseDate < '2001-05-01 00:00:00' and productId = 199824 and customerId = 418845

The same pattern becomes more evident with double the time range for the queries, with two months of data compared to just one.   Query #4 took 0.07 seconds for Infobright whereas InnoDB took 19.9 seconds, far longer.  

4) select sum(unit)  from Sale where purchaseDate >= '2001-04-01 00:00:00' and purchaseDate < '2001-06-01 00:00:00'

At 1.25 seconds, adding another criteria noticeably slowed down the performance from query #4 for Infobright.  At 0.061 seconds, InnoDB was much faster than the less selective query #4,

5) select sum(unit) from Sale where purchaseDate >= '2001-04-01 00:00:00' and purchaseDate < '2001-06-01 00:00:00' and productId = 199824

At 1.99 seconds, this more selective query #6 was again slower than #5 for Infobright.  Following the trend, InnoDB was even faster at 0.048 seconds.  

6) select sum(unit) from Sale where purchaseDate >= '2001-04-01 00:00:00' and purchaseDate < '2001-06-01 00:00:00' and productId = 199824 and customerId = 418845

But there is another difference between InnoDB and Infobright.  The below query #7 takes 0.06 seconds for InnoDB, and takes 11.5 seconds for Infobright with a cold cache (I restarted the database in both cases).  Once the cache was warmed up Infobright took 2.2 seconds to run and retrieve one row.  As there are 120 million different orderIds in the Sale table, perhaps my speculation that Infobright does worse the higher the cardinality of a column wasn't that off.  

7) select * from Sale where orderId = 1

At least for the cases I examined, InnoDB is better for more selective queries, at times far better, and Infobright is better for less selective ones, at times far better.  This assumes defining good indexes for InnoDB, work which Infobright does not require.  Which one to use for reporting?  It really does depend on the nature of the workload.   

Having said that, Infobright behaves quite differently from InnoDB and I know there is much I'm missing as I only put in a few hours of work.   I would try your own workload and come to your own conclusions. 

8 comments:

Mason said...

I would theorize that Infobright seems slower on more selective queries because as a column-oriented database, it needs to in effect perform an an internal join in evaluating additional criteria on other columns in the table.

hingo said...

The "more selective queries are faster on row based" is actually a great summary on the differences between column based and row based databases.

Another thing to consider, except for your first examples which are like Infobright jackpots, the advantages of Infobright (which includes load time btw) over row-based engines don't start to show before you have at least around 1TB of data. At that point the benefits of compression and column based data access start to wiegh in for real.

But even then, any query where you'd do SELECT * ... id=1; you are clearly accessing data in a row based way.

dbscience said...

Mason, I agree with you that, in effect, multiple columns from the same table need to be joined together for column based databases. It makes sense. But given the title of this site I'm hesitant to speculate too much (but I guess I just did).

Hingo, I would like to test with a TB of data, but finding that much disk for both Infobright and InnoDB on a good disk array that isn't currently in use isn't possible for me. Perhaps I could just do Infobright and show how it scales as that would only take about 100 gig or so.

Victoria said...

First of all, I am delighted to see that you have tried some of my suggestions in the previous comments to your blog post. As you can see, a bit of a difference in performance.

Also, you ran some interesting queries and with some interesting results. However my explanation would be a bit different (but then I understand the innards a bit better ;-)

You have admittedly a small volume of data (We can help you with that if you want some larger data sets to play with). With row-oriented databases and assuming you have indexes on the various columns you are using in queries, it is much faster. But as the volume grows, both the size of the database (with indexes) grows rapidly and performance degrades. Also, if your users are doing ad-hoc queries then you don’t have the opportunity to create the correct indexes, or you make them wait while you build additional indexes. Not to mention, you take a big hit on data load times with indexes. (ICE has linear load speed regardless of table size.)

ICE keeps information at a much higher level about the data in each column. It uses that information to eliminate data from the query (both columns and “chunks” of columns; what we call ‘datapacks’). Sometimes we have enough information to answer the query from this higher level information. Your Select sum(x) from y with date range is a perfect example. If you really want to see this in action, try a query that has a null result set; i.e. The conditions are not met, like product id doesn’t exist.

In other cases, like when you have added more criteria to the query, we use this information to isolate the data involved. However, since this is not a row level like an index and since the index is quite efficient at this volume of data, InnoDB shows better performance. However on much large volumes (we like playing with an 8TB instance), the story is much different.

And just on the note about column oriented vs row oriented; the big advantage to column is better compression because you can have specific algorithms for data types and you only need to access columns related to the query (both reduce disk I/O...The slowest thing going). In our case, rows are ordered the same way in each column making reconstruction easier (with little overhead).

Finally, if you want to do select *; well, you pay the price of decompression so... If you have TB’s of data, I don’t think you really want the answer to this? But I’m just teasing.

Again, thanks for trying out ICE and your feedback. Its extremely valuable to us. And if you want to try ICE on some volume, let me know. We have some sandboxes for this purpose.

hingo said...

Victoria,

You imply that selecting a full table is a bad idea, is this right? I mean of course that is *always* a bad idea, but...

Since loads are faster in Infobright due to compression, shouldn't reads be then faster as well? If I have a use case where I have to move 5TB of data in and out of a database, wouldn't Infobright be a good idea?

Victoria said...

Hingo,

I was joking about the select * bit. My point was that you can't get much useful information from 5TB of data, row by row. Normally with this amount of data you would do statistical type analysis to understand patterns in the data, etc. And from that point of view select * is not so useful.

However, select *'s still work and do come up. Our loader has been well optimized, but we have further work to optimize a full table scan type of query (which we are working on). ICE technology is all about minimizing decompression or avoid it all together. That makes queries super fast. And when we do need to decompress, with the reduced amount of I/O required, we can afford some decompression without query degradation.

As for moving 5TB in and out of a database, well that will be time consuming. And you can easily do this with ICE. I'm not sure what your use case is, but with a bit more insight we might be able to help you design an approach to move the data in compressed form -- which would change a 5TB data transfer problem into an ~500GB problem. So you would transfer the database itself rather than import/export.

Thanks again for your comments. Be glad to answer any other questions you might have. You might also visit our forums as there is all sorts of great information about ICE there.

dbscience said...

To try to summarize this all:

Infobright's main advantage is at high data volumes for queries that summarize large volumes of data.

InnoDB is better is better at low data volumes for queries that summarize small data volumes, or just select one row, using highly selective indexes.

This means, if one has a small database (100's of millions of rows) that needs some reporting the easier solution could be just to replicate the data and report on the slave.

However, for a company that has a TBs of data, Infobright deserves consideration.

Since my current company hasn't implemented the SaaS version of the product yet I'm more in the first category. But in 6 months or so I'm might take you up on the sandbox offer.

Victoria said...

I like your summary. It's spot on.

Thanks again for taking the time to review ICE. Hope to see you on the forums. Cheers.