Tuesday, September 23, 2008

An Infobright Review

With open source software I can install reasonably complete software and try it with my data. This way I get to see how it works in a realistic setting without having to rely on benchmarks and hoping they are a good match for my environment. And I get to do this without having to deal with commercial software sales people.

So I glad to hear the Infobright had gone open source as I have been wanting test a column based database for a while. I was even happier that it was a MySQL based engine as I would already know many of the commands. I decided to run some of the same tests I had run when comparing InnoDB and MyISAM for reporting (http://dbscience.blogspot.com/2008/08/innodb-suitability-for-reporting.html ).  InnoDB performed better than MyISAM in my reporting tests so I’m going to compare Infobright to InnoDB.

The first task is to load the data with this command.

load data infile '/dbdata/info/data/sale.dat' into table Sale fields terminated by '\t' ;

Infobright loaded 120 million rows in 8 min 2.62 seconds. Loading 14 million rows a minute is fast enough for me, considering I’m running on a commodity machine with an external disk subsystem that cost about 12k USD in total.

Next I run a simple query that takes 4 seconds to run on InnoDB when the data is cached, and 8.8 when disk is involved. This type of query is common in reporting systems.

select sum(unit) from Sale where purchaseDate >= '2001-11-01' and purchaseDate < '2001-12-01'

This same query takes 151 seconds to run on Infobright. I get the message “Brighthouse: Value out of range. Query executed by MySQL engine.” I have 631,530 different purchaseDate values for 120,000,000 rows in the Sale table, so perhaps it is the high cardinality of the column that is causing the problem. I’ve heard that column based databases can have this issue. I don’t see any noticeable disk io.

But, if I run the exact query again it now takes 0.0069 seconds; obviously data is being cached at some level.

I’m curious if the various caches need to warm up before the Infobright has fast performance. To test this I run the same query on a subset of the above date. This sql still takes 153 seconds. So it appears only the result set is being cached.

select sum(unit) from Sale where purchaseDate >= '2001-11-02' and purchaseDate < '2001-12-01'

If I limit the result set to one customer the performance doesn’t improve, with this sql taking 205 seconds. I’m still getting the message “Brighthouse: Value out of range. Query executed by MySQL engine.”

select sum(unit) from Sale where purchaseDate >= '2001-11-01' and purchaseDate < '2001-12-01' and customerId = 50

If I ignore including dates in the query and just go with simple ints conditions this sql takes 1.25 seconds. I’m no longer getting the message “Brighthouse: Value out of range. Query executed by MySQL engine”, which explains the better performance.

select sum(unit) from Sale where customerId = 50 and productId = 22641

The same query in InnoDB takes 0.035 seconds if the io is required and 0.0079 if the data is cached (I'm not using result set caching, I'm just letting the db cache the underlying data).

Perhaps the problem is all my data is too high in cardinality. I have created 10,000,000 products and 1,000,000 customers, with the data being randomly generated. Is this too much for Infobright to handle? Perhaps the data is not realistic enough; perhaps it has too high a cardinality. I do have a tendency to push to extremes to find out when things get interesting (in other words, break). Yet InnoDB handled this cardinality without a problem. Plus, in my case, for the SaaS application I’m working on, the customer count is low if we are a success and 10 million customers is plausible. Storing dates to the second precision isn't an unreasonable requirement, although 10 million products is undeniably unrealistic for most organizations. 

In spite of that, next I think about truncating the seconds on the purchaseDate column to reduce the cardinality. Then I remember the community version doesn’t allow updates (or deletes, or inserts). This is a bit too crippled for my taste for I often want to update customer demographic information.  This concerns me, as the data would have to be stored elsewhere to be updated and then the updated data, in fact the entire table, dropped and recreated in Infobright.  Having to store data twice reduces the compression advantages of Infobright and adds to the complexity of the solution, when one of the stated advantages of Infobright is simplicity compared to row based databases. 

And at that I will end this preliminary assessment. I may reduce the cardinality of the data and try the tests again.

5 comments:

Victoria Eastwood said...

I'm glad to see that you have taken the time to try out ICE. We are very excited to be part of the open source community and especially the MySQL community.

The first issue you ran into is that ICE expects the constant to match the field type. In this case, your field is likely a DATETIME and the constant is DATE so ICE doesn't know how to interpret it; hence it passes it to the MySQL optimizer for execution. If you added the time part to the constant (i.e. 2001-12-01 00:00:00) then the query should run much better. We plan to correct this behavior so its an item on the product roadmap.

When you see the message that "Brighthouse: Value out of range. Query executed by MySQL engine." -- this means that ICE can't handle the query for some reason and hence is using the MySQL optimizer which is row-oriented and bypasses our optimization. The MySQL optimizer generally does a full table scan because there are no indexes.

On your concerns regarding UPDATE, most of our ICE users use their ETL process to manage changing dimensions. Given our load speed, this is quite workable and more flexible - and most ICE users would never change a fact table other than to roll back a load, which ICE allows you to do.

There is plenty of information in our forums, including areas to post feature requests for consideration. I invite you to join our community.

Phantal said...

I have a few comments on this:

:- On the date stuff, InfoBright, Vertica, and other column stores are *usually* designed with the premise that dimension tables (date, time, etc) exist, date/time keys in fact tables are just integers and the only relation to the actual day or time they represent is their ordering (date/time dimensions are the only dimensions the Kimball group recommends there be *any* logic behind the keys). My experience in the past has been: mysql sucks at constraining queries based on date, datetime, or time fields, regardless of the engine.
:- Victoria: This doesn't explain why queries executed using simple integers ran slower than the innodb query. I suspect the answer lies in the amount of data used. InfoBright was built for the purpose of handling 10s or 100s of terabytes. For low-end queries like this I still would have expected -- based on claims made -- that ICE would beat InnoDB, but I would imagine on a huge dataset InnoDB wouldn't be able to keep up -- again, based on claims made.

-Brian

Anonymous said...

I am disappointed in these results. I was initially pretty excited about Infobright, but it sounds like it may not be as powerful as claimed. Open source or not, if it cannot beat InnoDB I don't know how useful it is.

I would be interested in seeing results making the changes Victoria suggested.

Also it would be great if Infobright checked the configuration, etc., to see if anything was misconfigured.

Great post, and thanks for taking the time to do the tests.

dbscience said...

Great to hear feedback from Infobright. This is a good sign.

First, for the configuration comment, there are only three configuration options and all of those are memory based. I used the middle setting, the 16 gig version. Given compression, all of my data should be in memory.

Second, given the choice in functionality between an intelligent date in int form or Mysql datetime form I pick the datetime form as it has so much more functionality. For the reporting systems I've worked on I've never found a need for a date dimension, but that may be a reflection of the simplicity of the start up environments I've worked in. This is one of the places where I happily disagree with Kimball.

Third, I'll try the suggestion on better defining the datetime with and additional ’00:00:00’. The performance difference makes sense if full table scan is being used as InnoDB was only doing range scan of 1/12 of the table.

And finally, I still disagree with the lack of updating functionality. I agree that fact tables usually don't have to be updated, but the example I gave was of customer dimensional data, which often has to be updated. This means that another version of the customer data would have to be kept which could be updated, and the customer data would have to be reloaded each time.

Victoria said...

ICE uses a technology called the knowledge grid instead of indexes. It is designed for large volume databases. So if you are not above 500 GB’s of data or 400-500 million rows in a table, then use InnoDB! You can tune it easily enough for the small volumes and its very efficient. But when you start approaching these sizes, performance will drop off considerable and the value of ICE really shines. Judging by your comments about memory, it would appear that you are working with small volumes. Brian, to your point, data volume has a lot to do with whether you will see value with ICE.

I’m not going to comment on the need for date dimensions, but one technique for dates that can be very effective is to the split the field into day, month, year; it is fair easier for a business user to query on the individual fields and they don’t have to worry about specifying the start and end DATETIME exactly. Having said this, we are working on fixing the code so a DATETIME comparison can be done without specifying the time component. I should also point out that ICE is limited in its support for functions. It is something we are currently working on. If you go to http://www.infobright.org/forums/viewthread/107/ ; you will see what syntax we support.

With the knowledge grid we quickly eliminate data so we only only deal with the data that is required for a query. You can think of the knowledge grid as a collection of statistics about the data in subsets of columns and rows. In a sense we try to extract the interesting information about the data at a much higher level than the individual row. This information is very good for eliminating data when there are conditions or constraints on a query. The more the better. So, don’t judge all storage engines the same way. MySQL is not particularly good at subquery processing and we are not perfect but much better than standard MySQL.

Remember most storage engines use the MySQL optimizer, but we use our own (for syntax we have implemented). Hence we don’t necessarily share the same issues.

As for UPDATE, we are all entitled to our opinions ;-) We value community feedback. Please feel free to post to our features request forum.

Lastly, I’m delighted that you are looking at ICE. We think it has considerable value for the use case (data warehousing) domain it addresses. Of course, there is lots more to do. Help us figure out what is essential!

Victoria