With open source software I can install reasonably complete software and try it with my data. This way I get to see how it works in a realistic setting without having to rely on benchmarks and hoping they are a good match for my environment. And I get to do this without having to deal with commercial software sales people.
So I glad to hear the Infobright had gone open source as I have been wanting test a column based database for a while. I was even happier that it was a MySQL based engine as I would already know many of the commands. I decided to run some of the same tests I had run when comparing InnoDB and MyISAM for reporting (http://dbscience.blogspot.com/2008/08/innodb-suitability-for-reporting.html ). InnoDB performed better than MyISAM in my reporting tests so I’m going to compare Infobright to InnoDB.
The first task is to load the data with this command.
load data infile '/dbdata/info/data/sale.dat' into table Sale fields terminated by '\t' ;
Infobright loaded 120 million rows in 8 min 2.62 seconds. Loading 14 million rows a minute is fast enough for me, considering I’m running on a commodity machine with an external disk subsystem that cost about 12k USD in total.
Next I run a simple query that takes 4 seconds to run on InnoDB when the data is cached, and 8.8 when disk is involved. This type of query is common in reporting systems.
select sum(unit) from Sale where purchaseDate >= '2001-11-01' and purchaseDate < '2001-12-01'
This same query takes 151 seconds to run on Infobright. I get the message “Brighthouse: Value out of range. Query executed by MySQL engine.” I have 631,530 different purchaseDate values for 120,000,000 rows in the Sale table, so perhaps it is the high cardinality of the column that is causing the problem. I’ve heard that column based databases can have this issue. I don’t see any noticeable disk io.
But, if I run the exact query again it now takes 0.0069 seconds; obviously data is being cached at some level.
I’m curious if the various caches need to warm up before the Infobright has fast performance. To test this I run the same query on a subset of the above date. This sql still takes 153 seconds. So it appears only the result set is being cached.
select sum(unit) from Sale where purchaseDate >= '2001-11-02' and purchaseDate < '2001-12-01'
If I limit the result set to one customer the performance doesn’t improve, with this sql taking 205 seconds. I’m still getting the message “Brighthouse: Value out of range. Query executed by MySQL engine.”
select sum(unit) from Sale where purchaseDate >= '2001-11-01' and purchaseDate < '2001-12-01' and customerId = 50
If I ignore including dates in the query and just go with simple ints conditions this sql takes 1.25 seconds. I’m no longer getting the message “Brighthouse: Value out of range. Query executed by MySQL engine”, which explains the better performance.
select sum(unit) from Sale where customerId = 50 and productId = 22641
The same query in InnoDB takes 0.035 seconds if the io is required and 0.0079 if the data is cached (I'm not using result set caching, I'm just letting the db cache the underlying data).
Perhaps the problem is all my data is too high in cardinality. I have created 10,000,000 products and 1,000,000 customers, with the data being randomly generated. Is this too much for Infobright to handle? Perhaps the data is not realistic enough; perhaps it has too high a cardinality. I do have a tendency to push to extremes to find out when things get interesting (in other words, break). Yet InnoDB handled this cardinality without a problem. Plus, in my case, for the SaaS application I’m working on, the customer count is low if we are a success and 10 million customers is plausible. Storing dates to the second precision isn't an unreasonable requirement, although 10 million products is undeniably unrealistic for most organizations.
In spite of that, next I think about truncating the seconds on the purchaseDate column to reduce the cardinality. Then I remember the community version doesn’t allow updates (or deletes, or inserts). This is a bit too crippled for my taste for I often want to update customer demographic information. This concerns me, as the data would have to be stored elsewhere to be updated and then the updated data, in fact the entire table, dropped and recreated in Infobright. Having to store data twice reduces the compression advantages of Infobright and adds to the complexity of the solution, when one of the stated advantages of Infobright is simplicity compared to row based databases.
And at that I will end this preliminary assessment. I may reduce the cardinality of the data and try the tests again.