Tuesday, September 23, 2008

An Infobright Review

With open source software I can install reasonably complete software and try it with my data. This way I get to see how it works in a realistic setting without having to rely on benchmarks and hoping they are a good match for my environment. And I get to do this without having to deal with commercial software sales people.

So I glad to hear the Infobright had gone open source as I have been wanting test a column based database for a while. I was even happier that it was a MySQL based engine as I would already know many of the commands. I decided to run some of the same tests I had run when comparing InnoDB and MyISAM for reporting (http://dbscience.blogspot.com/2008/08/innodb-suitability-for-reporting.html ).  InnoDB performed better than MyISAM in my reporting tests so I’m going to compare Infobright to InnoDB.

The first task is to load the data with this command.

load data infile '/dbdata/info/data/sale.dat' into table Sale fields terminated by '\t' ;

Infobright loaded 120 million rows in 8 min 2.62 seconds. Loading 14 million rows a minute is fast enough for me, considering I’m running on a commodity machine with an external disk subsystem that cost about 12k USD in total.

Next I run a simple query that takes 4 seconds to run on InnoDB when the data is cached, and 8.8 when disk is involved. This type of query is common in reporting systems.

select sum(unit) from Sale where purchaseDate >= '2001-11-01' and purchaseDate < '2001-12-01'

This same query takes 151 seconds to run on Infobright. I get the message “Brighthouse: Value out of range. Query executed by MySQL engine.” I have 631,530 different purchaseDate values for 120,000,000 rows in the Sale table, so perhaps it is the high cardinality of the column that is causing the problem. I’ve heard that column based databases can have this issue. I don’t see any noticeable disk io.

But, if I run the exact query again it now takes 0.0069 seconds; obviously data is being cached at some level.

I’m curious if the various caches need to warm up before the Infobright has fast performance. To test this I run the same query on a subset of the above date. This sql still takes 153 seconds. So it appears only the result set is being cached.

select sum(unit) from Sale where purchaseDate >= '2001-11-02' and purchaseDate < '2001-12-01'

If I limit the result set to one customer the performance doesn’t improve, with this sql taking 205 seconds. I’m still getting the message “Brighthouse: Value out of range. Query executed by MySQL engine.”

select sum(unit) from Sale where purchaseDate >= '2001-11-01' and purchaseDate < '2001-12-01' and customerId = 50

If I ignore including dates in the query and just go with simple ints conditions this sql takes 1.25 seconds. I’m no longer getting the message “Brighthouse: Value out of range. Query executed by MySQL engine”, which explains the better performance.

select sum(unit) from Sale where customerId = 50 and productId = 22641

The same query in InnoDB takes 0.035 seconds if the io is required and 0.0079 if the data is cached (I'm not using result set caching, I'm just letting the db cache the underlying data).

Perhaps the problem is all my data is too high in cardinality. I have created 10,000,000 products and 1,000,000 customers, with the data being randomly generated. Is this too much for Infobright to handle? Perhaps the data is not realistic enough; perhaps it has too high a cardinality. I do have a tendency to push to extremes to find out when things get interesting (in other words, break). Yet InnoDB handled this cardinality without a problem. Plus, in my case, for the SaaS application I’m working on, the customer count is low if we are a success and 10 million customers is plausible. Storing dates to the second precision isn't an unreasonable requirement, although 10 million products is undeniably unrealistic for most organizations. 

In spite of that, next I think about truncating the seconds on the purchaseDate column to reduce the cardinality. Then I remember the community version doesn’t allow updates (or deletes, or inserts). This is a bit too crippled for my taste for I often want to update customer demographic information.  This concerns me, as the data would have to be stored elsewhere to be updated and then the updated data, in fact the entire table, dropped and recreated in Infobright.  Having to store data twice reduces the compression advantages of Infobright and adds to the complexity of the solution, when one of the stated advantages of Infobright is simplicity compared to row based databases. 

And at that I will end this preliminary assessment. I may reduce the cardinality of the data and try the tests again.

Friday, September 19, 2008

Datetimes and Ints

In my last article I found, and others confirmed (thanks), that timestamps were much slower than datetimes. I next wondered if datetimes were slower than storing dates as ints in the UNIX timestamp format (seconds since January 1, 1970). The conclusion is that ints are a bit faster than datetimes, but not enough for me to give up all the functionality that datetimes provide.

I ran this sql 100 times against tables of about 20 gig, with the innodb_buffer_pool_size=2G and innodb_flush_method=O_DIRECT, to get physical io and make the tests reasonably realistic.

Running this datetime sql 100 times took 21.2 minutes.  The month randomly varied for each of the 100 executions to avoid caching issues. 

select sum(unit) from Sale where purchaseDate >= '2001-02-01' and purchaseDate < '2001-03-01'

Running the equavlent int sql 100 times took 20.7 minutes, a difference of about 2.4%.

select sum(unit) from SaleTI where purchaseDate >= 981014400 and purchaseDate < 983433600

So even though a datetime takes 8 bytes and an int 4 bytes, the performance is roughly comparable.  Since I find datetimes much easier to work with than ints I'll be using datetimes. 

For the curious, here is the datetime version of Sale table:

 

CREATE TABLE `test`.`Sale` (

`orderId` int(11) NOT NULL,

`customerId` int(11) NOT NULL,

`productId` int(11) NOT NULL,

`productBigId` int(11) NOT NULL,

`unit` int(11) NOT NULL,

`purchaseAmount` decimal(16,2) NOT NULL,

`purchaseCost` decimal(16,2) NOT NULL,

`purchaseDate` datetime NOT NULL,

PRIMARY KEY (`purchaseDate`,`orderId`),

UNIQUE KEY `idx_sale_order` (`orderId`),

KEY `pf_sale_product` (`productId`),

KEY `pf_sale_customer` (`customerId`),

CONSTRAINT `pf_sale_customer` FOREIGN KEY (`customerId`) REFERENCES `Customer` (`customerId`),

CONSTRAINT `pf_sale_product` FOREIGN KEY (`productId`) REFERENCES `Product` (`productId`)

) ENGINE=InnoDB

 

And here is the int version of SaleTI table.

 

CREATE TABLE `test`.`SaleTI` (

`orderId` int(11) NOT NULL,

`customerId` int(11) NOT NULL,

`productId` int(11) NOT NULL,

`productBigId` int(11) NOT NULL,

`unit` int(11) NOT NULL,

`purchaseAmount` decimal(16,2) NOT NULL,

`purchaseCost` decimal(16,2) NOT NULL,

`purchaseDate` int(11) NOT NULL,

PRIMARY KEY (`purchaseDate`,`orderId`),

UNIQUE KEY `idx_saleTI_order` (`orderId`),

KEY `idx_saleTI_product` (`productId`),

KEY `idx_saleTI_customer` (`customerId`),

CONSTRAINT `fk_saleTI_customer` FOREIGN KEY (`customerId`) REFERENCES `Customer` (`customerId`),

CONSTRAINT `fk_saleTI_product` FOREIGN KEY (`productId`) REFERENCES `Product` (`productId`)

) ENGINE=InnoDB

Update:  While it doesn't represent the environments I work in, I just did a ran with same sql with innodb_buffer_pool_size=28G.  The tables were now able to fit into memory.  In this case datetimes were about 1% faster than ints, which I can't explain.  I suspect the difference is just random fluctuation.  The conclusion remains the same, that datetimes performance is about the same as Unix ints and datetimes have far more functionality.