Friday, September 19, 2008

Datetimes and Ints

In my last article I found, and others confirmed (thanks), that timestamps were much slower than datetimes. I next wondered if datetimes were slower than storing dates as ints in the UNIX timestamp format (seconds since January 1, 1970). The conclusion is that ints are a bit faster than datetimes, but not enough for me to give up all the functionality that datetimes provide.

I ran this sql 100 times against tables of about 20 gig, with the innodb_buffer_pool_size=2G and innodb_flush_method=O_DIRECT, to get physical io and make the tests reasonably realistic.

Running this datetime sql 100 times took 21.2 minutes.  The month randomly varied for each of the 100 executions to avoid caching issues. 

select sum(unit) from Sale where purchaseDate >= '2001-02-01' and purchaseDate < '2001-03-01'

Running the equavlent int sql 100 times took 20.7 minutes, a difference of about 2.4%.

select sum(unit) from SaleTI where purchaseDate >= 981014400 and purchaseDate < 983433600

So even though a datetime takes 8 bytes and an int 4 bytes, the performance is roughly comparable.  Since I find datetimes much easier to work with than ints I'll be using datetimes. 

For the curious, here is the datetime version of Sale table:

 

CREATE TABLE `test`.`Sale` (

`orderId` int(11) NOT NULL,

`customerId` int(11) NOT NULL,

`productId` int(11) NOT NULL,

`productBigId` int(11) NOT NULL,

`unit` int(11) NOT NULL,

`purchaseAmount` decimal(16,2) NOT NULL,

`purchaseCost` decimal(16,2) NOT NULL,

`purchaseDate` datetime NOT NULL,

PRIMARY KEY (`purchaseDate`,`orderId`),

UNIQUE KEY `idx_sale_order` (`orderId`),

KEY `pf_sale_product` (`productId`),

KEY `pf_sale_customer` (`customerId`),

CONSTRAINT `pf_sale_customer` FOREIGN KEY (`customerId`) REFERENCES `Customer` (`customerId`),

CONSTRAINT `pf_sale_product` FOREIGN KEY (`productId`) REFERENCES `Product` (`productId`)

) ENGINE=InnoDB

 

And here is the int version of SaleTI table.

 

CREATE TABLE `test`.`SaleTI` (

`orderId` int(11) NOT NULL,

`customerId` int(11) NOT NULL,

`productId` int(11) NOT NULL,

`productBigId` int(11) NOT NULL,

`unit` int(11) NOT NULL,

`purchaseAmount` decimal(16,2) NOT NULL,

`purchaseCost` decimal(16,2) NOT NULL,

`purchaseDate` int(11) NOT NULL,

PRIMARY KEY (`purchaseDate`,`orderId`),

UNIQUE KEY `idx_saleTI_order` (`orderId`),

KEY `idx_saleTI_product` (`productId`),

KEY `idx_saleTI_customer` (`customerId`),

CONSTRAINT `fk_saleTI_customer` FOREIGN KEY (`customerId`) REFERENCES `Customer` (`customerId`),

CONSTRAINT `fk_saleTI_product` FOREIGN KEY (`productId`) REFERENCES `Product` (`productId`)

) ENGINE=InnoDB

Update:  While it doesn't represent the environments I work in, I just did a ran with same sql with innodb_buffer_pool_size=28G.  The tables were now able to fit into memory.  In this case datetimes were about 1% faster than ints, which I can't explain.  I suspect the difference is just random fluctuation.  The conclusion remains the same, that datetimes performance is about the same as Unix ints and datetimes have far more functionality. 

2 comments:

burtonator said...

You're really only testing your disk subsystem this way...

No one runs DBs off disk anymore anyway :)

All our data fits in memory.

dbscience said...

When I was testing timestamps vs. datetimes I found timestamps to take about 33 seconds compared to 21 seconds for datetimes when the data didn't fit into memory. I found it almost unbelievable that timestamps were slow enough to impact what should have been an io bound process. I should have made this clearer.

But if your data does fit into memory, then I agree, this test isn't realistic. Are there a lot of people out there for whom this is true? How big are your databases?

As for me, I typically deal with 100s of gigs or terabytes of data. This means the bottleneck in the systems I deal with is usually io. Not always, with MySQL concurrency is often a problem as well, but when I profile the worst performing queries the bottleneck is rarely CPU.

So, to tune these systems, if the performance difference between two approaches is insignificant (such as 2%) when compared to the io component of that query I generally don't worry about. This is basic bottleneck performance tuning, where one goes after the largest bottlenecks. Stated differently, I worry about how to speed up the 98% of the query.

All of this means, for me, a good test includes the impact of io.