Database Science: Denormalized data, not so fast?

The last article demonstrated that a denormalized data model was a bit faster than a normalized data model when all the data fit into memory, but only by 20-30 percent or so, at least for the experiments I ran. This article deals with larger data volumes and shows a case where normalized data is faster than denormalized data and then details another case where a denormalized data model faster.

First, this is the what the tables in question look like. All the product tables have 10 million rows. The denormalized ProductAll table is 5.2 gig, and as innodb_buffer_pool_size is set to 4 gig, this table won't fit into memory. The filler column represents other product attributes without having to specify them in detail.

create table ProductAll (
    productAllId int(11) not null,
    productAllName varchar(32) not null,
    productGroupId int(11) not null,
    productGroupName varchar(32) not null,
    filler varchar(4000) not null,
    productBrandId int(11) not null,
    productBrandName varchar(32) not null,
    primary key (productAllId)
) engine=InnoDB

The normalized ProductLess table is 4.9 gig, a bit less than the ProductAll table, but it still has the filler column.

create table ProductLess (
    productId int(11) not null,
    productName varchar(32) not null,
    productGroupId int(11) not null,
    productBrandId int(11) not null,
    filler varchar(4000) not null,
    primary key (`productId`)
) engine=InnoDB

The 4.6gig productBig table will be used to explore how joining two large tables, tables that are both larger than the memory allocated to mysql, together performs.

create table ProductBig (
    productBigId int(11) not null,
    filler` varchar(4000) default NULL,
    productBrandId int(11) not null,
    PRIMARY KEY (productBigId),
    CONSTRAINT on_ProductBig_productBigId FOREIGN KEY (productBigId) REFERENCES Product (productId)
) ENGINE=InnoDB

The 9.2 Product2Filler table is a combination of the the ProductBig and the ProductLess table. Where the filler columns of both the ProductBig and the ProductLess tables are 384 characters, the Product2Filler filler column is twice that at 768 characters.

create table Product2Filler (
    productId int(11) not null,
    productName varchar(32) not null,
    productGroupId int(11) not null,
    productBrandId int(11) not null,
    filler varchar(4000) not null,
    primary key (`productId`)
) engine=InnoDB;

The Sale table has 120 million rows and is about 30 gig. It, combined with the product tables, even just the one month (out of 12 months, or 1/12 of the data) I query against, don't fit into memory.

create table Sale (
orderId int(11) not null,
customerId int(11) not null,
productId int(11) not null
productBigId int(11) not null,
unit int(11) not null,
purchaseAmount decimal(16,2) not null,
purchaseCost decimal(16,2) not null,
purchaseDate datetime not null,
primary key (purchaseDate,orderId),
unique key idx_sale_order (orderId),
key pf_sale_product (productId),
key pf_sale_customer (customerId),
constraint `pf_sale_customer` foreign key (customerId) references Customer (customerId),
constraint `pf_sale_product` foreign key (productId) references Product (productId)
) engine=InnoDB

The tiny ProductGroup table is a small 101 rows.

create table ProductGroup (
    productGroupId int(11) not null,
    productGroupName varchar(32) not null,
    primary key (productGroupId)
) engine=InnoDB

Onto the actual sql. The below denormalized query takes about 4380 seconds. As all the data doesn't fit into memory MySQL needs to do some serious reading, which shows up in iostat numbers, as do all the below queries. All the below queries are read bound. The buffer pool hit rate of 950 /1000 (or so - I averaged a few samples) verifies the heavy io. This is a huge performance difference from the Blog of December 5, where the equivalent sql running against product tables that fit into memory, where the product table don't have the filler column, ran in 25 seconds.

select productGroupName, sum(unit)
from Sale s
join ProductAll p
on s.productId = p.productAllId
where purchaseDate >= '2001-12-01'
group by productGroupName;

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| 1 | SIMPLE | s | range | PRIMARY | 8 | null | 30931362 | Using filesort... |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| 1 | SIMPLE | p | eq_ref | PRIMARY | 4 | s.productId | 1 | null |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

At 3790 seconds, the normalized sql runs a bit faster than the above denormalized version. Like the previous query, It also is disk bound, but it has a better buffer hit rate of 972 / 1000 (or so). As the data is smaller it is better able to fit into memory, which explains the higher buffer hit rate and the better performance? I'm not convinced this is true as it needs to do more reads the get the small productGroup data, which is likely to be in memory all the time even with the MySQL least recently used caching - does anyone have insight into this?

select productGroupName, sum(unit)
from Sale s
join ProductLess p
on s.productId = p.productId
join ProductGroup pg
on p.productGroupId = pg.productGroupId
where purchaseDate >= '2001-12-01'
group by productGroupName

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| 1 | SIMPLE | s | range | PRIMARY | 8 | null | 30931362 | Using filesort... |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| 1 | SIMPLE | p | eq_ref | PRIMARY | 4 | s.productId | 1 | null |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

1 | SIMPLE | pg | eq_ref | PRIMARY | 4 | p.productGroupId | 1 | null |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

From this one might expect that a normalized data model is faster in all cases when dealing with large data volumes. But how about joining a large table, not only to a small table in memory table, but also to another large table. The below join of the ProductLess table to the ProductBig table takes 20,429 seconds.

select productGroupName, sum(unit)
from Sale s
join ProductLess p
    on s.productId = p.productId
join ProductGroup pg
    on p.productGroupId = pg.productGroupId
join ProductBig pb
    on p.productId = pb.productBigId
where purchaseDate >= '2001-12-01'
group by productGroupName

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| 1 | SIMPLE | s | range | PRIMARY | 8 | null | 30931362 | Using filesort... |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| 1 | SIMPLE | pb | eq_ref | PRIMARY | 4 | s.productId | 1 | Using index |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| 1 | SIMPLE | p | eq_ref | PRIMARY | 4 | pb.productBigId | 1 | Using where |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

1 | SIMPLE | pg | eq_ref | PRIMARY | 4 | p.productGroupId | 1 | |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

When the ProductLess table and the ProductBig tables are combined into the Product2Filler table similar SQL takes 9219 seconds, about half as long. In at least in some cases, joining two large tables, tables that can't fully fit into memory, a combined data model is faster. I know this isn't really a denormalization as both are keyed by productId, but I believe it has the same performance characteristics as a denormalized data model. At least it shows an example of what happens when two large tables are combined into one even larger one.

select productGroupName, sum(unit)
from Sale s
join Product2Filler p
on s.productId = p.productId
join ProductGroup pg
on p.productGroupId = pg.productGroupId
where purchaseDate >= '2001-12-01'
group by productGroupName

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| 1 | SIMPLE | s | range | PRIMARY | 8 | null | 30931362 | Using filesort... |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| 1 | SIMPLE | p | eq_ref | PRIMARY | 4 | s.productId | 1 | |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

1 | SIMPLE | pg | eq_ref | PRIMARY | 4 | p.productGroupId | 1 | |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

As a side note, the below query against data that is not cached (different dates than the other queries), the query runs in 340 seconds. It looks like the cost of a sequential table (range) scan is insignificant compared to a random index read of another table that can't fit into memory. Stated differently, joins are at times costly.

select count(*) from Sale
where purchaseDate >= '2001-03-01'
and purchaseDate < '2001-04-01'

What has been shown is normalized data is sometimes faster than denormalized data, and sometimes combined data is faster. So far the case for the denormalized data model isn't that persuasive. This is true both when all the data can fit into memory and when it can't and many reads are required. Why, then, do so many people use star schemas for reporting systems? In the next article there will be a case where a denormalized data model is a bit faster than what has been shown so far.

Database Science

Wednesday, December 12, 2007

Denormalized data, not so fast?

No comments:

Blog Archive

About Me