Database Science: Denormalized data is a bit faster than normalized data (but by enough?)

In the last article the performance impact of joins was shown. This one will demonstrate cases where denormalized joins are a bit faster, as will the third article with larger data volumes. The fourth article, the most interesting one, will show where a denormalized data model can be 50 times faster than a normalized data model.

Here are the tables that will be involved in the sql. The normalized ProductSmall table has a 100 million rows and is about 0.67 gig.

create table ProductSmall (

productSmallId int(11) not null,

productSmallName varchar(32) not null,

productGroupId int(11) not null,

primary key (productSmallId),

key idx_ProductSmall_productGroup (productGroupId),

constraint con_ProductSmall_productGroup

foreign key (productGroupId) references ProductGroup (productGroupId)

) ENGINE=InnoDB;

This 101 row table has the product group text and joins to the ProductSmall table.

create table ProductGroup (

productGroupId int(11) not null,

productGroupName varchar(32) not null,

primary key (productGroupId)

) ENGINE=InnoDB;

This is the 100 million rows denormalized combination of the previous two tables. It is 1.2 gig and is about twice the size of the normalized version.

create table Product (

productId int(11) not null,

productName varchar(32) not null,

productGroupId int(11) not null,

productGroupName varchar(32) not null,

primary key (productId)

) ENGINE=InnoDB;

First, how fast does it take to do a sequential scan of 652,637 rows of the 120 million rows Sale table? At 0.3 seconds it is fast as it is cached is memory. This is the same Sale table from the previous article. Purchase date is the leading edge of the primary key, which explains the range scan on the primary key.

select sum(unit) from Sale

where purchaseDate >= '2001-12-30'

+----+-------------+-------+-------+---------------+----------+---------+------+----------+-------------+

+----+-------------+-------+-------+---------------+----------+---------+------+----------+-------------+

+----+-------------+-------+-------+---------------+----------+---------+------+----------+-------------+

Adding a join to a small table (that also fits into memory) is bit slower than the previous simple table scan, but the the below sql runs in a about 1.1 seconds. Still reasonably fast and less than four times slower than a sequential scan of the data. This is great performance as btree lookup can be much more expensive than a sequential table scans. This is a sign that the innodb adaptive hash indexing is working well for reads. And running "show engine innodb status" shows "18617.95 hash searches/s, 5.00 non-hash searches/s". Show engine also shows "Buffer pool hit rate 1000 / 1000", indicating that all the data does fit into memory. I won't show it, but the rest of the queries in this article have similar numbers.

select sum(unit) from Sale s

join ProductSmall p

on s.productId = p.productSmallId

where purchaseDate >= '2001-12-30';

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

| 1 | SIMPLE | s | range | PRIMARY |8 | null | 1882540 | Using where |

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

| 1 | SIMPLE | p | eq_ref | PRIMARY | 4 | s.productId | 1 | Using index |

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

Using this normalized data model, where product group names are in a different tables, the extra join slows down performance but only by a bit. This sql runs in about 1.4 seconds.

select sum(unit) from Sale s

join ProductSmall p

on s.productId = p.productSmallId

join ProductGroup pg

on p.productGroupId = pg.productGroupId

where purchaseDate >= '2001-12-30';

The plan looks similar to the previous one, with the addition of one more join.

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------+

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------+

| 1 | SIMPLE | s | range | PRIMARY | 8 | null | 1882540 | Using where |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------+

| 1 | SIMPLE | p | eq_ref | PRIMARY | 4 | s.productIdn | 1 | Using index |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------+

| 1 | SIMPLE | pg | eq_ref | PRIMARY | 4 | p.productGroupId | 1 | Using index |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------+

Ok, the denormalized Product table is a larger than the normalized ProductSmall table, but has the productGroupName of the ProductGroup table, removing one of the joins and making the below query a bit faster at 1.1 seconds. Nothing is substantially different here, but if everything can fit into memory, which this data does, the denormalized data model performs better.

select sum(unit) from Sale s

join Product p

on s.productId = p.productId

where purchaseDate >= '2001-12-30';

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

| 1 | SIMPLE | s | range | PRIMARY |8 | null | 1882540 | Using where |

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

| 1 | SIMPLE | p | eq_ref | PRIMARY | 4 | s.productId | 1 | Using index |

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

Now to run the same series of tests but with a summary of the productGroupName just to verify the above pattern holds. In summary, it does.

Ok, now we know the simple cost of a join, grouping by a value on the joined to table does slow down the sql, but again, this is reasonably fast as 2.2 seconds.

select productGroupName, sum(unit)

from Sale s

join ProductSmall p

on s.productId = p.productSmallId

join ProductGroup pg

on p.productGroupId = pg.productGroupId

where purchaseDate >= '2001-12-30'

group by productGroupName

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| 1 | SIMPLE | s | range | PRIMARY |8 | null | 1882540 | Using filesort ...

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| 1 | SIMPLE | p | eq_ref | PRIMARY | 4 | s.productId | 1 | Using index |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| 1 | SIMPLE | pg | eq_ref | PRIMARY | 4 | p.productGroupId | 1 | Using index |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

The denormalized data model is faster again, but not by much, as 1.8 seconds.

select productGroupName, sum(unit)

from Sale s

join Product p

on s.productId = p.productId

where purchaseDate >= '2001-12-30'

group by productGroupName

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

| 1 | SIMPLE | s | range | PRIMARY |8 | null | 1882540 | Using filesort ...

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

| 1 | SIMPLE | p | eq_ref | PRIMARY | 4 | s.productId | 1 | Using index |

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

Ok, what occurs when the same sql is run against 10,182,566 rows? If cached, the simple range scan query takes about 4 seconds.

select sum(unit)

from Sale s

where purchaseDate >= '2001-12-01'

+----+-------------+-------+-------+----------+---------+------+----------+-------------+

+----+-------------+-------+-------+----------+---------+------+----------+-------------+

| 1 | SIMPLE | s | range | PRIMARY | 8 | null | 30931362 | USING WHERE |

+----+-------------+-------+-------+----------+---------+------+----------+-------------+

Now, what about a join to the denormalized product table? This takes 25 seconds, which shows the cost of a table join and a group by over a sequential table scan. But as all the data can fit into memory the execution times are still reasonably fast.

select productGroupName, sum(unit)

from Sale s

join Product p

on s.productId = p.productId

where purchaseDate >= '2001-12-01'

group by productGroupName

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

| 1 | SIMPLE | s | range | PRIMARY |8 | null | 30931362 | Using filesort ...

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

| 1 | SIMPLE | p | eq_ref | PRIMARY | 4 | s.productId | 1 | |

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

Once again, the below normalized data is slower than the above normalized version and runs in 33 seconds.

select productGroupName, sum(unit)

from Sale s

join ProductSmall p

on s.productId = p.productSmallId

join ProductGroup pg

on p.productGroupId = pg.productGroupId

where purchaseDate >= '2001-12-01'

group by productGroupName

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| 1 | SIMPLE | s | range | PRIMARY | 8 | null | 30931362 | Using filesort ...

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| 1 | SIMPLE | p | eq_ref | PRIMARY | 4 | s.productId | 1 | |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| 1 | SIMPLE | pg | eq_ref | PRIMARY | 4 | p.productGroupId | 1 | |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

So far, the case for a denormalized data model, such as a star schema, isn't that robust. Yes, the performance is better, by only by about 20-30 percent.

But this series of articles isn't finished yet. In the next blog I’ll show what happens when both the normalized and the denormalized data model struggle to fit into memory.

Database Science

Wednesday, December 5, 2007

Denormalized data is a bit faster than normalized data (but by enough?)

No comments:

Blog Archive

About Me