Friday, December 14, 2007

A setting star

On my second blog I stated I would demonstrate a case where a denormalized Star schema is 50 times slower than a normalized data model.  Well, while writing this I looked at the tables again and realized that I didn't have optimal indexes and my statistics weren't optimal.  Yeah, I feel a bit foolish.  Once I fixed those issues there wasn't a performance difference.  I'll go more into this during the post test discussion. 

First, these are the table that will be used to test the performance of the normalized and denormalized data models.  The ProductNormal table represents the normalized table, has 10 million rows, and is about 0.8 gig. 

 

create table ProductNormal (
     productId int(11) not null,
     productName varchar(32) not null,
     productGroupId int(11) not null,
     productBrandId int(11) not null,
     primary key (productId),
     key `idx_ProductNormal_GroupIdBrandId` (`productGroupId`,`productBrandId`),
     key `idx_ProductNormal_BrandGroup` (`productBrandId`,`productGroupId`), 
     constraint `fk_ProductNormal_productBrandId` 
         foreign key (`productBrandId`) REFERENCES     `ProductBrand` (`productBrandId`), 
     constraint `fk_ProductNormal_productGroupId` 
         foreign key (`productGroupId`) REFERENCES `ProductGroup` (`productGroupId`)
) engine=InnoDB

 

The small ProductBrand table has 11 rows.  

 

create table ProductBrand (
    productBrandId int(11) not null,
    productBrandName varchar(32) not null,
    primary key (productBrandId),
    unique key idx_ProductBrand_Name (ProductBrandName)
) engine=InnoDB

 

The ProductGroup table has 101 rows.  

 

create table ProductGroup (
    productGroupId int(11) not null,
    productGroupName varchar(32) not null,
    primary key (productGroupId),
    unique key idx_ProductGroup_Name (ProductGroupName)
) engine=InnoDB

 

The 10 million row ProductDenorm combines the previous three tables into a 1.27 gig table. 


create table ProductDenorm (
    productId int(11) not null,
    productName varchar(32) not null,
    productGroupId int(11) not null,
    productGroupName varchar(32) not null,
    productBrandId int(11) not null,
    productBrandName varchar(32) not null,
    primary key (productId),
    key idx_ProductDenorm_GroupNameBrandName (productGroupName, productBrandName),
    key idx_ProductDenorm_BrandNameGroupName (productBrandName, productGroupName)
) engine=InnoDB

 

The 120 million row Sale table is about 30 gig. 


create table Sale (
  orderId int(11) not null,
  customerId int(11) not null,
  productId int(11) not null,
  productBigId int(11) not null,
  unit int(11) not null,
  purchaseAmount decimal(16,2) not null,
  purchaseCost decimal(16,2) not null,
  purchaseDate datetime not null,
  primary key  (purchaseDate,orderId),
  unique key idx_sale_order (orderId),
  key idx_sale_product (productId),
  key idx_sale_customer (customerId),
  constraint pf_sale_customer foreign key (customerId) references Customer (customerId),
  constraint pf_sale_product foreign key  (productId) references Product (productId)
) engine=InnoDB

 

First, the below denormalized version of the query takes about 0.39 seconds, averaged over 100 executions.  The rest of the queries are also averaged over 100 runs.  What is interesting is the driving table of the plan is not the sale table, but the product table.  There are two basics query plans the optimizer can decide on.  The first is to drive off of the Sale fact table, which was covered in previous blogs.  The second is to drive off of the dimension (descriptive tables) like the ProductGroup table.  After this article, both cases will have been tested. 

 

select sum(unit) from Sale s
  join ProductDenorm p
    on s.productId = p.productId
where p.productBrandName = 'ProductBrand 1'
   and p.productGroupName = 'Group 1'
   and purchaseDate > '2001-12-30'

 

+----+-------------+-------+----------+---------------------+---------+-------------+-------+------------+
| id | select_type | table | type     | key                 | key_len | ref         | rows  | extra
+----+-------------+-------+----------+---------------------+---------+-------------+-------+------------+
|  1 | SIMPLE      | p     | ref      | idx_ProductDenorm_  | 70      | const,const | 13608 | Using where
|    |             |       |          | BrandNameGroupName  |         |             |       | 
+----+-------------+-------+----------+---------------------+---------+-------------+-------+------------+
|  1 | SIMPLE      | s     | ref      | pf_sale_product     | 4       | p.productId | 10    | Using where
+----+-------------+-------+----------+---------------------+---------+-------------+-------+------------+

 

The normalized version also takes about 0.39 seconds.  The plan is similar to the previous one, but the driving table is the normalized ProductGroup and not the denormalized ProductDenorm.   In both cases the last table in the plan is the Sale table. 

 

select sum(unit)
  from Sale s
  join ProductNormal p
    on s.productId = p.productId
  join ProductGroup pg
    on p.productGroupId = pg.productGroupId
  join ProductBrand pb
    on p.productBrandId = pb.productBrandId 
where pb.productBrandName = 'ProductBrand 1'
   and pg.productGroupName = 'Group 1'
   and purchaseDate > '2001-12-30';

 

+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
| id | select_type | table | type | key                          | key_len | ref                | rows | extra
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | pg    | ref  | idx_ProductGroup_Name        | 34      | const              | 1    | where,index 
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | pb    | ref  | idx_ProductBrand_Name        | 34      | const              | 1    | where,index
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | p     | ref  | idx_ProductNormal_BrandGroup | 8       | pb.productBrandId, | 3258 | where,index 
|    |             |       |      |                              |         | pg.productGroupId  |      |
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | s     | ref  | pf_sale_product              | 4       | p.productId        | 10   | where,index
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+

 

How about the same basic query, but for a month of data?  It also takes about .39 seconds. 

 

select sum(unit) from Sale s
  join ProductDenorm p
    on s.productId = p.productId
where p.productBrandName = 'ProductBrand 1'
   and p.productGroupName = 'Group 1'
   and purchaseDate > '2001-12-01'

 

+----+-------------+-------+----------+--------------------+---------+-------------+----------+-----------+
| id | select_type | table | type     | key                | key_len | ref         |          | extra
+----+-------------+-------+----------+--------------------+---------+-------------+----------+-----------+
|  1 | SIMPLE      | p     | ref      | idx_ProductDenorm_ | 70      | const,const | 13608    | where
|    |             |       |          | BrandNameGroupName |         |             |          |
+----+-------------+-------+----------+--------------------+---------+-------------+----------+-----------+
|  1 | SIMPLE      | s     | ref      | pf_sale_product    | 4       | p.productId | 10       | where
+----+-------------+-------+----------+--------------------+---------+-------------+----------+-----------+

 

The normalized version also takes about .39 seconds against a month of data.   So, at least for these simple queries the normalized data model is just as fast as the denormalized one. 

 

select sum(unit) from Sale s
  join ProductNormal p
    on s.productId = p.productId
  join ProductGroup pg
    on p.productGroupId = pg.productGroupId
  join ProductBrand pb
    on p.productBrandId = pb.productBrandId
where pb.productBrandName = 'ProductBrand 1'
   and pg.productGroupName = 'Group 1'
   and purchaseDate > '2001-12-01'

 

+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
| id | select_type | table | type | key                          | key_len | ref                | rows | extra
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | pg    | ref  | idx_ProductGroup_Name        | 34      | const              | 1    | where,index 
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | pb    | ref  | idx_ProductBrand_Name        | 34      | const              | 1    | where,index
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | p     | ref  | idx_ProductNormal_BrandGroup | 8       | pb.productBrandId, | 3258 | where,index
|    |             |       |      |                              |         | pg.productGroupId  |      |
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | s     | ref  | pf_sale_product              | 4       | p.productId        | 10   | where,index
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+

 

Over the course of four blog entries tests have shown that joins are expensive, but in spite of that, denormalized data isn't that much faster than a normalized data model.  The best case was when a denormalized table was a bit more than twice as fast as a few normalized tables.  Then there was the case where a normalized data model was faster than denormalized one as the normalized one used memory more efficiently.  And finally, when the optimizer drives off of the dimensional (descriptive) data and not the fact table there isn't a noticeable difference in performance.  I know there are exceptions when a normalized data model performance is worse than what I've shown; but they do appear to be exactly that, exceptions. 

I could run more complex tests, and I probably will over time, but there are enough of them start coming to tentative conclusions.   In the transactional world the general rule is to normalize and only denormalize when a specific query or set of queries warrant it.  Given that denormalization can degrade performance at times, this is wise.  The time spent denormalizing the data model would be better spent adding better indexes and getting the statistics right. 

For reporting systems I would, hesitantly, recommend the same practice - normalize first and then denormalize when required.  I believe the Star schema made sense when it was first proposed a decade ago and database optimizers were weak.  A denormalized data model has less join permutations to consider and therefore an optimizer was better able to find a fast query plan when the data model was simple.  

Over the past decade, the database optimizers have improved and are better able to handle the complexity of the normalized data model.  When I was working on this article my normalized data model was initially much slower than my denormalized version.  One of the main reasons my normalized data model was slow was because my statistics weren't that good.  With better stats, MySQL, which isn't known to have the best optimizer, was at times able to find a normal query plan just as fast as the denormalized version.  If this is true for the youthful MySQL it should be true for the more established databases like SQL Server and Oracle, which are better at creating stats. 

It takes considerable effort to develop and populate a Star Schema.  I now believe, in many cases at least, a normalized data model is best for reporting and denormalize on an exceptional basis where it makes sense.  Develop the physical data model for reporting systems just like you would a transactional physical data model.  If denormalization makes an important query much faster then denormalize. 

This may be reflective of the start up environments where I work, where we have reasonably simple data models of a hundred of so tables but some of those tables have up to a billion rows.  But based on this experience and these tests, the time spent denormalizing the data model would be better spent on creating better indexes and finding and populating summary tables.  This is where the huge, orders of magnitude, improvements are to be found.  Yes, a star schema would improve performance of that important query, but would a summary table be even better? 

I'm sure there are exceptions to this; with software as complex as a database, and some of those huge and complex databases out there, there are always exceptions.  MySQL is still young and, for example, the statistics functionality of MySQL isn't reliable enough for this to be true in all cases.  I believe a simple reporting application that has a limited number of predefined reports can get by with a normalized data model that is selectively denormalized when required, but complex reporting application with frequent ad hoc reporting needs would still do better with a Star schema.  Where the dividing line is a judgement call, but if in doubt I would start with a normalized data model and iterate to a more denormalized one. 

Yet as the capabilities of the optimizers continue to improve I suspect the development of reporting systems should shift to a more normalized data modeling approach. 

Wednesday, December 12, 2007

Denormalized data, not so fast?

The last article demonstrated that a denormalized data model was a bit faster than a normalized data model when all the data fit into memory, but only by 20-30 percent or so, at least for the experiments I ran.  This article deals with larger data volumes and shows a case where normalized data is faster than denormalized data and then details another case where a denormalized data model faster. 

First, this is the what the tables in question look like.  All the product tables have 10 million rows.  The denormalized ProductAll table is 5.2 gig, and as innodb_buffer_pool_size is set to 4 gig, this table won't fit into memory.  The filler column represents other product attributes without having to specify them in detail.

 

create table ProductAll (
    productAllId int(11) not null,
    productAllName varchar(32) not null,
    productGroupId int(11) not null,
    productGroupName varchar(32) not null,
    filler varchar(4000) not null,
    productBrandId int(11) not null,
    productBrandName varchar(32) not null,
    primary key (productAllId)
) engine=InnoDB

 

The normalized ProductLess table  is 4.9 gig, a bit less than the ProductAll table, but it still has the filler column.   


create table ProductLess (
    productId int(11) not null,
    productName varchar(32) not null,
    productGroupId int(11) not null,
    productBrandId int(11) not null,
    filler varchar(4000) not null,
    primary key (`productId`)
) engine=InnoDB

 

The 4.6gig productBig table will be used to explore how joining two large tables, tables that are both larger than the memory allocated to mysql, together performs. 


create table ProductBig (
    productBigId int(11) not null,
    filler` varchar(4000) default NULL,
    productBrandId int(11) not null,
    PRIMARY KEY (productBigId),
    CONSTRAINT on_ProductBig_productBigId FOREIGN KEY (productBigId) REFERENCES Product (productId)
) ENGINE=InnoDB

The 9.2 Product2Filler table is a combination of the the ProductBig and the ProductLess table.  Where the filler columns of both the ProductBig and the ProductLess tables are 384 characters, the Product2Filler filler column is twice that at 768 characters.


create table Product2Filler (
    productId int(11) not null,
    productName varchar(32) not null,
    productGroupId int(11) not null,
    productBrandId int(11) not null,
    filler varchar(4000) not null,
    primary key  (`productId`)
) engine=InnoDB;
 

 

The Sale table has 120 million rows and is about 30 gig.   It, combined with the product tables, even just the one month (out of 12 months, or 1/12 of the data) I query against, don't fit into memory.

 

create table Sale (
  orderId int(11) not null,
  customerId int(11) not null,
  productId int(11) not null
  productBigId int(11) not null,
  unit int(11) not null,
  purchaseAmount decimal(16,2) not null,
  purchaseCost decimal(16,2) not null,
  purchaseDate datetime not null,
  primary key  (purchaseDate,orderId),
  unique key idx_sale_order (orderId),
  key pf_sale_product (productId),
  key pf_sale_customer (customerId),
  constraint `pf_sale_customer` foreign key (customerId) references Customer (customerId),
  constraint `pf_sale_product` foreign key (productId) references Product (productId)
) engine=InnoDB

The tiny ProductGroup table is a small 101 rows. 


create table ProductGroup (
    productGroupId int(11) not null,
    productGroupName varchar(32) not null,
    primary key  (productGroupId)
) engine=InnoDB

 

Onto the actual sql.  The below denormalized query takes about 4380 seconds.  As all the data doesn't fit into memory MySQL needs to do some serious reading, which shows up in iostat numbers, as do all the below queries.  All the below queries are read bound.  The buffer pool hit rate of 950 /1000 (or so - I averaged a few samples) verifies the heavy io.  This is a huge performance difference from the Blog of December 5, where the equivalent sql running against product tables that fit into memory, where the product table don't have the filler column, ran in 25 seconds. 

 

select productGroupName, sum(unit)
  from Sale s
  join ProductAll p
    on s.productId = p.productAllId
where purchaseDate >= '2001-12-01'
group by productGroupName;

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| id | select_type | TABLE | type     | KEY      | key_len | ref              | rows     | Extra             |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | s     | range    | PRIMARY  | 8       | null             | 30931362 | Using filesort... |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | p     | eq_ref   | PRIMARY  | 4       | s.productId      | 1        | null              |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

 

At 3790 seconds, the normalized sql runs a bit faster than the above denormalized version.  Like the previous query, It also is disk bound, but it has a better buffer hit rate of 972 / 1000 (or so).  As the data is smaller it is better able to fit into memory, which explains the higher buffer hit rate and the better performance?  I'm not convinced this is true as it needs to do more reads the get the small productGroup data, which is likely to be in memory all the time even with the MySQL least recently used caching - does anyone have insight into this?

 

select productGroupName, sum(unit)
  from Sale s
  join ProductLess p
    on s.productId = p.productId
  join ProductGroup pg
    on p.productGroupId = pg.productGroupId
where purchaseDate >= '2001-12-01'
group by productGroupName

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| id | select_type | TABLE | type     | KEY      | key_len | ref              | rows     | Extra             |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | s     | range    | PRIMARY  | 8       | null             | 30931362 | Using filesort... |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | p     | eq_ref   | PRIMARY  | 4       | s.productId      | 1        | null              |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

  1 | SIMPLE       | pg    | eq_ref   | PRIMARY  | 4       | p.productGroupId | 1        | null              |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

 

From this one might expect that a normalized data model is faster in all cases when dealing with large data volumes.  But how about joining a large table, not only to a small table in memory table, but also to another large table.  The below join of the ProductLess table to the ProductBig table takes 20,429 seconds.   

select productGroupName, sum(unit)
  from Sale s
  join ProductLess p
    on s.productId = p.productId
  join ProductGroup pg
    on p.productGroupId = pg.productGroupId
  join ProductBig pb
    on p.productId = pb.productBigId
where purchaseDate >= '2001-12-01'
group by productGroupName

 

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| id | select_type | TABLE | type     | KEY      | key_len | ref              | rows     | Extra             |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | s     | range    | PRIMARY  | 8       | null             | 30931362 | Using filesort... |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | pb    | eq_ref   | PRIMARY  | 4       | s.productId      | 1        | Using index       |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | p     | eq_ref   | PRIMARY  | 4       | pb.productBigId  | 1        | Using where       |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

  1 | SIMPLE       | pg    | eq_ref   | PRIMARY  | 4       | p.productGroupId | 1        |                   |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

 

When the ProductLess table and the ProductBig tables are combined into the Product2Filler table similar SQL takes 9219 seconds, about half as long.  In at least in some cases, joining two large tables, tables that can't fully fit into memory, a combined data model is faster.  I know this isn't really a denormalization as both are keyed by productId, but I believe it has the same performance characteristics as a denormalized data model.  At least it shows an example of what happens when two large tables are combined into one even larger one. 

 

select productGroupName, sum(unit)
  from Sale s
  join Product2Filler p
    on s.productId = p.productId
  join ProductGroup pg
    on p.productGroupId = pg.productGroupId
where purchaseDate >= '2001-12-01'
group by productGroupName

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| id | select_type | TABLE | type     | KEY      | key_len | ref              | rows     | Extra             |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | s     | range    | PRIMARY  | 8       | null             | 30931362 | Using filesort... |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | p     | eq_ref   | PRIMARY  | 4       | s.productId      | 1        |                   |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

  1 | SIMPLE       | pg    | eq_ref   | PRIMARY  | 4       | p.productGroupId | 1        |                   |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

 

As a side note, the below query against data that is not cached (different dates than the other queries), the query runs in 340 seconds.  It looks like the cost of a sequential table (range) scan is insignificant compared to a random index read of another table that can't fit into memory.  Stated differently, joins are at times costly.

 

select count(*) from Sale
where purchaseDate >= '2001-03-01'
   and purchaseDate < '2001-04-01'

 

What has been shown is normalized data is sometimes faster than denormalized data, and sometimes combined data is faster.   So far the case for the denormalized data model isn't that persuasive.  This is true both when all the data can fit into memory and when it can't and many reads are required.  Why, then, do so many people use star schemas for reporting systems?  In the next article there will be a case where a denormalized data model is a bit faster than what has been shown so far. 

Wednesday, December 5, 2007

Denormalized data is a bit faster than normalized data (but by enough?)

In the last article the performance impact of joins was shown.  This one will demonstrate cases where denormalized joins are a bit faster, as will the third article with larger data volumes.  The fourth article, the most interesting one, will show where a denormalized data model can be 50 times faster than a normalized data model. 

Here are the tables that will be involved in the sql.  The normalized ProductSmall table has a 100 million rows and is about 0.67 gig. 

 

create table ProductSmall (

    productSmallId int(11) not null,

    productSmallName varchar(32) not null,

    productGroupId int(11) not null,

    primary key (productSmallId),

    key idx_ProductSmall_productGroup (productGroupId),

    constraint con_ProductSmall_productGroup

        foreign key (productGroupId) references ProductGroup (productGroupId)

) ENGINE=InnoDB;

 

This 101 row table has the product group text and joins to the ProductSmall table.   

 

create table ProductGroup (

    productGroupId int(11) not null,

    productGroupName varchar(32) not null,

    primary key (productGroupId)

) ENGINE=InnoDB;

 

This is the 100 million rows denormalized combination of the previous two tables.  It is 1.2 gig and is about twice the size of the normalized version. 

  

create table Product (

    productId int(11) not null,

    productName varchar(32) not null,

    productGroupId int(11) not null,

    productGroupName varchar(32) not null,

    primary key (productId)

) ENGINE=InnoDB;

 

First, how fast does it take to do a sequential scan of 652,637 rows of the 120 million rows Sale table?  At 0.3 seconds it is fast as it is cached is memory.  This is the same Sale table from the previous article.  Purchase date is the leading edge of the primary key, which explains the range scan on the primary key. 

 

select sum(unit) from Sale

where purchaseDate >= '2001-12-30'

 

+----+-------------+-------+-------+---------------+----------+---------+------+----------+-------------+

| id | select_type | TABLE | type  | possible_keys | KEY      | key_len | ref  | rows     | Extra       |

+----+-------------+-------+-------+---------------+----------+---------+------+----------+-------------+

|  1 | SIMPLE      | Sale  | range | PRIMARY       | PRIMARY  | 8       | null | 1882540  | USING WHERE |

+----+-------------+-------+-------+---------------+----------+---------+------+----------+-------------+

 

Adding a join to a small table (that also fits into memory) is bit slower than the previous simple table scan, but the the below sql runs in a about 1.1 seconds.  Still reasonably fast and less than four times slower than a sequential scan of the data.  This is great performance as btree lookup can be much more expensive than a sequential table scans.  This is a sign that the innodb adaptive hash indexing is working well for reads.  And running "show engine innodb status" shows "18617.95 hash searches/s, 5.00 non-hash searches/s".  Show engine also shows "Buffer pool hit rate 1000 / 1000", indicating that all the data does fit into memory.  I won't show it, but the rest of the queries in this article have similar numbers.

 

select sum(unit) from Sale s

  join ProductSmall p

    on s.productId = p.productSmallId

where purchaseDate >= '2001-12-30';

 

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

| id | select_type | TABLE | type     |  KEY     | key_len | ref         | rows     | Extra       |

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

|  1 | SIMPLE      | s     | range    |  PRIMARY |8        | null        |  1882540 | Using where |

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

|  1 | SIMPLE      | p     |  eq_ref  |  PRIMARY | 4       | s.productId | 1        | Using index |

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

 

Using this normalized data model, where product group names are in a different tables, the extra join slows down performance but only by a bit.  This sql runs in about 1.4 seconds.  

 

select sum(unit) from Sale s

  join ProductSmall p

    on s.productId = p.productSmallId

  join ProductGroup pg

    on p.productGroupId = pg.productGroupId

where purchaseDate >= '2001-12-30';

 

The plan looks similar to the previous one, with the addition of one more join. 

 

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------+

| id | select_type | TABLE | type     | KEY      | key_len | ref              | rows     | Extra       |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------+

|  1 | SIMPLE      | s     | range    | PRIMARY  | 8       | null             | 1882540  | Using where |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------+

|  1 | SIMPLE      | p     |  eq_ref  | PRIMARY  | 4       | s.productIdn     | 1        | Using index |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------+

|  1 | SIMPLE      | pg    |  eq_ref  | PRIMARY  | 4       | p.productGroupId | 1        | Using index |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------+

 

Ok, the denormalized Product table is a larger than the normalized ProductSmall table, but has the productGroupName of the ProductGroup table, removing one of the joins and making the below query a bit faster at 1.1 seconds.  Nothing is substantially different here, but if everything can fit into memory, which this data does, the denormalized data model performs better. 

 

select sum(unit) from Sale s

  join Product p

    on s.productId = p.productId

where purchaseDate >= '2001-12-30';

 

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

| id | select_type | TABLE | type     | KEY      | key_len | ref         | rows     | Extra       |

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

|  1 | SIMPLE      | s     | range    | PRIMARY  |8        | null        |  1882540 | Using where |

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

|  1 | SIMPLE      | p     |  eq_ref  | PRIMARY  | 4       | s.productId | 1        | Using index |

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

 

Now to run the same series of tests but with a summary of the productGroupName just to verify the above pattern holds.  In summary, it does. 

Ok, now we know the simple cost of a join, grouping by a value on the joined to table does slow down the sql, but again, this is reasonably fast as 2.2 seconds. 

 

select productGroupName, sum(unit)

  from Sale s

  join ProductSmall p

    on s.productId = p.productSmallId

  join ProductGroup pg

    on p.productGroupId = pg.productGroupId

where purchaseDate >= '2001-12-30'

group by productGroupName

 

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| id | select_type | TABLE | type     | KEY      | key_len | ref              | rows     | Extra             |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | s     | range    | PRIMARY  |8        | null             | 1882540  | Using filesort ...

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | p     |  eq_ref  | PRIMARY  | 4       | s.productId      | 1        | Using index       |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | pg    |  eq_ref  | PRIMARY  | 4       | p.productGroupId | 1        | Using index       |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

 

The denormalized data model is faster again, but not by much, as 1.8 seconds. 

 

select productGroupName, sum(unit)

  from Sale s

  join Product p

    on s.productId = p.productId

where purchaseDate >= '2001-12-30'

group by productGroupName

 

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

| id | select_type | TABLE | type     | KEY      | key_len | ref         | rows     | Extra                  |

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

|  1 | SIMPLE      | s     | range    | PRIMARY  |8        | null        | 1882540  | Using filesort ... 

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

|  1 | SIMPLE      | p     |  eq_ref  | PRIMARY  | 4       | s.productId | 1        | Using index            |

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

 

Ok, what occurs when the same sql is run against 10,182,566 rows?  If cached, the simple range scan query takes about 4 seconds.

 

select sum(unit)

  from Sale s

where purchaseDate >= '2001-12-01'

 

+----+-------------+-------+-------+----------+---------+------+----------+-------------+

| id | select_type | TABLE | type  | KEY      | key_len | ref  | rows     | Extra       |

+----+-------------+-------+-------+----------+---------+------+----------+-------------+

|  1 | SIMPLE      | s     | range | PRIMARY  | 8       | null | 30931362 | USING WHERE |

+----+-------------+-------+-------+----------+---------+------+----------+-------------+

 

Now, what about a join to the denormalized product table?  This takes 25 seconds, which shows the cost of a table join and a group by over a sequential table scan.  But as all the data can fit into memory the execution times are still reasonably fast. 

 

select productGroupName, sum(unit)

  from Sale s

  join Product p

    on s.productId = p.productId

where purchaseDate >= '2001-12-01'

group by productGroupName

 

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

| id | select_type | TABLE | type     | KEY      | key_len | ref         | rows     | Extra                  |

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

|  1 | SIMPLE      | s     | range    | PRIMARY  |8        | null        | 30931362 | Using filesort ... 

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

|  1 | SIMPLE      | p     |  eq_ref  | PRIMARY  | 4       | s.productId | 1        |                        |

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

 

Once again, the below normalized data is slower than the above normalized version and runs in 33 seconds.

 

select productGroupName, sum(unit)

  from Sale s

  join ProductSmall p

    on s.productId = p.productSmallId

  join ProductGroup pg

    on p.productGroupId = pg.productGroupId

where purchaseDate >= '2001-12-01'

group by productGroupName

 

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| id | select_type | TABLE | type     | KEY      | key_len | ref              | rows     | Extra             |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | s     | range    | PRIMARY  | 8       | null             | 30931362 | Using filesort ...

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | p     |  eq_ref  | PRIMARY  | 4       | s.productId      | 1        |                   |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | pg    |  eq_ref  | PRIMARY  | 4       | p.productGroupId | 1        |                   |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

 

So far, the case for a denormalized data model, such as a star schema, isn't that robust.  Yes, the performance is better, by only by about 20-30 percent.

But this series of articles isn't finished yet.   In the next blog I’ll show what happens when both the normalized and the denormalized data model struggle to fit into memory.  

Wednesday, November 28, 2007

Joins are slow, memory is fast

I've been working with various databases for a number of years. In that time I've found there is plenty of documentation on various features but not much on how database modeling impacts performance. Some people say one should use a fully normalized data model with a few denormalizations that are required for performance reasons, but those denormalizations are never discussed. And Ralph Kimball has many articles and a book concentrating on the physical data modeling, but they are more of the "here is how you solve the problem" and they don't detail why his method works better.

I've always found this odd as the physical data model has a major impact on database performance. I want this to be more of the whys behind various physical data modeling options with some examples showing the magnitude of the performance differences.

The first goal of this blog (over a few articles) will be to show cases where a denormalized dimensional data model substantially improves performance instances and where a normalized model optimized for updates is appropriate. The server I have available to do this isn't the best, so the largest table is only going to be 120 million rows and 30 gig on disk, but I believe the principles shown will scale to larger data volumes on more capable servers.

At first, at least, I'm going to demonstrate this with MySQL using the InnoDB database as I'm currently using this at work and I need to learn more about it. I'm more comfortable with Oracle and MS SQL Server, so if I get something wrong with MySQL, or it isn't true in all cases (and what is with a database?), or I'm just being an idiot, point it out.

The machine I ran these tests on has a innodb_buffer_pool_size of 2048M and innodb_flush_method=O_DIRECT to avoid double caching by the OS. The Linux machine I ran these tests on doesn’t have a great disk subsystem with only 4 disks in a striped mirrored configuration.

Enough of the preliminaries, here are the tables. This is the small customer dimension of 1 million rows and 60 meg of data that easily fits into memory.

CREATE TABLE Customer
(customerId int NOT NULL default '0',
customerName varchar(32) NOT NULL,
postalCode varchar(7) NOT NULL,
PRIMARY KEY (`customerId`)
) ENGINE=InnoDB;

The product dimension is larger, with 10 million rows and 1043 meg of data. It will fit into memory, but it takes half of the available memory.

CREATE TABLE Product
(productId int NOT NULL,
productName varchar(32) default NULL,
productGroupId int default NULL,
productGroupName varchar(32) default NULL,
PRIMARY KEY (`productId`)
) ENGINE=InnoDB;

Finally, the large sale fact table, with 120 million rows and about 28 gig of data. It will never fit into memory. Note the trick to create the clustered index (primary key) based on the date so the common date based queries can quickly do a range scan based on the clustered index, but there is a severe downside to this. Since all the other indexes point to the clustered index they will all be larger and will therefore take longer to insert data and read. When to use this trick is a topic in and of itself, so enough of this for now. The range of dates is from 2001-01-01 to 2001-01-31, with the data being randomly distributed over the year.

CREATE TABLE Sale
(orderId int NOT NULL,
customerId int(11) NOT NULL,
productId int(11) NOT NULL,
productBigId int(11) NOT NULL,
unit int(11) NOT NULL,
purchaseAmount decimal(16,2) NOT NULL,
purchaseCost decimal(16,2) NOT NULL,
purchaseDate datetime NOT NULL,
PRIMARY KEY idx_sale_dateOrder (purchaseDate,orderId),
UNIQUE KEY pk_sale_order (orderId),
KEY idx_sale_product (productId),
KEY idx_sale_customer (customerId),
CONSTRAINT pf_sale_product FOREIGN KEY (productId)
REFERENCES Product (productId),
CONSTRAINT pf_sale_customer FOREIGN KEY (customerId)
REFERENCES Customer (customerId)
) ENGINE=InnoDB;

So, my testing technique is to run a query a number of times and throw out the first execution. This isn’t entirely realistic, but the alternative is to flush all the data and have no caching at all. As reporting performance is heavily dependent on memory caching, I prefer a testing methodology that takes into the account caching, even if it is overstated.

Ok, the month of December comprises a bit more than 10 million rows. Stripping out all the keys that aren’t used, one month of this table fits into memory, which shows up in the 3 to 4 second execution times of this sql after the first execution.

select sum(unit) from Sale
where purchaseDate >= '2001-12-01'

And the execution plan shows that the primary key is being used to do a range scan.

1, 'SIMPLE', 'Sale', 'range', 'PRIMARY', 'PRIMARY', '8', '', 30931362, 'Using where; Using index'

Adding a simple join to the small customer table results in an execution time of 11 to 12 seconds. So while everything is fitting into memory, an in memory join still makes a query run about 3 times slower in this case. As expected, the execution plan shows the sale table primary key being used for a range scan with the join to the customer table using the primary key being for an index lookup. As a note, MySQL needs to do a better job on table analysis as I didn’t get a good plan on this simple query without a hint.

select straight_join sum(unit) from Sale s
join Customer c
on s.customerId = c.customerId
where s.purchaseDate >= '2001-12-01'

And here is the plan from the above hint, which drives from the sales table and looks up the related customer information.

1, 'SIMPLE', 's', 'range', 'PRIMARY', 'PRIMARY', '8', '', 30931362, 'Using where'
1, 'SIMPLE', 'c', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'perfbig.s.customerId', 1, 'Using index'

In the next case the sales table is joined to the much larger product table. Both one month of the sale table and the product table both can’t fit into memory; instead of fast memory access we are dealing with slow disk reads and query times of around 417 to 525 seconds.

I was surprised by how much slower this is than the equivalent customer query. I’m also wondering about the strength of the simple InnoDB least recently used buffer pool algorithm as this point. As expected with the slow execution time, iostat shows heavy disk utilization.

select count(*) from Sale s
join Product p
on s.productId = p.productId
where purchaseDate >= '2001-12-01';

The plan is reasonable as it drives from the sale table and joins to the product table.

1, 'SIMPLE', 's', 'range', 'PRIMARY,pf_sale_product', 'PRIMARY', '8', '', 30931362, 'Using where'
1, 'SIMPLE', 'p', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'perfbig.s.productId', 1, 'Using index'

Instead of using a full month of data does how does querying just two days of the sale table behave? Well, it executes in .3 seconds. Obviously, everything fits into memory at this point.

select sum(unit) from Sale s
where purchaseDate >= '2001-12-30';

The plan is good, with a range scan of the sale table.

1, 'SIMPLE', 's', 'range', 'PRIMARY', 'PRIMARY', '8', '', 1882540, 'Using where'

How about adding a simple join to the customer table? That results in an execution time of one second, so even small in memory joins impact execution times.

select sum(unit) from Sale s
join Customer c
on s.customerId = c.customerId
where purchaseDate >= '2001-12-30';

The plan is good.

1, 'SIMPLE', 's', 'range', 'PRIMARY,pf_sale_customer', 'PRIMARY', '8', '', 1882540, 'Using where'
1, 'SIMPLE', 'c', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'perfbig.s.customerId', 1, 'Using index'

What is interesting is the join to the much larger product table takes about the same amount of time as the Customer table, one second. I’m going to guess this is because everything can still fit into memory. This is verified as iostat does not show any disk io.

select sum(unit) from Sale s
join Product p
on s.productId = p.productId
where purchaseDate >= '2001-12-30';

And the product plan looks the same as the join to the customer table.

1, 'SIMPLE', 's', 'range', 'PRIMARY,pf_sale_product', 'PRIMARY', '8', '', 1882540, 'Using where'
1, 'SIMPLE', 'p', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'perfbig.s.productId', 1, 'Using index'

So, one of the insights of the dimensional model is that joins are expensive. This has been demonstrated by these tests. The size of the dimension doesn't impact performance if the dimension can fit into memory, but a large dimension that doesn't fit into memory has a severe impact on memory.

Others, looking at the dimensional model, will say normalization is best as this minimizes the amount of memory a table will hold, and memory optimization means more of the data fits into memory and result in less of those expensive physical reads. So which is more important, minimizing the number of joins or minimizing the size of tables? That will be covered in another (hopefully shorter) entry.