Friday, December 14, 2007

A setting star

On my second blog I stated I would demonstrate a case where a denormalized Star schema is 50 times slower than a normalized data model.  Well, while writing this I looked at the tables again and realized that I didn't have optimal indexes and my statistics weren't optimal.  Yeah, I feel a bit foolish.  Once I fixed those issues there wasn't a performance difference.  I'll go more into this during the post test discussion. 

First, these are the table that will be used to test the performance of the normalized and denormalized data models.  The ProductNormal table represents the normalized table, has 10 million rows, and is about 0.8 gig. 

 

create table ProductNormal (
     productId int(11) not null,
     productName varchar(32) not null,
     productGroupId int(11) not null,
     productBrandId int(11) not null,
     primary key (productId),
     key `idx_ProductNormal_GroupIdBrandId` (`productGroupId`,`productBrandId`),
     key `idx_ProductNormal_BrandGroup` (`productBrandId`,`productGroupId`), 
     constraint `fk_ProductNormal_productBrandId` 
         foreign key (`productBrandId`) REFERENCES     `ProductBrand` (`productBrandId`), 
     constraint `fk_ProductNormal_productGroupId` 
         foreign key (`productGroupId`) REFERENCES `ProductGroup` (`productGroupId`)
) engine=InnoDB

 

The small ProductBrand table has 11 rows.  

 

create table ProductBrand (
    productBrandId int(11) not null,
    productBrandName varchar(32) not null,
    primary key (productBrandId),
    unique key idx_ProductBrand_Name (ProductBrandName)
) engine=InnoDB

 

The ProductGroup table has 101 rows.  

 

create table ProductGroup (
    productGroupId int(11) not null,
    productGroupName varchar(32) not null,
    primary key (productGroupId),
    unique key idx_ProductGroup_Name (ProductGroupName)
) engine=InnoDB

 

The 10 million row ProductDenorm combines the previous three tables into a 1.27 gig table. 


create table ProductDenorm (
    productId int(11) not null,
    productName varchar(32) not null,
    productGroupId int(11) not null,
    productGroupName varchar(32) not null,
    productBrandId int(11) not null,
    productBrandName varchar(32) not null,
    primary key (productId),
    key idx_ProductDenorm_GroupNameBrandName (productGroupName, productBrandName),
    key idx_ProductDenorm_BrandNameGroupName (productBrandName, productGroupName)
) engine=InnoDB

 

The 120 million row Sale table is about 30 gig. 


create table Sale (
  orderId int(11) not null,
  customerId int(11) not null,
  productId int(11) not null,
  productBigId int(11) not null,
  unit int(11) not null,
  purchaseAmount decimal(16,2) not null,
  purchaseCost decimal(16,2) not null,
  purchaseDate datetime not null,
  primary key  (purchaseDate,orderId),
  unique key idx_sale_order (orderId),
  key idx_sale_product (productId),
  key idx_sale_customer (customerId),
  constraint pf_sale_customer foreign key (customerId) references Customer (customerId),
  constraint pf_sale_product foreign key  (productId) references Product (productId)
) engine=InnoDB

 

First, the below denormalized version of the query takes about 0.39 seconds, averaged over 100 executions.  The rest of the queries are also averaged over 100 runs.  What is interesting is the driving table of the plan is not the sale table, but the product table.  There are two basics query plans the optimizer can decide on.  The first is to drive off of the Sale fact table, which was covered in previous blogs.  The second is to drive off of the dimension (descriptive tables) like the ProductGroup table.  After this article, both cases will have been tested. 

 

select sum(unit) from Sale s
  join ProductDenorm p
    on s.productId = p.productId
where p.productBrandName = 'ProductBrand 1'
   and p.productGroupName = 'Group 1'
   and purchaseDate > '2001-12-30'

 

+----+-------------+-------+----------+---------------------+---------+-------------+-------+------------+
| id | select_type | table | type     | key                 | key_len | ref         | rows  | extra
+----+-------------+-------+----------+---------------------+---------+-------------+-------+------------+
|  1 | SIMPLE      | p     | ref      | idx_ProductDenorm_  | 70      | const,const | 13608 | Using where
|    |             |       |          | BrandNameGroupName  |         |             |       | 
+----+-------------+-------+----------+---------------------+---------+-------------+-------+------------+
|  1 | SIMPLE      | s     | ref      | pf_sale_product     | 4       | p.productId | 10    | Using where
+----+-------------+-------+----------+---------------------+---------+-------------+-------+------------+

 

The normalized version also takes about 0.39 seconds.  The plan is similar to the previous one, but the driving table is the normalized ProductGroup and not the denormalized ProductDenorm.   In both cases the last table in the plan is the Sale table. 

 

select sum(unit)
  from Sale s
  join ProductNormal p
    on s.productId = p.productId
  join ProductGroup pg
    on p.productGroupId = pg.productGroupId
  join ProductBrand pb
    on p.productBrandId = pb.productBrandId 
where pb.productBrandName = 'ProductBrand 1'
   and pg.productGroupName = 'Group 1'
   and purchaseDate > '2001-12-30';

 

+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
| id | select_type | table | type | key                          | key_len | ref                | rows | extra
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | pg    | ref  | idx_ProductGroup_Name        | 34      | const              | 1    | where,index 
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | pb    | ref  | idx_ProductBrand_Name        | 34      | const              | 1    | where,index
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | p     | ref  | idx_ProductNormal_BrandGroup | 8       | pb.productBrandId, | 3258 | where,index 
|    |             |       |      |                              |         | pg.productGroupId  |      |
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | s     | ref  | pf_sale_product              | 4       | p.productId        | 10   | where,index
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+

 

How about the same basic query, but for a month of data?  It also takes about .39 seconds. 

 

select sum(unit) from Sale s
  join ProductDenorm p
    on s.productId = p.productId
where p.productBrandName = 'ProductBrand 1'
   and p.productGroupName = 'Group 1'
   and purchaseDate > '2001-12-01'

 

+----+-------------+-------+----------+--------------------+---------+-------------+----------+-----------+
| id | select_type | table | type     | key                | key_len | ref         |          | extra
+----+-------------+-------+----------+--------------------+---------+-------------+----------+-----------+
|  1 | SIMPLE      | p     | ref      | idx_ProductDenorm_ | 70      | const,const | 13608    | where
|    |             |       |          | BrandNameGroupName |         |             |          |
+----+-------------+-------+----------+--------------------+---------+-------------+----------+-----------+
|  1 | SIMPLE      | s     | ref      | pf_sale_product    | 4       | p.productId | 10       | where
+----+-------------+-------+----------+--------------------+---------+-------------+----------+-----------+

 

The normalized version also takes about .39 seconds against a month of data.   So, at least for these simple queries the normalized data model is just as fast as the denormalized one. 

 

select sum(unit) from Sale s
  join ProductNormal p
    on s.productId = p.productId
  join ProductGroup pg
    on p.productGroupId = pg.productGroupId
  join ProductBrand pb
    on p.productBrandId = pb.productBrandId
where pb.productBrandName = 'ProductBrand 1'
   and pg.productGroupName = 'Group 1'
   and purchaseDate > '2001-12-01'

 

+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
| id | select_type | table | type | key                          | key_len | ref                | rows | extra
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | pg    | ref  | idx_ProductGroup_Name        | 34      | const              | 1    | where,index 
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | pb    | ref  | idx_ProductBrand_Name        | 34      | const              | 1    | where,index
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | p     | ref  | idx_ProductNormal_BrandGroup | 8       | pb.productBrandId, | 3258 | where,index
|    |             |       |      |                              |         | pg.productGroupId  |      |
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | s     | ref  | pf_sale_product              | 4       | p.productId        | 10   | where,index
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+

 

Over the course of four blog entries tests have shown that joins are expensive, but in spite of that, denormalized data isn't that much faster than a normalized data model.  The best case was when a denormalized table was a bit more than twice as fast as a few normalized tables.  Then there was the case where a normalized data model was faster than denormalized one as the normalized one used memory more efficiently.  And finally, when the optimizer drives off of the dimensional (descriptive) data and not the fact table there isn't a noticeable difference in performance.  I know there are exceptions when a normalized data model performance is worse than what I've shown; but they do appear to be exactly that, exceptions. 

I could run more complex tests, and I probably will over time, but there are enough of them start coming to tentative conclusions.   In the transactional world the general rule is to normalize and only denormalize when a specific query or set of queries warrant it.  Given that denormalization can degrade performance at times, this is wise.  The time spent denormalizing the data model would be better spent adding better indexes and getting the statistics right. 

For reporting systems I would, hesitantly, recommend the same practice - normalize first and then denormalize when required.  I believe the Star schema made sense when it was first proposed a decade ago and database optimizers were weak.  A denormalized data model has less join permutations to consider and therefore an optimizer was better able to find a fast query plan when the data model was simple.  

Over the past decade, the database optimizers have improved and are better able to handle the complexity of the normalized data model.  When I was working on this article my normalized data model was initially much slower than my denormalized version.  One of the main reasons my normalized data model was slow was because my statistics weren't that good.  With better stats, MySQL, which isn't known to have the best optimizer, was at times able to find a normal query plan just as fast as the denormalized version.  If this is true for the youthful MySQL it should be true for the more established databases like SQL Server and Oracle, which are better at creating stats. 

It takes considerable effort to develop and populate a Star Schema.  I now believe, in many cases at least, a normalized data model is best for reporting and denormalize on an exceptional basis where it makes sense.  Develop the physical data model for reporting systems just like you would a transactional physical data model.  If denormalization makes an important query much faster then denormalize. 

This may be reflective of the start up environments where I work, where we have reasonably simple data models of a hundred of so tables but some of those tables have up to a billion rows.  But based on this experience and these tests, the time spent denormalizing the data model would be better spent on creating better indexes and finding and populating summary tables.  This is where the huge, orders of magnitude, improvements are to be found.  Yes, a star schema would improve performance of that important query, but would a summary table be even better? 

I'm sure there are exceptions to this; with software as complex as a database, and some of those huge and complex databases out there, there are always exceptions.  MySQL is still young and, for example, the statistics functionality of MySQL isn't reliable enough for this to be true in all cases.  I believe a simple reporting application that has a limited number of predefined reports can get by with a normalized data model that is selectively denormalized when required, but complex reporting application with frequent ad hoc reporting needs would still do better with a Star schema.  Where the dividing line is a judgement call, but if in doubt I would start with a normalized data model and iterate to a more denormalized one. 

Yet as the capabilities of the optimizers continue to improve I suspect the development of reporting systems should shift to a more normalized data modeling approach. 

No comments: