Friday, December 14, 2007

A setting star

On my second blog I stated I would demonstrate a case where a denormalized Star schema is 50 times slower than a normalized data model.  Well, while writing this I looked at the tables again and realized that I didn't have optimal indexes and my statistics weren't optimal.  Yeah, I feel a bit foolish.  Once I fixed those issues there wasn't a performance difference.  I'll go more into this during the post test discussion. 

First, these are the table that will be used to test the performance of the normalized and denormalized data models.  The ProductNormal table represents the normalized table, has 10 million rows, and is about 0.8 gig. 

 

create table ProductNormal (
     productId int(11) not null,
     productName varchar(32) not null,
     productGroupId int(11) not null,
     productBrandId int(11) not null,
     primary key (productId),
     key `idx_ProductNormal_GroupIdBrandId` (`productGroupId`,`productBrandId`),
     key `idx_ProductNormal_BrandGroup` (`productBrandId`,`productGroupId`), 
     constraint `fk_ProductNormal_productBrandId` 
         foreign key (`productBrandId`) REFERENCES     `ProductBrand` (`productBrandId`), 
     constraint `fk_ProductNormal_productGroupId` 
         foreign key (`productGroupId`) REFERENCES `ProductGroup` (`productGroupId`)
) engine=InnoDB

 

The small ProductBrand table has 11 rows.  

 

create table ProductBrand (
    productBrandId int(11) not null,
    productBrandName varchar(32) not null,
    primary key (productBrandId),
    unique key idx_ProductBrand_Name (ProductBrandName)
) engine=InnoDB

 

The ProductGroup table has 101 rows.  

 

create table ProductGroup (
    productGroupId int(11) not null,
    productGroupName varchar(32) not null,
    primary key (productGroupId),
    unique key idx_ProductGroup_Name (ProductGroupName)
) engine=InnoDB

 

The 10 million row ProductDenorm combines the previous three tables into a 1.27 gig table. 


create table ProductDenorm (
    productId int(11) not null,
    productName varchar(32) not null,
    productGroupId int(11) not null,
    productGroupName varchar(32) not null,
    productBrandId int(11) not null,
    productBrandName varchar(32) not null,
    primary key (productId),
    key idx_ProductDenorm_GroupNameBrandName (productGroupName, productBrandName),
    key idx_ProductDenorm_BrandNameGroupName (productBrandName, productGroupName)
) engine=InnoDB

 

The 120 million row Sale table is about 30 gig. 


create table Sale (
  orderId int(11) not null,
  customerId int(11) not null,
  productId int(11) not null,
  productBigId int(11) not null,
  unit int(11) not null,
  purchaseAmount decimal(16,2) not null,
  purchaseCost decimal(16,2) not null,
  purchaseDate datetime not null,
  primary key  (purchaseDate,orderId),
  unique key idx_sale_order (orderId),
  key idx_sale_product (productId),
  key idx_sale_customer (customerId),
  constraint pf_sale_customer foreign key (customerId) references Customer (customerId),
  constraint pf_sale_product foreign key  (productId) references Product (productId)
) engine=InnoDB

 

First, the below denormalized version of the query takes about 0.39 seconds, averaged over 100 executions.  The rest of the queries are also averaged over 100 runs.  What is interesting is the driving table of the plan is not the sale table, but the product table.  There are two basics query plans the optimizer can decide on.  The first is to drive off of the Sale fact table, which was covered in previous blogs.  The second is to drive off of the dimension (descriptive tables) like the ProductGroup table.  After this article, both cases will have been tested. 

 

select sum(unit) from Sale s
  join ProductDenorm p
    on s.productId = p.productId
where p.productBrandName = 'ProductBrand 1'
   and p.productGroupName = 'Group 1'
   and purchaseDate > '2001-12-30'

 

+----+-------------+-------+----------+---------------------+---------+-------------+-------+------------+
| id | select_type | table | type     | key                 | key_len | ref         | rows  | extra
+----+-------------+-------+----------+---------------------+---------+-------------+-------+------------+
|  1 | SIMPLE      | p     | ref      | idx_ProductDenorm_  | 70      | const,const | 13608 | Using where
|    |             |       |          | BrandNameGroupName  |         |             |       | 
+----+-------------+-------+----------+---------------------+---------+-------------+-------+------------+
|  1 | SIMPLE      | s     | ref      | pf_sale_product     | 4       | p.productId | 10    | Using where
+----+-------------+-------+----------+---------------------+---------+-------------+-------+------------+

 

The normalized version also takes about 0.39 seconds.  The plan is similar to the previous one, but the driving table is the normalized ProductGroup and not the denormalized ProductDenorm.   In both cases the last table in the plan is the Sale table. 

 

select sum(unit)
  from Sale s
  join ProductNormal p
    on s.productId = p.productId
  join ProductGroup pg
    on p.productGroupId = pg.productGroupId
  join ProductBrand pb
    on p.productBrandId = pb.productBrandId 
where pb.productBrandName = 'ProductBrand 1'
   and pg.productGroupName = 'Group 1'
   and purchaseDate > '2001-12-30';

 

+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
| id | select_type | table | type | key                          | key_len | ref                | rows | extra
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | pg    | ref  | idx_ProductGroup_Name        | 34      | const              | 1    | where,index 
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | pb    | ref  | idx_ProductBrand_Name        | 34      | const              | 1    | where,index
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | p     | ref  | idx_ProductNormal_BrandGroup | 8       | pb.productBrandId, | 3258 | where,index 
|    |             |       |      |                              |         | pg.productGroupId  |      |
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | s     | ref  | pf_sale_product              | 4       | p.productId        | 10   | where,index
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+

 

How about the same basic query, but for a month of data?  It also takes about .39 seconds. 

 

select sum(unit) from Sale s
  join ProductDenorm p
    on s.productId = p.productId
where p.productBrandName = 'ProductBrand 1'
   and p.productGroupName = 'Group 1'
   and purchaseDate > '2001-12-01'

 

+----+-------------+-------+----------+--------------------+---------+-------------+----------+-----------+
| id | select_type | table | type     | key                | key_len | ref         |          | extra
+----+-------------+-------+----------+--------------------+---------+-------------+----------+-----------+
|  1 | SIMPLE      | p     | ref      | idx_ProductDenorm_ | 70      | const,const | 13608    | where
|    |             |       |          | BrandNameGroupName |         |             |          |
+----+-------------+-------+----------+--------------------+---------+-------------+----------+-----------+
|  1 | SIMPLE      | s     | ref      | pf_sale_product    | 4       | p.productId | 10       | where
+----+-------------+-------+----------+--------------------+---------+-------------+----------+-----------+

 

The normalized version also takes about .39 seconds against a month of data.   So, at least for these simple queries the normalized data model is just as fast as the denormalized one. 

 

select sum(unit) from Sale s
  join ProductNormal p
    on s.productId = p.productId
  join ProductGroup pg
    on p.productGroupId = pg.productGroupId
  join ProductBrand pb
    on p.productBrandId = pb.productBrandId
where pb.productBrandName = 'ProductBrand 1'
   and pg.productGroupName = 'Group 1'
   and purchaseDate > '2001-12-01'

 

+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
| id | select_type | table | type | key                          | key_len | ref                | rows | extra
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | pg    | ref  | idx_ProductGroup_Name        | 34      | const              | 1    | where,index 
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | pb    | ref  | idx_ProductBrand_Name        | 34      | const              | 1    | where,index
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | p     | ref  | idx_ProductNormal_BrandGroup | 8       | pb.productBrandId, | 3258 | where,index
|    |             |       |      |                              |         | pg.productGroupId  |      |
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+
|  1 | SIMPLE      | s     | ref  | pf_sale_product              | 4       | p.productId        | 10   | where,index
+----+-------------+-------+------+------------------------------+---------+--------------------+------+------------+

 

Over the course of four blog entries tests have shown that joins are expensive, but in spite of that, denormalized data isn't that much faster than a normalized data model.  The best case was when a denormalized table was a bit more than twice as fast as a few normalized tables.  Then there was the case where a normalized data model was faster than denormalized one as the normalized one used memory more efficiently.  And finally, when the optimizer drives off of the dimensional (descriptive) data and not the fact table there isn't a noticeable difference in performance.  I know there are exceptions when a normalized data model performance is worse than what I've shown; but they do appear to be exactly that, exceptions. 

I could run more complex tests, and I probably will over time, but there are enough of them start coming to tentative conclusions.   In the transactional world the general rule is to normalize and only denormalize when a specific query or set of queries warrant it.  Given that denormalization can degrade performance at times, this is wise.  The time spent denormalizing the data model would be better spent adding better indexes and getting the statistics right. 

For reporting systems I would, hesitantly, recommend the same practice - normalize first and then denormalize when required.  I believe the Star schema made sense when it was first proposed a decade ago and database optimizers were weak.  A denormalized data model has less join permutations to consider and therefore an optimizer was better able to find a fast query plan when the data model was simple.  

Over the past decade, the database optimizers have improved and are better able to handle the complexity of the normalized data model.  When I was working on this article my normalized data model was initially much slower than my denormalized version.  One of the main reasons my normalized data model was slow was because my statistics weren't that good.  With better stats, MySQL, which isn't known to have the best optimizer, was at times able to find a normal query plan just as fast as the denormalized version.  If this is true for the youthful MySQL it should be true for the more established databases like SQL Server and Oracle, which are better at creating stats. 

It takes considerable effort to develop and populate a Star Schema.  I now believe, in many cases at least, a normalized data model is best for reporting and denormalize on an exceptional basis where it makes sense.  Develop the physical data model for reporting systems just like you would a transactional physical data model.  If denormalization makes an important query much faster then denormalize. 

This may be reflective of the start up environments where I work, where we have reasonably simple data models of a hundred of so tables but some of those tables have up to a billion rows.  But based on this experience and these tests, the time spent denormalizing the data model would be better spent on creating better indexes and finding and populating summary tables.  This is where the huge, orders of magnitude, improvements are to be found.  Yes, a star schema would improve performance of that important query, but would a summary table be even better? 

I'm sure there are exceptions to this; with software as complex as a database, and some of those huge and complex databases out there, there are always exceptions.  MySQL is still young and, for example, the statistics functionality of MySQL isn't reliable enough for this to be true in all cases.  I believe a simple reporting application that has a limited number of predefined reports can get by with a normalized data model that is selectively denormalized when required, but complex reporting application with frequent ad hoc reporting needs would still do better with a Star schema.  Where the dividing line is a judgement call, but if in doubt I would start with a normalized data model and iterate to a more denormalized one. 

Yet as the capabilities of the optimizers continue to improve I suspect the development of reporting systems should shift to a more normalized data modeling approach. 

Wednesday, December 12, 2007

Denormalized data, not so fast?

The last article demonstrated that a denormalized data model was a bit faster than a normalized data model when all the data fit into memory, but only by 20-30 percent or so, at least for the experiments I ran.  This article deals with larger data volumes and shows a case where normalized data is faster than denormalized data and then details another case where a denormalized data model faster. 

First, this is the what the tables in question look like.  All the product tables have 10 million rows.  The denormalized ProductAll table is 5.2 gig, and as innodb_buffer_pool_size is set to 4 gig, this table won't fit into memory.  The filler column represents other product attributes without having to specify them in detail.

 

create table ProductAll (
    productAllId int(11) not null,
    productAllName varchar(32) not null,
    productGroupId int(11) not null,
    productGroupName varchar(32) not null,
    filler varchar(4000) not null,
    productBrandId int(11) not null,
    productBrandName varchar(32) not null,
    primary key (productAllId)
) engine=InnoDB

 

The normalized ProductLess table  is 4.9 gig, a bit less than the ProductAll table, but it still has the filler column.   


create table ProductLess (
    productId int(11) not null,
    productName varchar(32) not null,
    productGroupId int(11) not null,
    productBrandId int(11) not null,
    filler varchar(4000) not null,
    primary key (`productId`)
) engine=InnoDB

 

The 4.6gig productBig table will be used to explore how joining two large tables, tables that are both larger than the memory allocated to mysql, together performs. 


create table ProductBig (
    productBigId int(11) not null,
    filler` varchar(4000) default NULL,
    productBrandId int(11) not null,
    PRIMARY KEY (productBigId),
    CONSTRAINT on_ProductBig_productBigId FOREIGN KEY (productBigId) REFERENCES Product (productId)
) ENGINE=InnoDB

The 9.2 Product2Filler table is a combination of the the ProductBig and the ProductLess table.  Where the filler columns of both the ProductBig and the ProductLess tables are 384 characters, the Product2Filler filler column is twice that at 768 characters.


create table Product2Filler (
    productId int(11) not null,
    productName varchar(32) not null,
    productGroupId int(11) not null,
    productBrandId int(11) not null,
    filler varchar(4000) not null,
    primary key  (`productId`)
) engine=InnoDB;
 

 

The Sale table has 120 million rows and is about 30 gig.   It, combined with the product tables, even just the one month (out of 12 months, or 1/12 of the data) I query against, don't fit into memory.

 

create table Sale (
  orderId int(11) not null,
  customerId int(11) not null,
  productId int(11) not null
  productBigId int(11) not null,
  unit int(11) not null,
  purchaseAmount decimal(16,2) not null,
  purchaseCost decimal(16,2) not null,
  purchaseDate datetime not null,
  primary key  (purchaseDate,orderId),
  unique key idx_sale_order (orderId),
  key pf_sale_product (productId),
  key pf_sale_customer (customerId),
  constraint `pf_sale_customer` foreign key (customerId) references Customer (customerId),
  constraint `pf_sale_product` foreign key (productId) references Product (productId)
) engine=InnoDB

The tiny ProductGroup table is a small 101 rows. 


create table ProductGroup (
    productGroupId int(11) not null,
    productGroupName varchar(32) not null,
    primary key  (productGroupId)
) engine=InnoDB

 

Onto the actual sql.  The below denormalized query takes about 4380 seconds.  As all the data doesn't fit into memory MySQL needs to do some serious reading, which shows up in iostat numbers, as do all the below queries.  All the below queries are read bound.  The buffer pool hit rate of 950 /1000 (or so - I averaged a few samples) verifies the heavy io.  This is a huge performance difference from the Blog of December 5, where the equivalent sql running against product tables that fit into memory, where the product table don't have the filler column, ran in 25 seconds. 

 

select productGroupName, sum(unit)
  from Sale s
  join ProductAll p
    on s.productId = p.productAllId
where purchaseDate >= '2001-12-01'
group by productGroupName;

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| id | select_type | TABLE | type     | KEY      | key_len | ref              | rows     | Extra             |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | s     | range    | PRIMARY  | 8       | null             | 30931362 | Using filesort... |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | p     | eq_ref   | PRIMARY  | 4       | s.productId      | 1        | null              |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

 

At 3790 seconds, the normalized sql runs a bit faster than the above denormalized version.  Like the previous query, It also is disk bound, but it has a better buffer hit rate of 972 / 1000 (or so).  As the data is smaller it is better able to fit into memory, which explains the higher buffer hit rate and the better performance?  I'm not convinced this is true as it needs to do more reads the get the small productGroup data, which is likely to be in memory all the time even with the MySQL least recently used caching - does anyone have insight into this?

 

select productGroupName, sum(unit)
  from Sale s
  join ProductLess p
    on s.productId = p.productId
  join ProductGroup pg
    on p.productGroupId = pg.productGroupId
where purchaseDate >= '2001-12-01'
group by productGroupName

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| id | select_type | TABLE | type     | KEY      | key_len | ref              | rows     | Extra             |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | s     | range    | PRIMARY  | 8       | null             | 30931362 | Using filesort... |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | p     | eq_ref   | PRIMARY  | 4       | s.productId      | 1        | null              |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

  1 | SIMPLE       | pg    | eq_ref   | PRIMARY  | 4       | p.productGroupId | 1        | null              |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

 

From this one might expect that a normalized data model is faster in all cases when dealing with large data volumes.  But how about joining a large table, not only to a small table in memory table, but also to another large table.  The below join of the ProductLess table to the ProductBig table takes 20,429 seconds.   

select productGroupName, sum(unit)
  from Sale s
  join ProductLess p
    on s.productId = p.productId
  join ProductGroup pg
    on p.productGroupId = pg.productGroupId
  join ProductBig pb
    on p.productId = pb.productBigId
where purchaseDate >= '2001-12-01'
group by productGroupName

 

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| id | select_type | TABLE | type     | KEY      | key_len | ref              | rows     | Extra             |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | s     | range    | PRIMARY  | 8       | null             | 30931362 | Using filesort... |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | pb    | eq_ref   | PRIMARY  | 4       | s.productId      | 1        | Using index       |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | p     | eq_ref   | PRIMARY  | 4       | pb.productBigId  | 1        | Using where       |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

  1 | SIMPLE       | pg    | eq_ref   | PRIMARY  | 4       | p.productGroupId | 1        |                   |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

 

When the ProductLess table and the ProductBig tables are combined into the Product2Filler table similar SQL takes 9219 seconds, about half as long.  In at least in some cases, joining two large tables, tables that can't fully fit into memory, a combined data model is faster.  I know this isn't really a denormalization as both are keyed by productId, but I believe it has the same performance characteristics as a denormalized data model.  At least it shows an example of what happens when two large tables are combined into one even larger one. 

 

select productGroupName, sum(unit)
  from Sale s
  join Product2Filler p
    on s.productId = p.productId
  join ProductGroup pg
    on p.productGroupId = pg.productGroupId
where purchaseDate >= '2001-12-01'
group by productGroupName

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| id | select_type | TABLE | type     | KEY      | key_len | ref              | rows     | Extra             |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | s     | range    | PRIMARY  | 8       | null             | 30931362 | Using filesort... |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | p     | eq_ref   | PRIMARY  | 4       | s.productId      | 1        |                   |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

  1 | SIMPLE       | pg    | eq_ref   | PRIMARY  | 4       | p.productGroupId | 1        |                   |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

 

As a side note, the below query against data that is not cached (different dates than the other queries), the query runs in 340 seconds.  It looks like the cost of a sequential table (range) scan is insignificant compared to a random index read of another table that can't fit into memory.  Stated differently, joins are at times costly.

 

select count(*) from Sale
where purchaseDate >= '2001-03-01'
   and purchaseDate < '2001-04-01'

 

What has been shown is normalized data is sometimes faster than denormalized data, and sometimes combined data is faster.   So far the case for the denormalized data model isn't that persuasive.  This is true both when all the data can fit into memory and when it can't and many reads are required.  Why, then, do so many people use star schemas for reporting systems?  In the next article there will be a case where a denormalized data model is a bit faster than what has been shown so far. 

Wednesday, December 5, 2007

Denormalized data is a bit faster than normalized data (but by enough?)

In the last article the performance impact of joins was shown.  This one will demonstrate cases where denormalized joins are a bit faster, as will the third article with larger data volumes.  The fourth article, the most interesting one, will show where a denormalized data model can be 50 times faster than a normalized data model. 

Here are the tables that will be involved in the sql.  The normalized ProductSmall table has a 100 million rows and is about 0.67 gig. 

 

create table ProductSmall (

    productSmallId int(11) not null,

    productSmallName varchar(32) not null,

    productGroupId int(11) not null,

    primary key (productSmallId),

    key idx_ProductSmall_productGroup (productGroupId),

    constraint con_ProductSmall_productGroup

        foreign key (productGroupId) references ProductGroup (productGroupId)

) ENGINE=InnoDB;

 

This 101 row table has the product group text and joins to the ProductSmall table.   

 

create table ProductGroup (

    productGroupId int(11) not null,

    productGroupName varchar(32) not null,

    primary key (productGroupId)

) ENGINE=InnoDB;

 

This is the 100 million rows denormalized combination of the previous two tables.  It is 1.2 gig and is about twice the size of the normalized version. 

  

create table Product (

    productId int(11) not null,

    productName varchar(32) not null,

    productGroupId int(11) not null,

    productGroupName varchar(32) not null,

    primary key (productId)

) ENGINE=InnoDB;

 

First, how fast does it take to do a sequential scan of 652,637 rows of the 120 million rows Sale table?  At 0.3 seconds it is fast as it is cached is memory.  This is the same Sale table from the previous article.  Purchase date is the leading edge of the primary key, which explains the range scan on the primary key. 

 

select sum(unit) from Sale

where purchaseDate >= '2001-12-30'

 

+----+-------------+-------+-------+---------------+----------+---------+------+----------+-------------+

| id | select_type | TABLE | type  | possible_keys | KEY      | key_len | ref  | rows     | Extra       |

+----+-------------+-------+-------+---------------+----------+---------+------+----------+-------------+

|  1 | SIMPLE      | Sale  | range | PRIMARY       | PRIMARY  | 8       | null | 1882540  | USING WHERE |

+----+-------------+-------+-------+---------------+----------+---------+------+----------+-------------+

 

Adding a join to a small table (that also fits into memory) is bit slower than the previous simple table scan, but the the below sql runs in a about 1.1 seconds.  Still reasonably fast and less than four times slower than a sequential scan of the data.  This is great performance as btree lookup can be much more expensive than a sequential table scans.  This is a sign that the innodb adaptive hash indexing is working well for reads.  And running "show engine innodb status" shows "18617.95 hash searches/s, 5.00 non-hash searches/s".  Show engine also shows "Buffer pool hit rate 1000 / 1000", indicating that all the data does fit into memory.  I won't show it, but the rest of the queries in this article have similar numbers.

 

select sum(unit) from Sale s

  join ProductSmall p

    on s.productId = p.productSmallId

where purchaseDate >= '2001-12-30';

 

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

| id | select_type | TABLE | type     |  KEY     | key_len | ref         | rows     | Extra       |

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

|  1 | SIMPLE      | s     | range    |  PRIMARY |8        | null        |  1882540 | Using where |

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

|  1 | SIMPLE      | p     |  eq_ref  |  PRIMARY | 4       | s.productId | 1        | Using index |

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

 

Using this normalized data model, where product group names are in a different tables, the extra join slows down performance but only by a bit.  This sql runs in about 1.4 seconds.  

 

select sum(unit) from Sale s

  join ProductSmall p

    on s.productId = p.productSmallId

  join ProductGroup pg

    on p.productGroupId = pg.productGroupId

where purchaseDate >= '2001-12-30';

 

The plan looks similar to the previous one, with the addition of one more join. 

 

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------+

| id | select_type | TABLE | type     | KEY      | key_len | ref              | rows     | Extra       |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------+

|  1 | SIMPLE      | s     | range    | PRIMARY  | 8       | null             | 1882540  | Using where |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------+

|  1 | SIMPLE      | p     |  eq_ref  | PRIMARY  | 4       | s.productIdn     | 1        | Using index |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------+

|  1 | SIMPLE      | pg    |  eq_ref  | PRIMARY  | 4       | p.productGroupId | 1        | Using index |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------+

 

Ok, the denormalized Product table is a larger than the normalized ProductSmall table, but has the productGroupName of the ProductGroup table, removing one of the joins and making the below query a bit faster at 1.1 seconds.  Nothing is substantially different here, but if everything can fit into memory, which this data does, the denormalized data model performs better. 

 

select sum(unit) from Sale s

  join Product p

    on s.productId = p.productId

where purchaseDate >= '2001-12-30';

 

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

| id | select_type | TABLE | type     | KEY      | key_len | ref         | rows     | Extra       |

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

|  1 | SIMPLE      | s     | range    | PRIMARY  |8        | null        |  1882540 | Using where |

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

|  1 | SIMPLE      | p     |  eq_ref  | PRIMARY  | 4       | s.productId | 1        | Using index |

+----+-------------+-------+----------+----------+---------+-------------+----------+-------------+

 

Now to run the same series of tests but with a summary of the productGroupName just to verify the above pattern holds.  In summary, it does. 

Ok, now we know the simple cost of a join, grouping by a value on the joined to table does slow down the sql, but again, this is reasonably fast as 2.2 seconds. 

 

select productGroupName, sum(unit)

  from Sale s

  join ProductSmall p

    on s.productId = p.productSmallId

  join ProductGroup pg

    on p.productGroupId = pg.productGroupId

where purchaseDate >= '2001-12-30'

group by productGroupName

 

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| id | select_type | TABLE | type     | KEY      | key_len | ref              | rows     | Extra             |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | s     | range    | PRIMARY  |8        | null             | 1882540  | Using filesort ...

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | p     |  eq_ref  | PRIMARY  | 4       | s.productId      | 1        | Using index       |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | pg    |  eq_ref  | PRIMARY  | 4       | p.productGroupId | 1        | Using index       |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

 

The denormalized data model is faster again, but not by much, as 1.8 seconds. 

 

select productGroupName, sum(unit)

  from Sale s

  join Product p

    on s.productId = p.productId

where purchaseDate >= '2001-12-30'

group by productGroupName

 

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

| id | select_type | TABLE | type     | KEY      | key_len | ref         | rows     | Extra                  |

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

|  1 | SIMPLE      | s     | range    | PRIMARY  |8        | null        | 1882540  | Using filesort ... 

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

|  1 | SIMPLE      | p     |  eq_ref  | PRIMARY  | 4       | s.productId | 1        | Using index            |

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

 

Ok, what occurs when the same sql is run against 10,182,566 rows?  If cached, the simple range scan query takes about 4 seconds.

 

select sum(unit)

  from Sale s

where purchaseDate >= '2001-12-01'

 

+----+-------------+-------+-------+----------+---------+------+----------+-------------+

| id | select_type | TABLE | type  | KEY      | key_len | ref  | rows     | Extra       |

+----+-------------+-------+-------+----------+---------+------+----------+-------------+

|  1 | SIMPLE      | s     | range | PRIMARY  | 8       | null | 30931362 | USING WHERE |

+----+-------------+-------+-------+----------+---------+------+----------+-------------+

 

Now, what about a join to the denormalized product table?  This takes 25 seconds, which shows the cost of a table join and a group by over a sequential table scan.  But as all the data can fit into memory the execution times are still reasonably fast. 

 

select productGroupName, sum(unit)

  from Sale s

  join Product p

    on s.productId = p.productId

where purchaseDate >= '2001-12-01'

group by productGroupName

 

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

| id | select_type | TABLE | type     | KEY      | key_len | ref         | rows     | Extra                  |

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

|  1 | SIMPLE      | s     | range    | PRIMARY  |8        | null        | 30931362 | Using filesort ... 

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

|  1 | SIMPLE      | p     |  eq_ref  | PRIMARY  | 4       | s.productId | 1        |                        |

+----+-------------+-------+----------+----------+---------+-------------+----------+------------------------+

 

Once again, the below normalized data is slower than the above normalized version and runs in 33 seconds.

 

select productGroupName, sum(unit)

  from Sale s

  join ProductSmall p

    on s.productId = p.productSmallId

  join ProductGroup pg

    on p.productGroupId = pg.productGroupId

where purchaseDate >= '2001-12-01'

group by productGroupName

 

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

| id | select_type | TABLE | type     | KEY      | key_len | ref              | rows     | Extra             |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | s     | range    | PRIMARY  | 8       | null             | 30931362 | Using filesort ...

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | p     |  eq_ref  | PRIMARY  | 4       | s.productId      | 1        |                   |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

|  1 | SIMPLE      | pg    |  eq_ref  | PRIMARY  | 4       | p.productGroupId | 1        |                   |

+----+-------------+-------+----------+----------+---------+------------------+----------+-------------------+

 

So far, the case for a denormalized data model, such as a star schema, isn't that robust.  Yes, the performance is better, by only by about 20-30 percent.

But this series of articles isn't finished yet.   In the next blog I’ll show what happens when both the normalized and the denormalized data model struggle to fit into memory.