Wednesday, November 28, 2007

Joins are slow, memory is fast

I've been working with various databases for a number of years. In that time I've found there is plenty of documentation on various features but not much on how database modeling impacts performance. Some people say one should use a fully normalized data model with a few denormalizations that are required for performance reasons, but those denormalizations are never discussed. And Ralph Kimball has many articles and a book concentrating on the physical data modeling, but they are more of the "here is how you solve the problem" and they don't detail why his method works better.

I've always found this odd as the physical data model has a major impact on database performance. I want this to be more of the whys behind various physical data modeling options with some examples showing the magnitude of the performance differences.

The first goal of this blog (over a few articles) will be to show cases where a denormalized dimensional data model substantially improves performance instances and where a normalized model optimized for updates is appropriate. The server I have available to do this isn't the best, so the largest table is only going to be 120 million rows and 30 gig on disk, but I believe the principles shown will scale to larger data volumes on more capable servers.

At first, at least, I'm going to demonstrate this with MySQL using the InnoDB database as I'm currently using this at work and I need to learn more about it. I'm more comfortable with Oracle and MS SQL Server, so if I get something wrong with MySQL, or it isn't true in all cases (and what is with a database?), or I'm just being an idiot, point it out.

The machine I ran these tests on has a innodb_buffer_pool_size of 2048M and innodb_flush_method=O_DIRECT to avoid double caching by the OS. The Linux machine I ran these tests on doesn’t have a great disk subsystem with only 4 disks in a striped mirrored configuration.

Enough of the preliminaries, here are the tables. This is the small customer dimension of 1 million rows and 60 meg of data that easily fits into memory.

(customerId int NOT NULL default '0',
customerName varchar(32) NOT NULL,
postalCode varchar(7) NOT NULL,
PRIMARY KEY (`customerId`)

The product dimension is larger, with 10 million rows and 1043 meg of data. It will fit into memory, but it takes half of the available memory.

(productId int NOT NULL,
productName varchar(32) default NULL,
productGroupId int default NULL,
productGroupName varchar(32) default NULL,
PRIMARY KEY (`productId`)

Finally, the large sale fact table, with 120 million rows and about 28 gig of data. It will never fit into memory. Note the trick to create the clustered index (primary key) based on the date so the common date based queries can quickly do a range scan based on the clustered index, but there is a severe downside to this. Since all the other indexes point to the clustered index they will all be larger and will therefore take longer to insert data and read. When to use this trick is a topic in and of itself, so enough of this for now. The range of dates is from 2001-01-01 to 2001-01-31, with the data being randomly distributed over the year.

(orderId int NOT NULL,
customerId int(11) NOT NULL,
productId int(11) NOT NULL,
productBigId int(11) NOT NULL,
unit int(11) NOT NULL,
purchaseAmount decimal(16,2) NOT NULL,
purchaseCost decimal(16,2) NOT NULL,
purchaseDate datetime NOT NULL,
PRIMARY KEY idx_sale_dateOrder (purchaseDate,orderId),
UNIQUE KEY pk_sale_order (orderId),
KEY idx_sale_product (productId),
KEY idx_sale_customer (customerId),
CONSTRAINT pf_sale_product FOREIGN KEY (productId)
REFERENCES Product (productId),
CONSTRAINT pf_sale_customer FOREIGN KEY (customerId)
REFERENCES Customer (customerId)

So, my testing technique is to run a query a number of times and throw out the first execution. This isn’t entirely realistic, but the alternative is to flush all the data and have no caching at all. As reporting performance is heavily dependent on memory caching, I prefer a testing methodology that takes into the account caching, even if it is overstated.

Ok, the month of December comprises a bit more than 10 million rows. Stripping out all the keys that aren’t used, one month of this table fits into memory, which shows up in the 3 to 4 second execution times of this sql after the first execution.

select sum(unit) from Sale
where purchaseDate >= '2001-12-01'

And the execution plan shows that the primary key is being used to do a range scan.

1, 'SIMPLE', 'Sale', 'range', 'PRIMARY', 'PRIMARY', '8', '', 30931362, 'Using where; Using index'

Adding a simple join to the small customer table results in an execution time of 11 to 12 seconds. So while everything is fitting into memory, an in memory join still makes a query run about 3 times slower in this case. As expected, the execution plan shows the sale table primary key being used for a range scan with the join to the customer table using the primary key being for an index lookup. As a note, MySQL needs to do a better job on table analysis as I didn’t get a good plan on this simple query without a hint.

select straight_join sum(unit) from Sale s
join Customer c
on s.customerId = c.customerId
where s.purchaseDate >= '2001-12-01'

And here is the plan from the above hint, which drives from the sales table and looks up the related customer information.

1, 'SIMPLE', 's', 'range', 'PRIMARY', 'PRIMARY', '8', '', 30931362, 'Using where'
1, 'SIMPLE', 'c', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'perfbig.s.customerId', 1, 'Using index'

In the next case the sales table is joined to the much larger product table. Both one month of the sale table and the product table both can’t fit into memory; instead of fast memory access we are dealing with slow disk reads and query times of around 417 to 525 seconds.

I was surprised by how much slower this is than the equivalent customer query. I’m also wondering about the strength of the simple InnoDB least recently used buffer pool algorithm as this point. As expected with the slow execution time, iostat shows heavy disk utilization.

select count(*) from Sale s
join Product p
on s.productId = p.productId
where purchaseDate >= '2001-12-01';

The plan is reasonable as it drives from the sale table and joins to the product table.

1, 'SIMPLE', 's', 'range', 'PRIMARY,pf_sale_product', 'PRIMARY', '8', '', 30931362, 'Using where'
1, 'SIMPLE', 'p', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'perfbig.s.productId', 1, 'Using index'

Instead of using a full month of data does how does querying just two days of the sale table behave? Well, it executes in .3 seconds. Obviously, everything fits into memory at this point.

select sum(unit) from Sale s
where purchaseDate >= '2001-12-30';

The plan is good, with a range scan of the sale table.

1, 'SIMPLE', 's', 'range', 'PRIMARY', 'PRIMARY', '8', '', 1882540, 'Using where'

How about adding a simple join to the customer table? That results in an execution time of one second, so even small in memory joins impact execution times.

select sum(unit) from Sale s
join Customer c
on s.customerId = c.customerId
where purchaseDate >= '2001-12-30';

The plan is good.

1, 'SIMPLE', 's', 'range', 'PRIMARY,pf_sale_customer', 'PRIMARY', '8', '', 1882540, 'Using where'
1, 'SIMPLE', 'c', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'perfbig.s.customerId', 1, 'Using index'

What is interesting is the join to the much larger product table takes about the same amount of time as the Customer table, one second. I’m going to guess this is because everything can still fit into memory. This is verified as iostat does not show any disk io.

select sum(unit) from Sale s
join Product p
on s.productId = p.productId
where purchaseDate >= '2001-12-30';

And the product plan looks the same as the join to the customer table.

1, 'SIMPLE', 's', 'range', 'PRIMARY,pf_sale_product', 'PRIMARY', '8', '', 1882540, 'Using where'
1, 'SIMPLE', 'p', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'perfbig.s.productId', 1, 'Using index'

So, one of the insights of the dimensional model is that joins are expensive. This has been demonstrated by these tests. The size of the dimension doesn't impact performance if the dimension can fit into memory, but a large dimension that doesn't fit into memory has a severe impact on memory.

Others, looking at the dimensional model, will say normalization is best as this minimizes the amount of memory a table will hold, and memory optimization means more of the data fits into memory and result in less of those expensive physical reads. So which is more important, minimizing the number of joins or minimizing the size of tables? That will be covered in another (hopefully shorter) entry.

1 comment:

Karen Lopez said...

Excellent post. I look forward to similar ones. As you have stated, there are few resources with this level of detail and explanation out there.