Wednesday, July 23, 2008

SAAS Multi-tenant Databases

There are some good high level discussions on the various database architectures for implementing software as a service.  I’m on going to duplicate that.  Instead, this article is going to highlight some of the issues and tradeoffs when multiple tenants (customers of the software service provider) are put into one database.   The core of the problem is having multiple tenants in one set of database tables can make queries that select data for only one customer much slower. 

First, I’m testing on a new system, one with 32 gig of memory.  Currently,  the innodb_buffer_pool_size set to 28 gig.   I’m using the same size tables as previously, both Sale and SaleTenant have 120 million rows and are about 27 and 28 gig respectively.  This means that the tables fit into memory, so the following tests do not have an IO component.  Not perfect, but it shouldn't make a difference in the conclusions for these tests.  

For the other two tables, there are 100 tenants, and 1 million customers. 

The ddl looks like this (yeah, key and constraint names don’t make sense).   First, the non-multi-tenant table. 

 

CREATE TABLE  `test`.`Sale` (

  `orderId` int(11) NOT NULL,

  `customerId` int(11) NOT NULL,

  `productId` int(11) NOT NULL,

  `productBigId` int(11) NOT NULL,

  `unit` int(11) NOT NULL,

  `purchaseAmount` decimal(16,2) NOT NULL,

  `purchaseCost` decimal(16,2) NOT NULL,

  `purchaseDate` datetime NOT NULL,

  PRIMARY KEY  (`purchaseDate`,`orderId`),

  UNIQUE KEY `idx_sale_order` (`orderId`),

  KEY `pf_sale_product` (`productId`),

  KEY `pf_sale_customer` (`customerId`),

  CONSTRAINT `pf_sale_customer` FOREIGN KEY (`customerId`) REFERENCES `Customer` (`customerId`),

  CONSTRAINT `pf_sale_product` FOREIGN KEY (`productId`) REFERENCES `Product` (`productId`)

) ENGINE=InnoDB

 

Then the table with a tenantId in the table. 

 

CREATE TABLE  `test`.`SaleTenant` (

  `orderId` int(11) NOT NULL,

  `tenantId` int(11) NOT NULL,

  `customerId` int(11) NOT NULL,

  `productId` int(11) NOT NULL,

  `productBigId` int(11) NOT NULL,

  `unit` int(11) NOT NULL,

  `purchaseAmount` decimal(16,2) NOT NULL,

  `purchaseCost` decimal(16,2) NOT NULL,

  `purchaseDate` datetime NOT NULL,

  PRIMARY KEY  (`tenantId`,`purchaseDate`,`orderId`),

  UNIQUE KEY `idx_saletenant_order` (`orderId`),

  KEY `pf_saletenant_product` (`productId`),

  KEY `pf_saletenant_customer` (`customerId`),

  CONSTRAINT `pf_saletenant_customer` FOREIGN KEY (`customerId`) REFERENCES `Customer` (`customerId`),

  CONSTRAINT `pf_saletenant_product` FOREIGN KEY (`productId`) REFERENCES `Product` (`productId`),

  CONSTRAINT `pf_saletenant_tentant` FOREIGN KEY (`tenantId`) REFERENCES `Tenant` (`tenantId`)

) ENGINE=InnoDB

 

Now, the tenant table. 

 

CREATE TABLE  `test`.`Tenant` (

  `tenantId` int(11) NOT NULL,

  `tenantName` varchar(32) NOT NULL,

  PRIMARY KEY  (`tenantId`)

) ENGINE=InnoDB DEFAULT

 

And a customer table with a tenantId column. 

 

CREATE TABLE  `test`.`Customer` (

  `customerId` int(11) NOT NULL default '0',

  `customerName` varchar(32) NOT NULL,

  `tenantId` int(11) NOT NULL default '0',

  PRIMARY KEY  (`customerId`),

  KEY `pf_customer_tenant` (`tenantId`),

  CONSTRAINT `pf_customer_tenant` FOREIGN KEY (`tenantId`) REFERENCES `Tenant` (`tenantId`)

) ENGINE=InnoDB

 

#1) Against a table that is clustered by the TenantId, this query takes 0.5 seconds to run.  As it selects about 1/100 of the Sale Tenant database, the performance isn’t bad.

 

select sum(unit) from SaleTenant  where TenantId = 77;

 

And the query plan shows that the custered index was used to scan on the relevant rows. 

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: SaleTenant

         type: ref

possible_keys: PRIMARY

          key: PRIMARY

      key_len: 4

          ref: const

         rows: 2708904

        Extra:

#2) The query against the Sale table without the tenantId is about 5 times slow, at 2.5 seconds.

 

select sum(unit)

  from Customer c

  join Sale s

    on c.customerId = s.customerId

where c.TenantId = 77

 

The plan shows that the database drives off the customer table and joins to the sales table on the customerId column. 

 

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: c

         type: ref

possible_keys: PRIMARY,pf_customer_tenant

          key: pf_customer_tenant

      key_len: 4

          ref: const

         rows: 5040

        Extra: Using index

*************************** 2. row ***************************

           id: 1

  select_type: SIMPLE

        table: s

         type: ref

possible_keys: pf_sale_customer

          key: pf_sale_customer

      key_len: 4

          ref: test.c.customerId

         rows: 57

        Extra:

 

3) Further qualifying by date shows where adding the tentantId to the SaleTenant id column further improves performance.   This sql runs in about 0.08 seconds, and is much faster than the query #1, which doesn't have the more selective date criteria. 

 

select sum(unit) from SaleTenant

where TenantId = 77

   and purchaseDate >= '2001-06-01'

   and purchaseDate < '2001-07-01'

The query plan shows that the clustered index of (tenantId, pruchaseDate, orderId) allowed the database to quickly find the relevant rows and  sequentially the relevant rows. 

 

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: SaleTenant

         type: range

possible_keys: PRIMARY

          key: PRIMARY

      key_len: 12

          ref: NULL

         rows: 150491

        Extra: Using where

 

4) As the tenant criteria drives off of the customer table, and the date criteria drives off of the sales table, and the optimizer needs to start with one of these two tables, this sql can’t be as efficient as the query #3.  In this case, the optimizer again (like query #2) decides to drive off the customer table.  Unlike query #3, the extra date criteria doesn’t improve performance as it must join to each Sale row to determine if the date is valid, and in this case, only about 1/12 of the rows are relevant.  This means 11/12 of the joins are, in effect, tossed away.   Thus, it has to do the same amount of work as query #2 and runs in about the same time as #2 2.5 seconds, or about 31 times slower than query #3. 

 

select sum(unit)

  from Customer c

  join Sale s

    on c.customerId = s.customerId

where c.TenantId = 77

   and purchaseDate >= '2001-06-01'

   and purchaseDate < '2001-07-01'

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: c

         type: ref

possible_keys: PRIMARY,pf_customer_tenant

          key: pf_customer_tenant

      key_len: 4

          ref: const

         rows: 5040

        Extra: Using index

*************************** 2. row ***************************

           id: 1

  select_type: SIMPLE

        table: s

         type: ref

possible_keys: PRIMARY,pf_sale_customer

          key: pf_sale_customer

      key_len: 4

          ref: test.c.customerId

         rows: 57

        Extra: Using where

 

So, obviously there are cases where adding a tenantId will make the queries run much faster.  The issue with adding a tenantId to all the tables and clustering on that column is queries that are not limited to a tenant (or set of tenants) will be slower in some cases.  The next two sql statements will show such a case. 

5) In this sql a simple sum is run against the Sale table for a month, summing about 1/12 of the table.  As the table is clustered by date the query is fast and runs in about 4.4 seconds.

 

select sum(unit)

  from Sale s

where purchaseDate >= '2001-06-01'

   and purchaseDate < '2001-07-01'

 

The query plan shows the clustered index being used to scan the table. 

 

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: Sale

         type: range

possible_keys: PRIMARY

          key: PRIMARY

      key_len: 8

          ref: NULL

         rows: 26733636

        Extra: Using where

 

The equivalent query against the SaleTenant table takes 39 seconds to run.  In this case, the entire table must be scanned as there is no date oriented index.  

 

select sum(unit) from SaleTenant

where purchaseDate >= '2001-06-01'

  and purchaseDate < '2001-07-01'

The plan show the full table scan. 

 

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: SaleTenant

         type: ALL

possible_keys: NULL

          key: NULL

      key_len: NULL

          ref: NULL

         rows: 123303794

        Extra: Using where

Assuming this cross-tenant access pattern is frequent enough to be a performance problem, a potential solution to this problem is to add a covering index.  However, this might not be an issue as the only organization likely to execute such a query would be the hosting company. 

Given enough tenants in a database, adding a tenantId to a mutli-tenant database is recommended as some queries are going to be at least an order of magnitude slower if you don’t.   As denormalizations go, this is a reasonable one as how often would the tenantId (owner) of a row change?  Just be aware that queries that don’t include a tenantId in the criteria can be much slower.  

I didn’t cover this, but partitioning by tenantId also makes sense and can make moving customers from one sharded database to another much easier.  Perhaps more about that later.