Multitenancy: juggling customer data in Django

Filipe Ximenes
September 13, 2017
<p>Suppose you want to build a new SaaS (Software as a Service) application. Suppose your application will store sensitive data from your customers. What is the best way to guarantee the isolation of the data and make sure information from one client does not leak to the other?<br>The answer to that is: it depends. It depends on the number of customers you are planning to have. It depends on the size of your company. It depends on the technical skills of the developers working on the platform. And it depends on how sensitive the data is. In this article, I'll go over some of the architectures you can adopt in your application to tackle this problem and how to apply them in Django. My goal here is to give you an overview so you can decide for yourself what is best for your project. Expect to see some code samples, they are meant to give you a better understanding of how things work.</p><p><strong>If you already know what multitenancy is and why do we need it, feel free to jump to the "Our Application" section</strong></p><p>First of all, let's agree that we are looking to build an application that is capable of attending multiple clients from a single version of the codebase. The reason for that is that we don't want to be setting up a new server for each client that signs up. With that in mind, we can finally start talking about <code>multitenancy</code> (or multi tenancy if you prefer that).<br>First things first. Here is the Wikipedia definition for it:</p><blockquote>"The term "software multitenancy" refers to a software architecture in which a single instance of software runs on a server and serves multiple tenants."</blockquote><p>In the context of this article, every time we say "tenant" we mean "customer". Cool? Cool! Alright, so multitenancy is a very broad concept and it can be observed in many levels of the infrastructure modern applications rely on. Here are some examples of multitenant systems we deal with on the daily basis:</p><ul><li>Amazon AWS. AWS is an IaaS (Infrastructure as a Service). They provide one infrastructure that allows many companies (tenants) to host their own isolated infrastructure.</li><li>Heroku. Heroku operates much like AWS but from the PaaS (Platform as a Service) level. Again many companies (tenants) can run their own applications isolated from one another in Heroku's platform.</li><li>Basecamp. Basecamp (which is a SaaS) is an application that allows many companies to manage their day to day, each in their own isolated context.</li></ul><p>As you've guessed by now, we are going to focus on SaaS scope. But why do we want to build multitenant applications in the first place? The main reason is to benefit from sharing resources amongst our clients. Not everyone is using the software at the same time. So many clients fit into the same infrastructure where only one would exist otherwise. But there are some other very important reasons. Imagine the situation you are going to setup a new infrastructure with a server and a database for each client. Even if you try to keep the architecture and code base the same for everyone, it's going be a maintenance hell. A simple change in the software means multiple deploys. It might work for 10 or 20 clients, but it quickly becomes pretty much undoable.</p><p>There are 3 main approaches to multitenancy in this context. There's no established convention for naming them, but for this article, we are going to use the names: Single Shared Schema, Multiple Databases, Multiple Schemas. </p><p>Buckle up, here we go...</p><h2 id="our-application">Our Application</h2><p>I'm not sure how familiar you are with the enterprise software market, but you would be surprised to know there's an increasing demand for Fidget Spinner software. Because of this, we will be using an Enterprise Fidget Spinner Tracker application in our code samples. Here is the database schema of the application:</p><figure class="kg-card kg-image-card"><img src="https://vinta-cms.s3.amazonaws.com/media/filer_public/3a/71/3a71cd24-06fa-411e-afd1-8dadd3e71929/schema.jpg" class="kg-image" alt="Application Schema"></figure><h2 id="single-shared-schema">Single Shared Schema</h2><p>This is the approach used by most of the big companies. The reason for that is because it's the most scalable approach. Each of your clients will be sharing every piece of your infrastructure. There's a single database and a single server application.</p><figure class="kg-card kg-image-card"><img src="https://vinta-cms.s3.amazonaws.com/media/filer_public/75/be/75be1699-69b1-4a7a-9335-3c1317da5443/one-server-one-database.jpg" class="kg-image" alt="One Server, One Database"></figure><p>That scalability does not come without some compromises. The biggest problem here is assuring data isolation. Because tables contain data from all clients, you rely on manually filtering the correct information on queries. You can learn how Salesforce does it in <a href="https://www.youtube.com/watch?v=jrKA3cJmoms">this talk</a>. The short version is that they have internal tools that assist developers doing the job. The other thing is that they automatically double check every single query that goes to the database. Please keep in mind that Salesforce has plenty of engineers to take care of building those assisting tools. So if isolation is important to you and you are not Salesforce, this might not be the best approach.</p><p>Enough talking, let's see how to do this in Django. The idea is to add a <code>tenant</code> attribute to the <code>request</code> parameter of our views. The way to do this in Django is through a middleware. Historically middlewares in Django are implemented via a class, but since <code>1.10</code> we have function based middlewares. That's the style we are going to use. Here is the code, read through comments to follow each step:</p><pre><code>def tenant_middleware(get_response): def middleware(request): # we are going to use subdomains to identify customers. # the first step is to extract the identifier from the url host = request.get_host() # here is the full url, something like this: 'https://ibm.spinnertracking.com' host = host.split(':')[1] # we remove the protocol part: 'ibm.spinnertracking.com' subdomain = host.split('.')[0] # and now we get the subdomain:'ibm' # now is just a matter of using the subdomain to find the # corresponding Customer in our database try: customer = Customer.objects.get(name=subdomain) except Customer.DoesNotExist: customer = None # and it to the request request.tenant = customer # all done, the view will receive a request with a tenant attribute return get_response(request) return middleware</code></pre><p>All done, add the middleware in your settings:</p><pre><code>MIDDLEWARE_CLASSES = [ 'my_project.middlewares.tenant_middleware', ... ]</code></pre><p>And finally, you will be able to access it in your views to display things accordingly. In this example we are getting the average duration scoped by customer:</p><pre><code>def my_view(request): avg_duration = ( Spin.objects .filter(user_spinner__user__customer=request.customer) .aggregate(avg=Avg('duration')))['avg'] return render(request, 'show_average.html', {'avg_duration': avg_duration})</code></pre><p>Neat, huh? But there's a way to improve this. A common technique to avoid all that nesting in the query is to decorate every model with the customer it belongs to.</p><figure class="kg-card kg-image-card"><img src="https://vinta-cms.s3.amazonaws.com/media/filer_public/90/ee/90ee0683-c524-48ba-93be-6ce31da1ee3b/schema2.jpg" class="kg-image" alt="Customer annotation in every table"></figure><p>This will denormalize your database but it's, in general, a good thing. First of all, this is a field that will never change so you shouldn't get in too much trouble. It will also help you simplifying queries and performing sanity checks on the data you are working with. Here is the same query we did before using the new field:</p><pre><code>avg_duration = ( Spin.objects .filter(customer=request.customer) .aggregate(avg=Avg('duration')))['avg']</code></pre><p>Cool! Just a little note for you to keep in mind when doing this: you might have some trouble if you try to use a third party lib that introduces models to your application but does not give you access to customize them. Also, in case you are wondering, this is a similar to the Django sites framework approach. You should definitely take a <a href="https://docs.djangoproject.com/en/1.11/ref/contrib/sites/">look at it</a>.</p><p>There's a lot more you can do to improve your work when dealing with a shared database, but this should be enough to get you understanding the basics of how to do it. You should also check the <a href="https://github.com/hugobessa/django-shared-schema-tenants">django-shared-schema-tenants</a> lib. It's a WIP from <a href="https://www.vinta.com.br/blog/author/hugobessa/">Hugo</a>. Not yet ready for production but you can keep an eye on it and help him with some ideas.</p><p>Now let's try juggling with multiple databases:</p><h2 id="multiple-databases">Multiple Databases</h2><p>We’ll now move to the extreme opposite of single shared schema. Each tenant will now have its own database instance.</p><figure class="kg-card kg-image-card"><img src="https://vinta-cms.s3.amazonaws.com/media/filer_public/64/9e/649e72cc-cff3-4447-97d2-5a3257a935c0/one-server-multiple-databases.jpg" class="kg-image" alt="Mutliple Databases"></figure><p>Yes, I know it looks bad. But there are some very good features on this approach that you should consider. The first thing is isolation. Not only logical but you can actually do physical data isolation. Each database can be on a different hardware and it will be quite hard for someone to mistakenly show the wrong data to a client. Also, you can tune each instance for the requirements of each customer. It also means you are probably going to spend more money on infrastructure since there's less resource sharing. Alright, let's make this work in Django.</p><p>You've probably noticed that the Django <code>DATABASES</code> is a dictionary. Although we normally only set a <code>default</code> key in it, it's possible to have multiple database entries. Every time we instantiate a new client we are going to add a new entry there:</p><pre><code>DATABASES = { 'default': { 'ENGINE': ..., 'NAME': ..., }, 'ibm': { 'ENGINE': ..., 'NAME': ..., } }</code></pre><p>We can now use the same middleware we defined in the last approach. This will give us access to the customer object in our views. And we can use the <code>using</code> method from the queryset to select the desired database.</p><pre><code>spinners = ( Spinner.objects .using(request.customer.name) .annotate( avg_duration=Avg('owned_spinners__spins__duration')) .order_by('-avg_duration'))</code></pre><p>There's also a <code>db_manager</code> for creating objects, check the <a href="https://docs.djangoproject.com/en/1.11/topics/db/multi-db/#using-managers-with-multiple-databases">documentation</a> for more on that.</p><p>This is all nice and should work alright. One thing you might not enjoy as much is the requirement of writing <code>using</code> in every query. There's a way to get rid of this, but to do it you will need to use threadlocals. If you don't know what that is I advise that there's no consensus on the use of it. You will find people strongly advising it as a bad practice and saying that you should not use in your code. For educational purposes, I'm going to show how it's done but if someone asks, you didn't hear it from me.</p><h3 id="the-threadlocal-approach">The threadlocal approach</h3><p>I've prepared an <a href="https://github.com/filipeximenes/multitenancy">example app</a> that uses the threadlocal approach. If you want to understand more about how threadlocal works you can read the <a href="https://docs.python.org/3/library/threading.html#thread-local-data">python docs</a>. In the following examples, you will see the use of a <code>@thread_local</code> decorator <a href="https://github.com/filipeximenes/multitenancy/blob/master/multitenancy/threadlocal.py">click here</a> to see the source code for it.</p><p>The first thing we will need is a middleware that is very similar to the ones we did so far:</p><pre><code>def multidb_middleware(get_response): def middleware(request): subdomain = get_subdomain(request) customer = get_customer(subdomain) request.customer = customer @thread_local(using_db=customer.name) def execute_request(request): return get_response(request) response = execute_request(request) return response return middleware</code></pre><p>The only real difference is that this time we are setting <code>using_db</code> as a variable to the current thread using the decorator I mentioned before. Django allows us to define a custom database router. This is very useful for things like separate read and write databases. But this time we are going to use it to select the customer database. We will use the <code>using_db</code> variable we set in the middleware:</p><pre><code>class TenantRouter(object): def db_for_read(self, model, **hints): return get_thread_local('using_db', 'default') def db_for_write(self, model, **hints): return get_thread_local('using_db', 'default') # …</code></pre><p>Last thing, change the settings to use the custom router:</p><pre><code>DATABASE_ROUTERS = ['multitenancy.routers.TenantRouter']</code></pre><p>Database routing is abstracted from querying:</p><pre><code>spinners = ( Spinner.objects .annotate( avg_duration=Avg('owned_spinners__spins__duration')) .order_by('-avg_duration'))</code></pre><p><em>Voilà</em>, we have a per-client database application with all the complexity hidden from the MVC logic. Now, if this is a good thing or not, I’ll leave it for you to decide.</p><h3 id="is-this-multitenancy">Is this multitenancy?</h3><p>A thing some people might be questioning is whether having multiple databases is actually multitenancy or not. It depends on how you see it. From the database point of view, it's not. Each database instance is serving a single client. But from the application point of view, it is. We have a single instance of our application code serving multiple clients.</p><h2 id="multiple-schemas">Multiple Schemas</h2><figure class="kg-card kg-image-card"><img src="https://vinta-cms.s3.amazonaws.com/media/filer_public/1e/62/1e629f22-8cf4-4a9f-8843-1e3287e3c087/multiple-schemas.jpg" class="kg-image" alt="Multiple Schemas"></figure><p>But, what are schemas in the first place? The first thing to notice is that in the context of this blog post, every time we talk about schemas we are referring to <a href="https://www.postgresql.org/docs/9.1/static/ddl-schemas.html">PostgreSQL schemas</a>. That said, I can now safely state that if you’ve ever used PostgreSQL you have already used schemas. Schemas are simply scoped tables in your database that can be accessed through a namespace. If you don't specify a namespace, PostgreSQL assumes the <code>public</code> namespace. Simple enough. Now, how do we create a new schema? Easy:</p><pre><code>CREATE SCHEMA ibm; </code></pre><p>Just like the <code>public</code> schema, created schemas can have any number of tables. Each schema may or may not have tables with the same name. The important thing to understand is that regardless, tables in different schemas are disjoint. If you want to add a new field to tables in different schemas, you will need to run two different commands. So how do we query a table that is not in the <code>public</code> schema?</p><pre><code>SELECT id, name FROM ibm.user WHERE ibm.user.name LIKE 'F%'</code></pre><p>Makes sense, right? But there's a better way. PostgreSQL provides a <code>search_path</code> that you can use to scope your queries without the need to keep repeating yourself.</p><pre><code>SET search_path TO ibm; SELECT id, name FROM user WHERE user.name LIKE 'F%';</code></pre><p>Good, that's most of what you need to know to understand how schemas work. From there you could use the same middleware techniques we showed before to automate the use of schemas. But fear not, someone already did that that and it's a stable open source project!</p><p>Enters <code>django-tenant-schemas</code>.</p><p><a href="https://github.com/bernardopires/django-tenant-schemas">Django-tenant-schemas</a> will do most of the trick of managing schemas for you. And it does that using many of the tricks we talked before. Here is a block of code from its middleware:</p><pre><code> ... connection.set_schema_to_public() hostname = self.hostname_from_request(request) TenantModel = get_tenant_model() try: tenant = self.get_tenant(TenantModel, hostname, request) assert isinstance(tenant, TenantModel) except TenantModel.DoesNotExist: # ... request.tenant = tenant connection.set_tenant(request.tenant) ...</code></pre><p>Similar, isn't it? There's an extra interesting bit. Django-tenant-schemas also overwrites the default PostgreSQL database backend. There, it sets the search path just before executing queries. Here is the code:</p><pre><code> ... try: cursor_for_search_path.execute( 'SET search_path = {0}'.format(','.join(search_paths))) except (django.db.utils.DatabaseError, psycopg2.InternalError): self.search_path_set = False else: self.search_path_set = True if name: cursor_for_search_path.close() ...</code></pre><p>Apart from managing schemas for CRUD operations, Django-tenant-schemas also provides some very useful commands such as a custom <code>python manage.py createsuperuser</code> that will ask you for a schema name, a <code>python manage.py migrate_schemas</code> command to automatically run migrations in each of your customer schemas and a <code>python manage.py tenant_command &lt;manage command here&gt;</code> that will give multitenant support to any naive command.</p><h3 id="querying-across-schemas">Querying across schemas</h3><p>Since schema tables are disjoint, you might be wondering how can you aggregate data from multiple clients. This is something that is pretty easy to do in the single shared schema approach. The answer to that is doing schema <code>UNION</code>s.</p><pre><code>SELECT id, duration FROM ibm.spinner_spin WHERE duration &gt; 120 UNION SELECT id, duration FROM vinta.spinner_spin WHERE duration &gt; 120;</code></pre><p>But how about ID's? When you make unions you will end up with rows from different schemas that have the same ID. A solution to that is to use <code>UUID</code>s.</p><pre><code>SELECT uuid, duration FROM ibm.spinner_spin WHERE duration &gt; 120 UNION SELECT uuid, duration FROM vinta.spinner_spin WHERE duration &gt; 120;</code></pre><p>Now each row has an attribute that can be used to make complex queries and uniquely identify them in the database.</p><p>Nice, that should do it for now. I hope you’ve got a good overview of how multitenancy works and that you are now able to choose the approach that best suits your next project! Please let me know in the comments If I missed something. Feedbacks are also greatly appreciated!</p><p><strong>Looking for more?</strong><br><a href="https://www.vinta.com.br/blog/2017/advanced-django-querying-sorting-events-date/"><strong>Advanced Django querying: sorting events by date</strong></a><br><a href="https://www.vinta.com.br/blog/2017/case-against-over-engineering/"><strong>The case against over-engineering</strong></a></p>