In 2013, about a year after we joined Facebook, 200m people were using Instagram every month and we were storing 20b photos. With no slow-down in sight, we began Instagration - our move from AWS servers to Facebook's infrastructure. Two years later, Instagram has grown to be a community of 400 million monthly active users with 40b photos and videos, serving over a million requests per second. To keep supporting this growth, and to make sure the community has a reliable experience on Instagram, we decided to scale our infrastructure geographically. In this post, we'll talk about why we expanded our infrastructure from one to three data centers and some of the technical challenges we met along the way.
Mike Krieger, Instagram's co-founder and CTO, recently wrote a post that included a story about the time in 2012 when a huge storm in Virginia brought down nearly half of our instances. The small team spent the next 36 hours rebuilding almost all of our infrastructure, which was an experience they never wanted to repeat. Natural disasters like this have the potential to temporarily or permanently damage a data center—and we need to make sure to sustain the loss with minimal impact on user experience.
Other motivations for scaling out geographically include:
So how did we start spreading things out? First let’s take a look at Instagram’s overall infrastructure stack.
The key to expanding to multiple data centers is to distinguish global data and local data. Global data needs to be replicated across data centers, while local data can be different for each region (for example, the async jobs created by web server would only be viewed in that region).
The next consideration is hardware resources. These can be roughly divided into three types: storage, computing and caching.
Instagram mainly uses two backend database systems: PostgreSQL and Cassandra. Both PostgreSQL and Cassandra have mature replication frameworks that work well as a globally consistent data store.
Global data neatly maps to data stored in these servers. The goal is to have eventual consistency of these data across data centers, but with potential delay. Because there are vastly more read than write operations, having read replica each region avoids cross data center reads from web servers.
Writing to PostgreSQL, however, still goes across data centers because they always write to the primary.
Web servers, async servers are both easily distributed computing resources that are stateless, and only need to access data locally. Web servers can create async jobs that are queued by async message brokers, and then consumed by async servers, all in the same region.
The cache layer is the web servers' most frequently accessed tier, and they need to be collocated within a data center to avoid user request latency. This means that updates to cache in one data center are not reflected in another data center, therefore creating a challenge for moving to multiple data centers.
Imagine a user commented on your newly posted photo. In the one data center case, the web server that served the request can just update the cache with the new comment. A follower will see the new comment from the same cache.
In the multi data center scenario, however, if the commenter and the follower are served in different regions, the follower’s regional cache will not be updated and the user will not see the comment.
Our solution is to use PgQ and enhance it to insert cache invalidation events to the databases that are being modified.
On the primary side:
On the replica side:
This solves the cache consistency issue. On the other hand, compared to the one-region case where django servers directly update cache without re-reading from DB, this would create increased read load on databases. In order to mitigate this problem, we took two approaches: 1) reduce computational resources needed for each read by denormalizing counters; 2) reduce number of reads by using cache leases.
The most commonly cached keys are counters. For example, we would use a counter to determine the number of people who liked a specific post from Justin Bieber. When there was just one region, we would update the memcache counters by incrementing from web servers, therefore avoiding a “select count(*)” call to the database, which would take hundreds of milliseconds.
But with two regions and PgQ invalidation, each new like creates a cache invalidation event to the counter. This will create a lot of “select count(*)”, especially on hot objects.
To reduce the resources needed for each of these operations, we denormalized the counter for likes on the post. Whenever a new like comes in, the count is increased in the database. Therefore, each read of the count will just be a simple “select” which is a lot more efficient.
There is also an added benefit of denormalizing counters in the same database where the liker to the post is stored. Both updates can be included in one transaction, making the updates atomic and consistent all the time. Whereas before the change, the counter in cache could be inconsistent with what was stored in the database due to timeout, retries etc.
In the above example of a new post from Justin Bieber, during the first few minutes of the post, both the viewing of the new post and likes for the post spikes. With each new like, the counter is deleted from cache. It is very common that multiple web servers would try to retrieve the same counter from cache, but it will have a “cache miss”. If they all go to the database server for retrieval, it would create a thundering herd problem.
We used memcache lease mechanism to solve this problem. It works likes this:
In summary, with both of the above implementations, we can mitigate the increased database load by reducing the number of accesses to the database, as well as the resources required for each access.
It also improved the reliability of our backend in the cases when some hot counters fall out of cache, which wasn't an infrequent occurrence in early days of Instagram. Each of these occurrence would cause some hurried work from an engineer to manually fix the cache. With these changes, those incidents have become memories for old-timer engineers.
So far, we have focused mostly on cache consistency when caches become regional. Network latency between data centers across the continent was another challenge that impacted multiple designs. Between data centers, a 60ms network latency can cause problems in database replication as well as web servers' updates to the database. We needed to solve the following problems in order to support a seamless expansion:
PostgreSQL Read Replicas Can't Catch up
As a Postgres primary takes in writes, it generates delta logs. The faster the writes come in, the more frequent these logs are generated. The primaries themselves store the most recent log files for occasional needs from the replicas, but they archive all the logs to storage to make sure that they are saved and accessible by any replicas that need older data than what the primary has retained. This way, the primary does not run out of disk space.
When we build a new readreplica, the readreplica starts to read a snapshot of the database from the primary. Once it’s done, it needs to apply the logs that have happened since the snapshot to the database. When all the logs are applied, it will be up-to-date and can stream from the primary and serve reads from web servers.
However, when a large database's write rate is quite high, and there is a lot of network latency between the replica and storage device, it is possible that the rate at which the logs are read is slower than the log creation rate. The replica will fall further and further behind and never catch up!
Our fix was to start a second streamer on the new readreplica as soon as it starts to transfer the base snapshot from the primary. It streams logs and stores it on local disk. When snapshot finishes transfer, the readreplica can read the logs locally, making it a much faster recovery process.
This not only solved our database replication issues across the US, but also cut down the time it took to build a new replica by half. Now, even if the primary and replica are in the same region, operational efficiency is drastically increased.
Instagram is now running in multiple data centers across the US, giving us more flexible capacity planning and acquisition, higher reliability, and better preparedness for the kind of natural disaster that happened in 2012. In fact, we recently survived a staged “disaster.” Facebook regularly tests its data centers by shutting them down during peak hours. About a month ago, right as we had finished migrated our data to a new data center, Facebook ran a test and took it down. It was a high-stakes simulation, but fortunately we survived the loss of capacity without users noticing. Instagration part 2 was a success!