Flexible Feature Control at Instagram

Instagram Engineering
Instagram Engineering
8 min readSep 30, 2015

--

At Instagram, we tune and release features every day. Depending on the feature itself, we often need gradual roll out and conditional feature control. Say we want to target a certain age cohort in a particular group of countries, and we want to slowly rollout to 30% of that group over the course of a week to test things out. Engineers can hardcode these “conditions” and wait for deployment, but it has several drawbacks:

  • Not realtime. Deployment needs time. Although at Instagram we have systems to execute smooth and fast rollouts, it still takes some time, not to mention that you have to line up to wait for other developers’ code changes.
  • Not safe. Writing up check conditions might seem easy, but you can also make mistakes easily, and if you do, it could have consequences (like leaking your secret features or failing user requests). And to the first point, there’s no instant fix — you have to wait for another rollout to revert it.

Previously, we used a thin abstraction over these requirements that we called “IG Gatekeeper”(not to be confused with Facebook’s Gatekeeper). It basically allowed us to define a whitelist and a release percentage for the feature. It’s super simple and handy, but as we’ve grown, we’ve identified more capabilities we need, like custom geo-targeting. Unfortunately, Gatekeeper wasn’t designed for this and wasn’t easily extensible. We’ve since embarked on several efforts to make an extensible gating system, and in this post we’re going to talk about Gate Logic — our latest gating system that’s currently managing hundreds of features at Instagram.

Gate Logic

When thinking about a gating system, there are several things we need to consider:

  • Reactivity. We don’t want to wait minutes or hours for the change to take effect. After we decide to save our change, it should be in production within seconds. This also infers we should have the ability to revert changes quickly.
  • Safety. It’s going to a core part of the whole tech stack, so it has to be very safe. It can neither leak features nor break things.
  • Efficiency. It’s going to be used everywhere across the codebase, so we need to make it very fast so that it runs unnoticeably.
  • Flexibility. It should be able to handle future requirements without having to change fundamental parts of the system.

Before Gate Logic, engineers would write custom if conditions in their code. To move the testing logic out to an external system, which ensured easy tuning, efficiency and safety. Imagine instead of writing things like:

if get_country_for_user(user) in {'CA', 'US'} and calc_percentage_for_user(user) < 50:
# Do something

With Gate Logic, the Python side simplifies down to:

if gate_check('my_awesome_feature'):
# Do something

This piece of the checking logic is a gate.

You might be thinking, “How are we supposed to ensure safety if we allow people to write anything in the definition of the logic and send it to production in seconds?” Well, we’re not doing that. To gain the most control of the system, we’re forcing people to write their logic in a DSL (Domain Specific Language). The language looks very similar to the host language (Python), but we added extra sauce.

  1. Flexibility: Custom language can be as expressive as we want to.
  2. Safety: Python is not type safe, so we’ll do type checking for the logic before you save it.
  3. Efficiency: To make it at least as fast as writing the logic in Python, we compile the logic into Python bytecode.

To sum it up, we’re embedding a compiled and typed language in Python.

Logical Language

We want to have a simple typed language to enable engineers to express their rolling-out condition. One of the simple and natural things we can use is the boolean logical expression, things like (A AND B) OR (C AND D) . It’s easy to implement (code interview question level), but how do we add type support and enough flexibility? If the expression is a tree, and we’re adding type information to every node, the root and result of logical operators are Boolean. Considering the example above, if we obtain the country of the user and compare it with a fixed set of strings, we’re doing X in Y. It’s exactly the same pattern, except the operands and operator are different, and rather than boolean variables and logical operators, we have real values and some arithmetic operators.

This gives us a hint on our road to an expression-based language. We append type information on each node and operand, and we force typing rules on each of the operators. For example, the operator in is a binary operator, which takes operands with type T and Set[T] . We have a rule-based type checking system, which is much simpler than the type system in a real strongly typed programming language, but it’s suitable for our case. Here’s a real example to show what the language can give us:

(NOT (user in $blacklist)) 
AND (((app.os = $droid) AND (app.version >= $droid_version))
OR ((app.os = $ios) AND (app.version >= $ios_version)))
AND (request.country in $dogfooding_countries)

It’s worth noting here that while we do allow writing constants at the language level, it’s forbidden on the UX level. All constants must be moved to a section called *Parameters *so that the code is cleaner and more readable. Non-engineers like PMs and sales folks, who will sometimes tune the parameters or add someone to a whitelist, also use the system, so our UI must also be friendly to them.

@ Between Gates

We have a special gate which checks whether the current user is eligible for our internal dogfooding. This logic will be used during almost all new features rollouts, so to save the trouble of writing it in the logic every time (it’s pretty complicated), we added a *reference* functionality to the language. By saying @internal_dogfooding, you’re checking the gate called *internal_dogfooding*, and put the result of that in where you insert the reference. Sound familiar? Yes, function calls! If each gate is a function, you’re calling the other gate/function and embed the result in your future result. This is where “composability” kicks in. When designing a system, we should always try to find the minimal set of elements that can be re-used and composed together to form a much bigger system.

Implementation

Where does the data go?

Distributed systems are hard. We have thousands of web servers running in production, and our goal is to make them in sync within seconds of when the data is updated. We can’t afford to save in the database and make all the servers query the database every time, and the database doesn’t really have things like listeners to watch for data changes. Fortunately we are distributing a small amount of data so we can use Zookeeper. But remember, we also need to keep the histories for fast rollbacks. So the data is actually stored in two places — all the histories go into databases and are retained there forever, while only the latest versions of gates are distributing to production servers using Zookeeper. And we have a system to project the database table into a Zookeeper object automatically, so that we don’t need to update Zookeeper by hand. So when one of the engineers updates one of the gates, a new revision will be created in the database and projected to Zookeeper at the same time, and Zookeeper will distribute the data into all the production servers in seconds.

Under the Hood

We have the language, and now it’s time to compile it. We implemented a four-stage textbook-style compiler — Tokenizer, Parser, Type Checker, Code Generation — where every part is a little bit different (simpler) than a usual programming language. In the parser, instead of using some fancy parsing algorithm like Look-Ahead Left-to-right Rightmost Deviation parser (whose name is also fancy), since our grammar is unambiguous, we’re implementing a simple recursive function which looks for parentheses groups and recursively parses sub-expressions. It outputs an syntax tree with only five kinds of syntax nodes. In the code gen phase, we translates the typed syntax tree checked by the type checker to a Python AST, and then call the magic function compile to convert it into a native Python bytecode.

Optimization

Everything looks like it’s going well now. We just convert a custom expression into a Python expression, evaluate it, and put the result in the if statement at runtime. What could go wrong? The answer is a lot!

Eval is Evil, Let’s Exec

Even though we have compiled bytecode, an easy mistake is to call eval at runtime to run it. Here’s an example, with the logic of (user in $whitelist) AND (user.percentage < 10)with a whitelist of three users. You might be generating the native equivalent of the following Python code:

(user.id in {1, 2, 3}) and gatelogic.lang.Percentage.from_user(user) < 10

We’ve all been taught eval is evil, and in this case, it’s evil because it’s very slow. An easy view of this problem is, calling eval vs calling a real function at runtime. What’s the difference? Well, function objects are code objects in memory, and the Python interpreter just runs the code object directly, instruction by instruction. While evaling a piece of code, we first need to translate the code string/bytecode into a code object. Since our checking logic is usually simple, this translation time is usually much longer than the running time of the actual logic.

I’ll give you a second to think about how to get over this :-)

In fact, we can remove this overhead by noticing the difference between run-once and run-always. Ideally, the translation from bytecode string to Python code object is only needed once, but we’re running it every time. So instead of generating an bytecode to eval, we generate a function definition. Now the output code looks like:

def my_awesome_feature__gate_func(user):
return (user.id in {1, 2, 3}) and gatelogic.lang.Percentage.from_user(user) < 10

At runtime, we evaluate/execute this function definition once and call the generated function every time. eval in Python is only for expression, so we have to use exec here. This optimization allows us to run the code 3x faster on average.

Inlining

Inlining is one of the most useful optimization techniques in compilers. In this case, we have a lot of operators that have corresponding native operators. We used to call functions in the operator module for simplicity, but function calls in Python are also pretty slow. The ideal solution is to generate native Python operator calls when possible, which gives us the real native performance and benefits like short-circuit when evaluating boolean expressions.

When it Comes to a Closure

We’ve noticed that when we have whitelists with more than a couple hundred people, and we generate the code like the example above, we’re essentially building the big set every time. The big set is the same, so we can build it once and use that object ongoing. Another place where the redundant work is less obvious, but really a performance killer, is gatelogic.lang.Percentage.from_user. This involves multiple LOAD_ATTR instructions in the bytecode, and it’s really unnecessary. We can factor those out by pulling them up into a higher order function. Then all this building-set and loading-attributes work becomes a single LOAD_DEREF instruction. This gives us a 2x boost and more than 10x boost for the big whitelist. We are trading a bit of memory to save running time.

def my_awesome_feature__gate_func():
set1 = {1, 2, 3}
func1 = gatelogic.lang.Percentage.from_user
def inner(user):
return (user.id in set1) and func1(user) < 10
return inner

Wrapping it up

In this post, we covered the basic design of Gate Logic, our feature gating system, and how we implement and optimize the system. With Gate Logic, we’re able to control feature rollout in a safe and flexible way, with no performance loss compared to hard-coded Python, and it has taken the responsibility of rolling-out many Instagram new products, Search and Explore, Web Redesign and more. Moving forward, we’ll be working on extending the system for all kinds of requests from our engineers, integrating with various Facebook experimentation tools, and integrating and building API for the clients. I hope you found this interesting, and see you next time!

Chenyang is a software engineer at Instagram.

--

--