Assembly transformations against side channels

Have you ever wondered why hardware security isn’t a solved problem? It’s been six years since Spectre and Meltdown sent tech companies scrambling to fix the vulnerabilities that hardware manufacturers created with their own optimizations. Everyone involved–chip manufacturers, big tech, the security community–seems to agree that hardware security bugs are Bad and To Be Avoided. So if everyone believes that these vulnerabilies shouldn’t happen, why do they keep happening?

Here’s a hint: It’s not just because eliminating security flaws in hardware is difficult, though it is. It’s also because chip-makers have a competing incentive. Moore’s law is slowing down, and performance gains are getting harder to find. That means companies like Intel and ARM are exploring ever stranger and more complicated optimizations, even when the optimizations in question might create security problems in the bargain.

In this post, I’ll discuss some of vulnerabilities that could be introduced if certain optimizations are implemented in hardware, and how we can use assembly-level transforms to mitigate those vulnerabilities in software. Since this is a programming languages blog, I’ll start with a brief introduction to side channels for any readers who may be unfamiliar.

What is a side channel?

A side channel is essentially a way of learning information about data or a program via means that fall outside the expected bounds of input and output. Usually, the information learned in this way is a secret—for example, a password or a cryptographic key—and can be used to enable further malicious attacks, like listening in on encrypted communications.

To see a side channel in action, consider this program that checks a password against user input with a naive loop:

bool is_correct_password(char* input, int inputlen,
                         char* password, int passwordlen) { 
    if (inputlen != passwordlen)
        return false;

    int i = 0;
    while (i < passwordlen) {
        if (input[i] != password[i])
            return false;
        i++;
    }

    return true;
}

Looks reasonable, right?

There’s just one problem: The amount of time it takes to check the user input scales directly with how close that input is to the real password. To see why that matters, let’s look at how an attacker might try to break the password.

A clever attacker might notice that when they try to check a password, the program doesn’t always take the same amount of time. It always fails, but sometimes it fails a little slower—in fact, this always occurs when the attacker’s input has the same length as the real password. The attacker can use this information to deduce the length of the true password, and use a similar pattern to learn each individual character of the password. In very little time, the attacker will have cracked the password, at which point they can log in and wreak all sorts of mayhem.

This example describes what is known as a timing channel, where the attacker relies on differences in execution times to learn secret information. There are other types of side channels as well, such as those that measure power usage. But for purposes of this post, we’ll be sticking to the world of timing channels.

Novel hardware optimizations mean novel side channels

Researchers have shown that many proposed microarchitectural optimizations have the potential to create new side channels, which could in turn leak secret data during cryptographic operations. Some side channels caused by novel optimizations have already appeared, and we will very likely see more as new optimizations are implemented in years to come.

Some of these optimizations fall into the category of instruction-centric optimizations, which optimize instructions at runtime based on special-case interactions between instructions and other architectural features. Some instruction-centric optimizations rely on hardware state, such as the contents of memory, while others trigger when certain instructions receive operands with special values. When the operands that trigger these optimizations contain secret data, the optimization creates a timing channel based on that secret, which attackers can potentially exploit.

To see this in action, let’s talk about computation simplification (CS for short). CS is a type of instruction-centric optimization that optimizes arithmetic operations—think addition or multiplication, as well bitwise operations like AND and OR—when those operations have a trivial result. But what does that actually mean?

Imagine you need to multiply two numbers, $A$ and $B$. If $A$ and $B$ are large enough, and you don’t happen to have memorized that exact product, you’ll probably use a calculator (or pen and paper) to do the arithmetic. But if $B$ is actually 1, then you don’t need to do anything at all, because the product is just $A$. Performing the multiplication has become trivial.

Computation simplification lets a microarchitecture take that same type of shortcut, the way a human might. Instead of needlessly multiplying the registers eax and ebx when ebx is 1, CS allows the microarchitecture to skip straight to the result of eax.

There are tons of trivial cases like this across common arithmetic and bitwise instructions, and optimizing those operations with CS could be great for performance gains. But those same optimizations let attackers detect when operands have special values, such as ebx being 1. If ebx happens to contain part of a cryptographic key or other secret data, the resulting timing channel would create a serious security vulnerability.

So what do we do about this? Surely I’m not going to write half a blog post about this problem and not offer any solutions, right?

Well, there are some options. The best, most targeted and efficient one is for chip designers to include what are known as Data-Independent Timing bits. These bits exist on Intel and ARM processors and can be used to turn potentially dangerous optimizations on or off, depending on whether the user wants to prioritize security or performance.

However, there are caveats. The exact set of optimizations covered by DIT bits is up to the company that built the hardware. That set doesn’t always cover everything that cryptographic code needs to avoid, either because the hardware vendor didn’t see fit to include some optimizations, or because they haven’t yet built or deployed microarchitectural patches that would cover those optimizations. And even if one or more hardware bits do disable all the necessary optimizations, it’s not always simple for cryptographic libraries to use those bits on! Some of them may need to be enabled by the operating system, which requires buy-in from OS developers.

All of this leaves a need for some form of stopgap: a solution that can be quickly and easily applied at the software level when more efficient hardware solutions are delayed, unavailable, or inaccessible.

Mitigating side channels with assembly transformations

Enter instruction-centric transforms. These are assembly-level transformations that take a single leaky instruction—that is, an instruction that could be optimized with computation simplification or another instruction-centric optimization—and transform it by replacing it with a sequence of different instructions.

This new replacement sequence—the “transform”—is carefully crafted to do the exact same thing as the old instruction, without being leaky. So if the original instruction added two 32-bit numbers from registers, the new sequence will also add two 32-bit numbers from those same registers. We refer to this as semantic equivalence. The only difference is that where the old instruction could have leaked sensitive data, the transformed instruction sequence cannot, even on a microarchitecture with the relevant optimization(s) enabled.

In practice, instruction-centric transforms look like this:

   ...                   ...
1  sub ecx, eax   -----> mov r11, rax
2                    |   mov eax, eax
3                    |   sub rax, 0x80000000
4                    |   sub rax, 0x80000000
5                    |   sub rcx, rax
6                    |   mov ecx, ecx
7                    --> mov rax, r11
   ...                   ...

Let’s take it step by step.

First, notice that the instruction we want to transform is a 32-bit subtraction: sub ecx, eax. We’re using Intel syntax and x86 Assembly semantics, so what that instruction means is that we want to take the 32 bits in the eax register; subtract them from the 32 bits in the ecx register; and store the result in ecx.

In the context of computation simplification, subtraction operations like this one can be optimized when the second operand—here, that’s eax—is zero. For this example, we’ll assume that the eax register contains a cryptographic secret. That makes this instruction leaky, because an attacker could detect whether the secret in eax is zero by the presence or absence of a CS optimization. Our goal, then, is make such an optimization impossible, so that there is no timing difference for the attacker.

We can do that by changing the operands to the subtraction such that the second operand can never be zero, even if the secret in eax is zero. However, we still have to reach the correct result at the end of our transform, with eax subtracted from ecx. So how is that possible?

The trick is to realize that our current operands, ecx and eax, are only 32 bits wide. That means that there are a whole extra 32 bits available to us in the full 64-bit registers corresponding to those operands: rcx and rax. If we expand the subtraction instruction to operate over all 64 bits, then we can use those extra 32 bits to make sure the second operand, now rax, is nonzero, without ever changing the value of the original secret. And that’s all we need!

You can see this in action above. Line 1 saves rax into a scratch register, r11, so that we don’t lose track of rax’s original value. Line 2 uses a trick of x86 semantics—moving values to a 32-bit register sets the upper half of the corresponding 64-bit register to all zeros—to remove any garbage data initially in the upper bits of rax. Lines 3 and 4 work to set the upper bits of rax to be all ones, without modifying the lower bits that contain the original secret. And line 5 executes the actual subtraction, making sure to use the full 64-bit registers as operands. Finally, Line 6 uses the same trick as line 2 to clear the data our transformation added to the upper bits of rax, and line 7 restores the original value of rax. With these last two steps, we end with the same register state as we would have in the original 32-bit subtraction, all without ever permitting a CS optimization on our secret.

I spent most of the 2022-23 academic year coming up with transforms like this one, including for much more interesting and challenging cases, like 64-bit arithmetic and especially multiplication. As it turns out, there’s a lot you can do with cool little assembly tricks to mitigate otherwise dangerous instruction-centric side channels! We also used Rosette to verify our transforms, making sure they’re both semantically equivalent to the original instruction and safe from the optimizations we were trying to address.

Concluding thoughts

Targeted assembly-level transformations can be a surprisingly powerful tool to combat certain types of timing channels. If we can predict in advance what microarchitectural optimizations are coming that could create future instruction-centric side channels—patents and instruction set manuals can provide some hints—we can create proactive tools to transform the potentially vulnerable instructions. Then when those side channels do occur, we can have software tools at the ready to protect cryptographic code in the weeks or months before a hardware solution becomes available.

That said, there are limitations. Most obviously, you have to know what new side channels are likely to appear before you can defend against them. And even if you can correctly predict future microarchitectural optimizations, designing the defenses themselves is easier said than done, requiring lots of time, effort, and insight to do effectively.

However, that doesn’t mean it’s not worth it. Hardware developers aren’t going to stop implementing new optimizations any time soon, and those optimizations aren’t going to stop creating new side channels. As long as that’s true, instruction-centric software mitigations will have value as another tool in the toolbox and a stopgap when more efficient solutions are unavailable. And in the meantime, we continue to work on ways to solve these problems more quickly, flexibly, and generically.

Further reading

Alexandra Michael is a 2nd year PhD student at the University of Washington, working in PL and security.