Binary pointer alias analysis — beating CodeQL’s taint analysis without even having source code

Prologue

I did not attend Hexacon in 2022, and I was definitely not in a hotel in Paris those days.

It must have been 2020 when I re-watched the original Blade Runner for the 100th time in my life during covid when I first conceived what would later become PRIS (short for PRoven Inference Security, or maybe a tribute to Daryl Hannah, who knows), a binary zero-day identification feature for IoT firmware, the first of its kind.

The focus was put on path-sensitivity and reducing some of the false positives that other tools, based on simple monotone frameworks and pattern matching suffered from, but this resulted in increased computational costs due to concolic analysis in the first version of the implementation. Incidentally, that's the version that we still ship today.

I soon realized that the final solution must incorporate more static techniques in order to be efficient, and I also realized that the real gain of using a full-fledged concolic/symbolic engine was its power in solving the pointer-aliasing problem on those traces that it has coverage on. Therefore, I dived deeper into understanding this problem, and to my amazement it turned out that people have been investigating more or less the very same problem that I was faced with for as long as there were computers. The earliest useful papers I found were from the 70's, and considerable theoretical and practical progress took place in the 80's and 90's. In the 2000s, focus shifted to large-scale software systems, and then finally in 2010 we arrive at some security applications.

I'll try my best to distill my knowledge about this topic into a somewhat lengthy, but more or less self-contained article, so more people can enjoy the beauty of this subject.

Introduction

A common root cause of critical software bugs are so called taint-style vulnerabilities. These are characterized by execution flows where an attacker controlled input variable or piece of data reaches certain security-sensitive program statements. A notable example would be OS command injections (CWE-78) like this:

Or buffer overflows resulting from some kind of unsafe memcopy where the length of the input string is not checked (CWE-120):

These trivial examples are easy to spot, but discovering similar issues in production code is considerably more difficult. It is not a simple task to identify parts of the code where user-controlled data enters the system (so-called ‘sources’ in taint analysis nomenclature), and it is especially hard to find all feasible paths from these to the security sensitive ‘sinks’. The execution flow can span various functions, variables get assigned, reassigned and copied around, indirect calls can happen via function pointers or virtual dispatch, etc.

SAST (static application security testing) tools are good at identifying vulnerable code patterns, even if they can’t actually reason about the runtime semantics of the code due to known theoretical results, such as Rice’s theorem. Still, production grade static analyzers, such as CodeQL make a good attempt at providing the building blocks for security researchers and developers to help them identify these issues automatically. Results of SAST tools are known to be noisy, that is, having a lot of false positives, since these in general overestimate the feasible execution paths in programs, and they will raise alerts in many cases where the flagged vulnerability could never actually happen in practice (mitigating that would be introducing path-sensitivity, and it can be done for example with symbolic execution, however, that technique is much more expensive in computational resources and it can suffer from combinatorial explosion of the state space).

Sometimes, though, SAST tools also introduce false negatives, and that should be much more scary from a vulnerability management standpoint: you can’t address stuff you can’t even see.

A fairly simple example

Consider the following code.

We can try to craft some CodeQL to detect this vulnerability. Here, I’m using TaintTracking::Global with my own configuration to establish the sources and sinks.

This does the trick indeed.CodeQL is able to correctly track the taint relationship, even through a function call (so-called interprocedural analysis):

!1_ObjZ9Q7VAaks4Cfk6E-FRA

Now let’s have an example that is only a slightly more complex, resembling a real-world context. Here, there are two structure definitions where some additional data is stored along the pointer. Moreover, the two structures are nested in each other, our target pointer is stored in struct data_inner’s, and those are wrapped in data_outer’s. Our vulnerable function now receives the outer structure, and copies the data via data->ptr_inner->data.

Easy, right? Call it bof.c:

 

Well, let’s try to run the same CodeQL example to see what happens.

1_8JZIQ-lSmNdOqDWgpzyjxg

Oops. Nothing this time. Why is that?

Important

Well, as it turns out, the answer has to do with CodeQL choosing not to implement detailed pointer alias analysis as a design decision.

In many ways, static analyzers have to balance computational complexity, precision, scalability and also maintain some kind of a nice user experience with an intuitive API/rule syntax.

Still, aiming low on the pointer alias analysis problem means that CodeQL and most other SAST tools won’t find a lot of juicy and practical vulnerabilities, that are often super simple for human analysts to find, and for attackers to exploit (90% of high-impact IoT device Remote Code Execution CVEs are in fact taint-style vulnerabilities according to our own study over at BugProve, the rest is pretty much just hard-coded credentials).

Pointer alias analysis

So what is this problem and why is it difficult? Basically, you are trying to identify all pointer definitions p and q, where q may alias p. Two pointers alias each other when they point to the same object. Consider this example.

There are 10 pairs of pointers we should investigate at line 33. At the end of the main function, the followings are true:

  1. p and q may alias

  2. p and r must alias (why?)

  3. p and s do not alias

  4. p and t must alias

  5. q and r may alias

  6. q and s do not alias

  7. q and t may alias

  8. r and s do not alias

  9. r and t do not alias (why?)

  10. s and t do not alias

Now, all of these relationships hold for the values of p,q,r,s,t at the basic block starting from line 33 above. Crucially, we are also interested in aliasing relationships between the values of p, q, s at line 21, 22, 23 and the values that they hold at the end of the main function.

Since the func() operations does not change p or s, and we also see that r always receives the value of the original q, whereas t always receives the original p, we can deduce that:

  1. p at line 33 must alias &object1

  2. q at line 34 may alias &object1 or &object2

  3. r at line 35 must alias &object2

  4. s at line 36 must alias &object3

  5. t at line 37 must alias &object1

     

Sometimes authors differentiate between alias analysis and points-to analysis. Alias analysis computes a set S holding pairs of variables (p,q) where p and q may (or must) point to the same location. On the other hand, points-to analysis computes a relation points-to(p,x), where p may (or must) point to the location of the variable x.

It is not that hard to see that accurately establishing these relationships is not only NP-hard, but in general, it is uncomputable for arbitrary programs. For formal proofs of these, see Landi et al [1]. Practical algorithms therefore have to resort to approximations, and it is important to mention that may-analysis is generally easier than must-analysis, as one should expect.

Modern compilers like clang/LLVM and gcc run all sorts of fancy optimization and program analysis algorithms. However, for large codebases not only superpolynomial algorithms are a big no-no, but even more expense polynomial algorithms, say, cubic ones are going to have scalability issues. That is, for compilation, more accurate alias analysis solutions won't be acceptable. However, when it comes to security, the trade-off is slightly different: spending a few minutes or even hours of compute to prevent a critical CVE might actually be a bloody good idea under most reasonable threat models and organizational economic constraints. Hence, the full SAST, SCA and application security industry generating some $7B annually.

For our model example in bof.c, if we are able to establish with pointer alias analysis that data->ptr_inner->data on line 39 aliases d2.ptr_inner->data on line 30 and therefore aliases &taintedbuf, we can raise an alert about an issue given the sizes of the buffers used during strcpy. Cool, huh?

Alias analysis in compilers

Before investigating what we can come up with in terms of pointer alias techniques capable of discovering interesting taint-style vulnerabilities, let's see what compilers do when it comes to dealing with pointers. The two main algorithms are Andersen's algorithm and Steensgaard's algorithm, invented by fine Scandinavian gentlemen in the late 20th century.

Andersens's algorithm is pretty much what you would come up with naturally after obtaining a foundational body of knowledge in program analysis techniques such as dataflow analysis, constraint based analysis and Abstract Interpretation (not to be confused with the other AI, the bubbly one that 2024 will be remembered for).

So, Andersen's algorithm kind of goes like this. Let's define set constraints based on program statements using the following rules.

image-20241209203713319

That is,

  1. The set of locations pointed to by p contain the label x.

  2. Those locations pointed to by p must contain those pointed to by q.

  3. Those locations pointed to by *p must contain those pointed to by q.

  4. Those locations pointed to by *q must contain those pointer to by *q.

We can also generate the semantics via the following rules:

image-20241209205632486

The attentive reader must have noticed that these rules have no reference to the order of the statements whatsoever, and this is right. The analysis does not consider the flow of the program, which is another way of saying that Andersen's algorithm is a flow-insensitive alias analysis. Let's run in it manually for our toy example above and see whether it results in any inaccuracies.

Collecting address-of statements:

Collecting assignments (and folding the function call assignments into main, therefore dropping all * operators )

Warning

The above can't be done in general. The function func in general could be called with different calling context from main or elsewhere, and that opens up another dimension of the analysis: context-sensitivity. Here, we are dealing with a simple example and context-insensitive analysis for demonstration purposes.

So first we have.

Then, by [copy], we add a couple more labels:

Then, another run of [copy] reaches a fixpoint:

Comparing this to the accurate solution above, we see that Andersen is losing precision in that it asserts that r may alias object1. This is due to flow-insensitivity: even though q can become an alias of p on line 7, the assignment of r=q always happens before that statement in actual execution flows.

Well, this is not obvious from the example, but the lack of context-, field- and flow-sensitivity means that Andersen's solution will be pretty inaccurate for most real-world programs. Steensgaard's is even worse in terms of accuracy, but at least it is roughly linear so it scales well with program size (Andersen has cubic worst-case time complexity).

Instead of operating over the space of O(n^2) pointer pairs, Steensgaard uses a constant number of abstract locations to represent variables/objects, and points-to relations will therefore be constrained to these. In real programs, you can have multiple aliases for a variable, in this case, Steensgaard's solution unifies the corresponding abstract locations. And that's pretty much the algorithm: if something points to two or more different stuff, then we merge those stuff together, check if that merged stuff points to two or more different stuff and merge that as well, eventually creating a bunch of equivalence classes among these abstract locations.

Formally, we deal with the same type of statements that Andersen deals with:

image-20241210001132908

But crucially, we join abstract locations (denoted *p) like this:

So let's see a Steensgaard run for the same program as above. The initial state is the same.

Then, we create the equivalence classes according to assignments:

After all merges, we have:

  1. {p, q, r, t, object1, object2} (single large equivalence class).

  2. {s, object3} (independent).

So, Steensgaard manages to extract some useful information from the program, but considerably less than what you'd like for security analysis. In exchange, each statement is processed once and the additional operations are nearly constant, resulting in a near linear overall time complexity O(n*alpha(n)).

Now, although we looked at a simple example, things can get really tricky when you examine a few instances of this problem from the benchmark PTABEN found here: https://github.com/SVF-tools/Test-Suite/tree/master/src.

Here's one checking for context-sensitivity:

This looks much more like an NP-hard problem in program size, huh? Kind of like a Sodoku puzzle. Incidentally, it is possible to add context-sensitivity to Andersen's algorithm, but it will naturally increase computational complexity.

The state of the art for the general problem using source code (well, at least as of 2020) seems to be SUPA [2] improving on SFS [3], although neither of them are targeting security use cases per se. For our bof.c, field and flow sensitivity would be nice. For programs such as emacs, whole-program analysis times with flow-sensitivity are still in the range of thousands of seconds, with superpolynomial scaling.

Taint-style vulnerabilities

In order to discover these type of vulnerabilities in real-world code, we would need to be able to have field-sensitivity, flow-sensitivity, a bit of context-sensitivity would help as well, and in addition, we might also encounter indirect calls, such as function pointers (which could theoretically alias each other, and then it all becomes a crazy mess, but let's forget that scenario for now).

Moreover, we would love to do this using only the binary code, since most real life attackers won't have access to source-code either, and they will still manage to hack our applications and IoT systems. Therefore, there has to be a way to extract the 0-day vulnerable semantics from the object code itself via program analysis techniques, well, up to the limitations imposed by Rice's theorem. Since we are doing engineering here at the end of the day, we don't care if the problem we're trying to solve is impossible in theory, we are just trying to generate actionable and high-quality vulnerable alerts and the secure software development value associated with it.

Important

Intuitively, many developers roll their eyes at the notion of binary program analysis. I think this is because they are usually not happy with the false positives that many commercial tools output even though they are run on the full-fledged source code repository. This and the fact that a lot of information gets lost during compilation reinforces the wrong assumption that this necessarily translates to even lower quality findings. As we'll see, that's not the case! Actually, as far as vulnerabilities go, working with binaries actually results in more accurate results in many cases.

Through the meticulous work of our intern, Balint Toth, we actually collected as many CVEs as we could affecting IoT devices. Here's a sample of it:

image-20241210121305636

Guess what, all of these are taint-style vulnerabilities hiding in the code of the network-facing binaries listed in the 6th column!

If only we had access to a pointer analysis algorithm that operated on multi-architecture binary code with field, context and flow sensitivity.. Well, it turns out that there is a brilliant paper by Kai Cheng et al that attempts to solve this [4].

Access paths and on-demand interprocedural analysis

Their main inspiration comes from a much older paper authored by Wen-Mei W. Hwu. et al [5].

The year is 2000, Santana - Smooth is playing on the radios, the ILOVEYOU VBScript virus is spreading through the Internet, people are buying their very first Nokia 3310's, and yours truly is about to start his 2nd year in elementary school at age 8. Computers are objectively pretty slow and crappy, even though they are routinely beating Kasparov in chess.

Most importantly for this discussion, Intel Itanium is about to become a thing.

Note

Intel Itanium was a high-performance 64-bit processor family designed with an innovative EPIC (Explicitly Parallel Instruction Computing) architecture, aimed at revolutionizing enterprise computing by delivering superior parallelism and scalability. It failed due to poor backward compatibility with x86 software, competition from x86-64 processors like AMD's Opteron, and limited adoption by software developers, ultimately being overshadowed by more versatile and cost-effective alternatives.

Itanium's experimental compilers were designed to perform advanced static scheduling and optimization at compile time, leveraging the chip's EPIC architecture to extract maximum parallelism and efficiency from the code. However, these compilers struggled with the complexity of predicting runtime behavior accurately, often leading to suboptimal performance and failing to justify the high costs and effort required for software migration and optimization. Simply put: the idea was to go clever with the compilers and crazy with parallelization, but it just didn't work out.

Still, while searching for the holy grail of Itanium compilation, the IMPACT Research Group over at the University of Illinois led by Wen-mei W. Hwu produced some spectacular program analysis algorithms and implementations that included then-state-of-the-art pointer alias analysis.

Pointer analysis algorithms can be categorized based on whether they consider the underlying object/variable or work with access paths:

  1. Object/Variable-based analysis focuses on identifying the target object or variable a pointer refers to, abstracting away specific access paths, which simplifies the analysis but may lose precision for complex memory structures.

  2. Access path-based analysis considers the specific paths (e.g., x->y->z) used to reach a memory location, enabling finer-grained insights into pointer targets but increasing complexity and computational cost.

Access paths have been in use naturally in earlier algorithms, but it is the work [5] that formally defined them and explained how to do context- and field-sensitive interprocedural analysis more or less efficiently:

ql_fb472078824cd482f95a05fa0649f4af_l3

I won't go super deep into transfer functions and monotone frameworks, but the real meat of this paper is showing how to propagate points-to information and summary transfer functions across procedure calls using a conservative analysis until a fixpoint is reached, and also showing how access paths are crucial in enabling the analysis of complex, potentially recursive data structures. (I obtained the source code of the original Impact compiler, for those of you who are interested, drop a DM.)

How efficient is the resulting algorithm? Well, it is exponential in different parameters, still with a conservative k-limiting factor of 1, it is able to analyze a program as complex as gcc in 500 seconds on a 450Mhz Pentium II, so it would be pretty damn fast and practical today.

This also shows how great research in computer science eventually finds its application. The original objective failed, still, the same algorithm can be reused for security research in the 2020's! How cool is that.

Still, the above algorithm still operates on source code, and we need something better.

Multi-architecture binary program analysis

So, we're back to the research of [4]. The first insight of their research is that symbolic/concolic execution approaches apart from being slow and computationally intensive rely heavily on concretization strategies that might resolve two alias symbolic addresses to different concrete addresses, preventing taint from propagating properly. There is also an algorithm called VSA that is a form of Abstract Interpretation, based on more traditional program analysis techniques that does fairly well on binaries when it comes to numeric and pointer analysis benchmarks. However, like symbolic execution, it also represents memory locations on an object/variable (see above) basis. So, for example, when a read happens in the program, both analysis techniques will conservatively model the response (either as a top element in the underlying complete lattice used in the interval analysis domain of VSA's Galois connection, or as an unconstrained symbolic value in case of concolic analysis frameworks) and lose useful information in case the value is later used an address.

See where this is going? This is exactly the problem that access paths solve. Let's see how to make them work for binary code.

The idea is to operate on an Intermediate Representation. Instead of the grammar above for AP's, we define SSEs, structured symbolic expressions (not to be confused with static symbolic execution) recursively like this:

image-20241210144544575

The implementation uses VEX IR, originally designed for Valgrind, then ported to python by the angr team, and this definition closely follows the main objects and operations defined there. VEX is also powerful because it supports lifting x86, arm, mips, riscv, mips64, arm64, ppc64 binary code into IR, and it is easily extensible.

To motivate the analysis, I'll reproduce the example given in the paper with my own words. Consider the following assembly listing:

We'd like to find all pointer aliases to R1. By looking at line 1, we see that it is defined there. so we add load(R3+0x8) to the aliases. Now if we look forward from here, we find a usage of R3, it gets stored to R6+0x4. Therefore, we replace R3 with store(R6+0x4), and add load(store(R6+0x4)+0x8) to the alias set on line 3.

Similarly, for this code if we were to find the aliases of r1:

We would end up with these (non-trivial) expressions:

Nice, huh? The original paper gives the following set of update rules for the intra-procedural analysis:

ql_50786f86b8cafb2f349db02050d16660_l3

Believe or not, this is still not enough to discover the vulnerability in bof.c.

My humble personal contribution to the set of update rules is the following. The analysis often encounters statements of the form (2), where Binop is addition:

ql_4871a53ad02088105202aed2ca625037_l3

I relax this update rule for cases when the expression is also of the form:

ql_a00ce2ce4e9f9c96a5e506c937a3867e_l3

and replace these expressions even if the offset does not completely match. Why? Structures with multiple fields can be passed through the stack between functions, and this transformation allows the analysis to recover aliasing relationships nonetheless.

Prototype implementation

The rest of this article will be dedicated to a very bare bones python implementation that does just enough of the analysis to discover the taint vulnerability in binaries compiled from bof.c .

I wrote most of these during two long train rides, so excuse me for the spaghetti code. Don't do this at home, kids!

So first we need a way to represent SSEs. Turns out that angr has a module called claripy that already implements a nice AST interface to deal with registers, values and operations on them defined in claripy/ast/bv.py.

We just need to extend that with loads and stores to represent tau_mem expressions and we are done:

This was easy.

We'll also add a wrapper class around it to keep track of additional metadata:

We already know more or less enough to implement the intraprocedural part of the algorithm. We'll have to fetch a block of code, select a target register, and scan up and down and up and down all the statements within that block until a fixpoint is reached, repeatedly invoking our update rules defined above.

Let's simulate that we are doing TDD and start with this test case instead of high-level requirements. We know what alias expressions we should get for this piece of arm assembly:

As you can see, we are taking that snippet and turn it into an ELF file with arm-linux-gnueabihf-as. Then, using angr.Project, the file is loaded into memory. We perform a quick control flow graph recovery to get the entry block. Then, we fetch that block, an object that is now a proper VEX IR SuperBlock. If you would like to familiarize yourself with VEX IR, the best start is probably: https://github.com/angr/vex/blob/master/pub/libvex_ir.h#L42.

The code we have to implement, of course is the definition of traceblock() as used on line 23. It should take an IRSB, a statement ID corresponding to our entry point to the analysis, and the name of the register that we want to get aliases for, in this case, r1.

Conceptually we are doing the following things:

One of my other humble additions to the original algorithm is to parallelize SSE transformations as much as I can (here via trace_task and Pool). Omitting some particularly ugly stuff, traceblock looks something like this:

The forward and backward passes are similar and obviously, that's where the magic happens™ the kind that you're waiting for.

The heart of the code is just going through through statements and replacing SSE's according to their update rules:

In more granularity, this also involves tracking the direction and liveliness of aliases (for more information on those, refer to the paper):

The update rules themselves are implemented along the lines of (note that starting from line 183, we implement the Add(r_n, o_1) transformation discussed above):

 

This more or less completes the intraprocedural part of the algorithm. For the inter-procedural part, spanning multiple functions, the following has to be implemented:

image-20241210165314893

To make this barely work for our bof.c, we'll define the minimal set of sources and sinks, and somewhat disregard lines 12 to 14 of the above algorithm where the transfer function magic happens.

Spelled out in code:

And finally, we are able to discover the issue using our simple prototype, and beat CodeQL to it!

Important

Again, it makes sense to reiterate that this solution does not require the source code at all. Whether you compile it to x86, arm, mips or riscv, it will work just as well.

Let's see this in action:

https://asciinema.org/a/myKlLjNzP1A2YpLu8P0Ss5x90

As you would expect, the potential vulnerability is being 'detected' (see the complete trace here https://termbin.com/e14w)

In terms of benchmarks, the original implementation of the authors already achieves 200x speedups to other state-of-the-art analyzers, even though their solution is single-threaded and most of the computation is spent within python AST manipulation and it is effective at discovering real-world 0-day vulnerabilities in IoT device firmware, as well as rediscovering N-days. In terms of accuracy, it also achieves 90-100% on PTABEN basic, flow and context pointer alias datasets, beating VSA based systems.

Our experiments with multi-threaded C++ based implementations indicate that speeds can be increased to a level where the solution is scalable in that it can be operated on relatively large device firmware and monolithic binaries such as RTOS images.

Closing remarks

The field of binary pointer alias analysis has come a long way, from early foundational research to modern implementations capable of uncovering real-world vulnerabilities. Through meticulous refinement of access path-based approaches and on-demand interprocedural analysis, we have demonstrated how such methods can outperform traditional source code-based tools like CodeQL. This is particularly significant in environments where source code is unavailable, such as IoT device firmware analysis.

Our prototype, though simplistic, illustrates the power of structured symbolic expressions (SSEs) and how carefully crafted update rules can bring us closer to solving complex security problems. By embracing advancements in multi-architecture intermediate representations like VEX and leveraging parallelism for computational efficiency, the scope for future research and practical applications in this domain is immense.

Binary program analysis, once considered an impractical endeavor, has proven its worth in security research, particularly for IoT firmware and embedded systems. The ability to perform precise, scalable pointer alias analysis without access to source code opens new avenues for vulnerability discovery.

While challenges remain—such as handling scalability for extremely large binaries or incorporating more advanced interprocedural techniques—the results achieved thus far are encouraging. By integrating SSE-based alias analysis into broader security workflows, we can significantly improve our ability to detect and mitigate critical vulnerabilities. Ultimately, as attackers grow more sophisticated, it is innovations like these that will keep defenders one step ahead.

PS If you are interested in the full code for this example, please drop a comment or a DM, and I'll see what I can do:)

References

[1] William Landi and Barbara G. Ryder. 1991. Pointer-induced aliasing: a problem classification. In Proceedings of the 18th ACM SIGPLAN-SIGACT symposium on Principles of programming languages (POPL '91). Association for Computing Machinery, New York, NY, USA, 93–103. https://doi.org/10.1145/99583.99599

[2] Y. Sui and J. Xue, "Value-Flow-Based Demand-Driven Pointer Analysis for C and C++," in IEEE Transactions on Software Engineering, vol. 46, no. 8, pp. 812-835, 1 Aug. 2020, doi: 10.1109/TSE.2018.2869336. keywords: {C++ languages;Resource management;Open source software;Sensitivity;Reachability analysis;Instruction sets;Registers;Strong updates;value flow;pointer analysis;flow sensitivity}

[3] Ben Hardekopf and Calvin Lin. 2011. Flow-sensitive pointer analysis for millions of lines of code. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '11). IEEE Computer Society, USA, 289–298.

[4] Kai Cheng, Yaowen Zheng, Tao Liu, Le Guan, Peng Liu, Hong Li, Hongsong Zhu, Kejiang Ye, and Limin Sun. 2023. Detecting Vulnerabilities in Linux-Based Embedded Firmware with SSE-Based On-Demand Alias Analysis. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2023). Association for Computing Machinery, New York, NY, USA, 360–372. https://doi.org/10.1145/3597926.3598062

[5] Ben-Chung Cheng and Wen-Mei W. Hwu. 2000. Modular Interprocedural Pointer Analysis Using Access Paths: Design, Implementation, and Evaluation. SIGPLAN Not. 35, 5 (may 2000), 57–69. https://doi.org/10.1145/358438.349311