Tracing Linux: Fast, Compatible, Complete

Christopher Arges
Engineering Manager

Protecting mission-critical Linux machines is essential for any business. Sophisticated cyberattacks can start from a low-value target machine and pivot into a high-value database server filled with sensitive information. So where do you start? You can harden your systems, audit millions of lines of code downloaded from the internet and hope there are no vulnerabilities, or even wrap all your programs in tin foil. But, this isn't enough. You still need an audit trail to detect if there have been any breaches or things you've missed. This creates a deluge of so much data that monitoring becomes unmanageable.

In order to respond to threats in real-time and separate the signal from noise, we need a program that can extract the right data from your Linux servers. Such a program needs to be:

FAST - Tracing should leave minimal footprint and not affect other software running on the machine,

COMPATIBLE - Tracing must support users on a wide variety of platforms from the older LTS releases to the latest; and

COMPLETE - Tracing should pick the relevant system calls and audit events that provide security context.

Once we build this tracing software, how do we use all this data? This rich data enables an inescapable detection when a malicious hacker plants a script on the machine, or someone uses a well known privilege escalation technique.

This post describes the various approaches to tracing Linux and the pros and cons of each. It also highlights how we at Confluera, make monitoring our Linux systems fast, compatible, and complete within our Autonomous Detection & Response platform.

Tracing for Security Context

Tracing can be used for many purposes such as debugging, performance testing, auditing, and threat detection and response. In order to achieve the right level of tracing we need to ingest events from various sources from syscalls to audit events.

For any user program to modify a file, execute a program, or become a privileged user, the program must talk to the operating system. In Linux, this happens during a system call (a.k.a. syscall). Thus, the syscall is a natural place to monitor changes in system state to uncover any security context. The downside of tracing syscalls is the vast volumes of events and the argument parsing required that may contain the pointers to the context. Dereferencing these pointers often requires additional lookups inside the kernel.

So, how do we track higher level events such as users logging into a system or adding and removing users? Linux audit framework tracks security events that may not be easily connected via syscalls. However, these events are sent via a string and don't always have the complete context needed. Thus some work to parse and contextualize these events is unavoidable. To make this useful to many users, a solution needs to be compatible with multiple versions of Linux. Tracing technologies have advanced throughout the years, but many users are still stuck with Long Term Support (LTS) releases and may not have the latest and greatest code. Therefore any solution needs to also consider compatibility.

What tools already exist on a Linux system to get this information? Let's look at some userspace approaches and the issues with using them at scale.

Userspace Approaches

Userspace approaches to Linux tracing exist in many varieties. Each of them utilizes different kernel technologies to produce tracing events. Examples of such programs are: strace, which utilize ptrace to hook into processes, perf which can use tracepoints and kprobes, auditd which uses the linux audit subsystem, and even seccomp actions which allows logging callbacks.

In order to completely trace a Linux system for security events, we need to be extremely flexible. System calls, critical kernel functions, and audit events need to all be considered. Existing userspace tools are not comprehensive and only focus on a narrow set of events. In addition performance is paramount and because of the deluge of syscalls, being able to contextually filter events becomes a necessity. Tools that rely on per-process instrumentation like strace and seccomp are very unwieldy as every process must be modified. System call parameters may be just memory addresses or additional context about files such as inode numbers may require auxiliary lookups. This is why a better approach is to evaluate kernel-space approaches to tracing and use them to compose a more complete and performant solution.

Kernel Space Approaches

Using a kernel-space technology to produce tracing allows us to have more complete information and even filter the events contextually.

Tracepoints

Linux kernel tracepoints allow for a tracepoint to be embedded in many critical kernel functions. Loadable kernel modules can execute custom code upon hitting a tracepoint, making them very effective. An advantage of tracepoints is they are defined in the code rather than hooking into a function name. This is crucial because function names change between kernel versions as well as parameters. Tracepoints can also be used for events like processes being rescheduled or kernel modules being loaded. Hooks for syscall enter and exit were introduced in v2.6.32 giving Tracepoints a long and solid history. In addition various user and kernel technologies rely on tracepoints to get tracing information such as perf, ftrace and eBPF.

A disadvantage to tracepoints is that they are static, and cannot be dynamically added. This is not a problem since many security context rich tracepoints have already been added. An advantage to tracepoints is that they are generally faster than technologies like kprobes.

eBPF

The Extended Berkeley Packet Filter (eBPF) has been usable for tracing since v4.1, but many improvements are still ongoing.  To use eBPF, you can write a C program that gets compiled into eBPF bytecode that then gets loaded into the kernel via the BPF syscall. The program has to use specific functions and cannot have loops. The kernel verifies this before allowing the code to be executed. The advantage here is that writing a custom kernel module is not required and one can attach to tracepoints and perform advanced filtering and instrumentation. 

A disadvantage is that eBPF requires newer kernels to use tracepoints. In Linux v4.7, eBPF "attach to tracepoints" was introduced and in v4.17 BPF, "attach to raw tracepoints" was introduced. While earlier versions could support tracing syscalls, the performance improvements with raw tracepoints is significant. As highlighted in the patchset for eBPF [https://lwn.net/Articles/748352/], using kprobes+bpf is 20% slower than tracepoints. In addition there is a slight performance hit when using eBPF because programs are written in BPF bytecode then interpreted in the kernel. However, this gap is even smaller now that eBPF supports Just-In-Time (JIT) compilation.

Finally eBPF is limited in what it can access by design. Therefore if something does not already have a helper function to access, or requires a kernel function call you may be out of luck. Regardless, eBPF seems like a very promising technology to use.

Kprobes

Kprobes is a powerful Linux technology that has been around since the v2.6. It stands for kernel probes and allows one to write handlers that can be executed before or after a particular kernel function. This means one can trace system calls by attaching probes directly to the appropriate sys_* function or even the generic sys_exit and sys_enter functions.

For system call tracing, finding the function name is easy since those function names are never changed. However for tracing other parts of the Linux kernel, care has to be taken to ensure the function does not get refactored on newer versions

Another concern is performance. There is a slight overhead using kprobes versus using solutions like tracepoints. While it may be small, this can add up if you are tracing a large volume of syscalls.

Finally, how you use kprobes can be flexible. You can use them via debugfs and ftrace, or you can even add your own probes via loadable kernel modules. Using the generic kprobes via ftrace results in events that may not have pointer contexts, and therefore may not be entirely usable. Tracing via a loadable kernel module allows for dereferencing pointer structures and getting the full context.

Linux Auditing Subsystem

The Linux Auditing Subsystem is a CAPP (Controlled Access Protection Profile) complaint auditing system. Its goal is to ensure compliance for auditing. Typically it can be driven via auditd and log into a file. Any userspace program that would like to get data from the Audit subsystem can use Netlink to communicate and get relevant data. The data provided comes in the form of text and some processing is required to split and parse the string.

Userspace processes and libraries can also log to the audit system such as PAM, selinux, and apparmor. This makes it very useful for security context. One issue many have with auditd is that it is difficult to get great performance from this tool, and there have been projects like go-audit written to produce events with less overhead using the netlink socket to the kernel audit subsystem.

For higher-level audit events, this subsystem has a very stable, mature, centralized system that provides rich events. For lower-level system call events, using audit creates a ton of overhead in formatting and parsing strings.

The Confluera Approach

Figure 1 - Summary of various kernel tracing technologies

Our platform requires speed, completeness and compatibility. The figure above shows a summary of the various tracing technologies in the kernel, where green shows the best advantage, orange shows moderate results, and red is a disadvantage. Technologies like Tracepoints seem to be a great solution for compatibility, completeness and speed. As stated earlier in the post, we really wanted a solution that would give us full visibility into various data structures. This is why we chose to implement a kernel module that relies on tracepoints for gathering data. We also use tracepoints for other security related events where applicable such as kernel module loading and tasks exiting. Specific high-level audit events are also collected via the kernel filling in gaps that a syscall-only approach may miss. By taking this approach we are able to provide complete context of each event and dereference pointer structures where needed. Another benefit of using a kernel module is that it is widely compatible with many LTS (Long Term Support) kernels and thus doesn't leave Linux users with older distributions behind.

Choosing to write a kernel module should not be taken lightly, and we have spent much time ensuring our module is safe, correct and performant. We use the 'trinity' syscall fuzzer to ensure that our module does not crash under extreme conditions. In addition we ensure that our overhead is extremely low for tracing and does not affect existing workloads.

At the same time, we are very excited about the possibilities of eBPF to allow for kernel module-less tracing in the future. As users start migrating workloads to newer and newer kernels, we think this path forward makes sense.

In conclusion, we believe our solution gives the most complete data, and the best performance, while being compatible with most Linux versions.

STOP BREACHES. IN THEIR TRACKS.