Posts tagged eBPF

Linux and dynamic tracing rants
posted on 2016-01-27 21:46:34

This is current work in heavy progress.

tracing is just not overly accessible

When wanting to start with dynamic kernel tracing, usual problems are similar, no matter what technology you want to use:

  • "I don't know where to start."
  • "I don't know how XYZ is done in tracing tool ABC."
  • "I don't know what probes exist."
  • "I don't know what syscalls are existing."
  • "I installed the packages, but this doesn't work?"
  • "I need to copy-paste scripts to make this work?"
  • "Heck, I don't even know what syscalls are."
  • "What can I do with all this stuff?"

Usually the syntax ain't even too bad, it's the points above hindering the further spreading of these tools. There is a pattern there to be found, so this post should do this:

  • Show what tracing is and in what shape the tooling landscape is currently.
  • Provide small examples which are usable to get a proper starting point.
  • Provide one-liners for getting overviews over the currently available tools for all probes and trace-points.
  • Provide one-liners to show how to catch syscalls which took place.
  • Provide detailed install instructions where necessary, but rather search non-invasive tools. Some tools are completely integrated into the kernel and thus directly accessible, so the focus is on these.
  • Rather than running script files, statements can directly be run from the command-line when provided correctly.

The last two can be explained rather shortly:

  • Syscalls are the C functions which make up the API by which applications can access the kernel's functions. These are documented in the type (2) man pages, if you did't know yet.

Here's a list, even though they may be called a little differently at times:


read      read bytes from a file descriptor (file, socket)
write     write bytes from a file descriptor (file, socket)
open      open a file (returns a file descriptor)
close     close a file descriptor

fork      create a new process (current process is forked)
exec      execute a new program

connect   connect to a network host
accept    accept a network connection

stat      read file statistics
ioctl     set I/O properties, or other miscellaneous functions

mmap      map a file to the process memory address space

brk       extend the heap pointer

If a complete code audit is too heavy (All branches have to be checked after all. And later you find out you overlooked something.), dynamic tracing is for you. You can either find out how many syscalls were run, or what values variables were set to, you can collect data and create graphs from it, ... Actually you can do more than you need, so providing the most use cases should do well enough.

intro to dynamic tracing

What exactly is this dynamic tracing thing? Let's start with some terms which I shamelessly rephrase from a lesser-known but very able russian guy named Sergey Klyaus and his github stuff here:

  • Looking solely at code = static code analysis, sadly this is error-prone and a damn lot of work. There's a reason not many people do kernel development.
  • Watching a system's behaviour at runtime is dynamic analysis, but there are different types of introspection.

There are several methologies:


  • instrumentalizing
  • sampling
  • profiling
  • tracing

Sergey is truly awesome and knows his stuff. His ebook, though 'it may never be finished' as he said somewhere IIRC, is an outstanding piece of work and has already over 200 pages. The best part is that it is still freely available, and besides some little typos (English is not his mother tongue.) it is a damn good read.

So what technologies are available there will be provided in a short overview. The examples are purposefully short for copy-pasting, so starting with this stuff is easier.


After I read a lot of stuff lately from the man, the myth, the legend, @brendangregg, it looks like DTrace is plain awesome. But since adoption on linux may take forever (if it will even happen at all since the open DTrace4Linux port by Paul Fox seems to be pretty much a one-man-show and Oracle's DTrace is just a wrapper around SystemTap, sadly I have no link where I read this), going with the alternatives seems the way to go on linux.

On FreeBSD it seems: 'Just use DTrace.'

On Linux the answer is not just as simple, thus this post might grow quite a bit over the following paragraphs.


For the sake of completeness, here is a bunch of dtrace scripts:

# process plus its arguments
dtrace -n 'proc:::exec-success { trace(curpsinfo->pr_psargs); }'

# files opened by a process
dtrace -n 'syscall::open*:entry { printf("%s %s",execname,copyinstr(arg0)); }'

# syscall count of a program
dtrace -n 'syscall:::entry { @num[execname] = count(); }'

# syscall count by the system
dtrace -n 'syscall:::entry { @num[probefunc] = count(); }'

# syscall count of a process
dtrace -n 'syscall:::entry { @num[pid,execname] = count(); }'

# used memory of a progress
dtrace -n 'io:::start { printf("%d %s %d",pid,execname,args[0]->b_bcount); }'

# count of pages which were swapped by a process
dtrace -n 'vminfo:::pgpgin { @pg[execname] = sum(arg0); }'


eBPF is under active development within the linux kernel, latest changes in version 4.4 you can read about here, but kernel developers call these things scary stuff.

Somewhere in a presentation Brendan compared DTrace to eBPF like a kitty hawk to a jet engine, which, besides it being 'in-kernel', should be the reason why it might be most likely be the most important tracer in linux some day.

A little presentation on BPF can be found here.


Until Linux' extended Berkeley Packet Filter (eBPF) is real prime time material, stap should do well, Brendan thought, as could be seen here.

SystemTap has got two modes:

  • Awk/C like language, gets the job done
  • Embedded C mode aka "guru mode" in case you need it


Most distributions have prepackaged what you want. Well, at least Debian did, and maybe CentOS, too, IIRC. Afterwards run stap-prep, which should tell you what else you have to install. (Usually you need the debug headers for your kernel, to make systemtap work.)


TODO place some useful oneliners here

# show processes opening files in realtime
# Brendan wrote in his 'Systems Performance' book: "I've never actually seen this work."
# I feel proud, it did for me. ;)
stap -ve 'probe { printf ("%30s %-100s\n", execname(), user_string($filename)); }'



stap --dump-probe-types | awk -F. 'BEGIN {current=""; print "\n\033[31;1mstap -ve \"global s; probe ... {...}\"\033[0m\n"} {if (current != $1) { current=$1; printf "\n\033[33;1m%s\033[0m\n",current } else {print $0}}' | less -R
stap --dump-probe-aliases | awk -F. 'BEGIN {current=""; print "\n\033[31;1mstap -ve \"global s; probe ... {...}\"\033[0m\n"} {if (current != $1) { current=$1; printf "\n\033[33;1m%s\033[0m\n",current } else {print $0}}' | less -R
echo $'\n\e[31;1mstap --dump-functions\e[0m\n'; stap --dump-functions

# some other examples, for the sake of completeness
stap -l 'kernel.function("acpi_*")' | sort
stap -l 'module("ohci1394").function("*")' | sort
stap -L 'module("thinkpad_acpi").function("brightness*")' | sort

further stuff

A pretty new example on Heatmaps using stap can be found here and here.

Further you also can export histograms directly to console, which is a damn awesome feature.


According to Brendan, they quite heavily use perf over at netflix. Interestingly neflix runs no own infrastructure anymore, but completely relies on amazon's cloud services instead, I learned somewhere last week. You really got to know how to measure your available performance when doing such stunts, so perf sure sounds like a good idea.

Most stuff which helped me with perf here in a nutshell:


What syscalls are run the most?

perf top

Let's do some profiling. In short, create a baseline data-set of your system first, then start your application and collect a second set of data from your 'system under test' (SUT). Afterwards just compare both collected sets:

perf record -p <PID> -o sleep 30
perf record -p <PID> -o sleep 30
perf diff

perf report -n --stdio

If regular strace is too heavy on your system, give perf trace a try.

This is all you need if you don't want to go down the rabbit hole. If sure, just do proceed:


# check what probes exist at all
perf test

# helps with exploring what is actually possible
## alphabetically, from Brendan
perf list | awk -F':' '/Tracepoint event/ { lib[$1]++ } END { for (i in lib) { printf " %-16s %d\n",i,lib[i] } }' | sort | column
## by count
perf list | awk -F':' '/Tracepoint event/ { lib[$1]++ } END { for (i in lib) { printf " %-16s %d\n",i,lib[i] } }' | sort -nk2 | tac | column
perf list | awk -F'[: \t]+' 'BEGIN {current=""} /Tracepoint event/ {if (current != $2) { current=$2; print $2, "\n\t", $3 } else {print "\t", $3}}' | sed -r ''s/^[[:graph:]]+/$(printf "\033[33;1m&\033[0m")/'' | less -R

perf list | awk -F'[: \t]+' 'BEGIN {current=""} /Tracepoint event/ {if (current != $2) { current=$2; print $2, "\n\t", $3 } else {print "\t", $3}}' | grep -e syscalls -e sys_enter -e sys_exit | sed -r -e 's/^syscalls/& ( with prefixes: sys_enter_ \/ sys_exit_ )/' -e ''s/^[[:graph:]]+/$(printf "\033[33;1m&\033[0m")/'' -e 's/sys_enter_([[:graph:]])/\1/' -e 's/sys_exit_([[:graph:]])/\1/' | uniq | awk 'BEGIN { flag = 1; id = 0 } /with prefixes:/ { print $0; flag = 0; next; print $0 } { if (flag) {print $0} else {array[id]=$0; id++}} END { for (i in array){print array[i] | "sort" }}' | less -R

This blog covers .csv, .htaccess, .pfx, .vmx, /etc/crypttab, /etc/network/interfaces, /etc/sudoers, /proc, 10.04, 14.04, AS, ASA, ControlPanel, DS1054Z, GPT, HWR, Hyper-V, IPSEC, KVM, LSI, LVM, LXC, MBR, MTU, MegaCli, PHP, PKI, R, RAID, S.M.A.R.T., SNMP, SSD, SSL, TLS, TRIM, VEEAM, VMware, VServer, VirtualBox, Virtuozzo, XenServer, acpi, adaptec, algorithm, ansible, apache, apachebench, apple, arcconf, arch, architecture, areca, arping, asa, asdm, awk, backup, bandit, bar, bash, benchmarking, binding, bitrate, blackarmor, blowfish, bochs, bond, bonding, booknotes, bootable, bsd, btrfs, buffer, c-states, cache, caching, ccl, centos, certificate, certtool, cgdisk, cheatsheet, chrome, chroot, cisco, clamav, cli, clp, clush, cluster, coleslaw, colorscheme, common lisp, console, container, containers, controller, cron, cryptsetup, csync2, cu, cups, cygwin, d-states, database, date, db2, dcfldd, dcim, dd, debian, debug, debugger, debugging, decimal, desktop, df, dhclient, dhcp, diff, dig, display manager, dm-crypt, dmesg, dmidecode, dns, docker, dos, drivers, dtrace, dtrace4linux, du, dynamictracing, e2fsck, eBPF, ebook, efi, egrep, emacs, encoding, env, error, ess, esx, esxcli, esxi, ethtool, evil, expect, exportfs, factory reset, factory_reset, factoryreset, fail2ban, fbsd, fedora, file, filesystem, find, fio, firewall, firmware, fish, flashrom, forensics, free, freebsd, freedos, fritzbox, fsck, fstrim, ftp, ftps, g-states, gentoo, ghostscript, git, git-filter-branch, github, gitolite, gnutls, gradle, grep, grml, grub, grub2, guacamole, hardware, haskell, hdd, hdparm, hellowor, hex, hexdump, history, howto, htop, htpasswd, http, httpd, https, i3, icmp, ifenslave, iftop, iis, imagemagick, imap, imaps, init, innoDB, inodes, intel, ioncube, ios, iostat, ip, iperf, iphone, ipmi, ipmitool, iproute2, ipsec, iptables, ipv6, irc, irssi, iw, iwconfig, iwlist, iwlwifi, jailbreak, jails, java, javascript, javaws, js, juniper, junit, kali, kde, kemp, kernel, keyremap, kill, kpartx, krypton, lacp, lamp, languages, ldap, ldapsearch, less, leviathan, liero, lightning, links, linux, linuxin3months, lisp, list, livedisk, lmctfy, loadbalancing, locale, log, logrotate, looback, loopback, losetup, lsblk, lsi, lsof, lsusb, lsyncd, luks, lvextend, lvm, lvm2, lvreduce, lxc, lxde, macbook, macro, magento, mailclient, mailing, mailq, manpages, markdown, mbr, mdadm, megacli, micro sd, microsoft, minicom, mkfs, mktemp, mod_pagespeed, mod_proxy, modbus, modprobe, mount, mouse, movement, mpstat, multitasking, myISAM, mysql, mysql 5.7, mysql workbench, mysqlcheck, mysqldump, nagios, nas, nat, nc, netfilter, networking, nfs, nginx, nmap, nocaps, nodejs, numberingsystem, numbers, od, onyx, opcode-cache, openVZ, openlierox, openssl, openvpn, openvswitch, openwrt, oracle linux, org-mode, os, oscilloscope, overview, parallel, parameter expansion, parted, partitioning, passwd, patch, pdf, performance, pfsense, php, php7, phpmyadmin, pi, pidgin, pidstat, pins, pkill, plesk, plugin, posix, postfix, postfixadmin, postgres, postgresql, poudriere, powershell, preview, profiling, prompt, proxmox, ps, puppet, pv, pvecm, pvresize, python, qemu, qemu-img, qm, qmrestore, quicklisp, r, racktables, raid, raspberry pi, raspberrypi, raspbian, rbpi, rdp, redhat, redirect, registry, requirements, resize2fs, rewrite, rewrites, rhel, rigol, roccat, routing, rs0485, rs232, rsync, s-states, s_client, samba, sar, sata, sbcl, scite, scp, screen, scripting, seafile, seagate, security, sed, serial, serial port, setup, sftp, sg300, shell, shopware, shortcuts, showmount, signals, slattach, slip, slow-query-log, smbclient, snmpget, snmpwalk, software RAID, software raid, softwareraid, sophos, spacemacs, spam, specification, speedport, spi, sqlite, squid, ssd, ssh, ssh-add, sshd, ssl, stats, storage, strace, stronswan, su, submodules, subzone, sudo, sudoers, sup, swaks, swap, switch, switching, synaptics, synergy, sysfs, systemd, systemtap, tar, tcpdump, tcsh, tee, telnet, terminal, terminator, testdisk, testing, throughput, tmux, todo, tomcat, top, tput, trafficshaping, ttl, tuning, tunnel, tunneling, typo3, uboot, ubuntu, ubuntu 16.04, udev, uefi, ulimit, uname, unetbootin, unit testing, upstart, uptime, usb, usbstick, utf8, utm, utm 220, ux305, vcs, vgchange, vim, vimdiff, virtualbox, virtualization, visual studio code, vlan, vmstat, vmware, vnc, vncviewer, voltage, vpn, vsphere, vzdump, w, w701, wakeonlan, wargames, web, webdav, weechat, wget, whois, wicd, wifi, windowmanager, windows, wine, wireshark, wpa, wpa_passphrase, wpa_supplicant, x2x, xfce, xfreerdp, xmodem, xterm, xxd, yum, zones, zsh

View posts from 2017-02, 2017-01, 2016-12, 2016-11, 2016-10, 2016-09, 2016-08, 2016-07, 2016-06, 2016-05, 2016-04, 2016-03, 2016-02, 2016-01, 2015-12, 2015-11, 2015-10, 2015-09, 2015-08, 2015-07, 2015-06, 2015-05, 2015-04, 2015-03, 2015-02, 2015-01, 2014-12, 2014-11, 2014-10, 2014-09, 2014-08, 2014-07, 2014-06, 2014-05, 2014-04, 2014-03, 2014-01, 2013-12, 2013-11, 2013-10

Unless otherwise credited all material Creative Commons License by sjas