Home Blog Page 270

BPF: A Tour of Program Types

Notes on BPF (1) – A Tour of Progam Types

Oracle Linux kernel developer Alan Maguire presents this six-part series on BPF, wherein he presents an in depth look at the kernel’s “Berkeley Packet Filter” — a useful and extensible kernel function for much more than packet filtering.

If you follow Linux kernel development discussions and blog posts, you’ve probably heard BPF mentioned a lot lately. It’s being used for high-performance load-balancing, DDoS mitigation and firewalling, safe instrumentation of kernel and user-space code and much more! BPF does this by supporting a safe flexible programming environment in many different contexts; networking datapaths, kernel probes, perf events and more. Safety is key – in most environments, adding a kernel module introduces significant risk. BPF programs however are verified at program load time to ensure no out-of-bounds accesses occur etc. In addition BPF supports just-in-time compilation of its bytecode to the native instructions set, so BPF programs are also fast. If you’re interested in topics like fast packet processing and observability, learning BPF should definitely be on your to-do list.

Here we try to give a guide to BPF, covering a range of topics which will hopefully help developers trying to get to grips with writing BPF programs. This guide is based on Linux 4.14 (which is the kernel for Oracle Linux UEK5), so do bear that in mind as there have been a bunch of changes in BPF since, and some package names etc may differ for other distributions.

Because BPF in Linux is such a fast-moving target, I’m going to try and point you at relevant places in the kernel codebase that may help you get a sense for what the technology can do. The samples/bpf directory is a great place to look to see what others have done, but here we’ll also dig into the implementation as reference, as it may give you some ideas how to create new BPF programs. The aim here isn’t to give a deep dive into BPF internals, but rather to give a few pointers to areas in the code which reveal BPF functionality. The source tree I’m using for reference is our UEK5 release, based on Linux 4.14.35. See https://github.com/oracle/linux-uek/tree/uek5/master . Most of the functionality described can be found in any recent kernel. The bpf-next tree (where BPF kernel development happens) can be found at

https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

An important caveat; again, what is below describes the state as per the 4.14 kernel. A lot has changed since; but hopefully with the pointers into the code, you’ll be better equipped to figure out what some of these changes are!

The aim here is to be able to get to the point of working on interesting problems with BPF. However before we get there, let’s look at the various pieces and how they fit together.

The first question to ask is what can we do with BPF? What kinds of programs can we write?

To get a sense for this, let’s examine the enumerated type definition from include/uapi/linux/bpf.h https://github.com/oracle/linux-uek/blob/uek5/master/include/uapi/linux/bpf.h#L117

enum bpf_prog_type {
    BPF_PROG_TYPE_UNSPEC,
    BPF_PROG_TYPE_SOCKET_FILTER,
    BPF_PROG_TYPE_KPROBE,
    BPF_PROG_TYPE_SCHED_CLS,
    BPF_PROG_TYPE_SCHED_ACT,
    BPF_PROG_TYPE_TRACEPOINT,
    BPF_PROG_TYPE_XDP,
    BPF_PROG_TYPE_PERF_EVENT,
    BPF_PROG_TYPE_CGROUP_SKB,
    BPF_PROG_TYPE_CGROUP_SOCK,
    BPF_PROG_TYPE_LWT_IN,
    BPF_PROG_TYPE_LWT_OUT,
    BPF_PROG_TYPE_LWT_XMIT,
    BPF_PROG_TYPE_SOCK_OPS,
    BPF_PROG_TYPE_SK_SKB,
};

What are all of these program types? To understand this, we will ask the same set of questions for each program type:

  • what do I do with this program type?
  • how do I attach my BPF program for this program type?
  • what context is provided to my program? By this we mean what argument(s) and data are provided for us to work with.
  • when does the attached program get run? It’s important to understand this, as it gives a sense of where for example in the network stack a filter is applied.

We won’t worry about how you create the programs for now; that side of things is relatively uniform across the various program types.

1. socket-related program types – SOCKET_FILTER, SK_SKB, SOCK_OPS

First, let’s consider the socket-related program types which allow us to filter, redirect socket data and monitor socket events. The filtering use case relates to the origins of BPF. When observing the network we want to see only a portion of network traffic, for example all traffic from a troublesome system. Filters are used to describe the traffic we want to see, and ideally we want it to be fast, and we want to give users an open-ended set of filtering options. But we have a problem; we want to throw away unneeded data as early as possible, and to do that we need to filter in kernel context. Consider the alternative to an in-kernel solution – incurring the cost of copying packets to user-space and filtering there. That would be very expensive, especially if we only want to see a portion of the network traffic and throw away the rest.To achieve this, a safe mini-language was invented to translate high-level filters into a bytecode program that the kernel can use (termed classic BPF, cBPF). The aim of the language was to support a flexible set of filtering options while being fast and safe. Filters written in this assembly-like language could be pushed by userspace programs such as tcpdump to accomplish filtering in-kernel. See

https://www.tcpdump.org/papers/bpf-usenix93.pdf

…for the classic paper describing this work. Modern eBPF took these concepts, expanded the register and instruction set, added data structures called maps, hugely expanded the kinds of events we can attach to, and much more!

For socket filtering, the common case is to attach to a raw socket (SOCK_RAW), and in fact you’ll notice most programs that do socket filtering have a line like this:

    s = socket(AF_PACKET,SOCK_RAW,htons(ETH_P_ALL));

Creating such a socket, we specify the domain (AF_PACKET), socket type (SOCK_RAW) and protocol (all packet types). In the Linux kernel, receive of raw packets is implemented by the raw_local_deliver() function. It is called called by ip_local_deliver_finish(), just prior to calling the relevant IP protocol’s handler, which is where the packet is passed to TCP, UDP, ICMP etc. So at this point the traffic has not been associated with a specific socket; that happens later, when the IP stack figures out the mapping from packet to layer 4 protocol, and then to the relevant socket (if any). You can see the cBPF bytecodes generated by tcpdump by using the -d option. Here I want to run tcpdump on the wlp4s0 interface, filtering TCP traffic only:

# tcpdump -i wlp4s0 -d 'tcp'
(000) ldh      [12]
(001) jeq      #0x86dd          jt 2    jf 7
(002) ldb      [20]
(003) jeq      #0x6             jt 10    jf 4
(004) jeq      #0x2c            jt 5    jf 11
(005) ldb      [54]
(006) jeq      #0x6             jt 10    jf 11
(007) jeq      #0x800           jt 8    jf 11
(008) ldb      [23]
(009) jeq      #0x6             jt 10    jf 11
(010) ret      #65535
(011) ret      #0

Without much deep knowledge we can get a feel for what’s happening here. On line 000 we load the offset of the ether header + 12 ; the ether header protocol type. On line 001, we jump to 002 if it matches ETH_P_IPv6 (0x86dd) (jt 2), otherwise jump to 007 if false (jf 7) (handle the IPv4 case).

Let’s look at the IPv6 case first. On line 003 we jump to 010 – success – if the IPv6 protocol (offset + 20) is 6 (IPPROTO_TCP) – line 010 returns 65535 which is the max length so we’re accepting the packet. Otherwise we jump to 004. Here we compare to 0x2c, which indicates there’s an IPv6 fragment header. If that’s true we check if the fragment header (offset 54) specifies a next protocol value as IPPROTO_TCP, and if so we jump to 10 (success) or 11 (failure). Returning 0 means dropping the packet for filtering purposes.

Handling IPv4 is simpler; on 007 (arrived at via “jf” on 001), we check for ETH_P_IPV4 and, if found, we verify that the IPPROTO is TCP. And we’re done! Remember though this is cBPF; eBPF has an extended instruction/op set similar to x86_64 and additional registers.

One other thing to note – socket filtering is distinct from netfilter-based filtering. Netfilter defines its own set of hooks with NF_HOOK() definitions, which netfilter-based technologies such as ipfilter can use to filter traffic also. You might think – couldn’t we use eBPF there too? And you’d be right! bpfilter is replacing ipfilter in more recent Linux kernels.

So with all that in mind, let’s return to examining the socket-related program types.

1.1 BPF_PROG_TYPE_SOCKET_FILTER

  • What do I do with it? The filtering actions include dropping packets (if the program returns 0) or trimming packets (if the program returns a length less than the original). See sk_filter_trim_cap() and its call to bpf_prog_run_save_cb(). Note that we’re not trimming or dropping the original packet which would still reach the intended socket intact; we’re working with a copy of the packet metadata which raw sockets can access for observability. In addition to filtering packet flow to our socket, we can also do things that have side-effects; for example collecting statistics in BPF maps.

  • How do I attach my program? BPF programs can be attached to sockets via the SO_ATTACH_BPF setsockopt(), which passes in a file descriptor to the program.

  • What context is provided? A pointer to the struct __sk_buff containing packet metadata/data. This structure is defined in include/linux/bpf.h, and includes key fields from the real sk_buff. The bpf verifier converts access to valid __sk_buff fields into offsets into the “real” sk_buff, see https://lwn.net/Articles/636647/ for more details.

  • When does it run? Socket filters run for receive in sock_queue_rcv_skb() which is called by various protocols (TCP, UDP, ICMP, raw sockets etc) and can be used to filter inbound traffic.

To give a sense for what programs look like, here we will create a filter that trims packet data we filter on the basis of protocol type; for IPv4 TCP, let’s grab the IPv4 + TCP header only, while for UDP, we’ll take the IPv4 and UDP header only. We won’t deal with IPv4 options as it’s a simple example, so in all other cases we return 0 (drop packet).

#include <linux/bpf.h>
#include <linux/in.h>
#include <linux/types.h>
#include <linux/string.h>
#include <linux/if_ether.h>
#include <linux/if_packet.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <linux/udp.h>
#include "bpf_helpers.h"

#ifndef offsetof
#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
#endif

/*
 * We are only interested in TCP/UDP headers, so drop every other protocol
 * and trim packets after the TCP/UDP header by returning length of
 * ether header + IPv4 header + TCP/UDP header.
 */

SEC("socket")
int bpf_prog1(struct __sk_buff *skb)
{
        int proto = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
        int size = ETH_HLEN + sizeof(struct iphdr);

        switch (proto) {
        case IPPROTO_TCP:
             size += sizeof(struct tcphdr);
             break;
        case IPPROTO_UDP:
             size += sizeof(struct udphdr);
             break;
        default:
             size = 0;
             break;
        }
        return size;
}

char _license[] SEC("license") = "GPL";

This program can be compiled into BPF bytecodes using LLVM/clang by specifying arch of “bpf” , and once that is done it will contain an object with an ELF section called “socket”. That is our program. The next step is to use the BPF system call to assign a file descriptor to the program, then attach it to the socket. In samples/bpf , you can see that bpf_load.c scans the ELF sections, and sections with name prefixed by “socket” are recognized as BPF_PROG_TYPE_SOCKET_FILTER programs. If you’re adding a sample I’d recommend including bpf_load.h so you can just call load_bpf_file() on your BPF program. For example, in samples/bpf/sockex1_user.c we take the filename of our program (sockex1) and load sockex1_kern.o ; the associated BPF program. Then we open a raw socket to loopback (lo) and attach the program there:

        snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
        if (load_bpf_file(filename)) {
                printf("%s", bpf_log_buf);
                return 1;
        }
        sock = open_raw_sock("lo");
        assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd,
                          sizeof(prog_fd[0])) == 0);

1.2 BPF_PROG_TYPE_SOCK_OPS

  • What do I do with it? Attach a BPF program to catch socket operations such as connection establishment, retransmit timeout etc. Once caught options can also be set via bpf_setsockopt(), so for example on passive establishment of a connection from a system not on the same subnet, we could lower the MTU so we won’t have to worry about intermediate routers fragmenting packets. Programs can return success (0) or failure (a negative value) and a reply value can be set to indicate the desired value for a socket option (e.g. TCP rwnd). See https://lwn.net/Articles/727189/ for full details, and look for tcp_call_bpf()s inline definition in include/net/tcp.h to see how TCP handles execution of such programs. Another use case is for sockmap updates in combination with BPF_PROG_TYPE_SK_SKB programs; the bpf_sock_ops struct pointer passed into the BPF_PROG_TYPE_SOCK_OPS program is used to update the sockmap, associating a value for that socket. Later sk_skb programs can reference those values to specify which socket to redirect to via bpf_sk_redirect_map() calls. If this sounds confusing, I’d recommend taking a look at the code in samples/sockmap.

  • How do I attach my program? It is attached to a cgroup file descriptor using BPF_CGROUP_SOCK_OPS attach type.

  • What context is provided? Argument provided is the context, struct bpf_sock_ops *.. Op field specifies the operatiion, BPF_SOCK_OPS_RWND_INIT, BPF_SOCK_OPS_TCP_CONNECT_CB etc. The reply field can be used to indicate to the caller a new value for a parameter set.

/* User bpf_sock_ops struct to access socket values and specify request ops
 * and their replies.
 * Some of this fields are in network (bigendian) byte order and may need
 * to be converted before use (bpf_ntohl() defined in samples/bpf/bpf_endian.h).
 * New fields can only be added at the end of this structure
 */
struct bpf_sock_ops {
    __u32 op;
    union {
        __u32 reply;
        __u32 replylong[4];
    };
    __u32 family;
    __u32 remote_ip4;    /* Stored in network byte order */
    __u32 local_ip4;    /* Stored in network byte order */
    __u32 remote_ip6[4];    /* Stored in network byte order */
    __u32 local_ip6[4];    /* Stored in network byte order */
    __u32 remote_port;    /* Stored in network byte order */
    __u32 local_port;    /* stored in host byte order */
};
  • When does it run? As per the above article, unlike other BPF program types that expect to be called at a particular place in the codebase, SOCK_OPS program can be called at different places and use an “op” field to indicate that context. See include/uapi/linux/bpf.h for the enumerated BPF_SOCK_OPS_* definitions, but they include events like retransmit timeout, connection establishment etc.

1.3 BPF_PROG_TYPE_SK_SKB

  • What do I do with it? Allows users to access skb and socket details such as port and IP address with a view to supporting redirect of skbs between sockets. See https://lwn.net/Articles/731133/ . This functionality is used in conjunction with a sockmap – a special-purpose BPF map that contains references to socket structures and associated values. sockmaps are used to support redirection. The program is attached and the bpf_sk_redirect_map() helper can be used to carry out the redirection. The general approach we catch socket creation events with sock_ops BPF programs, associate values with the sockmap for these, and then use data at the sk_skb instrumentation points to inform socket redirection – this is termed the verdict, and the program for this is attached to the sockmap via BPF_SK_SKB_STREAM_VERDICT. The verdict can be __SK_DROP, __SK_PASS, or __SK_REDIRECT. Another use case for this program type is in the strparser framework (https://www.kernel.org/doc/Documentation/networking/strparser.txt). BPF programs can be used to parse message streams via callbacks for read operations, verdict and read completion. TLS and KCM use stream parsing.

  • How do I attach my program? A redirection progaram is attached to a sockmap as BPF_SK_SKB_STREAM_VERDICT; it should return the result of bpf_sk_redirect_map(). A strparser program is attached via BPF_SK_SKB_STREAM_PARSER and should return the length of data parsed.

  • What context is provided? A pointer to the struct __sk_buff containing packet metadata/data. However more fields are accessible to the sk_skb program type. The extra set of fields available are documented in include/linux/bpf.h like so:

/* Accessed by BPF_PROG_TYPE_sk_skb types from here to ... */
__u32 family;
__u32 remote_ip4;   /* Stored in network byte order */
__u32 local_ip4;    /* Stored in network byte order */
__u32 remote_ip6[4];    /* Stored in network byte order */
__u32 local_ip6[4]; /* Stored in network byte order */
__u32 remote_port;  /* Stored in network byte order */
__u32 local_port;   /* stored in host byte order */
/* ... here. */

So from the above alone we can see we can gather information about the socket, since the above represents the key information that identifies the socket uniquely (protocol is already available in the globally-accessible portion of the struct __sk_buff).

  • When does it run? A stream parser can be attached to a socket via BPF_SK_SKB_STREAM_PARSER attachment to a sockmap, and the parser runs on socket receive via smap_parse_func_strparser() in kernel/bpf/sockmap.c . BPF_SK_SKB_STREAM_VERDICT also attaches to the sockmap, and is run via smap_verdict_func().

2. tc (traffic control) subsystem programs

Next let’s examine the program type related to the TC kernel packet scheduling subsystem. See the tc(8) manpage for a general introduction, and tc-bpf(8) for BPF specifics.

2.1 tc_cls_act : qdisc classifier

  • What do I do with it? tc_cls_act allows us to use BPF programs as classifiers and actions in tc, the Linux QoS subsystem. What’s even better is the tc(8) command has eBPF support also, so we can directly load BPF programs as classifiers and actions for inbound (ingress) and outbound (egress) traffic. See http://man7.org/linux/man-pages/man8/tc-bpf.8.html for a description of how to use tc’s BPF functionality. tc programs can classify, modify, redirect or drop packets.

  • How do I attach my program? tc(8) can be used; see tc-bpf(8) for details. The basics are we create a “clsact” qdisc for a network device, and add ingress and egress classifiers/actoins by specifying the BPF object and relevant ELF section. Example, to add an ingress classifier to eth0 in ELF section my_elf_sec from myprog_kernel.o (a bpf-bytecode-compiled object file):

# tc qdisc add dev eth0 clsact
# tc filter add dev eth0 ingress bpf da obj myprog_kernel.o sec my_elf_sec
  • What context is provided? A pointer to struct __sk_buff packet metadata/data.

  • When does it get run? As mentioned above, classifier qdisc must be added, and once it is we can attach BPF programs to classify inbound and outbound traffic. Implementation-wise, act_bpf.c and cls_bpf.c implement action/classifier modules. On ingress/gress sch_handle_ingress()/egress() call tcf_classify(). In the case of ingress, we do classification via the core network interface receive function, so we are getting the packet after the driver has processed it but before IP etc see it. On egress, the filtering is done prior to submitting to the device queue for transmit.

3. xdp : the Xpress Data Path.

The key design goal for XDP is to introduce programmability in the network datapath. The aim is to provide the XDP hooks as close to the device as possible (before the OS has created sk_buff metadata) to maximize performace while supporting a common infrastructure across devices. To support XDP like this requires driver changes. For an example see drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c. A bpf net device op (ndo_bpf) is added. For bnxt it supports XDP_SETUP_PROG and XDP_QUERY_PROG actions; the former configures the device for XDP, reserving rings and setting the program as active. The latter returns the BPF program id. BPF-specific transmit and receive functions are provided and called by the real send/receive functions if needed.

3.1 BPF_PROG_TYPE_XDP

  • What do I do with it? XDP allows access to packet data as early as possible, before packet metadata (struct sk_buff) has been assigned. Thus it is a useful place to do DDoS mitigation or load balancing since such activities can often avoid the expensive overhead of sk_buff allocation. XDP is all about supporting run-time programming of the kernel in via BPF hooks, but by working in concert with the kernel itself; i.e. not a kernel bypass mechanism. Actions supported include XDP_PASS (pass into network processing as usual), XDP_DROP (drop), XDP_TX (transmit) and XDP_REDIRECT. See include/uapi/linux/bpf.h for the “enum xdp_action”.

  • How do I attach my program? Via netlink socket message. A netlink socket – socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE) – is created and bound, and then we send a netlink message of type NLA_F_NESTED | 43 ; this specifies XDP message. The message contains the BPF fd, the interface index (ifindex). See samples/bpf/bpf_load.c for an example.

  • What context is provided? An xdp metadata pointer; struct xdp_md * . XDP metadata is deliberately lightweight; from include/uapi/linux/bpf.h:

/* user accessible metadata for XDP packet hook
 * new fields must be added to the end of this structure
 */
struct xdp_md {
        __u32 data;
        __u32 data_end;
};
  • When does it get run? “Real” XDP is implemented at the driver level, and transmit/receive ring resources are set aside for XDP usage. For cases where drivers do not support XDP, there is the option of using “generic” XDP, which is implemented in net/core/dev.c. The downside of this is we do not bypass skb allocation, it just allows us to use XDP for such devices also.

4. kprobes, tracepoints and perf events

kprobes, tracepoints and perf events all provide kernel instrumentation. kprobes – https://www.kernel.org/doc/Documentation/kprobes.txt – allow instrumentation of specific functions – entry of a function can be monitored via a kprobe, along with most instructions within a function, or entry/return can be instrumented via a kretprobe. When one of these probes is enabled, the code at the enable point is saved, and replaced with a breakpoint instruction. When this breakpoint is hit a trap instruction is generated, registers are saved and we branch to the relevant instrumentation handler. For example, kprobes are handled by kprobe_dispatcher() which gets the address of the kprobe and register context as arguments. kretprobes are implemented via kprobes; a kprobe fires on entry and modifies the return address, saving the original and replacing it with the location of the instrumentation handler. Tracepoints – https://www.kernel.org/doc/Documentation/trace/tracepoints.rst – are similar, but ratther than being enabled at particular instructions, they are explicitly marked at sites in code, and if enabled can be used to collect debugging information at those sites of interest. The same tracepoint can be declared in multiple places; for example trace_drv_return_int() is called in multiple places in net/mac80211/driver-ops.c .

Perf events – https://perf.wiki.kernel.org/index.php/Main_Page – are the basis for eBPF support for these program types. BPF essentially piggy-backs on the existing infrastructure for event sampling, allowing us to attach programs to perf events of interest, which include kprobes, uprobes, tracepoints etc as well has other software events, and indeed hardware events can be monitored too.

These instrumentation points are what gives BPF the capability to be a general-purpose tracing tool as well as a means for supporting the original networking-centric use cases like socket filtering.

4.1 BPF_PROG_TYPE_KPROBE

  • What do I do with it? instrument code in any kernel function (bar a few exceptions) via kprobe, or instrument entry/return via kretprobe. k[ret]probe_perf_func() executes a BPF program attached to the probe point. Note that this program type can also be used to attach to u[ret]probes – see https://www.kernel.org/doc/Documentation/trace/uprobetracer.txt for details

  • How do I attach my program? When the kprobe is created via sysfs, it has an id associated with it, stored in /sys/kernel/debug/tracing/events/[uk]probe//id , /sys/kernel/debug/tracing/events/[uk]retprobe/probename/id . https://www.kernel.org/doc/Documentation/trace/kprobetrace.txt contains details on how to create a kprobe using sysfs.For example, to create a probe called “myprobe” on entry to tcp_retransmit_skb() and retrieve its id:

# echo 'p:myprobe tcp_retransmit_skb' > /sys/kernel/debug/tracing/kprobe_events
# cat /sys/kernel/debug/tracing/events/kprobes/myprobe/id
2266

We can use that probe id to open a perf event, enable it, and set the BPF program for that perf event to be our program. See samples/bpf/bpf_load.c in the load_and_attach() function for how this can be done for k[ret]probes. The code might look something like this:

        struct perf_event_attr attr;
        int eventfd, programfd;
        int probeid;

        /* Load BPF program and assign programfd to it; and get probeid of probe from sysfs */
        attr.type = PERF_TYPE_TRACEPOINT;
        attr.sample_type = PERF_SAMPLE_RAW;
        attr.sample_period = 1;
        attr.wakeup_events = 1;
        attr.config = probeid;
        eventfd = sys_perf_event_open(&attr, -1, 0, programfd, 0);
        if (eventfd < 0)
                return -errno;
        if (ioctl(eventfd, PERF_EVENT_IOC_ENABLE, 0)) {
                close(eventfd);
                return -errno;
        }
        if (ioctl(eventfd, PERF_EVENT_IOC_SET_BPF, programfd)) {
                close(eventfd);
                return -errno;
        }
  • What context is provided? A struct pt_regs *ctx , from which the registers can be accessed. Much of this is platform-specific, but some general-purpose functions exist such as regs_return_value(regs), which returns the value of the register than holds the function return value (regs→ax on x86).

  • When does it get run? When the probe is enabled and breakpoint is hit, k[ret]probe_perf_func() executes a BPF program attached to the probe point via trace_call_bpf(). Similar story for u[ret]probe_perf_func().

4.2 BPF_PROG_TYPE_TRACEPOINT

  • What do I do with it? Instrument tracepoints in kernel code. Tracepoints can be enabled via sysfs as is the case with kprobes, and in a similar way. The list of trace events can be seen under /sys/kernel/debug/tracing/events.

  • How do I attach my program? As we saw above, when the tracepoint is created via sysfs, it has an id associated with it. We can use that probe id to open a perf event, enable it, and set the BPF program for that perf event to be our program. See samples/bpf/bpf_load.c in the load_and_attach() function for how this can be done for tracepoints; the above code snippet for kprobes works for tracepoints also. As an example showing how tracepoints are enabled, here we enable the net/net_dev_xmit tracepoint as “myprobe2” and retrieve its id:

# echo 'p:myprobe2 trace:net/net_dev_xmit' > /sys/kernel/debug/tracing/kprobe_events
# cat /sys/kernel/debug/tracing/events/kprobes/myprobe2/id
2270
  • What context is provided? The context provided by the specific tracepoint; arguments and data types are associated with the tracepoint definition.

  • When does it get run? When the tracepoint is enabled and hit, perf_trace_() (see definition in include/trace/perf.h) calls perf_trace_run_bpf_submit() which will invoke the bpf program via trace_call_bpf().

4.3 BPF_PROG_TYPE_PERF_EVENT

  • What do I do with it? Instrument software and hardware perf events. These include events like syscalls, timer expiry, sampling of hardware events, etc. Hardware events include PMU events (processor monitoring unit) which tell us things like how many instructions completed etc. Perf event monitoring can be targeted at a specific process or group, processor, and a sample period can be specified for profiling.

  • How do I attach my program? A similar model as per the above; we call perf_event_open() with an attribute set, enable the perf event via PERF_EVENT_IOC_ENABLE ioctl(), and set the bpf program via PERF_EVENT_IOC_SET_BPF ioctl(). For PMU (processor monitoring unit) perf event example, see these snippets from samples/bpf/sampleip_user.c:

        ...
        struct perf_event_attr pe_sample_attr = {
                .type = PERF_TYPE_SOFTWARE,
                .freq = 1,
                .sample_period = freq,
                .config = PERF_COUNT_SW_CPU_CLOCK,
                .inherit = 1,
        };
                ...
        pmu_fd[i] = sys_perf_event_open(&pe_sample_attr, -1 /* pid */, i,
                                        -1 /* group_fd */, 0 /* flags */);
        if (pmu_fd[i] < 0) {
                fprintf(stderr, "ERROR: Initializing perf samplingn");
                return 1;
        }

        assert(ioctl(pmu_fd[i], PERF_EVENT_IOC_SET_BPF,
                     prog_fd[0]) == 0);
        assert(ioctl(pmu_fd[i], PERF_EVENT_IOC_ENABLE, 0) == 0);
        ...
  • What context is provided? A struct bpf_perf_event_data *, which looks like this:
struct bpf_perf_event_data {
    struct pt_regs regs;
     __u64 sample_period;
};
  • When does it get run? Depends on the perf event firing and the sample rate chosen, specified by freq and sample_period fields in the perf event attribute structure.

5. cgroups-related program types

CGroups are used to handle resource allocation, allowing or denying access to system resources such as CPU, network bandwidth etc for groups of processes. One key use case for cgroups is containers; a container’s resource access is limited via cgroups while its activities are isolated by the various classes of namespace (network namespace, process ID namespace etc). In the BPF context, we can create eBPF programs that allow or deny access. In include/linux/bpf-cgroup.h we can see definitions for execution of socket/skb programs, where __cgroup_bpf_run_filter_skb is called wrapped in a check that cgroup BPF is enabled:

#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb)                  
({                                                                 
    int __ret = 0;                                                 
    if (cgroup_bpf_enabled)                                        
        __ret = __cgroup_bpf_run_filter_skb(sk, skb,               
                            BPF_CGROUP_INET_INGRESS);              
                                                                   
    __ret;                                                         
})

#define BPF_CGROUP_RUN_SK_PROG(sk, type)                       
({                                                                 
    int __ret = 0;                                                 
    if (cgroup_bpf_enabled) {                                      
        __ret = __cgroup_bpf_run_filter_sk(sk, type);              
    }                                                              
    __ret;                                                         
})

If cgroups are enabled, we attach our program to the cgroup and it will be executed at the relevant hook points. To get an idea of the full list of hooks, consult include/uapi/linux/bpf.h and examine the enumerated type “bpf_attach_type” for BPF_CGROUP_* definitions.

5.1 BPF_PROG_TYPE_CGROUP_SKB

  • What do I do with it? Allow or deny network access on IP egress/ingress (BPF_CGROUP_INET_INGRESS/BPF_CGROUP_INET_EGRESS). BPF programs should return 1 to allow access. Any other value results in the function __cgroup_bpf_run_filter_skb() returning -EPERM, which will be propagated to the caller such that the packet is dropped.

  • How do I attach my program? The program is attached to a specific cgroup’s file descriptor.

  • What context is provided? The relevant skb.

  • When does it get run? For inet ingress, sk_filter_trim_cap() (see above) contains a call to BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb); if a non-zero value is returned, the error is propogated to the caller (e.g. __sk_receive_skb()) and the packet is discarded and freed. A similar approach is taken on egress, but in ip[6]_finish_output().

5.2 BPF_PROG_TYPE_CGROUP_SOCK

  • What do I do with it? Allow or deny network access at various socket-related events (BPF_CGROUP_INET_SOCK_CREATE, BPF_CGROUP_SOCK_OPS). As above, BPF programs should return 1 to allow access. Any other value results in the function __cgroup_bpf_run_filter_sk() returning -EPERM, which will be propagated to the caller such that the packet is dropped.

  • How do I attach my program? The program is attached to a specific cgroup’s file descriptor.

  • What context is provided? The relevant socket (sk).

  • When does it get run? At socket creation time, in inet_create() we call BPF_CGROUP_RUN_PROG_INET_SOCK() with the socket as argument, and if that function fails, the socket is released.

6. Lightweight tunnel program types.

Lightweight tunnels – https://lwn.net/Articles/650778/ – are a simple way to do tunneling by attaching encapsulation instructions to routes. The examples in the patch description make things clearer: iproute examples (without BPF):

  VXLAN:
  ip route add 40.1.1.1/32 encap vxlan id 10 dst 50.1.1.2 dev vxlan0

  MPLS:
  ip route add 10.1.1.0/30 encap mpls 200 via inet 10.1.1.1 dev swp1

So we’re telling Linux for example that for traffic to 40.1.1.1/32 addresses, we want to encapsulate with a a VXLAN ID of 10 and destination IPv4 address of 50.1.1.2. BPF programs can do the encapsulation on outbound/transmit (inbound packets are readonly). See https://lwn.net/Articles/705609/ for more details. Similarly to tc, iproute eBPF support allows us to attach the eBPF program ELF section directly:

ip route add 192.168.253.2/32 
     encap bpf out obj lwt_len_hist_kern.o section len_hist 
     dev veth0

6.1 BPF_PROG_TYPE_LWT_IN

  • What do I do with it? Examine inbound packets for lightweight tunnel de-encapsulation.

  • How do I attach my program? Via “ip route add”.

# ip route add <route+prefix> encap bpf in obj <bpf object file.o> section <ELF section> dev <device>
  • What context is provided? Pointer to the sk_buff.

  • When does it get run? Via lwtunnel_input() ; that function supports a number of encapsulation types including BPF. The BPF case runs bpf_input in net/core/lwt_bpf.c with redirection disallowed.

6.2 BPF_PROG_TYPE_LWT_OUT

  • What do I do with it? Implement encapsulation for lightweight tunnels for specific destination routes on outbound. Encapsulation is forbidden

  • How do I attach my program? Via “ip route add”:

# ip route add <route+prefix> encap bpf out obj <bpf object file.o> section <ELF section> dev <device>
  • What context is provided? Pointer to the struct __sk_buff

  • When does it get run? Via lwtunnel_output().

6.3 BPF_PROG_TYPE_LWT_XMIT

  • What do I do with it? Implement encapsulation/redirection for lightweight tunnels on transmit.

  • How do I attach my program? Via “ip route add”

# ip route add <route+prefix> encap bpf xmit obj <bpf object file.o> section <ELF section> dev <device>
  • What context is provided? Pointer to the struct __sk_buff

  • When does it get run? Via lwtunnel_xmit().

Summary

So hopefully this roundup of program types was useful. We can see that BPF’s safe in-kernel programmable environment can be used in all sorts of interesting ways! The next thing we will talk about is what BPF helper functions are available within the various program types.

Learning more about BPF

Thanks for reading this installment of our series on BPF. We hope you found it educational and useful. Questions or comments? Use the comments field below!

Stay tuned for the next installment in this series, BPF Helper Functions.

This article originally appeared at Oracle Developers Blog

How Service Meshes Are a Missing Link for Microservices

It is now widely assumed that making the shift to microservices and cloud native environments will only lead to great things — and yet. As the rush to the cloud builds in momentum, many organizations are rudely waking up to more than a few difficulties along the way, such as how Kubernetes and serverless platforms on offer often remain a work in progress.

“We are coming to all those communities and basically pitching them to move, right? We tell them, ‘look, monolithic is very complicated — let’s move to microservices,’ so, they are working very, very hard to move but then they discover that the tooling is not that mature,” Idit Levine, founder and CEO of Solo.io, said. “And actually, there are many gaps in the tooling that they had or used it before and now they’re losing this functionality. For instance, like logging or like debugging microservices is a big problem and so on.” …

One of the first things organizations notice when migrating away from monolithic to microservices environments is how “suddenly you’re losing your observability for all of the applications,” Levine said. “That’s one thing that [service meshes] is really big in solving.”

Read more at The New Stack

Community Demos at ONS to Highlight LFN Project Harmonization and More

A little more than one year since LF Networking (LFN) came together, the project continues to demonstrate strategic growth, with Telstra coming on in November, and the umbrella project representing ~70% of the world’s mobile subscribers. Working side by side with service providers in the technical projects has been critical to ensure that work coming out of LFN is relevant and useful for meeting their requirements. A small sample of these integrations and innovations will be on display once again in the LF Networking Booth at the Open Networking Summit Event, April 3-5 in San Jose, CA. We’re excited to be back in the Bay Area this year and are offering Day Pass and Hall Pass Options. Register today!

Due to demand from the community, we’ve expanded the number of demo stations from 8 to 10 and cover many areas in the open networking stack — with projects from within the LF Networking umbrella (FD,io, OpenDaylight, ONAPOPNFV, and Tungsten Fabric), as well as adjacent projects such as ConsulDockerEnvoyIstioLigatoKubernetesOpenstack, and SkyDive. We welcome you to come spend some time talking to and learning from these experts in the technical community.

The LFN promise of project harmonization will be on display from members iConectiv, Huawei, and Vodafone who will highlight ONAP’s onboarding process for testing Virtual Network Functions (VNFs), integrating with the OPNFV’s Dovetail project, supporting the expanding OPNFV Verification Program (OVP), and paving the way for future compliance and certification testing. Another example of project harmonization is use of common infrastructure for testing across LFN project and the folks from UNH-IOL will be on hand to walk users through the Lab-as-a-Service offering that is facilitating development and testing by hosting hardware and providing access to the community.

Other focus areas include network slicing, service mesh, SDN performance, testing, integration, and analysis.

Listed below is the full networking demo lineup, and you can read detailed descriptions of each demo here.

  • Demonstrating Service Provider Specific Test & Verification Requirements (OPNFV, ONAP) – Proposed By Vodafone and Demonstrated by Vodafone/Huawei/iConectiv
  • Integration of ODL and Network Service Mesh (OpenDaylight, Network Service Mesh) – Presented by Lumina Networks
  • SDN Performance Comparison Between Multi Data-paths (Tungsten Fabric, FD.io) – Presented by ATS, and Sofioni Networks
  • ONAP Broadband Service Use Case (ONAP, OPNFV) – Presented by Swisscom, Huawei, and Nokia
  • OpenStack Rolling Upgrade with Zero Downtime to Application (OPNFV, OpenStack) – Presented By Nokia and Deutsche Telekom
  • Skydive & VPP Integration: A Topology Analyzer (FD.io, Ligato, Skydive, Docker) – Presented by PANTHEON.tech
  • VSPERF: Beyond Performance Metrics, Towards Causation Analysis (OPNFV) – Presented by Spirent)
  • Network Slice with ONAP based Life Cycle Management (ONAP) – Presented by AT&T, and Ericsson
  • Service Mesh and SDN: At-Scale Cluster Load Balancing (Tungsten Fabric, Linux OS, Istio, Consul, Envoy, Kubernetes, HAProxy) – Presented by CloudOps, Juniper and the Tungsten Fabric Community
  • Lab as a Service 2.0 Demo (OPNFV) – Presented by UNH-IOL

We hope to see you at the show! Register today!

This article originally appeared at Linux Foundation

Kubernetes Setup Using Ansible and Vagrant

This blog post describes the steps required to set up a multi node Kubernetes cluster for development purposes. This setup provides a production-like cluster that can be set up on your local machine.

Why do we require multi node cluster setup?

Multi node Kubernetes clusters offer a production-like environment which has various advantages. Even though Minikube provides an excellent platform for getting started, it doesn’t provide the opportunity to work with multi node clusters which can help solve problems or bugs that are related to application design and architecture. For instance, Ops can reproduce an issue in a multi node cluster environment, Testers can deploy multiple versions of an application for executing test cases and verifying changes. These benefits enable teams to resolve issues faster which make them more agile.

Why use Vagrant and Ansible?

Vagrant is a tool that will allow us to create a virtual environment easily and it eliminates pitfalls that cause the works-on-my-machine phenomenon. It can be used with multiple providers such as Oracle VirtualBox, VMware, Docker, and so on. It allows us to create a disposable environment by making use of configuration files.

Ansible is an infrastructure automation engine that automates software configuration management. It is agentless and allows us to use SSH keys for connecting to remote machines. Ansible playbooks are written in yaml and offer inventory management in simple text files.

Prerequisites

  • Vagrant should be installed on your machine. Installation binaries can be found here.
  • Oracle VirtualBox can be used as a Vagrant provider or make use of similar providers as described in Vagrant’s official documentation.
  • Ansible should be installed in your machine. Refer to the Ansible installation guide for platform-specific installation.

Read more at Kubernetes.io

A Linux Sysadmin’s Guide to Network Management, Troubleshooting and Debugging

A system administrator’s routine tasks include configuring, maintaining, troubleshooting, and managing servers and networks within data centers. There are numerous tools and utilities in Linux designed for administrative purposes.

In this article, we will review some of the most used command-line tools and utilities for network management in Linux, under different categories. We will explain some common usage examples, which will make network management much easier in Linux.

Network Configuration, Troubleshooting and Debugging Tools

1. ifconfig Command

ifconfig is a command line interface tool for network interface configuration and also used to initialize an interface at system boot time. Once a server is up and running, it can be used to assign an IP Address to an interface and enable or disable the interface on demand.

It is also used to view the status IP Address, Hardware / MAC address, as well as MTU (Maximum Transmission Unit) size of the currently active interfaces. ifconfig is thus useful for debugging or performing system tuning.

Here is an example to display the status of all active network interfaces.

Read more at Tecmint

Why You Need DevOps and the Cloud

DevOps and the cloud are not only two of today’s biggest tech trends, but are inextricably linked. Research from DORA’s 2018 Accelerate State of DevOps Report and Redgate’s 2019 State of Database DevOps Report outline a clear correlation between cloud and DevOps adoption, with the two working together to contribute to greater business success.

For example, in the DORA report, those companies that met all essential cloud characteristics were 23 times more likely to be in the elite group when it came to DevOps performance. Similarly, Redgate found that 43 percent of organizations that have adopted DevOps have server estates that are all or mostly cloud-based. This compares to just 12 percent of organizations that have not yet adopted DevOps or have no plans to.

Looking into the link in more detail, research shows that there are four common factors that underpin DevOps and the cloud:

Cultural Openness to Transformation

Adopting DevOps is not just a technical or process change, but one that requires a certain type of culture to be in place. 

Read more at DevOps.com

Linux Desktop News: Solus 4 Released With New Budgie Goodness

After teasing fans for several months with the 3.9999 ISO refresh, the team at Solus has delivered  “Fortitude,” a new release of the independent Linux desktop OS. And like elementary OS did with Juno, it seems to earn that major version number.

Perhaps the most notable upgrade is the appearance of Budgie 10.5, even before it lands on the slick desktop environment’s official Ubuntu flavor next month. I first experienced Budgie during my review of the InfinityCubefrom Tuxedo Computers, and I found a lot to love about it. … Read more at Forbes

Finding Files with mlocate

Learn how to locate files in this tutorial from our archives.

It’s not uncommon for a sysadmin to have to find needles buried deep inside haystacks. On a busy machine, there can be hundreds of thousands of files present on your filesystems. What do you do when you need to make sure one particular configuration file is up to date, but you can’t remember where it is located?

If you’ve used Unix-type machines for a while, then you’ve almost certainly come across the find command before. It is unquestionably exceptionally sophisticated and highly functional. Here’s an example that just searches for links inside a directory, ignoring files:

# find . -lname "*"

You can do seemingly endless things with the find command; there’s no denying that. The find command is nice and succinct when it wants to be, but it can also get complex very quickly. This is not necessarily because of the find command itself, but coupled with “xargs” you can pass it all sorts of options to tune your output, and indeed delete those files which you have found.

Location, location, frustration

There often comes a time when simplicity is the preferred route, however — especially when a testy boss is leaning over your shoulder, chatting away about how time is of the essence. And, imagine trying to vaguely guess the path of the file you’ve never seen but that your boss is certain lives somewhere on the busy /var partition.

Step forward mlocate. You may be aware of one of its close relatives: slocate, which securely (note the prepended letter s for secure) took note of the pertinent file permissions to prevent unprivileged users from seeing privileged files). Additionally, there is also the older, original locate command whence they came.

The differences between mlocate and other members of its family (according to mlocate at least) is that, when scanning your filesystems, mlocate doesn’t need to continually rescan all your filesystem(s). Instead, it merges its findings (note the prepended m for merge) with any existing file lists, making it much more performant and less heavy on system caches.

In this series of articles, we’ll look more closely at the mlocate tool (and simply refer to it as “locate” due to its popularity) and examine how to quickly and easily tune it to your heart’s content.

Compact and Bijou

If you’re anything like me unless you reuse complex commands frequently then ultimately you forget them and need to look them up.The beauty of the locate command is that you can query entire filesystems very quickly and without worrying about top-level, root, paths with a simple command using locate.

In the past, you might well have discovered that the find command can be very stubborn and cause you lots of unwelcome head-scratching. You know, a missing semicolon here or a special character not being escaped properly there. Let’s leave the complicated find command alone now, relax, and have a look into the clever little command that is locate.

You will most likely want to check that it’s on your system first by running these commands:

For Red Hat derivatives:

# yum install mlocate

For Debian derivatives:

# apt-get install mlocate

There shouldn’t be any differences between distributions, but there are almost certainly subtle differences between versions; beware.

Next, we’ll introduce a key component to the locate command, namely updatedb. As you can probably guess, this is the command which updates the locate command’s db. Hardly counterintuitive.

The db is the locate command’s file list, which I mentioned earlier. That list is held in a relatively simple and highly efficient database for performance. The updatedb runs periodically, usually at quiet times of the day, scheduled via a cron job. In Listing 1, we can see the innards of the file /etc/cron.daily/mlocate.cron (both the file’s path and its contents might possibly be distro and version dependent).

#!/bin/sh

nodevs=$(< /proc/filesystems awk '$1 == "nodev" { print $2 }')

renice +19 -p $$ >/dev/null 2>&1

ionice -c2 -n7 -p $$ >/dev/null 2>&1

/usr/bin/updatedb -f "$nodevs"

Listing 1: How the “updatedb” command is triggered every day.

As you can see, the mlocate.cron script makes careful use of the excellent nice commands in order to have as little impact as possible on system performance. I haven’t explicitly stated that this command runs at a set time every day (although if my addled memory serves, the original locate command was associated with a slow-down-your-computer run scheduled at midnight). This is thanks to the fact that on some “cron” versions delays are now introduced into overnight start times.

This is probably because of the so-called Thundering Herd of Hippos problem. Imagine lots of computers (or hungry animals) waking up at the same time to demand food (or resources) from a single or limited source. This can happen when all your hippos set their wristwatches using NTP (okay, this allegory is getting stretched too far, but bear with me). Imagine that exactly every five minutes (just as a “cron job” might) they all demand access to food or something otherwise being served.

If you don’t believe me then have a quick look at the config from — a version of cron called anacron, in Listing 2, which is the guts of the file /etc/anacrontab.

# /etc/anacrontab: configuration file for anacron

# See anacron(8) and anacrontab(5) for details.


SHELL=/bin/sh

PATH=/sbin:/bin:/usr/sbin:/usr/bin

MAILTO=root

# the maximal random delay added to the base delay of the jobs

RANDOM_DELAY=45

# the jobs will be started during the following hours only

START_HOURS_RANGE=3-22

#period in days   delay in minutes   job-identifier   command

1       5       cron.daily              nice run-parts /etc/cron.daily

7       25      cron.weekly             nice run-parts /etc/cron.weekly

@monthly 45     cron.monthly            nice run-parts /etc/cron.monthly 

Listing 2: How delays are introduced into when “cron” jobs are run.

From Listing 2, you have hopefully spotted both “RANDOM_DELAY” and the “delay in minutes” column. If this aspect of cron is new to you, then you can find out more here:

# man anacrontab

Failing that, you can introduce a delay yourself if you’d like. An excellent web page (now more than a decade old) discusses this issue in a perfectly sensible way. This website discusses using sleep to introduce a level of randomality, as seen in Listing 3.

#!/bin/sh

# Grab a random value between 0-240.
value=$RANDOM
while [ $value -gt 240 ] ; do
 value=$RANDOM
done

# Sleep for that time.
sleep $value

# Syncronize.
/usr/bin/rsync -aqzC --delete --delete-after masterhost::master /some/dir/

Listing 3: A shell script to introduce random delays before triggering an event, to avoid a Thundering Herd of Hippos.

The aim in mentioning these (potentially surprising) delays was to point you at the file /etc/crontab, or the root user’s own crontab file. If you want to change the time of when the locate command runs specifically because of disk access slowdowns, then it’s not too tricky. There may be a more graceful way of achieving this result, but you can also just move the file /etc/cron.daily/mlocate.cron somewhere else (I’ll use the /usr/local/etc directory), and as the root user add an entry into the root user’s crontab with this command and then paste the content as below:

# crontab -e

33 3 * * * /usr/local/etc/mlocate.cron

Rather than traipse through /var/log/cron and its older, rotated, versions, you can quickly tell the last time your cron.daily jobs were fired, in the case of anacron at least, with:

# ls -hal /var/spool/anacron

Next time, we’ll look at more ways to use locate, updatedb, and other tools for finding files.

Learn more about essential sysadmin skills: Download the Future Proof Your SysAdmin Career ebook now.

Chris Binnie’s latest book, Linux Server Security: Hack and Defend, shows how hackers launch sophisticated attacks to compromise servers, steal data, and crack complex passwords, so you can learn how to defend against these attacks. In the book, he also talks you through making your servers invisible, performing penetration testing, and mitigating unwelcome attacks. You can find out more about DevSecOps and Linux security via his website (http://www.devsecops.cc).

Searchable List of Certified Open Hardware Projects

In this article, and hopefully more to come, I will share interesting examples of hardware that has been certified by the Open Source Hardware Association (OSHWA).

As an introduction to this series, I’ll start with a little background.

What is open source hardware?

Open source hardware is hardware that is, well, open source. The Open Source Hardware Association maintains a formal definition of open source hardware, but fundamentally, open source hardware is about two types of freedom. The first is freedom of information: Does a user have the information required to understand, replicate, and build upon the hardware? The second is freedom from legal barriers: Will legal barriers (such as intellectual property rights) prevent a user who is trying to understand, replicate, and build upon the hardware from doing so? True open source hardware is open to everyone to do with as they see fit.

Read more at OpenSource.com

The What and the Why of the Cluster API

Throughout the evolution of software tools there exists a tension between generalization and partial specialization. A tool’s broader adoption is a form of natural selection, where its evolution is predicated on filling a given need, or role, better than its competition. This premise is imbued in the central tenets of Unix philosophy:

  • Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new features.
  • Expect the output of every program to become the input to another, as yet unknown, program.

The domain of configuration management tooling is rife with examples of not heeding this lesson (i.e. Terraform, Puppet, Chef, Ansible, Juju, Saltstack, etc.), where expanse in generality has given way to partial specialization of different tools, causing fragmentation of an ecosystem. This pattern has not gone unnoticed by those in the Kubernetes cluster lifecycle special interest group, or SIG, whose objective is to simplify the creation, configuration, upgrade, downgrade, and teardown of Kubernetes clusters and their components. Therefore, one of the primary design principles for any subproject that the SIG endorses is: Where possible, tools should be composable to solve a higher order set of problems.

In this post, we will outline the history and motivations behind the creation of the Cluster API as a specialized toolset to bring declarative, Kubernetes-style APIs to cluster creation, configuration, and management in the Kubernetes ecosystem. The primary function of Cluster API is not meant to supplant existing tools, but to serve as a partial specialization that can be used in a composable fashion with those tools.

Read more at VMware