summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-02-01devlink: add version reporting to devlink info APIJakub Kicinski
ethtool -i has a few fixed-size fields which can be used to report firmware version and expansion ROM version. Unfortunately, modern hardware has more firmware components. There is usually some datapath microcode, management controller, PXE drivers, and a CPLD load. Running ethtool -i on modern controllers reveals the fact that vendors cram multiple values into firmware version field. Here are some examples from systems I could lay my hands on quickly: tg3: "FFV20.2.17 bc 5720-v1.39" i40e: "6.01 0x800034a4 1.1747.0" nfp: "0.0.3.5 0.25 sriov-2.1.16 nic" Add a new devlink API to allow retrieving multiple versions, and provide user-readable name for those versions. While at it break down the versions into three categories: - fixed - this is the board/fixed component version, usually vendors report information like the board version in the PCI VPD, but it will benefit from naming and common API as well; - running - this is the running firmware version; - stored - this is firmware in the flash, after firmware update this value will reflect the flashed version, while the running version may only be updated after reboot. v3: - add per-type helpers instead of using the special argument (Jiri). RFCv2: - remove the nesting in attr DEVLINK_ATTR_INFO_VERSIONS (now versions are mixed with other info attrs)l - have the driver report versions from the same callback as other info. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01devlink: add device information APIJakub Kicinski
ethtool -i has served us well for a long time, but its showing its limitations more and more. The device information should also be reported per device not per-netdev. Lay foundation for a simple devlink-based way of reading device info. Add driver name and device serial number as initial pieces of information exposed via this new API. v3: - rename helpers (Jiri); - rename driver name attr (Jiri); - remove double spacing in commit message (Jiri). RFC v2: - wrap the skb into an opaque structure (Jiri); - allow the serial number of be any length (Jiri & Andrew); - add driver name (Jonathan). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01Merge branch 'selftests-Various-fixes'David S. Miller
Petr Machata says: ==================== selftests: Various fixes This patch set contains various fixes whose common denominator is improving quality of forwarding and mlxsw selftests. Most of the fixes are improvements in determinism (such that timing and latency don't impact the test performance). These were prompted by regular runs of the test suite on a hardware emulator, the performance of which is necessarily lower than that of the real device. Patches #1 (from Ido), #2 and #3 make changes to ping limits. Patches #4 and #5 add more sleep in places where things need more time to finish. Patches #6 and #7 fix two tests in the suite of mirror-to-gretap tests where underlay involves a VLAN device over an 802.1q bridge. Patches #8, #9 and #10 fix bugs in mirror-to-gretap test where underlay involves a LAG device. Patch #11 fixes a missed RET initialization in mirror-to-gretap flower test. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01selftests: forwarding: mirror_gre_flower: Fix test result handlingPetr Machata
The global variable RET needs to be initialized before each call to log_test. This test case sets it once before running the tests, but then calls log_tests for every individual test. Thus a failure in one of the tests causes spurious failures in follow-up tests as well. Fix by moving the initialization of RET from test_all() to full_test_span_gre_dir_acl(), a function that implements the test. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01selftests: forwarding: mirror_gre_bridge_1q_lag: Ignore ARPPetr Machata
This test sets up mirroring such that it mirrors all overlay traffic. That includes ARP, which causes occasional miscounts and spurious failures. Ignore ARP explicitly to avoid these problems. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01selftests: forwarding: mirror_gre_bridge_1q_lag: Enable forwardingPetr Machata
This test relies on routing in the primary traffic path, but neglects to enable forwarding. Do so. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01selftests: forwarding: mirror_gre_bridge_1q_lag: Flush neighborsPetr Machata
After one LAG slave is downed and another upped, it takes a while for the neighbor on a bridge to time out and get renegotiated. The test does prompt update of FDB entries by arpinging. But because the neighbor still references another address, offloading is not possible, and some packets may end up not being mirrored. To force the neighbor renegotiation, simply flush the neighbor table at the bridge. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01selftests: forwarding: mirror_gre_vlan_bridge_1q: Fix roaming testPetr Machata
ARP or ND traffic can cause spurious migration of FDB back to $swp3. Mirroring is then updated in accordance with the change, and mirrored packets are seen at h3, causing a failure. Detect the case of this spurious roaming, and retry the test. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01selftests: forwarding: mirror_gre_vlan_bridge_1q: Fix untagged testPetr Machata
The untagged egress test sets up mirroring to {,ip6}gretap such that the underlay goes through a bridge. Then VLAN flags are manipulated to test that the traffic leaves the bridge 802.1q-tagged or not, as appropriate. However, when a neighbor expires at the time that the bridge VLAN is configured as PVID and egress untagged, the following discovery process can't finish, because the IP address on H3 is still at the VLAN-tagged netdevice. This manifests by occasional failures where only several of the 10 required packets get through. Therefore, when reconfiguring the VLAN flags, move the IP address to the appropriate device in the H3 VRF. In addition to that, take this opportunity to embed an ASCII art diagram to make the topology move obvious. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01selftests: forwarding: mirror_lib: Wait for tardy mirrored packetsPetr Machata
When running in an environment with poor performance (such as a simulator), processing mirrored packets can take a while. Evaluating the condition too soon leads to spurious "seen 9, expected 10" failures as the last packet doesn't have enough time to get mirrored and the mirror to arrive and bump the observed counters. Wait for one ping interval before evaluating the test. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01selftests: forwarding: mirror_gre_changes: Fix TTL testPetr Machata
When running in a simulator, the TTL change takes a while to settle and during this time the performance of the packet processing is lowered. The resulting instability leads to ping sending more packets as it assumes some have been dropped. This then leads to regular spurious failures as more packets than expected are observed. Sleep a bit to give the system time to stabilize. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01selftests: mlxsw: Update ping limitsPetr Machata
The current ping intervals are too short for running mirroring tests in simulator. This leads to ping sending a follow-up ping before the reply arrives, thus sending more than the requested 10 ICMP requests. This traffic is seen at the counters, and causes spurious failures. Bump interval and timeout numbers 5x in mirroring tests to address the spurious failures. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01selftests: forwarding: mirror_lib: Update ping limitsPetr Machata
The current ping intervals are too short for running mirroring tests in simulator. This leads to ping sending a follow-up ping before the reply arrives, thus sending more than the requested 10 ICMP requests. Those are mirrored, and over a certain threshold the test case run is considered a failure, because too much traffic is observed. Bump interval and timeout numbers 5x in mirroring tests to address the spurious failures. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01selftests: forwarding: Make ping timeout configurableIdo Schimmel
The current timeout (2 seconds) proved to be too low for some (emulated) systems where we run the tests. Make the timeout configurable and default to 5 seconds. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01ipconfig: add carrier_timeout kernel parameterMartin Kepplinger
commit 3fb72f1e6e61 ("ipconfig wait for carrier") added a "wait for carrier" policy, with a fixed worst case maximum wait of two minutes. Now make the wait for carrier timeout configurable on the kernel commandline and use the 120s as the default. The timeout messages introduced with commit 5e404cd65860 ("ipconfig: add informative timeout messages while waiting for carrier") are done in a fixed interval of 20 seconds, just like they were before (240/12). Signed-off-by: Martin Kepplinger <martin.kepplinger@ginzinger.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01ipv4: fib: use struct_size() in kzalloc()Gustavo A. R. Silva
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; instance = kzalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01nfp: use struct_size() in kzalloc()Gustavo A. R. Silva
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; instance = kzalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01tulip: eeprom: use struct_size() in kmalloc()Gustavo A. R. Silva
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; instance = kmalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01cxgb4: smt: use struct_size() in kvzalloc()Gustavo A. R. Silva
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; instance = kvzalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kvzalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01cxgb4: sched: use struct_size() in kvzalloc()Gustavo A. R. Silva
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; instance = kvzalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kvzalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01net: tls: Set async_capable for tls zerocopy only if we see EINPROGRESSDave Watson
Currently we don't zerocopy if the crypto framework async bit is set. However some crypto algorithms (such as x86 AESNI) support async, but in the context of sendmsg, will never run asynchronously. Instead, check for actual EINPROGRESS return code before assuming algorithm is async. Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01Merge branch 'tls-1.3-support'David S. Miller
Dave Watson says: ==================== net: tls: TLS 1.3 support This patchset adds 256bit keys and TLS1.3 support to the kernel TLS socket. TLS 1.3 is requested by passing TLS_1_3_VERSION in the setsockopt call, which changes the framing as required for TLS1.3. 256bit keys are requested by passing TLS_CIPHER_AES_GCM_256 in the sockopt. This is a fairly straightforward passthrough to the crypto framework. 256bit keys work with both TLS 1.2 and TLS 1.3 TLS 1.3 requires a different AAD layout, necessitating some minor refactoring. It also moves the message type byte to the encrypted portion of the message, instead of the cleartext header as it was in TLS1.2. This requires moving the control message handling to after decryption, but is otherwise similar. V1 -> V2 The first two patches were dropped, and sent separately, one as a bugfix to the net tree. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01net: tls: Add tests for TLS 1.3Dave Watson
Change most tests to TLS 1.3, while adding tests for previous TLS 1.2 behavior. Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01net: tls: Add tls 1.3 supportDave Watson
TLS 1.3 has minor changes from TLS 1.2 at the record layer. * Header now hardcodes the same version and application content type in the header. * The real content type is appended after the data, before encryption (or after decryption). * The IV is xored with the sequence number, instead of concatinating four bytes of IV with the explicit IV. * Zero-padding: No exlicit length is given, we search backwards from the end of the decrypted data for the first non-zero byte, which is the content type. Currently recv supports reading zero-padding, but there is no way for send to add zero padding. Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01net: tls: Refactor control message handling on recvDave Watson
For TLS 1.3, the control message is encrypted. Handle control message checks after decryption. Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01net: tls: Refactor tls aad space size calculationDave Watson
TLS 1.3 has a different AAD size, use a variable in the code to make TLS 1.3 support easy. Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01net: tls: Support 256 bit keysDave Watson
Wire up support for 256 bit keys from the setsockopt to the crypto framework Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01Merge branch 'bpf-xdp-sample-libbpf'Daniel Borkmann
Maciej Fijalkowski says: ==================== This patchset tries to address the situation where: * user loads a particular xdp sample application that does stats polling * user loads another sample application on the same interface * then, user sends SIGINT/SIGTERM to the app that was attached as a first one * second application ends up with an unloaded xdp program 1st patch contains a helper libbpf function for getting the map fd by a given map name. In patch 2 Jesper removes the read_trace_pipe usage from xdp_redirect_cpu which was a blocker for converting this sample to libbpf usage. 3rd patch updates a bunch of xdp samples to make the use of libbpf. Patch 4 adjusts RLIMIT_MEMLOCK for two samples touched in this patchset. In patch 5 extack messages are added for cases where dev_change_xdp_fd returns with an error so user has an idea what was the reason for not attaching the xdp program onto interface. Patch 6 makes the samples behavior similar to what iproute2 does when loading xdp prog - the "force" flag is introduced. Patch 7 introduces the libbpf function that will query the driver from userspace about the currently attached xdp prog id. Use it in samples that do polling by checking the prog id in signal handler and comparing it with previously stored one which is the scope of patch 8. Thanks! v1->v2: * add a libbpf helper for getting a prog via relative index * include xdp_redirect_cpu into conversion v2->v3: mostly addressing Daniel's/Jesper's comments * get rid of the helper from v1->v2 * feed the xdp_redirect_cpu with program name instead of number v3->v4: * fix help message in xdp_sample_pkts v4->v5: * in get_link_xdp_fd, assign prog_id only when libbpf_nl_get_link returned with 0 * add extack messages in dev_change_xdp_fd * check the return value of bpf_get_link_xdp_id when exiting from sample progs v5->v6: * rebase ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01samples/bpf: Check the prog id before exitingMaciej Fijalkowski
Check the program id within the signal handler on polling xdp samples that were previously converted to libbpf usage. Avoid the situation of unloading the program that was not attached by sample that is exiting. Handle also the case where bpf_get_link_xdp_id didn't exit with an error but the xdp program was not found on an interface. Reported-by: Michal Papaj <michal.papaj@intel.com> Reported-by: Jakub Spizewski <jakub.spizewski@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01libbpf: Add a support for getting xdp prog id on ifindexMaciej Fijalkowski
Since we have a dedicated netlink attributes for xdp setup on a particular interface, it is now possible to retrieve the program id that is currently attached to the interface. The use case is targeted for sample xdp programs, which will store the program id just after loading bpf program onto iface. On shutdown, the sample will make sure that it can unload the program by querying again the iface and verifying that both program id's matches. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01samples/bpf: Add a "force" flag to XDP samplesMaciej Fijalkowski
Make xdp samples consistent with iproute2 behavior and set the XDP_FLAGS_UPDATE_IF_NOEXIST by default when setting the xdp program on interface. Provide an option for user to force the program loading, which as a result will not include the mentioned flag in bpf_set_link_xdp_fd call. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01xdp: Provide extack messages when prog attachment failedMaciej Fijalkowski
In order to provide more meaningful messages to user when the process of loading xdp program onto network interface failed, let's add extack messages within dev_change_xdp_fd. Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01samples/bpf: Extend RLIMIT_MEMLOCK for xdp_{sample_pkts, router_ipv4}Maciej Fijalkowski
There is a common problem with xdp samples that happens when user wants to run a particular sample and some bpf program is already loaded. The default 64kb RLIMIT_MEMLOCK resource limit will cause a following error (assuming that xdp sample that is failing was converted to libbpf usage): libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program. libbpf: failed to load object './xdp_sample_pkts_kern.o' Fix it in xdp_sample_pkts and xdp_router_ipv4 by setting RLIMIT_MEMLOCK to RLIM_INFINITY. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01samples/bpf: Convert XDP samples to libbpf usageMaciej Fijalkowski
Some of XDP samples that are attaching the bpf program to the interface via libbpf's bpf_set_link_xdp_fd are still using the bpf_load.c for loading and manipulating the ebpf program and maps. Convert them to do this through libbpf usage and remove bpf_load from the picture. While at it remove what looks like debug leftover in xdp_redirect_map_user.c In xdp_redirect_cpu, change the way that the program to be loaded onto interface is chosen - user now needs to pass the program's section name instead of the relative number. In case of typo print out the section names to choose from. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01samples/bpf: xdp_redirect_cpu have not need for read_trace_pipeJesper Dangaard Brouer
The sample xdp_redirect_cpu is not using helper bpf_trace_printk. Thus it makes no sense that the --debug option us reading from /sys/kernel/debug/tracing/trace_pipe via read_trace_pipe. Simply remove it. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01libbpf: Add a helper for retrieving a map fd for a given nameMaciej Fijalkowski
XDP samples are mostly cooperating with eBPF maps through their file descriptors. In case of a eBPF program that contains multiple maps it might be tiresome to iterate through them and call bpf_map__fd for each one. Add a helper mostly based on bpf_object__find_map_by_name, but instead of returning the struct bpf_map pointer, return map fd. Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01bpf: powerpc64: add JIT support for bpf line infoSandipan Das
This adds support for generating bpf line info for JITed programs. Signed-off-by: Sandipan Das <sandipan@linux.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01Merge branch 'bpf-spinlocks'Daniel Borkmann
Alexei Starovoitov says: ==================== Many algorithms need to read and modify several variables atomically. Until now it was hard to impossible to implement such algorithms in BPF. Hence introduce support for bpf_spin_lock. The api consists of 'struct bpf_spin_lock' that should be placed inside hash/array/cgroup_local_storage element and bpf_spin_lock/unlock() helper function. Example: struct hash_elem { int cnt; struct bpf_spin_lock lock; }; struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key); if (val) { bpf_spin_lock(&val->lock); val->cnt++; bpf_spin_unlock(&val->lock); } and BPF_F_LOCK flag for lookup/update bpf syscall commands that allows user space to read/write map elements under lock. Together these primitives allow race free access to map elements from bpf programs and from user space. Key restriction: root only. Key requirement: maps must be annotated with BTF. This concept was discussed at Linux Plumbers Conference 2018. Thank you everyone who participated and helped to iron out details of api and implementation. Patch 1: bpf_spin_lock support in the verifier, BTF, hash, array. Patch 2: bpf_spin_lock in cgroup local storage. Patches 3,4,5: tests Patch 6: BPF_F_LOCK flag to lookup/update Patches 7,8,9: tests v6->v7: - fixed this_cpu->__this_cpu per Peter's suggestion and added Ack. - simplified bpf_spin_lock and load/store overlap check in the verifier as suggested by Andrii - rebase v5->v6: - adopted arch_spinlock approach suggested by Peter - switched to spin_lock_irqsave equivalent as the simplest way to avoid deadlocks in rare case of nested networking progs (cgroup-bpf prog in preempt_disable vs clsbpf in softirq sharing the same map with bpf_spin_lock) bpf_spin_lock is only allowed in networking progs that don't have arbitrary entry points unlike tracing progs. - rebase and split test_verifier tests v4->v5: - disallow bpf_spin_lock for tracing progs due to insufficient preemption checks - socket filter progs cannot use bpf_spin_lock due to missing preempt_disable - fix atomic_set_release. Spotted by Peter. - fixed hash_of_maps v3->v4: - fix BPF_EXIST | BPF_NOEXIST check patch 6. Spotted by Jakub. Thanks! - rebase v2->v3: - fixed build on ia64 and archs where qspinlock is not supported - fixed missing lock init during lookup w/o BPF_F_LOCK. Spotted by Martin v1->v2: - addressed several issues spotted by Daniel and Martin in patch 1 - added test11 to patch 4 as suggested by Daniel ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01selftests/bpf: test for BPF_F_LOCKAlexei Starovoitov
Add C based test that runs 4 bpf programs in parallel that update the same hash and array maps. And another 2 threads that read from these two maps via lookup(key, value, BPF_F_LOCK) api to make sure the user space sees consistent value in both hash and array elements while user space races with kernel bpf progs. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01libbpf: introduce bpf_map_lookup_elem_flags()Alexei Starovoitov
Introduce int bpf_map_lookup_elem_flags(int fd, const void *key, void *value, __u64 flags) helper to lookup array/hash/cgroup_local_storage elements with BPF_F_LOCK flag. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01tools/bpf: sync uapi/bpf.hAlexei Starovoitov
add BPF_F_LOCK definition to tools/include/uapi/linux/bpf.h Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01bpf: introduce BPF_F_LOCK flagAlexei Starovoitov
Introduce BPF_F_LOCK flag for map_lookup and map_update syscall commands and for map_update() helper function. In all these cases take a lock of existing element (which was provided in BTF description) before copying (in or out) the rest of map value. Implementation details that are part of uapi: Array: The array map takes the element lock for lookup/update. Hash: hash map also takes the lock for lookup/update and tries to avoid the bucket lock. If old element exists it takes the element lock and updates the element in place. If element doesn't exist it allocates new one and inserts into hash table while holding the bucket lock. In rare case the hashmap has to take both the bucket lock and the element lock to update old value in place. Cgroup local storage: It is similar to array. update in place and lookup are done with lock taken. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01selftests/bpf: add bpf_spin_lock C testAlexei Starovoitov
add bpf_spin_lock C based test that requires latest llvm with BTF support Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01selftests/bpf: add bpf_spin_lock verifier testsAlexei Starovoitov
add bpf_spin_lock tests to test_verifier.c that don't require latest llvm with BTF support Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01tools/bpf: sync include/uapi/linux/bpf.hAlexei Starovoitov
sync bpf.h Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01bpf: add support for bpf_spin_lock to cgroup local storageAlexei Starovoitov
Allow 'struct bpf_spin_lock' to reside inside cgroup local storage. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01bpf: introduce bpf_spin_lockAlexei Starovoitov
Introduce 'struct bpf_spin_lock' and bpf_spin_lock/unlock() helpers to let bpf program serialize access to other variables. Example: struct hash_elem { int cnt; struct bpf_spin_lock lock; }; struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key); if (val) { bpf_spin_lock(&val->lock); val->cnt++; bpf_spin_unlock(&val->lock); } Restrictions and safety checks: - bpf_spin_lock is only allowed inside HASH and ARRAY maps. - BTF description of the map is mandatory for safety analysis. - bpf program can take one bpf_spin_lock at a time, since two or more can cause dead locks. - only one 'struct bpf_spin_lock' is allowed per map element. It drastically simplifies implementation yet allows bpf program to use any number of bpf_spin_locks. - when bpf_spin_lock is taken the calls (either bpf2bpf or helpers) are not allowed. - bpf program must bpf_spin_unlock() before return. - bpf program can access 'struct bpf_spin_lock' only via bpf_spin_lock()/bpf_spin_unlock() helpers. - load/store into 'struct bpf_spin_lock lock;' field is not allowed. - to use bpf_spin_lock() helper the BTF description of map value must be a struct and have 'struct bpf_spin_lock anyname;' field at the top level. Nested lock inside another struct is not allowed. - syscall map_lookup doesn't copy bpf_spin_lock field to user space. - syscall map_update and program map_update do not update bpf_spin_lock field. - bpf_spin_lock cannot be on the stack or inside networking packet. bpf_spin_lock can only be inside HASH or ARRAY map value. - bpf_spin_lock is available to root only and to all program types. - bpf_spin_lock is not allowed in inner maps of map-in-map. - ld_abs is not allowed inside spin_lock-ed region. - tracing progs and socket filter progs cannot use bpf_spin_lock due to insufficient preemption checks Implementation details: - cgroup-bpf class of programs can nest with xdp/tc programs. Hence bpf_spin_lock is equivalent to spin_lock_irqsave. Other solutions to avoid nested bpf_spin_lock are possible. Like making sure that all networking progs run with softirq disabled. spin_lock_irqsave is the simplest and doesn't add overhead to the programs that don't use it. - arch_spinlock_t is used when its implemented as queued_spin_lock - archs can force their own arch_spinlock_t - on architectures where queued_spin_lock is not available and sizeof(arch_spinlock_t) != sizeof(__u32) trivial lock is used. - presence of bpf_spin_lock inside map value could have been indicated via extra flag during map_create, but specifying it via BTF is cleaner. It provides introspection for map key/value and reduces user mistakes. Next steps: - allow bpf_spin_lock in other map types (like cgroup local storage) - introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper to request kernel to grab bpf_spin_lock before rewriting the value. That will serialize access to map elements. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01Merge tag 'batadv-next-for-davem-20190201' of ↵David S. Miller
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== This feature/cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - Add DHCPACKs for DAT snooping, by Linus Luessing - Update copyright years for 2019, by Sven Eckelmann ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01Merge tag 'mac80211-next-for-davem-2019-02-01' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== New features for the wifi stack: * airtime fairness scheduling in mac80211, so we can share * more authentication offloads to userspace - this is for SAE which is part of WPA3 and is hard to do in firmware * documentation fixes * various mesh improvements * various other small improvements/cleanups This also contains the NLA_POLICY_NESTED{,_ARRAY} change we discussed, which affects everyone but there's no other user yet. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01net: hns3: Check for allocation failureDan Carpenter
We should return -ENOMEM if the kcalloc() fails. Fixes: d174ea75c96a ("net: hns3: add statistics for PFC frames and MAC control frame") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>