summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-04-27cpsw: Put back cpsw_ndo_poll_controller()David S. Miller
To fix the build. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27Merge branch 'net-ethernet-ti-clean-up-and-optimizations'David S. Miller
Grygorii Strashko says: ==================== net: ethernet: ti: clean up and optimizations This is a preparation series for introducing new switchbase TI CPSW driver which was originally introduced [1][2] by Ilias Apalodimas <ilias.apalodimas@linaro.org> and also discussed in private mails and at Netdev x13 confernce. Following discussions and suggestions (mostly by Andrew and Ivan) we going to introduce the new driver which is operating in dual-emac mode by default, thus working as 2 individual network interfaces. When both interfaces joined the bridge - CPSW driver will enter a switch mode and discard dual_mac configuration. The CPSW will be switched back to dual_mac mode if any port leaves the bridge. All configuration is going to be implemented via switchdev API. Hence overall change is already very big I'm sending prerequisite patches which are mostly minor fixes/clean ups and code refactoring to separate common parts to be reused by both drivers. Probably the most serious change from functional point of view is Patch 11. These patches were NFS boot tetested on TI AM335x/AM437x/AM5xx boards. These patches can be found at: git@git.ti.com:~gragst/ti-linux-kernel/gragsts-ti-linux-kernel.git branch: lkml-5.1-cpsw-clean-up-v2 changes in v2: - added new patch 16 to get rid of force type conversation - other chages metioned in patches ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net: ethernet: ti: cpsw: move ethtool func in separate fileGrygorii Strashko
As a preparatory patch to add support for a switchdev based cpsw driver, move common ethtool functions to separate cpsw-ethtool.c file so that they can be used across both drivers. It will simplify CPSW driver code maintenance also. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net: ethernet: ti: cpsw: switch to use mac sl apiGrygorii Strashko
Switch CPSW driver to use the new MAC SL API. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net: ethernet: ti: cpsw: introduce mac sl module apiGrygorii Strashko
The MAC SL submodule has a lot of common functions between many of TI SoCs AM335x/AM437x/DRA7(AM57xx), Keystone 2 66AK2HK/E/L/G and K3 AM654, but there are also differences especially in registers offsets and sets of supported functions. This patch introduces the MAC SL submodule API which is intended to provide a common way to access the MAC SL submodule and hide HW integrations details. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Sekhar Nori <nsekhar@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net: ethernet: ti: cpsw: move common hw init code in separate funcGrygorii Strashko
move common hw init code in separate function as preparation for adding new switchdev driver. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net: ethernet: ti: davinci_cpdma: use dma_addr_t for desc_mem_phys and ↵Grygorii Strashko
desc_hw_addr Use dma_addr_t for desc_mem_phys and desc_hw_addr to avoid types conversions. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net: ethernet: ti: cpsw: move cpsw definitions in priv headerGrygorii Strashko
As a preparatory patch to add a switchdev based cpsw driver move the common header definitions to cpsw_priv.h. The plan is to develop a new driver on switchdev driver model and obsolete the current cpsw driver after all required functions are added to the new driver. This patch allows the same header file to be re-used on both drivers during the transition period. Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net: ethernet: ti: cpsw: refactor probe to group common hw initializationGrygorii Strashko
Rework probe to group common hw initialization: - group resources request at the beginning of the probe - move net device initialization and registration at the end of the probe - drop cpsw_slave_init as preparation of refactoring of common hw initialization code to separate function. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net: ethernet: ti: davinci_mdio: use devm_ioremap()Grygorii Strashko
The Davinci MDIO in most of the case implemented as module inside of TI CPSW subsystem and fully depends on CPSW to be enabled, but historically it's implemented as separate Platform device/driver and defined in DT files in two ways: - as standalone node - as child node of CPSW subsystem. In later case it's required to split CPSW subsystem "reg" property to exclude MDIO I/O range which is not useful. Hence, replace devm_ioremap_resource() with devm_ioremap() to allow define full I/O range in parent CPSW subsystem without spliting. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net: ethernet: ti: ale: do not auto delete mcast super entriesGrygorii Strashko
Do not delete multicast supervisory packet's (SUPER) entries while flushing multicast addresses from ALE table cpsw_ale_flush_multicast(). Those entries have to be added/removed only explicitly. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net: ethernet: ti: cpsw: fix allmulti cfg in dual_mac modeGrygorii Strashko
Now CPSW ALE will set/clean Host port bit in Unregistered Multicast Flood Mask (UNREG_MCAST_FLOOD_MASK) for every VLAN without checking if this port belongs to VLAN or not when ALLMULTI mode flag is set for nedev. This is working in non dual_mac mode, but in dual_mac - it causes enabling/disabling ALLMULTI flag for both ports. Hence fix it by adding additional parameter to cpsw_ale_set_allmulti() to specify ALE port number for which ALLMULTI has to be enabled and check if port belongs to VLAN before modifying UNREG_MCAST_FLOOD_MASK. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net: ethernet: ti: ale: use define for host port in cpsw_ale_set_allmulti()Grygorii Strashko
Use ALE_PORT_HOST define for host port in cpsw_ale_set_allmulti() instead of constants. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net: ethernet: ti: ale: fix mcast super settingGrygorii Strashko
Use correct define ALE_SUPER for ALE Multicast Address Table Entry Supervisory Packet (SUPER) bit setting instead of ALE_BLOCKED. No issues were observed till now as it have never been set, but it's going to be used by new CPSW switch driver. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net: ethernet: ti: cpsw: drop cpsw_tx_packet_submit()Grygorii Strashko
Drop unnecessary wrapper function cpsw_tx_packet_submit() which is used only in one place. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net: ethernet: ti: cpsw: use devm_alloc_etherdev_mqs()Grygorii Strashko
Use devm_alloc_etherdev_mqs() and simplify code. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net: ethernet: ti: cpsw: drop pinctrl_pm_select_default_state callGrygorii Strashko
Drop pinctrl_pm_select_default_state call from probe as default pinctrl state is set by DD core. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net: ethernet: ti: cpsw: use local var dev in probeGrygorii Strashko
Use local variable struct device *dev in probe to simplify code. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net: ethernet: ti: cpsw: update cpsw_split_res() to accept cpsw_commonGrygorii Strashko
Update cpsw_split_res() to accept struct cpsw_common instead of struct net_device to simplify code. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net: ethernet: ti: cpsw: drop CONFIG_TI_CPSW_ALE config optionGrygorii Strashko
All TI drivers CPSW/NETCP can't work without ALE, hence simplify build of those drivers by always linking cpsw_ale and drop CONFIG_TI_CPSW_ALE config option. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net: ethernet: ti: cpsw: drop TI_DAVINCI_CPDMA config optionGrygorii Strashko
Both drivers CPSW and EMAC can't work without CPDMA, hence simplify build of those drivers by always linking davinci_cpdma and drop TI_DAVINCI_CPDMA config option. Note. the davinci_emac driver module was changed to "ti_davinci_emac" to make build work. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net: ethernet: ti: convert to SPDX license identifiersGrygorii Strashko
Replace textual license with SPDX-License-Identifier. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27Merge branch 'strict-netlink-validation'David S. Miller
Johannes Berg says: ==================== strict netlink validation Here's a respin, with the following changes: * change message when rejecting unknown attribute types (David Ahern) * drop nl80211 patch - I'll apply it separately * remove NL_VALIDATE_POLICY - we have a lot of calls to nla_parse() that really should be without a policy as it has previously been validated - need to find a good way to handle this later * include the correct generic netlink change (d'oh, sorry) ==================== Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27genetlink: optionally validate strictly/dumpsJohannes Berg
Add options to strictly validate messages and dump messages, sometimes perhaps validating dump messages non-strictly may be required, so add an option for that as well. Since none of this can really be applied to existing commands, set the options everwhere using the following spatch: @@ identifier ops; expression X; @@ struct genl_ops ops[] = { ..., { .cmd = X, + .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, ... }, ... }; For new commands one should just not copy the .validate 'opt-out' flags and thus get strict validation. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27netlink: add strict parsing for future attributesJohannes Berg
Unfortunately, we cannot add strict parsing for all attributes, as that would break existing userspace. We currently warn about it, but that's about all we can do. For new attributes, however, the story is better: nobody is using them, so we can reject bad sizes. Also, for new attributes, we need not accept them when the policy doesn't declare their usage. David Ahern and I went back and forth on how to best encode this, and the best way we found was to have a "boundary type", from which point on new attributes have all possible validation applied, and NLA_UNSPEC is rejected. As we didn't want to add another argument to all functions that get a netlink policy, the workaround is to encode that boundary in the first entry of the policy array (which is for type 0 and thus probably not really valid anyway). I put it into the validation union for the rare possibility that somebody is actually using attribute 0, which would continue to work fine unless they tried to use the extended validation, which isn't likely. We also didn't find any in-tree users with type 0. The reason for setting the "start strict here" attribute is that we never really need to start strict from 0, which is invalid anyway (or in legacy families where that isn't true, it cannot be set to strict), so we can thus reserve the value 0 for "don't do this check" and don't have to add the tag to all policies right now. Thus, policies can now opt in to this validation, which we should do for all existing policies, at least when adding new attributes. Note that entirely *new* policies won't need to set it, as the use of that should be using nla_parse()/nlmsg_parse() etc. which anyway do fully strict validation now, regardless of this. So in effect, this patch only covers the "existing command with new attribute" case. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27netlink: re-add parse/validate functions in strict modeJohannes Berg
This re-adds the parse and validate functions like nla_parse() that are now actually strict after the previous rename and were just split out to make sure everything is converted (and if not compilation of the previous patch would fail.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27netlink: make validation more configurable for future strictnessJohannes Berg
We currently have two levels of strict validation: 1) liberal (default) - undefined (type >= max) & NLA_UNSPEC attributes accepted - attribute length >= expected accepted - garbage at end of message accepted 2) strict (opt-in) - NLA_UNSPEC attributes accepted - attribute length >= expected accepted Split out parsing strictness into four different options: * TRAILING - check that there's no trailing data after parsing attributes (in message or nested) * MAXTYPE - reject attrs > max known type * UNSPEC - reject attributes with NLA_UNSPEC policy entries * STRICT_ATTRS - strictly validate attribute size The default for future things should be *everything*. The current *_strict() is a combination of TRAILING and MAXTYPE, and is renamed to _deprecated_strict(). The current regular parsing has none of this, and is renamed to *_parse_deprecated(). Additionally it allows us to selectively set one of the new flags even on old policies. Notably, the UNSPEC flag could be useful in this case, since it can be arranged (by filling in the policy) to not be an incompatible userspace ABI change, but would then going forward prevent forgetting attribute entries. Similar can apply to the POLICY flag. We end up with the following renames: * nla_parse -> nla_parse_deprecated * nla_parse_strict -> nla_parse_deprecated_strict * nlmsg_parse -> nlmsg_parse_deprecated * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict * nla_parse_nested -> nla_parse_nested_deprecated * nla_validate_nested -> nla_validate_nested_deprecated Using spatch, of course: @@ expression TB, MAX, HEAD, LEN, POL, EXT; @@ -nla_parse(TB, MAX, HEAD, LEN, POL, EXT) +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression TB, MAX, NLA, POL, EXT; @@ -nla_parse_nested(TB, MAX, NLA, POL, EXT) +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT) @@ expression START, MAX, POL, EXT; @@ -nla_validate_nested(START, MAX, POL, EXT) +nla_validate_nested_deprecated(START, MAX, POL, EXT) @@ expression NLH, HDRLEN, MAX, POL, EXT; @@ -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT) +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT) For this patch, don't actually add the strict, non-renamed versions yet so that it breaks compile if I get it wrong. Also, while at it, make nla_validate and nla_parse go down to a common __nla_validate_parse() function to avoid code duplication. Ultimately, this allows us to have very strict validation for every new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the next patch, while existing things will continue to work as is. In effect then, this adds fully strict validation for any new command. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27netlink: add NLA_MIN_LENJohannes Berg
Rather than using NLA_UNSPEC for this type of thing, use NLA_MIN_LEN so we can make NLA_UNSPEC be NLA_REJECT under certain conditions for future attributes. While at it, also use NLA_EXACT_LEN for the struct example. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27Merge branch 'nla_nest_start'David S. Miller
Michal Kubecek says: ==================== make nla_nest_start() add NLA_F_NESTED flag One of the comments in recent review of the ethtool netlink series pointed out that proposed ethnl_nest_start() helper which adds NLA_F_NESTED to second argument of nla_nest_start() is not really specific to ethtool netlink code. That is hard to argue with as closer inspection revealed that exactly the same helper already exists in ipset code (except it's a macro rather than an inline function). Another observation was that even if NLA_F_NESTED flag was introduced in 2007, only few netlink based interfaces set it in kernel generated messages and even many recently added APIs omit it. That is unfortunate as without the flag, message parsers not familiar with attribute semantics cannot recognize nested attributes and do not see message structure; this affects e.g. wireshark dissector or mnl_nlmsg_fprintf() from libmnl. This is why I'm suggesting to rename existing nla_nest_start() to different name (nla_nest_start_noflag) and reintroduce nla_nest_start() as a wrapper adding NLA_F_NESTED flag. This is implemented in first patch which is mostly generated by spatch. Second patch drops ipset helper macros which lose their purpose. Third patch cleans up minor coding style issues found by checkpatch.pl in first patch. We could leave nla_nest_start() untouched and simply add a wrapper adding NLA_F_NESTED but that would probably preserve the state when even most new code doesn't set the flag. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net: fix two coding style issuesMichal Kubecek
This is a simple cleanup addressing two coding style issues found by checkpatch.pl in an earlier patch. It's submitted as a separate patch to keep the original patch as it was generated by spatch. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27ipset: drop ipset_nest_start() and ipset_nest_end()Michal Kubecek
After the previous commit, both ipset_nest_start() and ipset_nest_end() are just aliases for nla_nest_start() and nla_nest_end() so that there is no need to keep them. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27netlink: make nla_nest_start() add NLA_F_NESTED flagMichal Kubecek
Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most netlink based interfaces (including recently added ones) are still not setting it in kernel generated messages. Without the flag, message parsers not aware of attribute semantics (e.g. wireshark dissector or libmnl's mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display the structure of their contents. Unfortunately we cannot just add the flag everywhere as there may be userspace applications which check nlattr::nla_type directly rather than through a helper masking out the flags. Therefore the patch renames nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start() as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually are rewritten to use nla_nest_start(). Except for changes in include/net/netlink.h, the patch was generated using this semantic patch: @@ expression E1, E2; @@ -nla_nest_start(E1, E2) +nla_nest_start_noflag(E1, E2) @@ expression E1, E2; @@ -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED) +nla_nest_start(E1, E2) Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27Merge branch 'net-tls-small-code-cleanup'David S. Miller
Jakub Kicinski says: ==================== net/tls: small code cleanup This small patch set cleans up tls (mostly offload parts). Other than avoiding unnecessary error messages - no functional changes here. v2 (Saeed): - fix up Review tags; - remove the warning on failure completely. ==================== Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net/tls: byte swap device req TCP seq no upon settingJakub Kicinski
To avoid a sparse warning byteswap the be32 sequence number before it's stored in the atomic value. While at it drop unnecessary brackets and use kernel's u64 type. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net/tls: move definition of tls ops into net/tls.hJakub Kicinski
There seems to be no reason for tls_ops to be defined in netdevice.h which is included in a lot of places. Don't wrap the struct/enum declaration in ifdefs, it trickles down unnecessary ifdefs into driver code. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net/tls: remove old exports of sk_destruct functionsJakub Kicinski
tls_device_sk_destruct being set on a socket used to indicate that socket is a kTLS device one. That is no longer true - now we use sk_validate_xmit_skb pointer for that purpose. Remove the export. tls_device_attach() needs to be moved. While at it, remove the dead declaration of tls_sk_destruct(). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net/tls: don't log errors every time offload can't proceedJakub Kicinski
Currently when CONFIG_TLS_DEVICE is set each time kTLS connection is opened and the offload is not successful (either because the underlying device doesn't support it or e.g. it's tables are full) a rate limited error will be printed to the logs. There is nothing wrong with failing TLS offload. SW path will process the packets just fine, drop the noisy messages. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27Merge branch 'sk-local-storage'Alexei Starovoitov
Martin KaFai Lau says: ==================== v4: - Move checks to map_alloc_check in patch 1 (Stanislav Fomichev) - Refactor BTF encoding macros to test_btf.h at a new patch 4 (Stanislav Fomichev) - Refactor getenv and add print PASS message at the end of the test in patch 6 (Yonghong Song) v3: - Replace spinlock_types.h with spinlock.h in patch 1 (kbuild test robot <lkp@intel.com>) v2: - Add the "test_maps.h" file in patch 5 This series introduces the BPF sk local storage. The details is in the patch 1 commit message. ==================== Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-27bpf: Add ene-to-end test for bpf_sk_storage_* helpersMartin KaFai Lau
This patch rides on an existing BPF_PROG_TYPE_CGROUP_SKB test (test_sock_fields.c) to do a TCP end-to-end test on the new bpf_sk_storage_* helpers. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-27bpf: Add BPF_MAP_TYPE_SK_STORAGE test to test_mapsMartin KaFai Lau
This patch adds BPF_MAP_TYPE_SK_STORAGE test to test_maps. The src file is rather long, so it is put into another dir map_tests/ and compile like the current prog_tests/ does. Other existing tests in test_maps can also be re-factored into map_tests/ in the future. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-27bpf: Add verifier tests for the bpf_sk_storageMartin KaFai Lau
This patch adds verifier tests for the bpf_sk_storage: 1. ARG_PTR_TO_MAP_VALUE_OR_NULL 2. Map and helper compatibility (e.g. disallow bpf_map_loookup_elem) It also takes this chance to remove the unused struct btf_raw_data and uses the BTF encoding macros from "test_btf.h". Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-27bpf: Refactor BTF encoding macro to test_btf.hMartin KaFai Lau
Refactor common BTF encoding macros for other tests to use. The libbpf may reuse some of them in the future which requires some more thoughts before publishing as a libbpf API. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-27bpf: Support BPF_MAP_TYPE_SK_STORAGE in bpf map probingMartin KaFai Lau
This patch supports probing for the new BPF_MAP_TYPE_SK_STORAGE. BPF_MAP_TYPE_SK_STORAGE enforces BTF usage, so the new probe requires to create and load a BTF also. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-27bpf: Sync bpf.h to toolsMartin KaFai Lau
This patch sync the bpf.h to tools/. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-27bpf: Introduce bpf sk local storageMartin KaFai Lau
After allowing a bpf prog to - directly read the skb->sk ptr - get the fullsock bpf_sock by "bpf_sk_fullsock()" - get the bpf_tcp_sock by "bpf_tcp_sock()" - get the listener sock by "bpf_get_listener_sock()" - avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock" into different bpf running context. this patch is another effort to make bpf's network programming more intuitive to do (together with memory and performance benefit). When bpf prog needs to store data for a sk, the current practice is to define a map with the usual 4-tuples (src/dst ip/port) as the key. If multiple bpf progs require to store different sk data, multiple maps have to be defined. Hence, wasting memory to store the duplicated keys (i.e. 4 tuples here) in each of the bpf map. [ The smallest key could be the sk pointer itself which requires some enhancement in the verifier and it is a separate topic. ] Also, the bpf prog needs to clean up the elem when sk is freed. Otherwise, the bpf map will become full and un-usable quickly. The sk-free tracking currently could be done during sk state transition (e.g. BPF_SOCK_OPS_STATE_CB). The size of the map needs to be predefined which then usually ended-up with an over-provisioned map in production. Even the map was re-sizable, while the sk naturally come and go away already, this potential re-size operation is arguably redundant if the data can be directly connected to the sk itself instead of proxy-ing through a bpf map. This patch introduces sk->sk_bpf_storage to provide local storage space at sk for bpf prog to use. The space will be allocated when the first bpf prog has created data for this particular sk. The design optimizes the bpf prog's lookup (and then optionally followed by an inline update). bpf_spin_lock should be used if the inline update needs to be protected. BPF_MAP_TYPE_SK_STORAGE: ----------------------- To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in this patch) needs to be created. Multiple BPF_MAP_TYPE_SK_STORAGE maps can be created to fit different bpf progs' needs. The map enforces BTF to allow printing the sk-local-storage during a system-wise sk dump (e.g. "ss -ta") in the future. The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete a "sk-local-storage" data from a particular sk. Think of the map as a meta-data (or "type") of a "sk-local-storage". This particular "type" of "sk-local-storage" data can then be stored in any sk. The main purposes of this map are mostly: 1. Define the size of a "sk-local-storage" type. 2. Provide a similar syscall userspace API as the map (e.g. lookup/update, map-id, map-btf...etc.) 3. Keep track of all sk's storages of this "type" and clean them up when the map is freed. sk->sk_bpf_storage: ------------------ The main lookup/update/delete is done on sk->sk_bpf_storage (which is a "struct bpf_sk_storage"). When doing a lookup, the "map" pointer is now used as the "key" to search on the sk_storage->list. The "map" pointer is actually serving as the "type" of the "sk-local-storage" that is being requested. To allow very fast lookup, it should be as fast as looking up an array at a stable-offset. At the same time, it is not ideal to set a hard limit on the number of sk-local-storage "type" that the system can have. Hence, this patch takes a cache approach. The last search result from sk_storage->list is cached in sk_storage->cache[] which is a stable sized array. Each "sk-local-storage" type has a stable offset to the cache[] array. In the future, a map's flag could be introduced to do cache opt-out/enforcement if it became necessary. The cache size is 16 (i.e. 16 types of "sk-local-storage"). Programs can share map. On the program side, having a few bpf_progs running in the networking hotpath is already a lot. The bpf_prog should have already consolidated the existing sock-key-ed map usage to minimize the map lookup penalty. 16 has enough runway to grow. All sk-local-storage data will be removed from sk->sk_bpf_storage during sk destruction. bpf_sk_storage_get() and bpf_sk_storage_delete(): ------------------------------------------------ Instead of using bpf_map_(lookup|update|delete)_elem(), the bpf prog needs to use the new helper bpf_sk_storage_get() and bpf_sk_storage_delete(). The verifier can then enforce the ARG_PTR_TO_SOCKET argument. The bpf_sk_storage_get() also allows to "create" new elem if one does not exist in the sk. It is done by the new BPF_SK_STORAGE_GET_F_CREATE flag. An optional value can also be provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE. The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock. Together, it has eliminated the potential use cases for an equivalent bpf_map_update_elem() API (for bpf_prog) in this patch. Misc notes: ---------- 1. map_get_next_key is not supported. From the userspace syscall perspective, the map has the socket fd as the key while the map can be shared by pinned-file or map-id. Since btf is enforced, the existing "ss" could be enhanced to pretty print the local-storage. Supporting a kernel defined btf with 4 tuples as the return key could be explored later also. 2. The sk->sk_lock cannot be acquired. Atomic operations is used instead. e.g. cmpxchg is done on the sk->sk_bpf_storage ptr. Please refer to the source code comments for the details in synchronization cases and considerations. 3. The mem is charged to the sk->sk_omem_alloc as the sk filter does. Benchmark: --------- Here is the benchmark data collected by turning on the "kernel.bpf_stats_enabled" sysctl. Two bpf progs are tested: One bpf prog with the usual bpf hashmap (max_entries = 8192) with the sk ptr as the key. (verifier is modified to support sk ptr as the key That should have shortened the key lookup time.) Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE. Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for each egress skb and then bump the cnt. netperf is used to drive data with 4096 connected UDP sockets. BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run) 27: cgroup_skb name egress_sk_map tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633 loaded_at 2019-04-15T13:46:39-0700 uid 0 xlated 344B jited 258B memlock 4096B map_ids 16 btf_id 5 BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run) 30: cgroup_skb name egress_sk_stora tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739 loaded_at 2019-04-15T13:47:54-0700 uid 0 xlated 168B jited 156B memlock 4096B map_ids 17 btf_id 6 Here is a high-level picture on how are the objects organized: sk ┌──────┐ │ │ │ │ │ │ │*sk_bpf_storage─────▶ bpf_sk_storage └──────┘ ┌───────┐ ┌───────────┤ list │ │ │ │ │ │ │ │ │ │ │ └───────┘ │ │ elem │ ┌────────┐ ├─▶│ snode │ │ ├────────┤ │ │ data │ bpf_map │ ├────────┤ ┌─────────┐ │ │map_node│◀─┬─────┤ list │ │ └────────┘ │ │ │ │ │ │ │ │ elem │ │ │ │ ┌────────┐ │ └─────────┘ └─▶│ snode │ │ ├────────┤ │ bpf_map │ data │ │ ┌─────────┐ ├────────┤ │ │ list ├───────▶│map_node│ │ │ │ └────────┘ │ │ │ │ │ │ elem │ └─────────┘ ┌────────┐ │ ┌─▶│ snode │ │ │ ├────────┤ │ │ │ data │ │ │ ├────────┤ │ │ │map_node│◀─┘ │ └────────┘ │ │ │ ┌───────┐ sk └──────────│ list │ ┌──────┐ │ │ │ │ │ │ │ │ │ │ │ │ └───────┘ │*sk_bpf_storage───────▶bpf_sk_storage └──────┘ Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-26Merge branch 'writeable-bpf-tracepoints'Alexei Starovoitov
Matt Mullins says: ==================== This adds an opt-in interface for tracepoints to expose a writable context to BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE programs that are attached, while supporting read-only access from existing BPF_PROG_TYPE_RAW_TRACEPOINT programs, as well as from non-BPF-based tracepoints. The initial motivation is to support tracing that can be observed from the remote end of an NBD socket, e.g. by adding flags to the struct nbd_request header. Earlier attempts included adding an NBD-specific tracepoint fd, but in code review, I was recommended to implement it more generically -- as a result, this patchset is far simpler than my initial try. v4->v5: * rebased onto bpf-next/master and fixed merge conflicts * "tools: sync bpf.h" also syncs comments that have previously changed in bpf-next v3->v4: * fixed a silly copy/paste typo in include/trace/events/bpf_test_run.h (_TRACE_NBD_H -> _TRACE_BPF_TEST_RUN_H) * fixed incorrect/misleading wording in patch 1's commit message, since the pointer cannot be directly dereferenced in a BPF_PROG_TYPE_RAW_TRACEPOINT * cleaned up the error message wording if the prog_tests fail * Addressed feedback from Yonghong * reject non-pointer-sized accesses to the buffer pointer * use sizeof(struct nbd_request) as one-byte-past-the-end in raw_tp_writable_reject_nbd_invalid.c * use BPF_MOV64_IMM instead of BPF_LD_IMM64 v2->v3: * Andrew addressed Josef's comments: * C-style commenting in nbd.c * Collapsed identical events into a single DECLARE_EVENT_CLASS. This saves about 2kB of kernel text v1->v2: * add selftests * sync tools/include/uapi/linux/bpf.h * reject variable offset into the buffer * add string representation of PTR_TO_TP_BUFFER to reg_type_str ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-26selftests: bpf: test writable buffers in raw tpsMatt Mullins
This tests that: * a BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE cannot be attached if it uses either: * a variable offset to the tracepoint buffer, or * an offset beyond the size of the tracepoint buffer * a tracer can modify the buffer provided when attached to a writable tracepoint in bpf_prog_test_run Signed-off-by: Matt Mullins <mmullins@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-26tools: sync bpf.hMatt Mullins
This adds BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, and fixes up the error: enumeration value ‘BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE’ not handled in switch [-Werror=switch-enum] build errors it would otherwise cause in libbpf. Signed-off-by: Matt Mullins <mmullins@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-26nbd: add tracepoints for send/receive timingAndrew Hall
This adds four tracepoints to nbd, enabling separate tracing of payload and header sending/receipt. In the send path for headers that have already been sent, we also explicitly initialize the handle so it can be referenced by the later tracepoint. Signed-off-by: Andrew Hall <hall@fb.com> Signed-off-by: Matt Mullins <mmullins@fb.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-26nbd: trace sending nbd requestsMatt Mullins
This adds a tracepoint that can both observe the nbd request being sent to the server, as well as modify that request , e.g., setting a flag in the request that will cause the server to collect detailed tracing data. The struct request * being handled is included to permit correlation with the block tracepoints. Signed-off-by: Matt Mullins <mmullins@fb.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>