diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2023-06-28 16:43:10 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2023-06-28 16:43:10 -0700 |
commit | 3a8a670eeeaa40d87bd38a587438952741980c18 (patch) | |
tree | d5546d311271503eadf75b45d87e12720e72899f /drivers/net/virtio_net.c | |
parent | 6a8cbd9253abc1bd0df4d60c4c24fa555190376d (diff) | |
parent | ae230642190a51b85656d6da2df744d534d59544 (diff) |
Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking changes from Jakub Kicinski:
"WiFi 7 and sendpage changes are the biggest pieces of work for this
release. The latter will definitely require fixes but I think that we
got it to a reasonable point.
Core:
- Rework the sendpage & splice implementations
Instead of feeding data into sockets page by page extend sendmsg
handlers to support taking a reference on the data, controlled by a
new flag called MSG_SPLICE_PAGES
Rework the handling of unexpected-end-of-file to invoke an
additional callback instead of trying to predict what the right
combination of MORE/NOTLAST flags is
Remove the MSG_SENDPAGE_NOTLAST flag completely
- Implement SCM_PIDFD, a new type of CMSG type analogous to
SCM_CREDENTIALS, but it contains pidfd instead of plain pid
- Enable socket busy polling with CONFIG_RT
- Improve reliability and efficiency of reporting for ref_tracker
- Auto-generate a user space C library for various Netlink families
Protocols:
- Allow TCP to shrink the advertised window when necessary, prevent
sk_rcvbuf auto-tuning from growing the window all the way up to
tcp_rmem[2]
- Use per-VMA locking for "page-flipping" TCP receive zerocopy
- Prepare TCP for device-to-device data transfers, by making sure
that payloads are always attached to skbs as page frags
- Make the backoff time for the first N TCP SYN retransmissions
linear. Exponential backoff is unnecessarily conservative
- Create a new MPTCP getsockopt to retrieve all info
(MPTCP_FULL_INFO)
- Avoid waking up applications using TLS sockets until we have a full
record
- Allow using kernel memory for protocol ioctl callbacks, paving the
way to issuing ioctls over io_uring
- Add nolocalbypass option to VxLAN, forcing packets to be fully
encapsulated even if they are destined for a local IP address
- Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
in-kernel ECMP implementation (e.g. Open vSwitch) select the same
link for all packets. Support L4 symmetric hashing in Open vSwitch
- PPPoE: make number of hash bits configurable
- Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
(ipconfig)
- Add layer 2 miss indication and filtering, allowing higher layers
(e.g. ACL filters) to make forwarding decisions based on whether
packet matched forwarding state in lower devices (bridge)
- Support matching on Connectivity Fault Management (CFM) packets
- Hide the "link becomes ready" IPv6 messages by demoting their
printk level to debug
- HSR: don't enable promiscuous mode if device offloads the proto
- Support active scanning in IEEE 802.15.4
- Continue work on Multi-Link Operation for WiFi 7
BPF:
- Add precision propagation for subprogs and callbacks. This allows
maintaining verification efficiency when subprograms are used, or
in fact passing the verifier at all for complex programs,
especially those using open-coded iterators
- Improve BPF's {g,s}setsockopt() length handling. Previously BPF
assumed the length is always equal to the amount of written data.
But some protos allow passing a NULL buffer to discover what the
output buffer *should* be, without writing anything
- Accept dynptr memory as memory arguments passed to helpers
- Add routing table ID to bpf_fib_lookup BPF helper
- Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
- Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
maps as read-only)
- Show target_{obj,btf}_id in tracing link fdinfo
- Addition of several new kfuncs (most of the names are
self-explanatory):
- Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
and bpf_dynptr_clone().
- bpf_task_under_cgroup()
- bpf_sock_destroy() - force closing sockets
- bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs
Netfilter:
- Relax set/map validation checks in nf_tables. Allow checking
presence of an entry in a map without using the value
- Increase ip_vs_conn_tab_bits range for 64BIT builds
- Allow updating size of a set
- Improve NAT tuple selection when connection is closing
Driver API:
- Integrate netdev with LED subsystem, to allow configuring HW
"offloaded" blinking of LEDs based on link state and activity
(i.e. packets coming in and out)
- Support configuring rate selection pins of SFP modules
- Factor Clause 73 auto-negotiation code out of the drivers, provide
common helper routines
- Add more fool-proof helpers for managing lifetime of MDIO devices
associated with the PCS layer
- Allow drivers to report advanced statistics related to Time Aware
scheduler offload (taprio)
- Allow opting out of VF statistics in link dump, to allow more VFs
to fit into the message
- Split devlink instance and devlink port operations
New hardware / drivers:
- Ethernet:
- Synopsys EMAC4 IP support (stmmac)
- Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
- Marvell 88E6250 7 port switches
- Microchip LAN8650/1 Rev.B0 PHYs
- MediaTek MT7981/MT7988 built-in 1GE PHY driver
- WiFi:
- Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
- Realtek RTL8723DS (SDIO variant)
- Realtek RTL8851BE
- CAN:
- Fintek F81604
Drivers:
- Ethernet NICs:
- Intel (100G, ice):
- support dynamic interrupt allocation
- use meta data match instead of VF MAC addr on slow-path
- nVidia/Mellanox:
- extend link aggregation to handle 4, rather than just 2 ports
- spawn sub-functions without any features by default
- OcteonTX2:
- support HTB (Tx scheduling/QoS) offload
- make RSS hash generation configurable
- support selecting Rx queue using TC filters
- Wangxun (ngbe/txgbe):
- add basic Tx/Rx packet offloads
- add phylink support (SFP/PCS control)
- Freescale/NXP (enetc):
- report TAPRIO packet statistics
- Solarflare/AMD:
- support matching on IP ToS and UDP source port of outer
header
- VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
- add devlink dev info support for EF10
- Virtual NICs:
- Microsoft vNIC:
- size the Rx indirection table based on requested
configuration
- support VLAN tagging
- Amazon vNIC:
- try to reuse Rx buffers if not fully consumed, useful for ARM
servers running with 16kB pages
- Google vNIC:
- support TCP segmentation of >64kB frames
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- enable USXGMII (88E6191X)
- Microchip:
- lan966x: add support for Egress Stage 0 ACL engine
- lan966x: support mapping packet priority to internal switch
priority (based on PCP or DSCP)
- Ethernet PHYs:
- Broadcom PHYs:
- support for Wake-on-LAN for BCM54210E/B50212E
- report LPI counter
- Microsemi PHYs: support RGMII delay configuration (VSC85xx)
- Micrel PHYs: receive timestamp in the frame (LAN8841)
- Realtek PHYs: support optional external PHY clock
- Altera TSE PCS: merge the driver into Lynx PCS which it is a
variant of
- CAN: Kvaser PCIEcan:
- support packet timestamping
- WiFi:
- Intel (iwlwifi):
- major update for new firmware and Multi-Link Operation (MLO)
- configuration rework to drop test devices and split the
different families
- support for segmented PNVM images and power tables
- new vendor entries for PPAG (platform antenna gain) feature
- Qualcomm 802.11ax (ath11k):
- Multiple Basic Service Set Identifier (MBSSID) and Enhanced
MBSSID Advertisement (EMA) support in AP mode
- support factory test mode
- RealTek (rtw89):
- add RSSI based antenna diversity
- support U-NII-4 channels on 5 GHz band
- RealTek (rtl8xxxu):
- AP mode support for 8188f
- support USB RX aggregation for the newer chips"
* tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits)
net: scm: introduce and use scm_recv_unix helper
af_unix: Skip SCM_PIDFD if scm->pid is NULL.
net: lan743x: Simplify comparison
netlink: Add __sock_i_ino() for __netlink_diag_dump().
net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses
Revert "af_unix: Call scm_recv() only after scm_set_cred()."
phylink: ReST-ify the phylink_pcs_neg_mode() kdoc
libceph: Partially revert changes to support MSG_SPLICE_PAGES
net: phy: mscc: fix packet loss due to RGMII delays
net: mana: use vmalloc_array and vcalloc
net: enetc: use vmalloc_array and vcalloc
ionic: use vmalloc_array and vcalloc
pds_core: use vmalloc_array and vcalloc
gve: use vmalloc_array and vcalloc
octeon_ep: use vmalloc_array and vcalloc
net: usb: qmi_wwan: add u-blox 0x1312 composition
perf trace: fix MSG_SPLICE_PAGES build error
ipvlan: Fix return value of ipvlan_queue_xmit()
netfilter: nf_tables: fix underflow in chain reference counter
netfilter: nf_tables: unbind non-anonymous set if rule construction fails
...
Diffstat (limited to 'drivers/net/virtio_net.c')
-rw-r--r-- | drivers/net/virtio_net.c | 661 |
1 files changed, 385 insertions, 276 deletions
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 486b5849033d..0db14f6b87d3 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -445,6 +445,22 @@ static unsigned int mergeable_ctx_to_truesize(void *mrg_ctx) return (unsigned long)mrg_ctx & ((1 << MRG_CTX_HEADER_SHIFT) - 1); } +static struct sk_buff *virtnet_build_skb(void *buf, unsigned int buflen, + unsigned int headroom, + unsigned int len) +{ + struct sk_buff *skb; + + skb = build_skb(buf, buflen); + if (unlikely(!skb)) + return NULL; + + skb_reserve(skb, headroom); + skb_put(skb, len); + + return skb; +} + /* Called from bottom half context */ static struct sk_buff *page_to_skb(struct virtnet_info *vi, struct receive_queue *rq, @@ -478,13 +494,10 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi, /* copy small packet so we can reuse these pages */ if (!NET_IP_ALIGN && len > GOOD_COPY_LEN && tailroom >= shinfo_size) { - skb = build_skb(buf, truesize); + skb = virtnet_build_skb(buf, truesize, p - buf, len); if (unlikely(!skb)) return NULL; - skb_reserve(skb, p - buf); - skb_put(skb, len); - page = (struct page *)page->private; if (page) give_pages(rq, page); @@ -791,6 +804,75 @@ out: return ret; } +static void put_xdp_frags(struct xdp_buff *xdp) +{ + struct skb_shared_info *shinfo; + struct page *xdp_page; + int i; + + if (xdp_buff_has_frags(xdp)) { + shinfo = xdp_get_shared_info_from_buff(xdp); + for (i = 0; i < shinfo->nr_frags; i++) { + xdp_page = skb_frag_page(&shinfo->frags[i]); + put_page(xdp_page); + } + } +} + +static int virtnet_xdp_handler(struct bpf_prog *xdp_prog, struct xdp_buff *xdp, + struct net_device *dev, + unsigned int *xdp_xmit, + struct virtnet_rq_stats *stats) +{ + struct xdp_frame *xdpf; + int err; + u32 act; + + act = bpf_prog_run_xdp(xdp_prog, xdp); + stats->xdp_packets++; + + switch (act) { + case XDP_PASS: + return act; + + case XDP_TX: + stats->xdp_tx++; + xdpf = xdp_convert_buff_to_frame(xdp); + if (unlikely(!xdpf)) { + netdev_dbg(dev, "convert buff to frame failed for xdp\n"); + return XDP_DROP; + } + + err = virtnet_xdp_xmit(dev, 1, &xdpf, 0); + if (unlikely(!err)) { + xdp_return_frame_rx_napi(xdpf); + } else if (unlikely(err < 0)) { + trace_xdp_exception(dev, xdp_prog, act); + return XDP_DROP; + } + *xdp_xmit |= VIRTIO_XDP_TX; + return act; + + case XDP_REDIRECT: + stats->xdp_redirects++; + err = xdp_do_redirect(dev, xdp, xdp_prog); + if (err) + return XDP_DROP; + + *xdp_xmit |= VIRTIO_XDP_REDIR; + return act; + + default: + bpf_warn_invalid_xdp_action(dev, xdp_prog, act); + fallthrough; + case XDP_ABORTED: + trace_xdp_exception(dev, xdp_prog, act); + fallthrough; + case XDP_DROP: + return XDP_DROP; + } +} + static unsigned int virtnet_get_headroom(struct virtnet_info *vi) { return vi->xdp_enabled ? VIRTIO_XDP_HEADROOM : 0; @@ -864,134 +946,103 @@ err_buf: return NULL; } -static struct sk_buff *receive_small(struct net_device *dev, - struct virtnet_info *vi, - struct receive_queue *rq, - void *buf, void *ctx, - unsigned int len, - unsigned int *xdp_xmit, - struct virtnet_rq_stats *stats) +static struct sk_buff *receive_small_build_skb(struct virtnet_info *vi, + unsigned int xdp_headroom, + void *buf, + unsigned int len) { + unsigned int header_offset; + unsigned int headroom; + unsigned int buflen; struct sk_buff *skb; - struct bpf_prog *xdp_prog; - unsigned int xdp_headroom = (unsigned long)ctx; + + header_offset = VIRTNET_RX_PAD + xdp_headroom; + headroom = vi->hdr_len + header_offset; + buflen = SKB_DATA_ALIGN(GOOD_PACKET_LEN + headroom) + + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + + skb = virtnet_build_skb(buf, buflen, headroom, len); + if (unlikely(!skb)) + return NULL; + + buf += header_offset; + memcpy(skb_vnet_hdr(skb), buf, vi->hdr_len); + + return skb; +} + +static struct sk_buff *receive_small_xdp(struct net_device *dev, + struct virtnet_info *vi, + struct receive_queue *rq, + struct bpf_prog *xdp_prog, + void *buf, + unsigned int xdp_headroom, + unsigned int len, + unsigned int *xdp_xmit, + struct virtnet_rq_stats *stats) +{ unsigned int header_offset = VIRTNET_RX_PAD + xdp_headroom; unsigned int headroom = vi->hdr_len + header_offset; - unsigned int buflen = SKB_DATA_ALIGN(GOOD_PACKET_LEN + headroom) + - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + struct virtio_net_hdr_mrg_rxbuf *hdr = buf + header_offset; struct page *page = virt_to_head_page(buf); - unsigned int delta = 0; struct page *xdp_page; - int err; + unsigned int buflen; + struct xdp_buff xdp; + struct sk_buff *skb; unsigned int metasize = 0; + u32 act; - len -= vi->hdr_len; - stats->bytes += len; + if (unlikely(hdr->hdr.gso_type)) + goto err_xdp; - if (unlikely(len > GOOD_PACKET_LEN)) { - pr_debug("%s: rx error: len %u exceeds max size %d\n", - dev->name, len, GOOD_PACKET_LEN); - dev->stats.rx_length_errors++; - goto err; - } + buflen = SKB_DATA_ALIGN(GOOD_PACKET_LEN + headroom) + + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + + if (unlikely(xdp_headroom < virtnet_get_headroom(vi))) { + int offset = buf - page_address(page) + header_offset; + unsigned int tlen = len + vi->hdr_len; + int num_buf = 1; + + xdp_headroom = virtnet_get_headroom(vi); + header_offset = VIRTNET_RX_PAD + xdp_headroom; + headroom = vi->hdr_len + header_offset; + buflen = SKB_DATA_ALIGN(GOOD_PACKET_LEN + headroom) + + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + xdp_page = xdp_linearize_page(rq, &num_buf, page, + offset, header_offset, + &tlen); + if (!xdp_page) + goto err_xdp; - if (likely(!vi->xdp_enabled)) { - xdp_prog = NULL; - goto skip_xdp; + buf = page_address(xdp_page); + put_page(page); + page = xdp_page; } - rcu_read_lock(); - xdp_prog = rcu_dereference(rq->xdp_prog); - if (xdp_prog) { - struct virtio_net_hdr_mrg_rxbuf *hdr = buf + header_offset; - struct xdp_frame *xdpf; - struct xdp_buff xdp; - void *orig_data; - u32 act; + xdp_init_buff(&xdp, buflen, &rq->xdp_rxq); + xdp_prepare_buff(&xdp, buf + VIRTNET_RX_PAD + vi->hdr_len, + xdp_headroom, len, true); - if (unlikely(hdr->hdr.gso_type)) - goto err_xdp; + act = virtnet_xdp_handler(xdp_prog, &xdp, dev, xdp_xmit, stats); - if (unlikely(xdp_headroom < virtnet_get_headroom(vi))) { - int offset = buf - page_address(page) + header_offset; - unsigned int tlen = len + vi->hdr_len; - int num_buf = 1; - - xdp_headroom = virtnet_get_headroom(vi); - header_offset = VIRTNET_RX_PAD + xdp_headroom; - headroom = vi->hdr_len + header_offset; - buflen = SKB_DATA_ALIGN(GOOD_PACKET_LEN + headroom) + - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); - xdp_page = xdp_linearize_page(rq, &num_buf, page, - offset, header_offset, - &tlen); - if (!xdp_page) - goto err_xdp; - - buf = page_address(xdp_page); - put_page(page); - page = xdp_page; - } + switch (act) { + case XDP_PASS: + /* Recalculate length in case bpf program changed it */ + len = xdp.data_end - xdp.data; + metasize = xdp.data - xdp.data_meta; + break; - xdp_init_buff(&xdp, buflen, &rq->xdp_rxq); - xdp_prepare_buff(&xdp, buf + VIRTNET_RX_PAD + vi->hdr_len, - xdp_headroom, len, true); - orig_data = xdp.data; - act = bpf_prog_run_xdp(xdp_prog, &xdp); - stats->xdp_packets++; - - switch (act) { - case XDP_PASS: - /* Recalculate length in case bpf program changed it */ - delta = orig_data - xdp.data; - len = xdp.data_end - xdp.data; - metasize = xdp.data - xdp.data_meta; - break; - case XDP_TX: - stats->xdp_tx++; - xdpf = xdp_convert_buff_to_frame(&xdp); - if (unlikely(!xdpf)) - goto err_xdp; - err = virtnet_xdp_xmit(dev, 1, &xdpf, 0); - if (unlikely(!err)) { - xdp_return_frame_rx_napi(xdpf); - } else if (unlikely(err < 0)) { - trace_xdp_exception(vi->dev, xdp_prog, act); - goto err_xdp; - } - *xdp_xmit |= VIRTIO_XDP_TX; - rcu_read_unlock(); - goto xdp_xmit; - case XDP_REDIRECT: - stats->xdp_redirects++; - err = xdp_do_redirect(dev, &xdp, xdp_prog); - if (err) - goto err_xdp; - *xdp_xmit |= VIRTIO_XDP_REDIR; - rcu_read_unlock(); - goto xdp_xmit; - default: - bpf_warn_invalid_xdp_action(vi->dev, xdp_prog, act); - fallthrough; - case XDP_ABORTED: - trace_xdp_exception(vi->dev, xdp_prog, act); - goto err_xdp; - case XDP_DROP: - goto err_xdp; - } + case XDP_TX: + case XDP_REDIRECT: + goto xdp_xmit; + + default: + goto err_xdp; } - rcu_read_unlock(); -skip_xdp: - skb = build_skb(buf, buflen); - if (!skb) + skb = virtnet_build_skb(buf, buflen, xdp.data - buf, len); + if (unlikely(!skb)) goto err; - skb_reserve(skb, headroom - delta); - skb_put(skb, len); - if (!xdp_prog) { - buf += header_offset; - memcpy(skb_vnet_hdr(skb), buf, vi->hdr_len); - } /* keep zeroed vnet hdr since XDP is loaded */ if (metasize) skb_metadata_set(skb, metasize); @@ -999,7 +1050,6 @@ skip_xdp: return skb; err_xdp: - rcu_read_unlock(); stats->xdp_drops++; err: stats->drops++; @@ -1008,6 +1058,53 @@ xdp_xmit: return NULL; } +static struct sk_buff *receive_small(struct net_device *dev, + struct virtnet_info *vi, + struct receive_queue *rq, + void *buf, void *ctx, + unsigned int len, + unsigned int *xdp_xmit, + struct virtnet_rq_stats *stats) +{ + unsigned int xdp_headroom = (unsigned long)ctx; + struct page *page = virt_to_head_page(buf); + struct sk_buff *skb; + + len -= vi->hdr_len; + stats->bytes += len; + + if (unlikely(len > GOOD_PACKET_LEN)) { + pr_debug("%s: rx error: len %u exceeds max size %d\n", + dev->name, len, GOOD_PACKET_LEN); + dev->stats.rx_length_errors++; + goto err; + } + + if (unlikely(vi->xdp_enabled)) { + struct bpf_prog *xdp_prog; + + rcu_read_lock(); + xdp_prog = rcu_dereference(rq->xdp_prog); + if (xdp_prog) { + skb = receive_small_xdp(dev, vi, rq, xdp_prog, buf, + xdp_headroom, len, xdp_xmit, + stats); + rcu_read_unlock(); + return skb; + } + rcu_read_unlock(); + } + + skb = receive_small_build_skb(vi, xdp_headroom, buf, len); + if (likely(skb)) + return skb; + +err: + stats->drops++; + put_page(page); + return NULL; +} + static struct sk_buff *receive_big(struct net_device *dev, struct virtnet_info *vi, struct receive_queue *rq, @@ -1031,6 +1128,28 @@ err: return NULL; } +static void mergeable_buf_free(struct receive_queue *rq, int num_buf, + struct net_device *dev, + struct virtnet_rq_stats *stats) +{ + struct page *page; + void *buf; + int len; + + while (num_buf-- > 1) { + buf = virtqueue_get_buf(rq->vq, &len); + if (unlikely(!buf)) { + pr_debug("%s: rx error: %d buffers missing\n", + dev->name, num_buf); + dev->stats.rx_length_errors++; + break; + } + stats->bytes += len; + page = virt_to_head_page(buf); + put_page(page); + } +} + /* Why not use xdp_build_skb_from_frame() ? * XDP core assumes that xdp frags are PAGE_SIZE in length, while in * virtio-net there are 2 points that do not match its requirements: @@ -1132,7 +1251,7 @@ static int virtnet_build_xdp_buff_mrg(struct net_device *dev, dev->name, *num_buf, virtio16_to_cpu(vi->vdev, hdr->num_buffers)); dev->stats.rx_length_errors++; - return -EINVAL; + goto err; } stats->bytes += len; @@ -1151,13 +1270,11 @@ static int virtnet_build_xdp_buff_mrg(struct net_device *dev, pr_debug("%s: rx error: len %u exceeds truesize %lu\n", dev->name, len, (unsigned long)(truesize - room)); dev->stats.rx_length_errors++; - return -EINVAL; + goto err; } frag = &shinfo->frags[shinfo->nr_frags++]; - __skb_frag_set_page(frag, page); - skb_frag_off_set(frag, offset); - skb_frag_size_set(frag, len); + skb_frag_fill_page_desc(frag, page, offset, len); if (page_is_pfmemalloc(page)) xdp_buff_set_frag_pfmemalloc(xdp); @@ -1166,6 +1283,144 @@ static int virtnet_build_xdp_buff_mrg(struct net_device *dev, *xdp_frags_truesize = xdp_frags_truesz; return 0; + +err: + put_xdp_frags(xdp); + return -EINVAL; +} + +static void *mergeable_xdp_get_buf(struct virtnet_info *vi, + struct receive_queue *rq, + struct bpf_prog *xdp_prog, + void *ctx, + unsigned int *frame_sz, + int *num_buf, + struct page **page, + int offset, + unsigned int *len, + struct virtio_net_hdr_mrg_rxbuf *hdr) +{ + unsigned int truesize = mergeable_ctx_to_truesize(ctx); + unsigned int headroom = mergeable_ctx_to_headroom(ctx); + struct page *xdp_page; + unsigned int xdp_room; + + /* Transient failure which in theory could occur if + * in-flight packets from before XDP was enabled reach + * the receive path after XDP is loaded. + */ + if (unlikely(hdr->hdr.gso_type)) + return NULL; + + /* Now XDP core assumes frag size is PAGE_SIZE, but buffers + * with headroom may add hole in truesize, which + * make their length exceed PAGE_SIZE. So we disabled the + * hole mechanism for xdp. See add_recvbuf_mergeable(). + */ + *frame_sz = truesize; + + if (likely(headroom >= virtnet_get_headroom(vi) && + (*num_buf == 1 || xdp_prog->aux->xdp_has_frags))) { + return page_address(*page) + offset; + } + + /* This happens when headroom is not enough because + * of the buffer was prefilled before XDP is set. + * This should only happen for the first several packets. + * In fact, vq reset can be used here to help us clean up + * the prefilled buffers, but many existing devices do not + * support it, and we don't want to bother users who are + * using xdp normally. + */ + if (!xdp_prog->aux->xdp_has_frags) { + /* linearize data for XDP */ + xdp_page = xdp_linearize_page(rq, num_buf, + *page, offset, + VIRTIO_XDP_HEADROOM, + len); + if (!xdp_page) + return NULL; + } else { + xdp_room = SKB_DATA_ALIGN(VIRTIO_XDP_HEADROOM + + sizeof(struct skb_shared_info)); + if (*len + xdp_room > PAGE_SIZE) + return NULL; + + xdp_page = alloc_page(GFP_ATOMIC); + if (!xdp_page) + return NULL; + + memcpy(page_address(xdp_page) + VIRTIO_XDP_HEADROOM, + page_address(*page) + offset, *len); + } + + *frame_sz = PAGE_SIZE; + + put_page(*page); + + *page = xdp_page; + + return page_address(*page) + VIRTIO_XDP_HEADROOM; +} + +static struct sk_buff *receive_mergeable_xdp(struct net_device *dev, + struct virtnet_info *vi, + struct receive_queue *rq, + struct bpf_prog *xdp_prog, + void *buf, + void *ctx, + unsigned int len, + unsigned int *xdp_xmit, + struct virtnet_rq_stats *stats) +{ + struct virtio_net_hdr_mrg_rxbuf *hdr = buf; + int num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers); + struct page *page = virt_to_head_page(buf); + int offset = buf - page_address(page); + unsigned int xdp_frags_truesz = 0; + struct sk_buff *head_skb; + unsigned int frame_sz; + struct xdp_buff xdp; + void *data; + u32 act; + int err; + + data = mergeable_xdp_get_buf(vi, rq, xdp_prog, ctx, &frame_sz, &num_buf, &page, + offset, &len, hdr); + if (unlikely(!data)) + goto err_xdp; + + err = virtnet_build_xdp_buff_mrg(dev, vi, rq, &xdp, data, len, frame_sz, + &num_buf, &xdp_frags_truesz, stats); + if (unlikely(err)) + goto err_xdp; + + act = virtnet_xdp_handler(xdp_prog, &xdp, dev, xdp_xmit, stats); + + switch (act) { + case XDP_PASS: + head_skb = build_skb_from_xdp_buff(dev, vi, &xdp, xdp_frags_truesz); + if (unlikely(!head_skb)) + break; + return head_skb; + + case XDP_TX: + case XDP_REDIRECT: + return NULL; + + default: + break; + } + + put_xdp_frags(&xdp); + +err_xdp: + put_page(page); + mergeable_buf_free(rq, num_buf, dev, stats); + + stats->xdp_drops++; + stats->drops++; + return NULL; } static struct sk_buff *receive_mergeable(struct net_device *dev, @@ -1182,13 +1437,10 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, struct page *page = virt_to_head_page(buf); int offset = buf - page_address(page); struct sk_buff *head_skb, *curr_skb; - struct bpf_prog *xdp_prog; unsigned int truesize = mergeable_ctx_to_truesize(ctx); unsigned int headroom = mergeable_ctx_to_headroom(ctx); unsigned int tailroom = headroom ? sizeof(struct skb_shared_info) : 0; unsigned int room = SKB_DATA_ALIGN(headroom + tailroom); - unsigned int frame_sz, xdp_room; - int err; head_skb = NULL; stats->bytes += len - vi->hdr_len; @@ -1200,149 +1452,20 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, goto err_skb; } - if (likely(!vi->xdp_enabled)) { - xdp_prog = NULL; - goto skip_xdp; - } - - rcu_read_lock(); - xdp_prog = rcu_dereference(rq->xdp_prog); - if (xdp_prog) { - unsigned int xdp_frags_truesz = 0; - struct skb_shared_info *shinfo; - struct xdp_frame *xdpf; - struct page *xdp_page; - struct xdp_buff xdp; - void *data; - u32 act; - int i; - - /* Transient failure which in theory could occur if - * in-flight packets from before XDP was enabled reach - * the receive path after XDP is loaded. - */ - if (unlikely(hdr->hdr.gso_type)) - goto err_xdp; - - /* Now XDP core assumes frag size is PAGE_SIZE, but buffers - * with headroom may add hole in truesize, which - * make their length exceed PAGE_SIZE. So we disabled the - * hole mechanism for xdp. See add_recvbuf_mergeable(). - */ - frame_sz = truesize; - - /* This happens when headroom is not enough because - * of the buffer was prefilled before XDP is set. - * This should only happen for the first several packets. - * In fact, vq reset can be used here to help us clean up - * the prefilled buffers, but many existing devices do not - * support it, and we don't want to bother users who are - * using xdp normally. - */ - if (!xdp_prog->aux->xdp_has_frags && - (num_buf > 1 || headroom < virtnet_get_headroom(vi))) { - /* linearize data for XDP */ - xdp_page = xdp_linearize_page(rq, &num_buf, - page, offset, - VIRTIO_XDP_HEADROOM, - &len); - frame_sz = PAGE_SIZE; - - if (!xdp_page) - goto err_xdp; - offset = VIRTIO_XDP_HEADROOM; - } else if (unlikely(headroom < virtnet_get_headroom(vi))) { - xdp_room = SKB_DATA_ALIGN(VIRTIO_XDP_HEADROOM + - sizeof(struct skb_shared_info)); - if (len + xdp_room > PAGE_SIZE) - goto err_xdp; - - xdp_page = alloc_page(GFP_ATOMIC); - if (!xdp_page) - goto err_xdp; - - memcpy(page_address(xdp_page) + VIRTIO_XDP_HEADROOM, - page_address(page) + offset, len); - frame_sz = PAGE_SIZE; - offset = VIRTIO_XDP_HEADROOM; - } else { - xdp_page = page; - } - - data = page_address(xdp_page) + offset; - err = virtnet_build_xdp_buff_mrg(dev, vi, rq, &xdp, data, len, frame_sz, - &num_buf, &xdp_frags_truesz, stats); - if (unlikely(err)) - goto err_xdp_frags; + if (unlikely(vi->xdp_enabled)) { + struct bpf_prog *xdp_prog; - act = bpf_prog_run_xdp(xdp_prog, &xdp); - stats->xdp_packets++; - - switch (act) { - case XDP_PASS: - head_skb = build_skb_from_xdp_buff(dev, vi, &xdp, xdp_frags_truesz); - if (unlikely(!head_skb)) - goto err_xdp_frags; - - if (unlikely(xdp_page != page)) - put_page(page); + rcu_read_lock(); + xdp_prog = rcu_dereference(rq->xdp_prog); + if (xdp_prog) { + head_skb = receive_mergeable_xdp(dev, vi, rq, xdp_prog, buf, ctx, + len, xdp_xmit, stats); rcu_read_unlock(); return head_skb; - case XDP_TX: - stats->xdp_tx++; - xdpf = xdp_convert_buff_to_frame(&xdp); - if (unlikely(!xdpf)) { - netdev_dbg(dev, "convert buff to frame failed for xdp\n"); - goto err_xdp_frags; - } - err = virtnet_xdp_xmit(dev, 1, &xdpf, 0); - if (unlikely(!err)) { - xdp_return_frame_rx_napi(xdpf); - } else if (unlikely(err < 0)) { - trace_xdp_exception(vi->dev, xdp_prog, act); - goto err_xdp_frags; - } - *xdp_xmit |= VIRTIO_XDP_TX; - if (unlikely(xdp_page != page)) - put_page(page); - rcu_read_unlock(); - goto xdp_xmit; - case XDP_REDIRECT: - stats->xdp_redirects++; - err = xdp_do_redirect(dev, &xdp, xdp_prog); - if (err) - goto err_xdp_frags; - *xdp_xmit |= VIRTIO_XDP_REDIR; - if (unlikely(xdp_page != page)) - put_page(page); - rcu_read_unlock(); - goto xdp_xmit; - default: - bpf_warn_invalid_xdp_action(vi->dev, xdp_prog, act); - fallthrough; - case XDP_ABORTED: - trace_xdp_exception(vi->dev, xdp_prog, act); - fallthrough; - case XDP_DROP: - goto err_xdp_frags; - } -err_xdp_frags: - if (unlikely(xdp_page != page)) - __free_pages(xdp_page, 0); - - if (xdp_buff_has_frags(&xdp)) { - shinfo = xdp_get_shared_info_from_buff(&xdp); - for (i = 0; i < shinfo->nr_frags; i++) { - xdp_page = skb_frag_page(&shinfo->frags[i]); - put_page(xdp_page); - } } - - goto err_xdp; + rcu_read_unlock(); } - rcu_read_unlock(); -skip_xdp: head_skb = page_to_skb(vi, rq, page, offset, len, truesize, headroom); curr_skb = head_skb; @@ -1408,27 +1531,13 @@ skip_xdp: ewma_pkt_len_add(&rq->mrg_avg_pkt_len, head_skb->len); return head_skb; -err_xdp: - rcu_read_unlock(); - stats->xdp_drops++; err_skb: put_page(page); - while (num_buf-- > 1) { - buf = virtqueue_get_buf(rq->vq, &len); - if (unlikely(!buf)) { - pr_debug("%s: rx error: %d buffers missing\n", - dev->name, num_buf); - dev->stats.rx_length_errors++; - break; - } - stats->bytes += len; - page = virt_to_head_page(buf); - put_page(page); - } + mergeable_buf_free(rq, num_buf, dev, stats); + err_buf: stats->drops++; dev_kfree_skb(head_skb); -xdp_xmit: return NULL; } |