Age | Commit message (Collapse) | Author |
|
We have some problem closing zero-window fin-wait-1 tcp sockets in our
environment. This patch come from the investigation.
Previously tcp_abort only sends out reset and calls tcp_done when the
socket is not SOCK_DEAD, aka orphan. For orphan socket, it will only
purging the write queue, but not close the socket and left it to the
timer.
While purging the write queue, tp->packets_out and sk->sk_write_queue
is cleared along the way. However tcp_retransmit_timer have early
return based on !tp->packets_out and tcp_probe_timer have early
return based on !sk->sk_write_queue.
This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched
and socket not being killed by the timers, converting a zero-windowed
orphan into a forever orphan.
This patch removes the SOCK_DEAD check in tcp_abort, making it send
reset to peer and close the socket accordingly. Preventing the
timer-less orphan from happening.
According to Lorenzo's email in the v1 thread, the check was there to
prevent force-closing the same socket twice. That situation is handled
by testing for TCP_CLOSE inside lock, and returning -ENOENT if it is
already closed.
The -ENOENT code comes from the associate patch Lorenzo made for
iproute2-ss; link attached below, which also conform to RFC 9293.
At the end of the patch, tcp_write_queue_purge(sk) is removed because it
was already called in tcp_done_with_error().
p.s. This is the same patch with v2. Resent due to mis-labeled "changes
requested" on patchwork.kernel.org.
Link: https://patchwork.ozlabs.org/project/netdev/patch/1450773094-7978-3-git-send-email-lorenzo@google.com/
Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.")
Signed-off-by: Xueming Feng <kuro@kuroa.me>
Tested-by: Lorenzo Colitti <lorenzo@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240826102327.1461482-1-kuro@kuroa.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Recent commit fc7ec7f554d7 ("l2tp: delete sessions using work queue")
incorrectly uses drain_workqueue. The use of drain_workqueue in
l2tp_pre_exit_net is flawed because the workqueue is shared by all
nets and it is therefore possible for new work items to be queued
for other nets while drain_workqueue runs.
Instead of using drain_workqueue, use __flush_workqueue twice. The
first one will run all tunnel delete work items and any work already
queued. When tunnel delete work items are run, they may queue
new session delete work items, which the second __flush_workqueue will
run.
In l2tp_exit_net, warn if any of the net's idr lists are not empty.
Fixes: fc7ec7f554d7 ("l2tp: delete sessions using work queue")
Signed-off-by: James Chapman <jchapman@katalix.com>
Link: https://patch.msgid.link/20240823142257.692667-1-jchapman@katalix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
fix an unreleased lock in out_dev_put path by removing the (now)
unnecessary path.
Reported-by: syzbot+c641161e97237326ea74@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c641161e97237326ea74
Fixes: 3688ff3077d3 ("net: ethtool: cable-test: Target the command to the requested PHY")
Signed-off-by: Diogo Jahchan Koike <djahchankoike@gmail.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20240826134656.94892-1-djahchankoike@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
fq_dequeue() has a complex logic to find packets in one of the 3 bands.
As Neal found out, it is possible that one band has a deficit smaller
than its weight. fq_dequeue() can return NULL while some packets are
elligible for immediate transmit.
In this case, more than one iteration is needed to refill pband->credit.
With default parameters (weights 589824 196608 65536) bug can trigger
if large BIG TCP packets are sent to the lowest priority band.
Bisected-by: John Sperbeck <jsperbeck@google.com>
Diagnosed-by: Neal Cardwell <ncardwell@google.com>
Fixes: 29f834aa326e ("net_sched: sch_fq: add 3 bands and WRR scheduling")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20240824181901.953776-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
<tldr>
skb network header of the single-tagged vlan packet continues to point the
vlan payload (e.g. IP) after second vlan tag is pushed by tc act_vlan. This
causes problem at the dissector which expects double-tagged packet network
header to point to the inner vlan.
The fix is to adjust network header in tcf_act_vlan.c but requires
refactoring of skb_vlan_push function.
</tldr>
Consider the following shell script snippet configuring TC rules on the
veth interface:
ip link add veth0 type veth peer veth1
ip link set veth0 up
ip link set veth1 up
tc qdisc add dev veth0 clsact
tc filter add dev veth0 ingress pref 10 chain 0 flower \
num_of_vlans 2 cvlan_ethtype 0x800 action goto chain 5
tc filter add dev veth0 ingress pref 20 chain 0 flower \
num_of_vlans 1 action vlan push id 100 \
protocol 0x8100 action goto chain 5
tc filter add dev veth0 ingress pref 30 chain 5 flower \
num_of_vlans 2 cvlan_ethtype 0x800 action simple sdata "success"
Sending double-tagged vlan packet with the IP payload inside:
cat <<ENDS | text2pcap - - | tcpreplay -i veth1 -
0000 00 00 00 00 00 11 00 00 00 00 00 22 81 00 00 64 ..........."...d
0010 81 00 00 14 08 00 45 04 00 26 04 d2 00 00 7f 11 ......E..&......
0020 18 ef 0a 00 00 01 14 00 00 02 00 00 00 00 00 12 ................
0030 e1 c7 00 00 00 00 00 00 00 00 00 00 ............
ENDS
will match rule 10, goto rule 30 in chain 5 and correctly emit "success" to
the dmesg.
OTOH, sending single-tagged vlan packet:
cat <<ENDS | text2pcap - - | tcpreplay -i veth1 -
0000 00 00 00 00 00 11 00 00 00 00 00 22 81 00 00 14 ..........."....
0010 08 00 45 04 00 2a 04 d2 00 00 7f 11 18 eb 0a 00 ..E..*..........
0020 00 01 14 00 00 02 00 00 00 00 00 16 e1 bf 00 00 ................
0030 00 00 00 00 00 00 00 00 00 00 00 00 ............
ENDS
will match rule 20, will push the second vlan tag but will *not* match
rule 30. IOW, the match at rule 30 fails if the second vlan was freshly
pushed by the kernel.
Lets look at __skb_flow_dissect working on the double-tagged vlan packet.
Here is the relevant code from around net/core/flow_dissector.c:1277
copy-pasted here for convenience:
if (dissector_vlan == FLOW_DISSECTOR_KEY_MAX &&
skb && skb_vlan_tag_present(skb)) {
proto = skb->protocol;
} else {
vlan = __skb_header_pointer(skb, nhoff, sizeof(_vlan),
data, hlen, &_vlan);
if (!vlan) {
fdret = FLOW_DISSECT_RET_OUT_BAD;
break;
}
proto = vlan->h_vlan_encapsulated_proto;
nhoff += sizeof(*vlan);
}
The "else" clause above gets the protocol of the encapsulated packet from
the skb data at the network header location. printk debugging has showed
that in the good double-tagged packet case proto is
htons(0x800 == ETH_P_IP) as expected. However in the single-tagged packet
case proto is garbage leading to the failure to match tc filter 30.
proto is being set from the skb header pointed by nhoff parameter which is
defined at the beginning of __skb_flow_dissect
(net/core/flow_dissector.c:1055 in the current version):
nhoff = skb_network_offset(skb);
Therefore the culprit seems to be that the skb network offset is different
between double-tagged packet received from the interface and single-tagged
packet having its vlan tag pushed by TC.
Lets look at the interesting points of the lifetime of the single/double
tagged packets as they traverse our packet flow.
Both of them will start at __netif_receive_skb_core where the first vlan
tag will be stripped:
if (eth_type_vlan(skb->protocol)) {
skb = skb_vlan_untag(skb);
if (unlikely(!skb))
goto out;
}
At this stage in double-tagged case skb->data points to the second vlan tag
while in single-tagged case skb->data points to the network (eg. IP)
header.
Looking at TC vlan push action (net/sched/act_vlan.c) we have the following
code at tcf_vlan_act (interesting points are in square brackets):
if (skb_at_tc_ingress(skb))
[1] skb_push_rcsum(skb, skb->mac_len);
....
case TCA_VLAN_ACT_PUSH:
err = skb_vlan_push(skb, p->tcfv_push_proto, p->tcfv_push_vid |
(p->tcfv_push_prio << VLAN_PRIO_SHIFT),
0);
if (err)
goto drop;
break;
....
out:
if (skb_at_tc_ingress(skb))
[3] skb_pull_rcsum(skb, skb->mac_len);
And skb_vlan_push (net/core/skbuff.c:6204) function does:
err = __vlan_insert_tag(skb, skb->vlan_proto,
skb_vlan_tag_get(skb));
if (err)
return err;
skb->protocol = skb->vlan_proto;
[2] skb->mac_len += VLAN_HLEN;
in the case of pushing the second tag. Lets look at what happens with
skb->data of the single-tagged packet at each of the above points:
1. As a result of the skb_push_rcsum, skb->data is moved back to the start
of the packet.
2. First VLAN tag is moved from the skb into packet buffer, skb->mac_len is
incremented, skb->data still points to the start of the packet.
3. As a result of the skb_pull_rcsum, skb->data is moved forward by the
modified skb->mac_len, thus pointing to the network header again.
Then __skb_flow_dissect will get confused by having double-tagged vlan
packet with the skb->data at the network header.
The solution for the bug is to preserve "skb->data at second vlan header"
semantics in the skb_vlan_push function. We do this by manipulating
skb->network_header rather than skb->mac_len. skb_vlan_push callers are
updated to do skb_reset_mac_len.
Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
In packet offload mode, append Security Association (SA) information
to each packet, replicating the crypto offload implementation.
The XFRM_XMIT flag is set to enable packet to be returned immediately
from the validate_xmit_xfrm function, thus aligning with the existing
code path for packet offload mode.
This SA info helps HW offload match packets to their correct security
policies. The XFRM interface ID is included, which is crucial in setups
with multiple XFRM interfaces where source/destination addresses alone
can't pinpoint the right policy.
Signed-off-by: wangfe <wangfe@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Let the kmemdup_array() take care about multiplication
and possible overflows.
Using kmemdup_array() is more appropriate and makes the code
easier to audit.
Signed-off-by: Shen Lichuan <shenlichuan@vivo.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20240827072115.42680-1-shenlichuan@vivo.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Let the kememdup_array() take care about multiplication and possible
overflows.
Signed-off-by: Yu Jiaoliang <yujiaoliang@vivo.com>
Reviewed-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://patch.msgid.link/20240822074743.1366561-1-yujiaoliang@vivo.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Drivers need to purge TX SKB when stopping. Using skb_queue_purge() can't
report TX status to mac80211, causing ieee80211_free_ack_frame() warns
"Have pending ack frames!". Export ieee80211_purge_tx_queue() for drivers
to not have to reimplement it.
Signed-off-by: Ping-Ke Shih <pkshih@realtek.com>
Link: https://patch.msgid.link/20240822014255.10211-1-pkshih@realtek.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Change type of parameter @reason to enum rfkill_hard_block_reasons
for API rfkill_set_hw_state_reason() according to its comments, and
all kernel callers have invoked the API with enum type actually.
Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Link: https://patch.msgid.link/20240811-rfkill_fix-v2-1-9050760336f4@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
IS_ERR() already calls unlikely(), so unlikely() is redundant here.
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Link: https://patch.msgid.link/1724488003-45388-1-git-send-email-zhangchangzhong@huawei.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
The Lenovo Yoga Tab 3 Pro YT3-X90 has a non functional "BCM4752" ACPI
device, which uses GPIO resources which are actually necessary / used
for the sound (codec, speaker amplifier) on the Lenovo Yoga Tab 3 Pro.
If the rfkill-gpio driver loads before the sound drivers do the sound
drivers fail to load because the GPIOs are already claimed.
Add a DMI based deny list with the Lenovo Yoga Tab 3 Pro on there and
make rfkill_gpio_probe() exit with -ENODEV for devices on the DMI based
deny list.
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Link: https://patch.msgid.link/20240825131916.6388-1-hdegoede@redhat.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
When we had a comeback, we will never use the default timeout values
again because comeback is never cleared.
Clear comeback if we send another association request which will allow
to start a default timer after Tx status.
The problem was seen with iwlwifi where the tx_status on the association
request is handled before the association response frame (which is the
usual case).
1) Tx assoc request 1/3
2) Rx assoc response (comeback, timeout = 1 second)
3) wait 1 second
4) Tx assoc request 2/3
5) Set timer to IEEE80211_ASSOC_TIMEOUT_LONG = 500ms (1 second after
round_up)
6) tx_status on frame sent in 4) is ignored because comeback is still
true
7) AP does not reply with assoc response
8) wait 1s <= This is where the bug is felt
9) Tx assoc request 3/3
With this fix, in step 6 we will reset the timer to
IEEE80211_ASSOC_TIMEOUT_SHORT = 100ms and we will wait only 100ms in
step 8.
Fixes: b133fdf07db8 ("wifi: mac80211: Skip association timeout update after comeback rejection")
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Link: https://patch.msgid.link/20240808085916.23519-1-emmanuel.grumbach@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
According to RFC8325 4.3, Multimedia Streaming: AF31(011010, 26),
AF32(011100, 28), AF33(011110, 30) maps to User Priority = 4
and AC_VI (Video).
However, the original code remain the default three Most Significant
Bits (MSBs) of the DSCP, which makes AF3x map to User Priority = 3
and AC_BE (Best Effort).
Fixes: 6fdb8b8781d5 ("wifi: cfg80211: Update the default DSCP-to-UP mapping")
Signed-off-by: hhorace <hhoracehsu@gmail.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20240807082205.1369-1-hhoracehsu@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Drivers may at times want to iterate their stations with a function
which requires some non-atomic operations.
ieee80211_iterate_stations_mtx() introduces an API to iterate stations
while holding that wiphy's mutex. This allows the iterating function to
do non-atomic operations safely.
Signed-off-by: Rory Little <rory@candelatech.com>
Link: https://patch.msgid.link/20240806004024.2014080-2-rory@candelatech.com
[unify internal list iteration functions]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Now that functions in lib80211 handle "const struct lib80211_crypto_ops",
some structure can be constified as well.
Constifying these structures moves some data to a read-only section, so
increase overall security.
Before:
text data bss dec hex filename
7273 604 16 7893 1ed5 net/wireless/lib80211.o
After:
text data bss dec hex filename
7429 444 16 7889 1ed1 net/wireless/lib80211.o
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/0cc3741c15f2c502cc85bddda9d6582b5977c8f9.1722839425.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
lib80211_register_crypto_ops() and lib80211_unregister_crypto_ops() don't
modify their "struct lib80211_crypto_ops *ops" argument. So, it can be
declared as const.
Doing so, some adjustments are needed to also constify some date in
"struct lib80211_crypt_data", "struct lib80211_crypto_alg" and the
return value of lib80211_get_crypto_ops().
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/c74085e02f33a11327582b19c9f51c3236e85ae2.1722839425.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Like the commit ab9177d83c04 ("wifi: mac80211: don't use rate mask for
scanning"), ignore incorrect settings to avoid no supported rate warning
reported by syzbot.
The syzbot did bisect and found cause is commit 9df66d5b9f45 ("cfg80211:
fix default HE tx bitrate mask in 2G band"), which however corrects
bitmask of HE MCS and recognizes correctly settings of empty legacy rate
plus HE MCS rate instead of returning -EINVAL.
As suggestions [1], follow the change of SCAN TX to consider this case of
offchannel TX as well.
[1] https://lore.kernel.org/linux-wireless/6ab2dc9c3afe753ca6fdcdd1421e7a1f47e87b84.camel@sipsolutions.net/T/#m2ac2a6d2be06a37c9c47a3d8a44b4f647ed4f024
Reported-by: syzbot+8dd98a9e98ee28dc484a@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-wireless/000000000000fdef8706191a3f7b@google.com/
Fixes: 9df66d5b9f45 ("cfg80211: fix default HE tx bitrate mask in 2G band")
Signed-off-by: Ping-Ke Shih <pkshih@realtek.com>
Link: https://patch.msgid.link/20240729074816.20323-1-pkshih@realtek.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Introduce 'ieee80211_mgmt_ba()' to avoid code duplication between
'ieee80211_send_addba_resp()', 'ieee80211_send_addba_request()',
and 'ieee80211_send_delba()', ensure that all related addresses
are '__aligned(2)', and prefer convenient 'ether_addr_copy()'
over generic 'memcpy()'. No functional changes expected.
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Link: https://patch.msgid.link/20240725090925.6022-1-dmantipov@yandex.ru
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
When resolving name in ceph_dns_resolve_name(), the end address of name
is determined by the minimum value of delim_p and colon_p. So using min()
here is more in line with the context.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
ipv6_setsockopt() can directly call ip_setsockopt()
instead of going through udp_prot.setsockopt()
ipv6_getsockopt() can directly call ip_getsockopt()
instead of going through udp_prot.getsockopt()
These indirections predate git history, not sure why they
were there.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240823140019.3727643-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The macro sk_for_each_bound_bhash accepts a parameter
__sk, but it was not used, rather the sk2 is directly
used, so we replace the sk2 with __sk in macro.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Link: https://patch.msgid.link/20240823070453.3327832-1-lihongbo22@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A sysfs reader can race with a device reset or removal, attempting to
read device state when the device is not actually present. eg:
[exception RIP: qed_get_current_link+17]
#8 [ffffb9e4f2907c48] qede_get_link_ksettings at ffffffffc07a994a [qede]
#9 [ffffb9e4f2907cd8] __rh_call_get_link_ksettings at ffffffff992b01a3
#10 [ffffb9e4f2907d38] __ethtool_get_link_ksettings at ffffffff992b04e4
#11 [ffffb9e4f2907d90] duplex_show at ffffffff99260300
#12 [ffffb9e4f2907e38] dev_attr_show at ffffffff9905a01c
#13 [ffffb9e4f2907e50] sysfs_kf_seq_show at ffffffff98e0145b
#14 [ffffb9e4f2907e68] seq_read at ffffffff98d902e3
#15 [ffffb9e4f2907ec8] vfs_read at ffffffff98d657d1
#16 [ffffb9e4f2907f00] ksys_read at ffffffff98d65c3f
#17 [ffffb9e4f2907f38] do_syscall_64 at ffffffff98a052fb
crash> struct net_device.state ffff9a9d21336000
state = 5,
state 5 is __LINK_STATE_START (0b1) and __LINK_STATE_NOCARRIER (0b100).
The device is not present, note lack of __LINK_STATE_PRESENT (0b10).
This is the same sort of panic as observed in commit 4224cfd7fb65
("net-sysfs: add check for netdevice being present to speed_show").
There are many other callers of __ethtool_get_link_ksettings() which
don't have a device presence check.
Move this check into ethtool to protect all callers.
Fixes: d519e17e2d01 ("net: export device speed and duplex via sysfs")
Fixes: 4224cfd7fb65 ("net-sysfs: add check for netdevice being present to speed_show")
Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com>
Link: https://patch.msgid.link/8bae218864beaa44ed01628140475b9bf641c5b0.1724393671.git.jamie.bainbridge@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We found that one close-wait socket was reset by the other side
due to a new connection reusing the same port which is beyond our
expectation, so we have to investigate the underlying reason.
The following experiment is conducted in the test environment. We
limit the port range from 40000 to 40010 and delay the time to close()
after receiving a fin from the active close side, which can help us
easily reproduce like what happened in production.
Here are three connections captured by tcpdump:
127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965525191
127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 2769915070
127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1
127.0.0.1.40002 > 127.0.0.1.9999: Flags [F.], seq 1, ack 1
// a few seconds later, within 60 seconds
127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730
127.0.0.1.9999 > 127.0.0.1.40002: Flags [.], ack 2
127.0.0.1.40002 > 127.0.0.1.9999: Flags [R], seq 2965525193
// later, very quickly
127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730
127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 3120990805
127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1
As we can see, the first flow is reset because:
1) client starts a new connection, I mean, the second one
2) client tries to find a suitable port which is a timewait socket
(its state is timewait, substate is fin_wait2)
3) client occupies that timewait port to send a SYN
4) server finds a corresponding close-wait socket in ehash table,
then replies with a challenge ack
5) client sends an RST to terminate this old close-wait socket.
I don't think the port selection algo can choose a FIN_WAIT2 socket
when we turn on tcp_tw_reuse because on the server side there
remain unread data. In some cases, if one side haven't call close() yet,
we should not consider it as expendable and treat it at will.
Even though, sometimes, the server isn't able to call close() as soon
as possible like what we expect, it can not be terminated easily,
especially due to a second unrelated connection happening.
After this patch, we can see the expected failure if we start a
connection when all the ports are occupied in fin_wait2 state:
"Ncat: Cannot assign requested address."
Reported-by: Jade Dong <jadedong@tencent.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240823001152.31004-1-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Although commit 4a4cd70369f1 ("l2tp: don't set sk_user_data in tunnel socket")
removed sk->sk_user_data usage, setup_udp_tunnel_sock() still touches
sk->sk_user_data, this conflicts with sockmap which also leverages
sk->sk_user_data to save psock.
Restore this sk->sk_user_data check to avoid such conflicts.
Fixes: 4a4cd70369f1 ("l2tp: don't set sk_user_data in tunnel socket")
Reported-by: syzbot+8dbe3133b840c470da0e@syzkaller.appspotmail.com
Cc: Tom Parkin <tparkin@katalix.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Tested-by: James Chapman <jchapman@katalix.com>
Reviewed-by: James Chapman <jchapman@katalix.com>
Link: https://patch.msgid.link/20240822182544.378169-1-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When calculating size of own domain based on number of peers, the result
should be less than MAX_MON_DOMAIN, so using min() here is very semantic.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240822133908.1042240-8-lizetao1@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When coping sockaddr in ip6_mc_msfget(), the time of copies
depends on the minimum value between sl_count and gf_numsrc.
Using min() here is very semantic.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240822133908.1042240-7-lizetao1@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When processing the tail append of sk buffer, the final length needs
to be determined based on expectlen and addlen. Using max() here can
increase the readability of the code.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240822133908.1042240-4-lizetao1@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Correct spelling in net/core.
As reported by codespell.
Signed-off-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240822-net-spell-v1-13-3a98971ce2d2@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Enhance the ethtool cable test interface by introducing the ability to
specify the source of the diagnostic information for cable test results.
This is particularly useful for PHYs that offer multiple diagnostic
methods, such as Time Domain Reflectometry (TDR) and Active Link Cable
Diagnostic (ALCD).
Key changes:
- Added `ethnl_cable_test_result_with_src` and
`ethnl_cable_test_fault_length_with_src` functions to allow specifying
the information source when reporting cable test results.
- Updated existing `ethnl_cable_test_result` and
`ethnl_cable_test_fault_length` functions to use TDR as the default
source, ensuring backward compatibility.
- Modified the UAPI to support these new attributes, enabling drivers to
provide more detailed diagnostic information.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20240822120703.1393130-3-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Modify netpoll_setup() and __netpoll_setup() to ensure that the netpoll
structure (np) is left in a clean state if setup fails for any reason.
This prevents carrying over misconfigured fields in case of partial
setup success.
Key changes:
- np->dev is now set only after successful setup, ensuring it's always
NULL if netpoll is not configured or if netpoll_setup() fails.
- np->local_ip is zeroed if netpoll setup doesn't complete successfully.
- Added DEBUG_NET_WARN_ON_ONCE() checks to catch unexpected states.
- Reordered some operations in __netpoll_setup() for better logical flow.
These changes improve the reliability of netpoll configuration, since it
assures that the structure is fully initialized or totally unset.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20240822111051.179850-2-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-08-23
We've added 10 non-merge commits during the last 15 day(s) which contain
a total of 10 files changed, 222 insertions(+), 190 deletions(-).
The main changes are:
1) Add TCP_BPF_SOCK_OPS_CB_FLAGS to bpf_*sockopt() to address the case
when long-lived sockets miss a chance to set additional callbacks
if a sockops program was not attached early in their lifetime,
from Alan Maguire.
2) Add a batch of BPF selftest improvements which fix a few bugs and add
missing features to improve the test coverage of sockmap/sockhash,
from Michal Luczaj.
3) Fix a false-positive Smatch-reported off-by-one in tcp_validate_cookie()
which is part of the test_tcp_custom_syncookie BPF selftest,
from Kuniyuki Iwashima.
4) Fix the flow_dissector BPF selftest which had a bug in IP header's
tot_len calculation doing subtraction after htons() instead of inside
htons(), from Asbjørn Sloth Tønnesen.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
selftest: bpf: Remove mssind boundary check in test_tcp_custom_syncookie.c.
selftests/bpf: Introduce __attribute__((cleanup)) in create_pair()
selftests/bpf: Exercise SOCK_STREAM unix_inet_redir_to_connected()
selftests/bpf: Honour the sotype of af_unix redir tests
selftests/bpf: Simplify inet_socketpair() and vsock_socketpair_connectible()
selftests/bpf: Socket pair creation, cleanups
selftests/bpf: Support more socket types in create_pair()
selftests/bpf: Avoid subtraction after htons() in ipip tests
selftests/bpf: add sockopt tests for TCP_BPF_SOCK_OPS_CB_FLAGS
bpf/bpf_get,set_sockopt: add option to set TCP-BPF sock ops flags
====================
Link: https://patch.msgid.link/20240823134959.1091-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In 'ieee80211_beacon_get_ap()', free allocated skb in case of error
returned by 'ieee80211_beacon_protect()'. Compile tested only.
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Link: https://patch.msgid.link/20240805142035.227847-1-dmantipov@yandex.ru
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following batch contains Netfilter updates for net-next:
Patch #1 fix checksum calculation in nfnetlink_queue with SCTP,
segment GSO packet since skb_zerocopy() does not support
GSO_BY_FRAGS, from Antonio Ojea.
Patch #2 extend nfnetlink_queue coverage to handle SCTP packets,
from Antonio Ojea.
Patch #3 uses consume_skb() instead of kfree_skb() in nfnetlink,
from Donald Hunter.
Patch #4 adds a dedicate commit list for sets to speed up
intra-transaction lookups, from Florian Westphal.
Patch #5 skips removal of element from abort path for the pipapo
backend, ditching the shadow copy of this datastructure
is sufficient.
Patch #6 moves nf_ct_netns_get() out of nf_conncount_init() to
let users of conncoiunt decide when to enable conntrack,
this is needed by openvswitch, from Xin Long.
Patch #7 pass context to all nft_parse_register_load() in
preparation for the next patch.
Patches #8 and #9 reject loads from uninitialized registers from
control plane to remove register initialization from
datapath. From Florian Westphal.
* tag 'nf-next-24-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_tables: don't initialize registers in nft_do_chain()
netfilter: nf_tables: allow loads only when register is initialized
netfilter: nf_tables: pass context structure to nft_parse_register_load
netfilter: move nf_ct_netns_get out of nf_conncount_init
netfilter: nf_tables: do not remove elements if set backend implements .abort
netfilter: nf_tables: store new sets in dedicated list
netfilter: nfnetlink: convert kfree_skb to consume_skb
selftests: netfilter: nft_queue.sh: sctp coverage
netfilter: nfnetlink_queue: unbreak SCTP traffic
====================
Link: https://patch.msgid.link/20240822221939.157858-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Return false when memcmp with zero_ssid returns 0 to correctly
handle hidden SSIDs case.
Fixes: 9cc88678db5b ("wifi: mac80211: check SSID in beacon")
Reviewed-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com>
Signed-off-by: Daniel Gabay <daniel.gabay@intel.com>
Link: https://patch.msgid.link/20240823105546.7ab29ae287a6.I7f98e57e1ab6597614703fdd138cc88ad253d986@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Commit 5fbf57a937f4 ("net: netlink: remove the cb_mutex "injection" from
netlink core") has removed the usage of the 'dump_cb_mutex' field from the
struct netlink_sock.
Remove the field itself now. It saves a few bytes in the structure.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When net devices propagate xdp configurations to slave devices,
we will need to perform a memory provider check to ensure we're
not binding xdp to a device using unreadable netmem.
Currently the ->ndo_bpf calls in a few places. Adding checks to all
these places would not be ideal.
Refactor all the ->ndo_bpf calls into one place where we can add this
check in the future.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
No consumers anymore, remove it. After this, insertion of policies
no longer require list walk of all inexact policies but only those
that are reachable via the candidate sets.
This gives almost linear insertion speeds provided the inserted
policies are for non-overlapping networks.
Before:
Inserted 1000 policies in 70 ms
Inserted 10000 policies in 1155 ms
Inserted 100000 policies in 216848 ms
After:
Inserted 1000 policies in 56 ms
Inserted 10000 policies in 478 ms
Inserted 100000 policies in 4580 ms
Insertion of 1m entries takes about ~40s after this change
on my test vm.
Cc: Noel Kuntze <noel@familie-kuntze.de>
Cc: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
XFRM_MIGRATE still uses the old lookup method:
first check the bydst hash table, then search the list of all the other
policies.
Switch MIGRATE to use the same lookup function as the packetpath.
This is done to remove the last remaining users of the pernet
xfrm.policy_inexact lists with the intent of removing this list.
After this patch, policies are still added to the list on insertion
and they are rehashed as-needed but no single API makes use of these
anymore.
This change is compile tested only.
Cc: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Since commit
6be3b0db6db8 ("xfrm: policy: add inexact policy search tree infrastructure")
policy lookup no longer walks a list but has a set of candidate lists.
This set has to be searched for the best match.
In case there are several matches, the priority wins.
If the priority is also the same, then the historic behaviour with
a single list was to return the first match (first-in-list).
With introduction of serval lists, this doesn't work and a new
'pos' member was added that reflects the xfrm_policy structs position
in the list.
This value is not exported to userspace and it does not need to be
the 'position in the list', it just needs to make sure that
a->pos < b->pos means that a was added to the lists more recently
than b.
This re-walk is expensive when many inexact policies are in use.
Speed this up: when appending the policy to the end of the walker list,
then just take the ->pos value of the last entry made and add 1.
Add a slowpath version to prevent overflow, if we'd assign UINT_MAX
then iterate the entire list and fix the ordering.
While this speeds up insertion considerably finding the insertion spot
in the inexact list still requires a partial list walk.
This is addressed in followup patches.
Before:
./xfrm_policy_add_speed.sh
Inserted 1000 policies in 72 ms
Inserted 10000 policies in 1540 ms
Inserted 100000 policies in 334780 ms
After:
Inserted 1000 policies in 68 ms
Inserted 10000 policies in 1137 ms
Inserted 100000 policies in 157307 ms
Reported-by: Noel Kuntze <noel@familie-kuntze.de>
Cc: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Pull NFS client fixes from Anna Schumaker:
- Fix rpcrdma refcounting in xa_alloc
- Fix rpcrdma usage of XA_FLAGS_ALLOC
- Fix requesting FATTR4_WORD2_OPEN_ARGUMENTS
- Fix attribute bitmap decoder to handle a 3rd word
- Add reschedule points when returning delegations to avoid soft lockups
- Fix clearing layout segments in layoutreturn
- Avoid unnecessary rescanning of the per-server delegation list
* tag 'nfs-for-6.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFS: Avoid unnecessary rescanning of the per-server delegation list
NFSv4: Fix clearing of layout segments in layoutreturn
NFSv4: Add missing rescheduling points in nfs_client_return_marked_delegations
nfs: fix bitmap decoder to handle a 3rd word
nfs: fix the fetch of FATTR4_OPEN_ARGUMENTS
rpcrdma: Trace connection registration and unregistration
rpcrdma: Use XA_FLAGS_ALLOC instead of XA_FLAGS_ALLOC1
rpcrdma: Device kref is over-incremented on error from xa_alloc
|
|
This fixes not handling hibernation actions on suspend notifier so they
are treated in the same way as regular suspend actions.
Fixes: 9952d90ea288 ("Bluetooth: Handle PM_SUSPEND_PREPARE and PM_POST_SUSPEND")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
The initial value of err is -ENOBUFS, and err is guaranteed to be
less than 0 before all goto errout. Therefore, on the error path
of errout, there is no need to repeatedly judge that err is less than 0,
and delete redundant judgments to make the code more concise.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The initial value of err is -ENOBUFS, and err is guaranteed to be
less than 0 before all goto errout. Therefore, on the error path
of errout, there is no need to repeatedly judge that err is less than 0,
and delete redundant judgments to make the code more concise.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The initial value of err is -ENOBUFS, and err is guaranteed to be
less than 0 before all goto errout. Therefore, on the error path
of errout, there is no need to repeatedly judge that err is less than 0,
and delete redundant judgments to make the code more concise.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The initial value of err is -ENOBUFS, and err is guaranteed to be
less than 0 before all goto errout. Therefore, on the error path
of errout, there is no need to repeatedly judge that err is less than 0,
and delete redundant judgments to make the code more concise.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The initial value of err is -ENOBUFS, and err is guaranteed to be
less than 0 before all goto errout. Therefore, on the error path
of errout, there is no need to repeatedly judge that err is less than 0,
and delete redundant judgments to make the code more concise.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The initial value of err is -ENOBUFS, and err is guaranteed to be
less than 0 before all goto errout. Therefore, on the error path
of errout, there is no need to repeatedly judge that err is less than 0,
and delete redundant judgments to make the code more concise.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The initial value of err is -ENOBUFS, and err is guaranteed to be
less than 0 before all goto errout. Therefore, on the error path
of errout, there is no need to repeatedly judge that err is less than 0,
and delete redundant judgments to make the code more concise.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The initial value of err is -ENOBUFS, and err is guaranteed to be
less than 0 before all goto errout. Therefore, on the error path
of errout, there is no need to repeatedly judge that err is less than 0,
and delete redundant judgments to make the code more concise.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|