<feed xmlns='http://www.w3.org/2005/Atom'>
<title>pm24.git/include/net/netns, branch v6.6</title>
<subtitle>Unnamed repository; edit this file 'description' to name the repository.
</subtitle>
<id>https://git.kobert.dev/pm24.git/atom?h=v6.6</id>
<link rel='self' href='https://git.kobert.dev/pm24.git/atom?h=v6.6'/>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/'/>
<updated>2023-09-13T07:20:28Z</updated>
<entry>
<title>xfrm: fix a data-race in xfrm_gen_index()</title>
<updated>2023-09-13T07:20:28Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2023-09-08T18:13:59Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=3e4bc23926b83c3c67e5f61ae8571602754131a6'/>
<id>urn:sha1:3e4bc23926b83c3c67e5f61ae8571602754131a6</id>
<content type='text'>
xfrm_gen_index() mutual exclusion uses net-&gt;xfrm.xfrm_policy_lock.

This means we must use a per-netns idx_generator variable,
instead of a static one.
Alternative would be to use an atomic variable.

syzbot reported:

BUG: KCSAN: data-race in xfrm_sk_policy_insert / xfrm_sk_policy_insert

write to 0xffffffff87005938 of 4 bytes by task 29466 on cpu 0:
xfrm_gen_index net/xfrm/xfrm_policy.c:1385 [inline]
xfrm_sk_policy_insert+0x262/0x640 net/xfrm/xfrm_policy.c:2347
xfrm_user_policy+0x413/0x540 net/xfrm/xfrm_state.c:2639
do_ipv6_setsockopt+0x1317/0x2ce0 net/ipv6/ipv6_sockglue.c:943
ipv6_setsockopt+0x57/0x130 net/ipv6/ipv6_sockglue.c:1012
rawv6_setsockopt+0x21e/0x410 net/ipv6/raw.c:1054
sock_common_setsockopt+0x61/0x70 net/core/sock.c:3697
__sys_setsockopt+0x1c9/0x230 net/socket.c:2263
__do_sys_setsockopt net/socket.c:2274 [inline]
__se_sys_setsockopt net/socket.c:2271 [inline]
__x64_sys_setsockopt+0x66/0x80 net/socket.c:2271
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

read to 0xffffffff87005938 of 4 bytes by task 29460 on cpu 1:
xfrm_sk_policy_insert+0x13e/0x640
xfrm_user_policy+0x413/0x540 net/xfrm/xfrm_state.c:2639
do_ipv6_setsockopt+0x1317/0x2ce0 net/ipv6/ipv6_sockglue.c:943
ipv6_setsockopt+0x57/0x130 net/ipv6/ipv6_sockglue.c:1012
rawv6_setsockopt+0x21e/0x410 net/ipv6/raw.c:1054
sock_common_setsockopt+0x61/0x70 net/core/sock.c:3697
__sys_setsockopt+0x1c9/0x230 net/socket.c:2263
__do_sys_setsockopt net/socket.c:2274 [inline]
__se_sys_setsockopt net/socket.c:2271 [inline]
__x64_sys_setsockopt+0x66/0x80 net/socket.c:2271
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

value changed: 0x00006ad8 -&gt; 0x00006b18

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 29460 Comm: syz-executor.1 Not tainted 6.5.0-rc5-syzkaller-00243-g9106536c1aa3 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023

Fixes: 1121994c803f ("netns xfrm: policy insertion in netns")
Reported-by: syzbot &lt;syzkaller@googlegroups.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Steffen Klassert &lt;steffen.klassert@secunet.com&gt;
Cc: Herbert Xu &lt;herbert@gondor.apana.org.au&gt;
Acked-by: Herbert Xu &lt;herbert@gondor.apana.org.au&gt;
Signed-off-by: Steffen Klassert &lt;steffen.klassert@secunet.com&gt;
</content>
</entry>
<entry>
<title>net: Remove leftover include from nftables.h</title>
<updated>2023-08-13T13:55:25Z</updated>
<author>
<name>Jörn-Thorben Hinz</name>
<email>jthinz@mailbox.tu-berlin.de</email>
</author>
<published>2023-08-11T17:33:57Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=f614a29d6ca6962139b0eb36b985e3dda80258a6'/>
<id>urn:sha1:f614a29d6ca6962139b0eb36b985e3dda80258a6</id>
<content type='text'>
Commit db3685b4046f ("net: remove obsolete members from struct net")
removed the uses of struct list_head from this header, without removing
the corresponding included header.

Signed-off-by: Jörn-Thorben Hinz &lt;jthinz@mailbox.tu-berlin.de&gt;
Reviewed-by: Simon Horman &lt;horms@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: get rid of sysctl_tcp_adv_win_scale</title>
<updated>2023-07-19T01:41:18Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2023-07-17T15:29:17Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=dfa2f0483360d4d6f2324405464c9f281156bd87'/>
<id>urn:sha1:dfa2f0483360d4d6f2324405464c9f281156bd87</id>
<content type='text'>
With modern NIC drivers shifting to full page allocations per
received frame, we face the following issue:

TCP has one per-netns sysctl used to tweak how to translate
a memory use into an expected payload (RWIN), in RX path.

tcp_win_from_space() implementation is limited to few cases.

For hosts dealing with various MSS, we either under estimate
or over estimate the RWIN we send to the remote peers.

For instance with the default sysctl_tcp_adv_win_scale value,
we expect to store 50% of payload per allocated chunk of memory.

For the typical use of MTU=1500 traffic, and order-0 pages allocations
by NIC drivers, we are sending too big RWIN, leading to potential
tcp collapse operations, which are extremely expensive and source
of latency spikes.

This patch makes sysctl_tcp_adv_win_scale obsolete, and instead
uses a per socket scaling factor, so that we can precisely
adjust the RWIN based on effective skb-&gt;len/skb-&gt;truesize ratio.

This patch alone can double TCP receive performance when receivers
are too slow to drain their receive queue, or by allowing
a bigger RWIN when MSS is close to PAGE_SIZE.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Link: https://lore.kernel.org/r/20230717152917.751987-1-edumazet@google.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: enforce receive buffer memory limits by allowing the tcp window to shrink</title>
<updated>2023-06-17T08:53:53Z</updated>
<author>
<name>mfreemon@cloudflare.com</name>
<email>mfreemon@cloudflare.com</email>
</author>
<published>2023-06-12T03:05:24Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=b650d953cd391595e536153ce30b4aab385643ac'/>
<id>urn:sha1:b650d953cd391595e536153ce30b4aab385643ac</id>
<content type='text'>
Under certain circumstances, the tcp receive buffer memory limit
set by autotuning (sk_rcvbuf) is increased due to incoming data
packets as a result of the window not closing when it should be.
This can result in the receive buffer growing all the way up to
tcp_rmem[2], even for tcp sessions with a low BDP.

To reproduce:  Connect a TCP session with the receiver doing
nothing and the sender sending small packets (an infinite loop
of socket send() with 4 bytes of payload with a sleep of 1 ms
in between each send()).  This will cause the tcp receive buffer
to grow all the way up to tcp_rmem[2].

As a result, a host can have individual tcp sessions with receive
buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
limits, causing the host to go into tcp memory pressure mode.

The fundamental issue is the relationship between the granularity
of the window scaling factor and the number of byte ACKed back
to the sender.  This problem has previously been identified in
RFC 7323, appendix F [1].

The Linux kernel currently adheres to never shrinking the window.

In addition to the overallocation of memory mentioned above, the
current behavior is functionally incorrect, because once tcp_rmem[2]
is reached when no remediations remain (i.e. tcp collapse fails to
free up any more memory and there are no packets to prune from the
out-of-order queue), the receiver will drop in-window packets
resulting in retransmissions and an eventual timeout of the tcp
session.  A receive buffer full condition should instead result
in a zero window and an indefinite wait.

In practice, this problem is largely hidden for most flows.  It
is not applicable to mice flows.  Elephant flows can send data
fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
triggering a zero window.

But this problem does show up for other types of flows.  Examples
are websockets and other type of flows that send small amounts of
data spaced apart slightly in time.  In these cases, we directly
encounter the problem described in [1].

RFC 7323, section 2.4 [2], says there are instances when a retracted
window can be offered, and that TCP implementations MUST ensure
that they handle a shrinking window, as specified in RFC 1122,
section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
management have made clear that sender must accept a shrunk window
from the receiver, including RFC 793 [4] and RFC 1323 [5].

This patch implements the functionality to shrink the tcp window
when necessary to keep the right edge within the memory limit by
autotuning (sk_rcvbuf).  This new functionality is enabled with
the new sysctl: net.ipv4.tcp_shrink_window

Additional information can be found at:
https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/

[1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
[2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
[3] https://www.rfc-editor.org/rfc/rfc1122#page-91
[4] https://www.rfc-editor.org/rfc/rfc793
[5] https://www.rfc-editor.org/rfc/rfc1323

Signed-off-by: Mike Freemon &lt;mfreemon@cloudflare.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net</title>
<updated>2023-06-08T18:35:14Z</updated>
<author>
<name>Jakub Kicinski</name>
<email>kuba@kernel.org</email>
</author>
<published>2023-06-08T18:34:28Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=449f6bc17a51e68b06cfd742898e5ff3fe6e04d7'/>
<id>urn:sha1:449f6bc17a51e68b06cfd742898e5ff3fe6e04d7</id>
<content type='text'>
Cross-merge networking fixes after downstream PR.

Conflicts:

net/sched/sch_taprio.c
  d636fc5dd692 ("net: sched: add rcu annotations around qdisc-&gt;qdisc_sleeping")
  dced11ef84fb ("net/sched: taprio: don't overwrite "sch" variable in taprio_dump_class_stats()")

net/ipv4/sysctl_net_ipv4.c
  e209fee4118f ("net/ipv4: ping_group_range: allow GID from 2147483648 to 4294967294")
  ccce324dabfe ("tcp: make the first N SYN RTO backoffs linear")
https://lore.kernel.org/all/20230605100816.08d41a7b@canb.auug.org.au/

No adjacent changes.

Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>net/ipv6: convert skip_notify_on_dev_down sysctl to u8</title>
<updated>2023-06-03T05:55:43Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2023-06-01T16:04:45Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=ef62c0ae6db11c095880e473db9f846132d7eba8'/>
<id>urn:sha1:ef62c0ae6db11c095880e473db9f846132d7eba8</id>
<content type='text'>
Save a bit a space, and could help future sysctls to
use the same pattern.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reviewed-by: David Ahern &lt;dsahern@kernel.org&gt;
Acked-by: Matthieu Baerts &lt;matthieu.baerts@tessares.net&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>net/ipv6: fix bool/int mismatch for skip_notify_on_dev_down</title>
<updated>2023-06-03T05:55:43Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2023-06-01T16:04:44Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=edf2e1d2019b2730d6076dbe4c040d37d7c10bbe'/>
<id>urn:sha1:edf2e1d2019b2730d6076dbe4c040d37d7c10bbe</id>
<content type='text'>
skip_notify_on_dev_down ctl table expects this field
to be an int (4 bytes), not a bool (1 byte).

Because proc_dou8vec_minmax() was added in 5.13,
this patch converts skip_notify_on_dev_down to an int.

Following patch then converts the field to u8 and use proc_dou8vec_minmax().

Fixes: 7c6bb7d2faaf ("net/ipv6: Add knob to skip DELROUTE message on device down")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reviewed-by: David Ahern &lt;dsahern@kernel.org&gt;
Acked-by: Matthieu Baerts &lt;matthieu.baerts@tessares.net&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: make the first N SYN RTO backoffs linear</title>
<updated>2023-05-11T08:31:16Z</updated>
<author>
<name>David Morley</name>
<email>morleyd@google.com</email>
</author>
<published>2023-05-09T18:05:58Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=ccce324dabfe2143519daf50ed8b1ef1d0c542f7'/>
<id>urn:sha1:ccce324dabfe2143519daf50ed8b1ef1d0c542f7</id>
<content type='text'>
Currently the SYN RTO schedule follows an exponential backoff
scheme, which can be unnecessarily conservative in cases where
there are link failures. In such cases, it's better to
aggressively try to retransmit packets, so it takes routers
less time to find a repath with a working link.

We chose a default value for this sysctl of 4, to follow
the macOS and IOS backoff scheme of 1,1,1,1,1,2,4,8, ...
MacOS and IOS have used this backoff schedule for over
a decade, since before this 2009 IETF presentation
discussed the behavior:
https://www.ietf.org/proceedings/75/slides/tcpm-1.pdf

This commit makes the SYN RTO schedule start with a number of
linear backoffs given by the following sysctl:
* tcp_syn_linear_timeouts

This changes the SYN RTO scheme to be: init_rto_val for
tcp_syn_linear_timeouts, exp backoff starting at init_rto_val

For example if init_rto_val = 1 and tcp_syn_linear_timeouts = 2, our
backoff scheme would be: 1, 1, 1, 2, 4, 8, 16, ...

Signed-off-by: David Morley &lt;morleyd@google.com&gt;
Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Tested-by: David Morley &lt;morleyd@google.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Link: https://lore.kernel.org/r/20230509180558.2541885-1-morleyd.kernel@gmail.com
Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;
</content>
</entry>
<entry>
<title>ipv6: add icmpv6_error_anycast_as_unicast for ICMPv6</title>
<updated>2023-04-21T03:07:50Z</updated>
<author>
<name>Mahesh Bandewar</name>
<email>maheshb@google.com</email>
</author>
<published>2023-04-19T01:32:38Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=7ab75456be144a354fbb3df1516d82fc24d3d67d'/>
<id>urn:sha1:7ab75456be144a354fbb3df1516d82fc24d3d67d</id>
<content type='text'>
ICMPv6 error packets are not sent to the anycast destinations and this
prevents things like traceroute from working. So create a setting similar
to ECHO when dealing with Anycast sources (icmpv6_echo_ignore_anycast).

Signed-off-by: Mahesh Bandewar &lt;maheshb@google.com&gt;
Reviewed-by: David Ahern &lt;dsahern@kernel.org&gt;
Reviewed-by: Maciej Żenczykowski &lt;maze@google.com&gt;
Link: https://lore.kernel.org/r/20230419013238.2691167-1-maheshb@google.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf</title>
<updated>2023-02-23T05:25:23Z</updated>
<author>
<name>Jakub Kicinski</name>
<email>kuba@kernel.org</email>
</author>
<published>2023-02-23T05:25:23Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=fd2a55e74a991ae5ff531c9da52963277dc7fbd5'/>
<id>urn:sha1:fd2a55e74a991ae5ff531c9da52963277dc7fbd5</id>
<content type='text'>
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

1) Fix broken listing of set elements when table has an owner.

2) Fix conntrack refcount leak in ctnetlink with related conntrack
   entries, from Hangyu Hua.

3) Fix use-after-free/double-free in ctnetlink conntrack insert path,
   from Florian Westphal.

4) Fix ip6t_rpfilter with VRF, from Phil Sutter.

5) Fix use-after-free in ebtables reported by syzbot, also from Florian.

6) Use skb-&gt;len in xt_length to deal with IPv6 jumbo packets,
   from Xin Long.

7) Fix NETLINK_LISTEN_ALL_NSID with ctnetlink, from Florian Westphal.

8) Fix memleak in {ip_,ip6_,arp_}tables in ENOMEM error case,
   from Pavel Tikhomirov.

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: x_tables: fix percpu counter block leak on error path when creating new netns
  netfilter: ctnetlink: make event listener tracking global
  netfilter: xt_length: use skb len to match in length_mt6
  netfilter: ebtables: fix table blob use-after-free
  netfilter: ip6t_rpfilter: Fix regression with VRF interfaces
  netfilter: conntrack: fix rmmod double-free race
  netfilter: ctnetlink: fix possible refcount leak in ctnetlink_create_conntrack()
  netfilter: nf_tables: allow to fetch set elements when table has an owner
====================

Link: https://lore.kernel.org/r/20230222092137.88637-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
</feed>
