summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)Author
2015-03-20inet: get rid of central tcp/dccp listener timerEric Dumazet
One of the major issue for TCP is the SYNACK rtx handling, done by inet_csk_reqsk_queue_prune(), fired by the keepalive timer of a TCP_LISTEN socket. This function runs for awful long times, with socket lock held, meaning that other cpus needing this lock have to spin for hundred of ms. SYNACK are sent in huge bursts, likely to cause severe drops anyway. This model was OK 15 years ago when memory was very tight. We now can afford to have a timer per request sock. Timer invocations no longer need to lock the listener, and can be run from all cpus in parallel. With following patch increasing somaxconn width to 32 bits, I tested a listener with more than 4 million active request sockets, and a steady SYNFLOOD of ~200,000 SYN per second. Host was sending ~830,000 SYNACK per second. This is ~100 times more what we could achieve before this patch. Later, we will get rid of the listener hash and use ehash instead. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18net: Fix high overhead of vlan sub-device teardown.David S. Miller
When a networking device is taken down that has a non-trivial number of VLAN devices configured under it, we eat a full synchronize_net() for every such VLAN device. This is because of the call chain: NETDEV_DOWN notifier --> vlan_device_event() --> dev_change_flags() --> __dev_change_flags() --> __dev_close() --> __dev_close_many() --> dev_deactivate_many() --> synchronize_net() This is kind of rediculous because we already have infrastructure for batching doing operation X to a list of net devices so that we only incur one sync. So make use of that by exporting dev_close_many() and adjusting it's interfaace so that the caller can fully manage the batch list. Use this in vlan_device_event() and all the overhead goes away. Reported-by: Salam Noureddine <noureddine@arista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18net: add support for phys_port_nameDavid Ahern
Similar to port id allow netdevices to specify port names and export the name via sysfs. Drivers can implement the netdevice operation to assist udev in having sane default names for the devices using the rule: $ cat /etc/udev/rules.d/80-net-setup-link.rules SUBSYSTEM=="net", ACTION=="add", ATTR{phys_port_name}!="", NAME="$attr{phys_port_name}" Use of phys_name versus phys_id was suggested-by Jiri Pirko. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18net: Add max rate tx queue attributeJohn Fastabend
This adds a tx_maxrate attribute to the tx queue sysfs entry allowing for max-rate limiting. Along with DCB-ETS and BQL this provides another knob to tune queue performance. The limit units are Mbps. By default it is disabled. To disable the rate limitation after it has been set for a queue, it should be set to zero. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-17tcp: rename struct tcp_request_sock listenerEric Dumazet
The listener field in struct tcp_request_sock is a pointer back to the listener. We now have req->rsk_listener, so TCP only needs one boolean and not a full pointer. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-17bpf: allow BPF programs access 'protocol' and 'vlan_tci' fieldsAlexei Starovoitov
as a follow on to patch 70006af95515 ("bpf: allow eBPF access skb fields") this patch allows 'protocol' and 'vlan_tci' fields to be accessible from extended BPF programs. The usage of 'protocol', 'vlan_present' and 'vlan_tci' fields is the same as corresponding SKF_AD_PROTOCOL, SKF_AD_VLAN_TAG_PRESENT and SKF_AD_VLAN_TAG accesses in classic BPF. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-16net: kernel socket should be released in init_net namespaceYing Xue
Creating a kernel socket with sock_create_kern() happens in "init_net" namespace, however, releasing it with sk_release_kernel() occurs in the current namespace which may be different with "init_net" namespace. Therefore, we should guarantee that the namespace in which a kernel socket is created is same as the socket is created. Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-16inet: add proper refcounting to request sockEric Dumazet
reqsk_put() is the generic function that should be used to release a refcount (and automatically call reqsk_free()) reqsk_free() might be called if refcount is known to be 0 or undefined. refcnt is set to one in inet_csk_reqsk_queue_add() As request socks are not yet in global ehash table, I added temporary debugging checks in reqsk_put() and reqsk_free() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-16inet: factorize sock_edemux()/sock_gen_put() codeEric Dumazet
sock_edemux() is not used in fast path, and should really call sock_gen_put() to save some code. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-15bpf: allow extended BPF programs access skb fieldsAlexei Starovoitov
introduce user accessible mirror of in-kernel 'struct sk_buff': struct __sk_buff { __u32 len; __u32 pkt_type; __u32 mark; __u32 queue_mapping; }; bpf programs can do: int bpf_prog(struct __sk_buff *skb) { __u32 var = skb->pkt_type; which will be compiled to bpf assembler as: dst_reg = *(u32 *)(src_reg + 4) // 4 == offsetof(struct __sk_buff, pkt_type) bpf verifier will check validity of access and will convert it to: dst_reg = *(u8 *)(src_reg + offsetof(struct sk_buff, __pkt_type_offset)) dst_reg &= 7 since skb->pkt_type is a bitfield. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-15ebpf: add helper for obtaining current processor idDaniel Borkmann
This patch adds the possibility to obtain raw_smp_processor_id() in eBPF. Currently, this is only possible in classic BPF where commit da2033c28226 ("filter: add SKF_AD_RXHASH and SKF_AD_CPU") has added facilities for this. Perhaps most importantly, this would also allow us to track per CPU statistics with eBPF maps, or to implement a poor-man's per CPU data structure through eBPF maps. Example function proto-type looks like: u32 (*smp_processor_id)(void) = (void *)BPF_FUNC_get_smp_processor_id; Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-15ebpf: add prandom helper for packet samplingDaniel Borkmann
This work is similar to commit 4cd3675ebf74 ("filter: added BPF random opcode") and adds a possibility for packet sampling in eBPF. Currently, this is only possible in classic BPF and useful to combine sampling with f.e. packet sockets, possible also with tc. Example function proto-type looks like: u32 (*prandom_u32)(void) = (void *)BPF_FUNC_get_prandom_u32; Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12inet: prepare sock_edemux() & sock_gen_put() for new SYN_RECV stateEric Dumazet
sock_edemux() & sock_gen_put() should be ready to cope with request socks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12net: add req_prot_cleanup() & req_prot_init() helpersEric Dumazet
Make proto_register() & proto_unregister() a bit nicer. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12net: Kill hold_net release_netEric W. Biederman
hold_net and release_net were an idea that turned out to be useless. The code has been disabled since 2008. Kill the code it is long past due. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12sock: fix possible NULL sk dereference in __skb_tstamp_txWillem de Bruijn
Test that sk != NULL before reading sk->sk_tsflags. Fixes: 49ca0d8bfaf3 ("net-timestamp: no-payload option") Reported-by: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-11xps: must clear sender_cpu before forwardingEric Dumazet
John reported that my previous commit added a regression on his router. This is because sender_cpu & napi_id share a common location, so get_xps_queue() can see garbage and perform an out of bound access. We need to make sure sender_cpu is cleared before doing the transmit, otherwise any NIC busy poll enabled (skb_mark_napi_id()) can trigger this bug. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: John <jw@nuclearfallout.net> Bisected-by: John <jw@nuclearfallout.net> Fixes: 2bd82484bb4c ("xps: fix xps for stacked devices") Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-11net: add real socket cookiesEric Dumazet
A long standing problem in netlink socket dumps is the use of kernel socket addresses as cookies. 1) It is a security concern. 2) Sockets can be reused quite quickly, so there is no guarantee a cookie is used once and identify a flow. 3) request sock, establish sock, and timewait socks for a given flow have different cookies. Part of our effort to bring better TCP statistics requires to switch to a different allocator. In this patch, I chose to use a per network namespace 64bit generator, and to use it only in the case a socket needs to be dumped to netlink. (This might be refined later if needed) Note that I tried to carry cookies from request sock, to establish sock, then timewait sockets. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Eric Salo <salo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-11net: sysctl_net_core: check SNDBUF and RCVBUF for min lengthAlexey Kodanev
sysctl has sysctl.net.core.rmem_*/wmem_* parameters which can be set to incorrect values. Given that 'struct sk_buff' allocates from rcvbuf, incorrectly set buffer length could result to memory allocation failures. For example, set them as follows: # sysctl net.core.rmem_default=64 net.core.wmem_default = 64 # sysctl net.core.wmem_default=64 net.core.wmem_default = 64 # ping localhost -s 1024 -i 0 > /dev/null This could result to the following failure: skbuff: skb_over_panic: text:ffffffff81628db4 len:-32 put:-32 head:ffff88003a1cc200 data:ffff88003a1cc200 tail:0xffffffe0 end:0xc0 dev:<NULL> kernel BUG at net/core/skbuff.c:102! invalid opcode: 0000 [#1] SMP ... task: ffff88003b7f5550 ti: ffff88003ae88000 task.ti: ffff88003ae88000 RIP: 0010:[<ffffffff8155fbd1>] [<ffffffff8155fbd1>] skb_put+0xa1/0xb0 RSP: 0018:ffff88003ae8bc68 EFLAGS: 00010296 RAX: 000000000000008d RBX: 00000000ffffffe0 RCX: 0000000000000000 RDX: ffff88003fdcf598 RSI: ffff88003fdcd9c8 RDI: ffff88003fdcd9c8 RBP: ffff88003ae8bc88 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 00000000000002b2 R12: 0000000000000000 R13: 0000000000000000 R14: ffff88003d3f7300 R15: ffff88000012a900 FS: 00007fa0e2b4a840(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000d0f7e0 CR3: 000000003b8fb000 CR4: 00000000000006f0 Stack: ffff88003a1cc200 00000000ffffffe0 00000000000000c0 ffffffff818cab1d ffff88003ae8bd68 ffffffff81628db4 ffff88003ae8bd48 ffff88003b7f5550 ffff880031a09408 ffff88003b7f5550 ffff88000012aa48 ffff88000012ab00 Call Trace: [<ffffffff81628db4>] unix_stream_sendmsg+0x2c4/0x470 [<ffffffff81556f56>] sock_write_iter+0x146/0x160 [<ffffffff811d9612>] new_sync_write+0x92/0xd0 [<ffffffff811d9cd6>] vfs_write+0xd6/0x180 [<ffffffff811da499>] SyS_write+0x59/0xd0 [<ffffffff81651532>] system_call_fastpath+0x12/0x17 Code: 00 00 48 89 44 24 10 8b 87 c8 00 00 00 48 89 44 24 08 48 8b 87 d8 00 00 00 48 c7 c7 30 db 91 81 48 89 04 24 31 c0 e8 4f a8 0e 00 <0f> 0b eb fe 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 RIP [<ffffffff8155fbd1>] skb_put+0xa1/0xb0 RSP <ffff88003ae8bc68> Kernel panic - not syncing: Fatal exception Moreover, the possible minimum is 1, so we can get another kernel panic: ... BUG: unable to handle kernel paging request at ffff88013caee5c0 IP: [<ffffffff815604cf>] __alloc_skb+0x12f/0x1f0 ... Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-11ipv4: FIB Local/MAIN table collapseAlexander Duyck
This patch is meant to collapse local and main into one by converting tb_data from an array to a pointer. Doing this allows us to point the local table into the main while maintaining the same variables in the table. As such the tb_data was converted from an array to a pointer, and a new array called data is added in order to still provide an object for tb_data to point to. In order to track the origin of the fib aliases a tb_id value was added in a hole that existed on 64b systems. Using this we can also reverse the merge in the event that custom FIB rules are enabled. With this patch I am seeing an improvement of 20ns to 30ns for routing lookups as long as custom rules are not enabled, with custom rules enabled we fall back to split tables and the original behavior. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-10net: Handle unregister properly when netdev namespace change fails.David S. Miller
If rtnl_newlink() fails on it's call to dev_change_net_namespace(), we have to make use of the ->dellink() method, if present, just like we do when rtnl_configure_link() fails. Fixes: 317f4810e45e ("rtnl: allow to create device with IFLA_LINK_NETNSID set") Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-10net: add comment for sock_efree() usageOliver Hartkopp
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Acked-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-10net: constify sock_diag_check_cookie()Eric Dumazet
sock_diag_check_cookie() second parameter is constant Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-09net: core: add of_find_net_device_by_node()Florian Fainelli
Add a helper function which allows getting the struct net_device pointer associated with a given struct device_node pointer. This is useful for instance for DSA Ethernet devices not backed by a platform_device, but a PCI device. Since we need to access net_class which is not accessible outside of net/core/net-sysfs.c, this helper function is also added here and gated with CONFIG_OF_NET. Network devices initialized with SET_NETDEV_DEV() are also taken into account by checking for dev->parent first and then falling back to checking the device pointer within struct net_device. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-08neigh: Use neigh table index for neigh_packet_xmitEric W. Biederman
Remove a little bit of unnecessary work when transmitting a packet with neigh_packet_xmit. Use the neighbour table index not the address family as a parameter. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06net: gro: remove obsolete code from skb_gro_receive()Eric Dumazet
Some drivers use copybreak to copy tiny frames into smaller skb, and this smaller skb might not have skb->head_frag set for various reasons. skb_gro_receive() currently doesn't allow to aggregate the smaller skb into the previous GRO packet if this GRO packet has at least 2 MSS in it. Following workload easily demonstrates the problem. netperf -t TCP_RR -H target -- -r 3000,3000 (tcpdump shows one GRO packet with 2 MSS, plus one additional packet of 104 bytes that should have been appended.) It turns out that we can remove code from skb_gro_receive(), because commit 8a29111c7ca6 ("net: gro: allow to build full sized skb") and its followups removed the assumption that a GRO packet with a frag_list had to have an empty head. Removing this code allows the aggregation of the last (incomplete) frame in some RPC workloads. Note that tcp_gro_receive() already takes care of forcing a flush if necessary, including this case. If we want to avoid using frag_list in the first place (in forwarding workloads for example, as the outgoing NIC is generally not able to cope with skbs having a frag_list), we need to address this separately. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04neigh: Add helper function neigh_xmitEric W. Biederman
For MPLS I am building the code so that either the neighbour mac address can be specified or we can have a next hop in ipv4 or ipv6. The kind of next hop we have is indicated by the neighbour table pointer. A neighbour table pointer of NULL is a link layer address. A non-NULL neighbour table pointer indicates which neighbour table and thus which address family the next hop address is in that we need to look up. The code either sends a packet directly or looks up the appropriate neighbour table entry and sends the packet. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04neigh: Factor out ___neigh_lookup_norefEric W. Biederman
While looking at the mpls code I found myself writing yet another version of neigh_lookup_noref. We currently have __ipv4_lookup_noref and __ipv6_lookup_noref. So to make my work a little easier and to make it a smidge easier to verify/maintain the mpls code in the future I stopped and wrote ___neigh_lookup_noref. Then I rewote __ipv4_lookup_noref and __ipv6_lookup_noref in terms of this new function. I tested my new version by verifying that the same code is generated in ip_finish_output2 and ip6_finish_output2 where these functions are inlined. To get to ___neigh_lookup_noref I added a new neighbour cache table function key_eq. So that the static size of the key would be available. I also added __neigh_lookup_noref for people who want to to lookup a neighbour table entry quickly but don't know which neibhgour table they are going to look up. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/rocker/rocker.c The rocker commit was two overlapping changes, one to rename the ->vport member to ->pport, and another making the bitmask expression use '1ULL' instead of plain '1'. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02neigh: Don't require a dst in neigh_resolve_outputEric W. Biederman
Having a dst helps a little bit for teql but is fundamentally unnecessary and there are code paths where a dst is not available that it would be nice to use the neighbour cache. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02neigh: Don't require dst in neigh_hh_initEric W. Biederman
- Add protocol to neigh_tbl so that dst->ops->protocol is not needed - Acquire the device from neigh->dev This results in a neigh_hh_init that will cache the samve values regardless of the packets flowing through it. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02neigh: Move neigh_compat_output into ax25_ip.cEric W. Biederman
The only caller is now is ax25_neigh_construct so move neigh_compat_output into ax25_ip.c make it static and rename it ax25_neigh_output. Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-hams@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02filter: refactor common filter attach code into __sk_attach_progDaniel Borkmann
Both sk_attach_filter() and sk_attach_bpf() are setting up sk_filter, charging skmem and attaching it to the socket after we got the eBPF prog up and ready. Lets refactor that into a common helper. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02net: Remove iocb argument from sendmsg and recvmsgYing Xue
After TIPC doesn't depend on iocb argument in its internal implementations of sendmsg() and recvmsg() hooks defined in proto structure, no any user is using iocb argument in them at all now. Then we can drop the redundant iocb argument completely from kinds of implementations of both sendmsg() and recvmsg() in the entire networking stack. Cc: Christoph Hellwig <hch@lst.de> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02net: add common accessor for setting dropcount on packetsEyal Birger
As part of an effort to move skb->dropcount to skb->cb[], use a common function in order to set dropcount in struct sk_buff. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-01ebpf: move read-only fields to bpf_prog and shrink bpf_prog_auxDaniel Borkmann
is_gpl_compatible and prog_type should be moved directly into bpf_prog as they stay immutable during bpf_prog's lifetime, are core attributes and they can be locked as read-only later on via bpf_prog_select_runtime(). With a bit of rearranging, this also allows us to shrink bpf_prog_aux to exactly 1 cacheline. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-01ebpf: add sched_cls_type and map it to sk_filter's verifier opsDaniel Borkmann
As discussed recently and at netconf/netdev01, we want to prevent making bpf_verifier_ops registration available for modules, but have them at a controlled place inside the kernel instead. The reason for this is, that out-of-tree modules can go crazy and define and register any verfifier ops they want, doing all sorts of crap, even bypassing available GPLed eBPF helper functions. We don't want to offer such a shiny playground, of course, but keep strict control to ourselves inside the core kernel. This also encourages us to design eBPF user helpers carefully and generically, so they can be shared among various subsystems using eBPF. For the eBPF traffic classifier (cls_bpf), it's a good start to share the same helper facilities as we currently do in eBPF for socket filters. That way, we have BPF_PROG_TYPE_SCHED_CLS look like it's own type, thus one day if there's a good reason to diverge the set of helper functions from the set available to socket filters, we keep ABI compatibility. In future, we could place all bpf_prog_type_list at a central place, perhaps. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-01ebpf: remove CONFIG_BPF_SYSCALL ifdefs in socket filter codeDaniel Borkmann
This gets rid of CONFIG_BPF_SYSCALL ifdefs in the socket filter code, now that the BPF internal header can deal with it. While going over it, I also changed eBPF related functions to a sk_filter prefix to be more consistent with the rest of the file. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-01ebpf: constify various function pointer structsDaniel Borkmann
We can move bpf_map_ops and bpf_verifier_ops and other structs into ro section, bpf_map_type_list and bpf_prog_type_list into read mostly. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-01net: do not use rcu in rtnl_dump_ifinfo()Eric Dumazet
We did a failed attempt in the past to only use rcu in rtnl dump operations (commit e67f88dd12f6 "net: dont hold rtnl mutex during netlink dump callbacks") Now that dumps are holding RTNL anyway, there is no need to also use rcu locking, as it forbids any scheduling ability, like GFP_KERNEL allocations that controlling path should use instead of GFP_ATOMIC whenever possible. This should fix following splat Cong Wang reported : [ INFO: suspicious RCU usage. ] 3.19.0+ #805 Tainted: G W include/linux/rcupdate.h:538 Illegal context switch in RCU read-side critical section! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 2 locks held by ip/771: #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff8182b8f4>] netlink_dump+0x21/0x26c #1: (rcu_read_lock){......}, at: [<ffffffff817d785b>] rcu_read_lock+0x0/0x6e stack backtrace: CPU: 3 PID: 771 Comm: ip Tainted: G W 3.19.0+ #805 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 0000000000000001 ffff8800d51e7718 ffffffff81a27457 0000000029e729e6 ffff8800d6108000 ffff8800d51e7748 ffffffff810b539b ffffffff820013dd 00000000000001c8 0000000000000000 ffff8800d7448088 ffff8800d51e7758 Call Trace: [<ffffffff81a27457>] dump_stack+0x4c/0x65 [<ffffffff810b539b>] lockdep_rcu_suspicious+0x107/0x110 [<ffffffff8109796f>] rcu_preempt_sleep_check+0x45/0x47 [<ffffffff8109e457>] ___might_sleep+0x1d/0x1cb [<ffffffff8109e67d>] __might_sleep+0x78/0x80 [<ffffffff814b9b1f>] idr_alloc+0x45/0xd1 [<ffffffff810cb7ab>] ? rcu_read_lock_held+0x3b/0x3d [<ffffffff814b9f9d>] ? idr_for_each+0x53/0x101 [<ffffffff817c1383>] alloc_netid+0x61/0x69 [<ffffffff817c14c3>] __peernet2id+0x79/0x8d [<ffffffff817c1ab7>] peernet2id+0x13/0x1f [<ffffffff817d8673>] rtnl_fill_ifinfo+0xa8d/0xc20 [<ffffffff810b17d9>] ? __lock_is_held+0x39/0x52 [<ffffffff817d894f>] rtnl_dump_ifinfo+0x149/0x213 [<ffffffff8182b9c2>] netlink_dump+0xef/0x26c [<ffffffff8182bcba>] netlink_recvmsg+0x17b/0x2c5 [<ffffffff817b0adc>] __sock_recvmsg+0x4e/0x59 [<ffffffff817b1b40>] sock_recvmsg+0x3f/0x51 [<ffffffff817b1f9a>] ___sys_recvmsg+0xf6/0x1d9 [<ffffffff8115dc67>] ? handle_pte_fault+0x6e1/0xd3d [<ffffffff8100a3a0>] ? native_sched_clock+0x35/0x37 [<ffffffff8109f45b>] ? sched_clock_local+0x12/0x72 [<ffffffff8109f6ac>] ? sched_clock_cpu+0x9e/0xb7 [<ffffffff810cb7ab>] ? rcu_read_lock_held+0x3b/0x3d [<ffffffff811abde8>] ? __fcheck_files+0x4c/0x58 [<ffffffff811ac556>] ? __fget_light+0x2d/0x52 [<ffffffff817b376f>] __sys_recvmsg+0x42/0x60 [<ffffffff817b379f>] SyS_recvmsg+0x12/0x1c Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 0c7aecd4bde4b7302 ("netns: add rtnl cmd to add and get peer netns ids") Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-28net: Verify permission to link_net in newlinkEric W. Biederman
When applicable verify that the caller has permisson to the underlying network namespace for a newly created network device. Similary checks exist for the network namespace a network device will be created in. Fixes: 317f4810e45e ("rtnl: allow to create device with IFLA_LINK_NETNSID set") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-28net: Verify permission to dest_net in newlinkEric W. Biederman
When applicable verify that the caller has permision to create a network device in another network namespace. This check is already present when moving a network device between network namespaces in setlink so all that is needed is to duplicate that check in newlink. This change almost backports cleanly, but there are context conflicts as the code that follows was added in v4.0-rc1 Fixes: b51642f6d77b net: Enable a userns root rtnl calls that are safe for unprivilged users Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-24rtnetlink: avoid 0 sized arraysSasha Levin
Arrays (when not in a struct) "shall have a value greater than zero". GCC complains when it's not the case here. Fixes: ba7d49b1f0 ("rtnetlink: provide api for getting and setting slave info") Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-22net: pktgen: disable xmit_clone on virtual devicesEric Dumazet
Trying to use burst capability (aka xmit_more) on a virtual device like bonding is not supported. For example, skb might be queued multiple times on a qdisc, with various list corruptions. Fixes: 38b2cf2982dc ("net: pktgen: packet bursting via skb->xmit_more") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-22net: Remove state argument from skb_find_text()Bojan Prtvar
Although it is clear that textsearch state is intentionally passed to skb_find_text() as uninitialized argument, it was never used by the callers. Therefore, we can simplify skb_find_text() by making it local variable. Signed-off-by: Bojan Prtvar <prtvar.b@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-21ethtool: use "ops" name consistenty in ethtool_set_rxfh()Dan Carpenter
"dev->ethtool_ops" and "ops" are the same, but we should use "ops" everywhere to be consistent. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-21net: reject creation of netdev names with colonsMatthew Thode
colons are used as a separator in netdev device lookup in dev_ioctl.c Specific functions are SIOCGIFTXQLEN SIOCETHTOOL SIOCSIFNAME Signed-off-by: Matthew Thode <mthode@mthode.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-20ethtool: Add hw-switch-offload to netdev_features_strings.Rami Rosen
commit aafb3e98b279 (netdev: introduce new NETIF_F_HW_SWITCH_OFFLOAD feature flag for switch device offloads) add a new feature without adding it to netdev_features_strings array; this patch fixes this. Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-20sock: sock_dequeue_err_skb() needs hard irq safetyEric Dumazet
Non NAPI drivers can call skb_tstamp_tx() and then sock_queue_err_skb() from hard IRQ context. Therefore, sock_dequeue_err_skb() needs to block hard irq or corruptions or hangs can happen. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 364a9e93243d1 ("sock: deduplicate errqueue dequeue") Fixes: cb820f8e4b7f7 ("net: Provide a generic socket error queue delivery method for Tx time stamps.") Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-19gen_stats.c: Duplicate xstats buffer for later useIgnacy Gawędzki
The gnet_stats_copy_app() function gets called, more often than not, with its second argument a pointer to an automatic variable in the caller's stack. Therefore, to avoid copying garbage afterwards when calling gnet_stats_finish_copy(), this data is better copied to a dynamically allocated memory that gets freed after use. [xiyou.wangcong@gmail.com: remove a useless kfree()] Signed-off-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>