summaryrefslogtreecommitdiff
path: root/Documentation/core-api
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/core-api')
-rw-r--r--Documentation/core-api/index.rst2
-rw-r--r--Documentation/core-api/ioctl.rst253
-rw-r--r--Documentation/core-api/pin_user_pages.rst232
3 files changed, 487 insertions, 0 deletions
diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index ab0b9ec85506..a501dc1c90d0 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -31,6 +31,7 @@ Core utilities
generic-radix-tree
memory-allocation
mm-api
+ pin_user_pages
gfp_mask-from-fs-io
timekeeping
boot-time-mm
@@ -40,6 +41,7 @@ Core utilities
gcc-plugins
symbol-namespaces
padata
+ ioctl
Interfaces for kernel debugging
diff --git a/Documentation/core-api/ioctl.rst b/Documentation/core-api/ioctl.rst
new file mode 100644
index 000000000000..c455db0e1627
--- /dev/null
+++ b/Documentation/core-api/ioctl.rst
@@ -0,0 +1,253 @@
+======================
+ioctl based interfaces
+======================
+
+ioctl() is the most common way for applications to interface
+with device drivers. It is flexible and easily extended by adding new
+commands and can be passed through character devices, block devices as
+well as sockets and other special file descriptors.
+
+However, it is also very easy to get ioctl command definitions wrong,
+and hard to fix them later without breaking existing applications,
+so this documentation tries to help developers get it right.
+
+Command number definitions
+==========================
+
+The command number, or request number, is the second argument passed to
+the ioctl system call. While this can be any 32-bit number that uniquely
+identifies an action for a particular driver, there are a number of
+conventions around defining them.
+
+``include/uapi/asm-generic/ioctl.h`` provides four macros for defining
+ioctl commands that follow modern conventions: ``_IO``, ``_IOR``,
+``_IOW``, and ``_IOWR``. These should be used for all new commands,
+with the correct parameters:
+
+_IO/_IOR/_IOW/_IOWR
+ The macro name specifies how the argument will be used.  It may be a
+ pointer to data to be passed into the kernel (_IOW), out of the kernel
+ (_IOR), or both (_IOWR).  _IO can indicate either commands with no
+ argument or those passing an integer value instead of a pointer.
+ It is recommended to only use _IO for commands without arguments,
+ and use pointers for passing data.
+
+type
+ An 8-bit number, often a character literal, specific to a subsystem
+ or driver, and listed in :doc:`../userspace-api/ioctl/ioctl-number`
+
+nr
+ An 8-bit number identifying the specific command, unique for a give
+ value of 'type'
+
+data_type
+ The name of the data type pointed to by the argument, the command number
+ encodes the ``sizeof(data_type)`` value in a 13-bit or 14-bit integer,
+ leading to a limit of 8191 bytes for the maximum size of the argument.
+ Note: do not pass sizeof(data_type) type into _IOR/_IOW/IOWR, as that
+ will lead to encoding sizeof(sizeof(data_type)), i.e. sizeof(size_t).
+ _IO does not have a data_type parameter.
+
+
+Interface versions
+==================
+
+Some subsystems use version numbers in data structures to overload
+commands with different interpretations of the argument.
+
+This is generally a bad idea, since changes to existing commands tend
+to break existing applications.
+
+A better approach is to add a new ioctl command with a new number. The
+old command still needs to be implemented in the kernel for compatibility,
+but this can be a wrapper around the new implementation.
+
+Return code
+===========
+
+ioctl commands can return negative error codes as documented in errno(3);
+these get turned into errno values in user space. On success, the return
+code should be zero. It is also possible but not recommended to return
+a positive 'long' value.
+
+When the ioctl callback is called with an unknown command number, the
+handler returns either -ENOTTY or -ENOIOCTLCMD, which also results in
+-ENOTTY being returned from the system call. Some subsystems return
+-ENOSYS or -EINVAL here for historic reasons, but this is wrong.
+
+Prior to Linux 5.5, compat_ioctl handlers were required to return
+-ENOIOCTLCMD in order to use the fallback conversion into native
+commands. As all subsystems are now responsible for handling compat
+mode themselves, this is no longer needed, but it may be important to
+consider when backporting bug fixes to older kernels.
+
+Timestamps
+==========
+
+Traditionally, timestamps and timeout values are passed as ``struct
+timespec`` or ``struct timeval``, but these are problematic because of
+incompatible definitions of these structures in user space after the
+move to 64-bit time_t.
+
+The ``struct __kernel_timespec`` type can be used instead to be embedded
+in other data structures when separate second/nanosecond values are
+desired, or passed to user space directly. This is still not ideal though,
+as the structure matches neither the kernel's timespec64 nor the user
+space timespec exactly. The get_timespec64() and put_timespec64() helper
+functions can be used to ensure that the layout remains compatible with
+user space and the padding is treated correctly.
+
+As it is cheap to convert seconds to nanoseconds, but the opposite
+requires an expensive 64-bit division, a simple __u64 nanosecond value
+can be simpler and more efficient.
+
+Timeout values and timestamps should ideally use CLOCK_MONOTONIC time,
+as returned by ktime_get_ns() or ktime_get_ts64(). Unlike
+CLOCK_REALTIME, this makes the timestamps immune from jumping backwards
+or forwards due to leap second adjustments and clock_settime() calls.
+
+ktime_get_real_ns() can be used for CLOCK_REALTIME timestamps that
+need to be persistent across a reboot or between multiple machines.
+
+32-bit compat mode
+==================
+
+In order to support 32-bit user space running on a 64-bit machine, each
+subsystem or driver that implements an ioctl callback handler must also
+implement the corresponding compat_ioctl handler.
+
+As long as all the rules for data structures are followed, this is as
+easy as setting the .compat_ioctl pointer to a helper function such as
+compat_ptr_ioctl() or blkdev_compat_ptr_ioctl().
+
+compat_ptr()
+------------
+
+On the s390 architecture, 31-bit user space has ambiguous representations
+for data pointers, with the upper bit being ignored. When running such
+a process in compat mode, the compat_ptr() helper must be used to
+clear the upper bit of a compat_uptr_t and turn it into a valid 64-bit
+pointer. On other architectures, this macro only performs a cast to a
+``void __user *`` pointer.
+
+In an compat_ioctl() callback, the last argument is an unsigned long,
+which can be interpreted as either a pointer or a scalar depending on
+the command. If it is a scalar, then compat_ptr() must not be used, to
+ensure that the 64-bit kernel behaves the same way as a 32-bit kernel
+for arguments with the upper bit set.
+
+The compat_ptr_ioctl() helper can be used in place of a custom
+compat_ioctl file operation for drivers that only take arguments that
+are pointers to compatible data structures.
+
+Structure layout
+----------------
+
+Compatible data structures have the same layout on all architectures,
+avoiding all problematic members:
+
+* ``long`` and ``unsigned long`` are the size of a register, so
+ they can be either 32-bit or 64-bit wide and cannot be used in portable
+ data structures. Fixed-length replacements are ``__s32``, ``__u32``,
+ ``__s64`` and ``__u64``.
+
+* Pointers have the same problem, in addition to requiring the
+ use of compat_ptr(). The best workaround is to use ``__u64``
+ in place of pointers, which requires a cast to ``uintptr_t`` in user
+ space, and the use of u64_to_user_ptr() in the kernel to convert
+ it back into a user pointer.
+
+* On the x86-32 (i386) architecture, the alignment of 64-bit variables
+ is only 32-bit, but they are naturally aligned on most other
+ architectures including x86-64. This means a structure like::
+
+ struct foo {
+ __u32 a;
+ __u64 b;
+ __u32 c;
+ };
+
+ has four bytes of padding between a and b on x86-64, plus another four
+ bytes of padding at the end, but no padding on i386, and it needs a
+ compat_ioctl conversion handler to translate between the two formats.
+
+ To avoid this problem, all structures should have their members
+ naturally aligned, or explicit reserved fields added in place of the
+ implicit padding. The ``pahole`` tool can be used for checking the
+ alignment.
+
+* On ARM OABI user space, structures are padded to multiples of 32-bit,
+ making some structs incompatible with modern EABI kernels if they
+ do not end on a 32-bit boundary.
+
+* On the m68k architecture, struct members are not guaranteed to have an
+ alignment greater than 16-bit, which is a problem when relying on
+ implicit padding.
+
+* Bitfields and enums generally work as one would expect them to,
+ but some properties of them are implementation-defined, so it is better
+ to avoid them completely in ioctl interfaces.
+
+* ``char`` members can be either signed or unsigned, depending on
+ the architecture, so the __u8 and __s8 types should be used for 8-bit
+ integer values, though char arrays are clearer for fixed-length strings.
+
+Information leaks
+=================
+
+Uninitialized data must not be copied back to user space, as this can
+cause an information leak, which can be used to defeat kernel address
+space layout randomization (KASLR), helping in an attack.
+
+For this reason (and for compat support) it is best to avoid any
+implicit padding in data structures.  Where there is implicit padding
+in an existing structure, kernel drivers must be careful to fully
+initialize an instance of the structure before copying it to user
+space.  This is usually done by calling memset() before assigning to
+individual members.
+
+Subsystem abstractions
+======================
+
+While some device drivers implement their own ioctl function, most
+subsystems implement the same command for multiple drivers. Ideally the
+subsystem has an .ioctl() handler that copies the arguments from and
+to user space, passing them into subsystem specific callback functions
+through normal kernel pointers.
+
+This helps in various ways:
+
+* Applications written for one driver are more likely to work for
+ another one in the same subsystem if there are no subtle differences
+ in the user space ABI.
+
+* The complexity of user space access and data structure layout is done
+ in one place, reducing the potential for implementation bugs.
+
+* It is more likely to be reviewed by experienced developers
+ that can spot problems in the interface when the ioctl is shared
+ between multiple drivers than when it is only used in a single driver.
+
+Alternatives to ioctl
+=====================
+
+There are many cases in which ioctl is not the best solution for a
+problem. Alternatives include:
+
+* System calls are a better choice for a system-wide feature that
+ is not tied to a physical device or constrained by the file system
+ permissions of a character device node
+
+* netlink is the preferred way of configuring any network related
+ objects through sockets.
+
+* debugfs is used for ad-hoc interfaces for debugging functionality
+ that does not need to be exposed as a stable interface to applications.
+
+* sysfs is a good way to expose the state of an in-kernel object
+ that is not tied to a file descriptor.
+
+* configfs can be used for more complex configuration than sysfs
+
+* A custom file system can provide extra flexibility with a simple
+ user interface but adds a lot of complexity to the implementation.
diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
new file mode 100644
index 000000000000..1d490155ecd7
--- /dev/null
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -0,0 +1,232 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================================================
+pin_user_pages() and related calls
+====================================================
+
+.. contents:: :local:
+
+Overview
+========
+
+This document describes the following functions::
+
+ pin_user_pages()
+ pin_user_pages_fast()
+ pin_user_pages_remote()
+
+Basic description of FOLL_PIN
+=============================
+
+FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*()
+("gup") family of functions. FOLL_PIN has significant interactions and
+interdependencies with FOLL_LONGTERM, so both are covered here.
+
+FOLL_PIN is internal to gup, meaning that it should not appear at the gup call
+sites. This allows the associated wrapper functions (pin_user_pages*() and
+others) to set the correct combination of these flags, and to check for problems
+as well.
+
+FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites.
+This is in order to avoid creating a large number of wrapper functions to cover
+all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the
+pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so
+that's a natural dividing line, and a good point to make separate wrapper calls.
+In other words, use pin_user_pages*() for DMA-pinned pages, and
+get_user_pages*() for other cases. There are four cases described later on in
+this document, to further clarify that concept.
+
+FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
+multiple threads and call sites are free to pin the same struct pages, via both
+FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
+other, not the struct page(s).
+
+The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN
+uses a different reference counting technique.
+
+FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is,
+FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
+
+Which flags are set by each wrapper
+===================================
+
+For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
+flags the caller provides. The caller is required to pass in a non-null struct
+pages* array, and the function then pin pages by incrementing each by a special
+value. For now, that value is +1, just like get_user_pages*().::
+
+ Function
+ --------
+ pin_user_pages FOLL_PIN is always set internally by this function.
+ pin_user_pages_fast FOLL_PIN is always set internally by this function.
+ pin_user_pages_remote FOLL_PIN is always set internally by this function.
+
+For these get_user_pages*() functions, FOLL_GET might not even be specified.
+Behavior is a little more complex than above. If FOLL_GET was *not* specified,
+but the caller passed in a non-null struct pages* array, then the function
+sets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount
+of each page by +1.::
+
+ Function
+ --------
+ get_user_pages FOLL_GET is sometimes set internally by this function.
+ get_user_pages_fast FOLL_GET is sometimes set internally by this function.
+ get_user_pages_remote FOLL_GET is sometimes set internally by this function.
+
+Tracking dma-pinned pages
+=========================
+
+Some of the key design constraints, and solutions, for tracking dma-pinned
+pages:
+
+* An actual reference count, per struct page, is required. This is because
+ multiple processes may pin and unpin a page.
+
+* False positives (reporting that a page is dma-pinned, when in fact it is not)
+ are acceptable, but false negatives are not.
+
+* struct page may not be increased in size for this, and all fields are already
+ used.
+
+* Given the above, we can overload the page->_refcount field by using, sort of,
+ the upper bits in that field for a dma-pinned count. "Sort of", means that,
+ rather than dividing page->_refcount into bit fields, we simple add a medium-
+ large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to
+ page->_refcount. This provides fuzzy behavior: if a page has get_page() called
+ on it 1024 times, then it will appear to have a single dma-pinned count.
+ And again, that's acceptable.
+
+This also leads to limitations: there are only 31-10==21 bits available for a
+counter that increments 10 bits at a time.
+
+TODO: for 1GB and larger huge pages, this is cutting it close. That's because
+when pin_user_pages() follows such pages, it increments the head page by "1"
+(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for
+pin_user_pages()) for each tail page. So if you have a 1GB huge page:
+
+* There are 256K (18 bits) worth of 4 KB tail pages.
+* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is,
+ 10 bits at a time)
+* There are 21 - 18 == 3 bits available to count. Except that there aren't,
+ because you need to allow for a few normal get_page() calls on the head page,
+ as well. Fortunately, the approach of using addition, rather than "hard"
+ bitfields, within page->_refcount, allows for sharing these bits gracefully.
+ But we're still looking at about 8 references.
+
+This, however, is a missing feature more than anything else, because it's easily
+solved by addressing an obvious inefficiency in the original get_user_pages()
+approach of retrieving pages: stop treating all the pages as if they were
+PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of
+this, so some work is required. Once that's in place, this limitation mostly
+disappears from view, because there will be ample refcounting range available.
+
+* Callers must specifically request "dma-pinned tracking of pages". In other
+ words, just calling get_user_pages() will not suffice; a new set of functions,
+ pin_user_page() and related, must be used.
+
+FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags
+==========================================================
+
+Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing
+these categories:
+
+CASE 1: Direct IO (DIO)
+-----------------------
+There are GUP references to pages that are serving
+as DIO buffers. These buffers are needed for a relatively short time (so they
+are not "long term"). No special synchronization with page_mkclean() or
+munmap() is provided. Therefore, flags to set at the call site are: ::
+
+ FOLL_PIN
+
+...but rather than setting FOLL_PIN directly, call sites should use one of
+the pin_user_pages*() routines that set FOLL_PIN.
+
+CASE 2: RDMA
+------------
+There are GUP references to pages that are serving as DMA
+buffers. These buffers are needed for a long time ("long term"). No special
+synchronization with page_mkclean() or munmap() is provided. Therefore, flags
+to set at the call site are: ::
+
+ FOLL_PIN | FOLL_LONGTERM
+
+NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's
+because DAX pages do not have a separate page cache, and so "pinning" implies
+locking down file system blocks, which is not (yet) supported in that way.
+
+CASE 3: Hardware with page faulting support
+-------------------------------------------
+Here, a well-written driver doesn't normally need to pin pages at all. However,
+if the driver does choose to do so, it can register MMU notifiers for the range,
+and will be called back upon invalidation. Either way (avoiding page pinning, or
+using MMU notifiers to unpin upon request), there is proper synchronization with
+both filesystem and mm (page_mkclean(), munmap(), etc).
+
+Therefore, neither flag needs to be set.
+
+In this case, ideally, neither get_user_pages() nor pin_user_pages() should be
+called. Instead, the software should be written so that it does not pin pages.
+This allows mm and filesystems to operate more efficiently and reliably.
+
+CASE 4: Pinning for struct page manipulation only
+-------------------------------------------------
+Here, normal GUP calls are sufficient, so neither flag needs to be set.
+
+page_dma_pinned(): the whole point of pinning
+=============================================
+
+The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able
+to query, "is this page DMA-pinned?" That allows code such as page_mkclean()
+(and file system writeback code in general) to make informed decisions about
+what to do when a page cannot be unmapped due to such pins.
+
+What to do in those cases is the subject of a years-long series of discussions
+and debates (see the References at the end of this document). It's a TODO item
+here: fill in the details once that's worked out. Meanwhile, it's safe to say
+that having this available: ::
+
+ static inline bool page_dma_pinned(struct page *page)
+
+...is a prerequisite to solving the long-running gup+DMA problem.
+
+Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM
+===================================================================
+
+Another way of thinking about these flags is as a progression of restrictions:
+FOLL_GET is for struct page manipulation, without affecting the data that the
+struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for
+short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is
+a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more
+restrictive case that has FOLL_PIN as a prerequisite: this is for pages that
+will be pinned longterm, and whose data will be accessed.
+
+Unit testing
+============
+This file::
+
+ tools/testing/selftests/vm/gup_benchmark.c
+
+has the following new calls to exercise the new pin*() wrapper functions:
+
+* PIN_FAST_BENCHMARK (./gup_benchmark -a)
+* PIN_BENCHMARK (./gup_benchmark -b)
+
+You can monitor how many total dma-pinned pages have been acquired and released
+since the system was booted, via two new /proc/vmstat entries: ::
+
+ /proc/vmstat/nr_foll_pin_requested
+ /proc/vmstat/nr_foll_pin_requested
+
+Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
+because there is a noticeable performance drop in unpin_user_page(), when they
+are activated.
+
+References
+==========
+
+* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_
+* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_
+* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_
+
+John Hubbard, October, 2019