diff options
Diffstat (limited to 'Documentation')
180 files changed, 6162 insertions, 4185 deletions
diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node index df8413cf1468..484fc04bcc25 100644 --- a/Documentation/ABI/stable/sysfs-devices-node +++ b/Documentation/ABI/stable/sysfs-devices-node @@ -54,7 +54,7 @@ Date: October 2002 Contact: Linux Memory Management list <linux-mm@kvack.org> Description: Provides information about the node's distribution and memory - utilization. Similar to /proc/meminfo, see Documentation/filesystems/proc.txt + utilization. Similar to /proc/meminfo, see Documentation/filesystems/proc.rst What: /sys/devices/system/node/nodeX/numastat Date: October 2002 diff --git a/Documentation/ABI/testing/procfs-smaps_rollup b/Documentation/ABI/testing/procfs-smaps_rollup index 274df44d8b1b..046978193368 100644 --- a/Documentation/ABI/testing/procfs-smaps_rollup +++ b/Documentation/ABI/testing/procfs-smaps_rollup @@ -11,7 +11,7 @@ Description: Additionally, the fields Pss_Anon, Pss_File and Pss_Shmem are not present in /proc/pid/smaps. These fields represent the sum of the Pss field of each type (anon, file, shmem). - For more details, see Documentation/filesystems/proc.txt + For more details, see Documentation/filesystems/proc.rst and the procfs man page. Typical output looks like this: diff --git a/Documentation/Makefile b/Documentation/Makefile index cc786d11a028..db1fc35ded50 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -98,7 +98,11 @@ else # HAVE_PDFLATEX pdfdocs: latexdocs @$(srctree)/scripts/sphinx-pre-install --version-check - $(foreach var,$(SPHINXDIRS), $(MAKE) PDFLATEX="$(PDFLATEX)" LATEXOPTS="$(LATEXOPTS)" -C $(BUILDDIR)/$(var)/latex || exit;) + $(foreach var,$(SPHINXDIRS), \ + $(MAKE) PDFLATEX="$(PDFLATEX)" LATEXOPTS="$(LATEXOPTS)" -C $(BUILDDIR)/$(var)/latex || exit; \ + mkdir -p $(BUILDDIR)/$(var)/pdf; \ + mv $(subst .tex,.pdf,$(wildcard $(BUILDDIR)/$(var)/latex/*.tex)) $(BUILDDIR)/$(var)/pdf/; \ + ) endif # HAVE_PDFLATEX diff --git a/Documentation/PCI/boot-interrupts.rst b/Documentation/PCI/boot-interrupts.rst index d078ef3eb192..2ec70121bfca 100644 --- a/Documentation/PCI/boot-interrupts.rst +++ b/Documentation/PCI/boot-interrupts.rst @@ -32,12 +32,13 @@ interrupt goes unhandled over time, they are tracked by the Linux kernel as Spurious Interrupts. The IRQ will be disabled by the Linux kernel after it reaches a specific count with the error "nobody cared". This disabled IRQ now prevents valid usage by an existing interrupt which may happen to share -the IRQ line. +the IRQ line:: irq 19: nobody cared (try booting with the "irqpoll" option) CPU: 0 PID: 2988 Comm: irq/34-nipalk Tainted: 4.14.87-rt49-02410-g4a640ec-dirty #1 Hardware name: National Instruments NI PXIe-8880/NI PXIe-8880, BIOS 2.1.5f1 01/09/2020 Call Trace: + <IRQ> ? dump_stack+0x46/0x5e ? __report_bad_irq+0x2e/0xb0 @@ -85,15 +86,18 @@ Mitigations The mitigations take the form of PCI quirks. The preference has been to first identify and make use of a means to disable the routing to the PCH. In such a case a quirk to disable boot interrupt generation can be -added.[1] +added. [1]_ - Intel® 6300ESB I/O Controller Hub +Intel® 6300ESB I/O Controller Hub Alternate Base Address Register: BIE: Boot Interrupt Enable - 0 = Boot interrupt is enabled. - 1 = Boot interrupt is disabled. - Intel® Sandy Bridge through Sky Lake based Xeon servers: + == =========================== + 0 Boot interrupt is enabled. + 1 Boot interrupt is disabled. + == =========================== + +Intel® Sandy Bridge through Sky Lake based Xeon servers: Coherent Interface Protocol Interrupt Control dis_intx_route2pch/dis_intx_route2ich/dis_intx_route2dmi2: When this bit is set. Local INTx messages received from the @@ -109,12 +113,12 @@ line by default. Therefore, on chipsets where this INTx routing cannot be disabled, the Linux kernel will reroute the valid interrupt to its legacy interrupt. This redirection of the handler will prevent the occurrence of the spurious interrupt detection which would ordinarily disable the IRQ -line due to excessive unhandled counts.[2] +line due to excessive unhandled counts. [2]_ The config option X86_REROUTE_FOR_BROKEN_BOOT_IRQS exists to enable (or disable) the redirection of the interrupt handler to the PCH interrupt line. The option can be overridden by either pci=ioapicreroute or -pci=noioapicreroute.[3] +pci=noioapicreroute. [3]_ More Documentation @@ -127,19 +131,19 @@ into the evolution of its handling with chipsets. Example of disabling of the boot interrupt ------------------------------------------ -Intel® 6300ESB I/O Controller Hub (Document # 300641-004US) + - Intel® 6300ESB I/O Controller Hub (Document # 300641-004US) 5.7.3 Boot Interrupt https://www.intel.com/content/dam/doc/datasheet/6300esb-io-controller-hub-datasheet.pdf -Intel® Xeon® Processor E5-1600/2400/2600/4600 v3 Product Families -Datasheet - Volume 2: Registers (Document # 330784-003) + - Intel® Xeon® Processor E5-1600/2400/2600/4600 v3 Product Families + Datasheet - Volume 2: Registers (Document # 330784-003) 6.6.41 cipintrc Coherent Interface Protocol Interrupt Control https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-v3-datasheet-vol-2.pdf Example of handler rerouting ---------------------------- -Intel® 6700PXH 64-bit PCI Hub (Document # 302628) + - Intel® 6700PXH 64-bit PCI Hub (Document # 302628) 2.15.2 PCI Express Legacy INTx Support and Boot Interrupt https://www.intel.com/content/dam/doc/datasheet/6700pxh-64-bit-pci-hub-datasheet.pdf @@ -150,6 +154,6 @@ Cheers, Sean V Kelley sean.v.kelley@linux.intel.com -[1] https://lore.kernel.org/r/12131949181903-git-send-email-sassmann@suse.de/ -[2] https://lore.kernel.org/r/12131949182094-git-send-email-sassmann@suse.de/ -[3] https://lore.kernel.org/r/487C8EA7.6020205@suse.de/ +.. [1] https://lore.kernel.org/r/12131949181903-git-send-email-sassmann@suse.de/ +.. [2] https://lore.kernel.org/r/12131949182094-git-send-email-sassmann@suse.de/ +.. [3] https://lore.kernel.org/r/487C8EA7.6020205@suse.de/ diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst index fd5e2cbc4935..75b8ca007a11 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.rst +++ b/Documentation/RCU/Design/Requirements/Requirements.rst @@ -1943,56 +1943,27 @@ invoked from a CPU-hotplug notifier. Scheduler and RCU ~~~~~~~~~~~~~~~~~ -RCU depends on the scheduler, and the scheduler uses RCU to protect some -of its data structures. The preemptible-RCU ``rcu_read_unlock()`` -implementation must therefore be written carefully to avoid deadlocks -involving the scheduler's runqueue and priority-inheritance locks. In -particular, ``rcu_read_unlock()`` must tolerate an interrupt where the -interrupt handler invokes both ``rcu_read_lock()`` and -``rcu_read_unlock()``. This possibility requires ``rcu_read_unlock()`` -to use negative nesting levels to avoid destructive recursion via -interrupt handler's use of RCU. - -This scheduler-RCU requirement came as a `complete -surprise <https://lwn.net/Articles/453002/>`__. - -As noted above, RCU makes use of kthreads, and it is necessary to avoid -excessive CPU-time accumulation by these kthreads. This requirement was -no surprise, but RCU's violation of it when running context-switch-heavy -workloads when built with ``CONFIG_NO_HZ_FULL=y`` `did come as a -surprise +RCU makes use of kthreads, and it is necessary to avoid excessive CPU-time +accumulation by these kthreads. This requirement was no surprise, but +RCU's violation of it when running context-switch-heavy workloads when +built with ``CONFIG_NO_HZ_FULL=y`` `did come as a surprise [PDF] <http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf>`__. RCU has made good progress towards meeting this requirement, even for context-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is room for further improvement. -It is forbidden to hold any of scheduler's runqueue or -priority-inheritance spinlocks across an ``rcu_read_unlock()`` unless -interrupts have been disabled across the entire RCU read-side critical -section, that is, up to and including the matching ``rcu_read_lock()``. -Violating this restriction can result in deadlocks involving these -scheduler spinlocks. There was hope that this restriction might be -lifted when interrupt-disabled calls to ``rcu_read_unlock()`` started -deferring the reporting of the resulting RCU-preempt quiescent state -until the end of the corresponding interrupts-disabled region. -Unfortunately, timely reporting of the corresponding quiescent state to -expedited grace periods requires a call to ``raise_softirq()``, which -can acquire these scheduler spinlocks. In addition, real-time systems -using RCU priority boosting need this restriction to remain in effect -because deferred quiescent-state reporting would also defer deboosting, -which in turn would degrade real-time latencies. - -In theory, if a given RCU read-side critical section could be guaranteed -to be less than one second in duration, holding a scheduler spinlock -across that critical section's ``rcu_read_unlock()`` would require only -that preemption be disabled across the entire RCU read-side critical -section, not interrupts. Unfortunately, given the possibility of vCPU -preemption, long-running interrupts, and so on, it is not possible in -practice to guarantee that a given RCU read-side critical section will -complete in less than one second. Therefore, as noted above, if -scheduler spinlocks are held across a given call to -``rcu_read_unlock()``, interrupts must be disabled across the entire RCU -read-side critical section. +There is no longer any prohibition against holding any of +scheduler's runqueue or priority-inheritance spinlocks across an +``rcu_read_unlock()``, even if interrupts and preemption were enabled +somewhere within the corresponding RCU read-side critical section. +Therefore, it is now perfectly legal to execute ``rcu_read_lock()`` +with preemption enabled, acquire one of the scheduler locks, and hold +that lock across the matching ``rcu_read_unlock()``. + +Similarly, the RCU flavor consolidation has removed the need for negative +nesting. The fact that interrupt-disabled regions of code act as RCU +read-side critical sections implicitly avoids earlier issues that used +to result in destructive recursion via interrupt handler's use of RCU. Tracing and RCU ~~~~~~~~~~~~~~~ diff --git a/Documentation/admin-guide/acpi/ssdt-overlays.rst b/Documentation/admin-guide/acpi/ssdt-overlays.rst index da37455f96c9..5d7e25988085 100644 --- a/Documentation/admin-guide/acpi/ssdt-overlays.rst +++ b/Documentation/admin-guide/acpi/ssdt-overlays.rst @@ -63,7 +63,7 @@ which can then be compiled to AML binary format:: ASL Input: minnomax.asl - 30 lines, 614 bytes, 7 keywords AML Output: minnowmax.aml - 165 bytes, 6 named objects, 1 executable opcodes -[1] http://wiki.minnowboard.org/MinnowBoard_MAX#Low_Speed_Expansion_Connector_.28Top.29 +[1] https://www.elinux.org/Minnowboard:MinnowMax#Low_Speed_Expansion_.28Top.29 The resulting AML code can then be loaded by the kernel using one of the methods below. diff --git a/Documentation/admin-guide/bug-hunting.rst b/Documentation/admin-guide/bug-hunting.rst index 44b8a4edd348..f7c80f4649fc 100644 --- a/Documentation/admin-guide/bug-hunting.rst +++ b/Documentation/admin-guide/bug-hunting.rst @@ -49,15 +49,19 @@ the issue, it may also contain the word **Oops**, as on this one:: Despite being an **Oops** or some other sort of stack trace, the offended line is usually required to identify and handle the bug. Along this chapter, -we'll refer to "Oops" for all kinds of stack traces that need to be analized. +we'll refer to "Oops" for all kinds of stack traces that need to be analyzed. -.. note:: +If the kernel is compiled with ``CONFIG_DEBUG_INFO``, you can enhance the +quality of the stack trace by using file:`scripts/decode_stacktrace.sh`. + +Modules linked in +----------------- + +Modules that are tainted or are being loaded or unloaded are marked with +"(...)", where the taint flags are described in +file:`Documentation/admin-guide/tainted-kernels.rst`, "being loaded" is +annotated with "+", and "being unloaded" is annotated with "-". - ``ksymoops`` is useless on 2.6 or upper. Please use the Oops in its original - format (from ``dmesg``, etc). Ignore any references in this or other docs to - "decoding the Oops" or "running it through ksymoops". - If you post an Oops from 2.6+ that has been run through ``ksymoops``, - people will just tell you to repost it. Where is the Oops message is located? ------------------------------------- @@ -71,7 +75,7 @@ by running ``journalctl`` command. Sometimes ``klogd`` dies, in which case you can run ``dmesg > file`` to read the data from the kernel buffers and save it. Or you can ``cat /proc/kmsg > file``, however you have to break in to stop the transfer, -``kmsg`` is a "never ending file". +since ``kmsg`` is a "never ending file". If the machine has crashed so badly that you cannot enter commands or the disk is not available then you have three options: @@ -81,9 +85,9 @@ the disk is not available then you have three options: planned for a crash. Alternatively, you can take a picture of the screen with a digital camera - not nice, but better than nothing. If the messages scroll off the top of the console, you - may find that booting with a higher resolution (eg, ``vga=791``) + may find that booting with a higher resolution (e.g., ``vga=791``) will allow you to read more of the text. (Caveat: This needs ``vesafb``, - so won't help for 'early' oopses) + so won't help for 'early' oopses.) (2) Boot with a serial console (see :ref:`Documentation/admin-guide/serial-console.rst <serial_console>`), @@ -104,7 +108,7 @@ Kernel source file. There are two methods for doing that. Usually, using gdb ^^^ -The GNU debug (``gdb``) is the best way to figure out the exact file and line +The GNU debugger (``gdb``) is the best way to figure out the exact file and line number of the OOPS from the ``vmlinux`` file. The usage of gdb works best on a kernel compiled with ``CONFIG_DEBUG_INFO``. @@ -165,7 +169,7 @@ If you have a call trace, such as:: [<ffffffff8802770b>] :jbd:journal_stop+0x1be/0x1ee ... -this shows the problem likely in the :jbd: module. You can load that module +this shows the problem likely is in the :jbd: module. You can load that module in gdb and list the relevant code:: $ gdb fs/jbd/jbd.ko @@ -199,8 +203,9 @@ in the kernel hacking menu of the menu configuration.) For example:: You need to be at the top level of the kernel tree for this to pick up your C files. -If you don't have access to the code you can also debug on some crash dumps -e.g. crash dump output as shown by Dave Miller:: +If you don't have access to the source code you can still debug some crash +dumps using the following method (example crash dump output as shown by +Dave Miller):: EIP is at +0x14/0x4c0 ... @@ -230,6 +235,9 @@ e.g. crash dump output as shown by Dave Miller:: mov 0x8(%ebp), %ebx ! %ebx = skb->sk mov 0x13c(%ebx), %eax ! %eax = inet_sk(sk)->opt +file:`scripts/decodecode` can be used to automate most of this, depending +on what CPU architecture is being debugged. + Reporting the bug ----------------- @@ -241,7 +249,7 @@ used for the development of the affected code. This can be done by using the ``get_maintainer.pl`` script. For example, if you find a bug at the gspca's sonixj.c file, you can get -their maintainers with:: +its maintainers with:: $ ./scripts/get_maintainer.pl -f drivers/media/usb/gspca/sonixj.c Hans Verkuil <hverkuil@xs4all.nl> (odd fixer:GSPCA USB WEBCAM DRIVER,commit_signer:1/1=100%) @@ -253,16 +261,17 @@ their maintainers with:: Please notice that it will point to: -- The last developers that touched on the source code. On the above example, - Tejun and Bhaktipriya (in this specific case, none really envolved on the - development of this file); +- The last developers that touched the source code (if this is done inside + a git tree). On the above example, Tejun and Bhaktipriya (in this + specific case, none really envolved on the development of this file); - The driver maintainer (Hans Verkuil); - The subsystem maintainer (Mauro Carvalho Chehab); - The driver and/or subsystem mailing list (linux-media@vger.kernel.org); - the Linux Kernel mailing list (linux-kernel@vger.kernel.org). Usually, the fastest way to have your bug fixed is to report it to mailing -list used for the development of the code (linux-media ML) copying the driver maintainer (Hans). +list used for the development of the code (linux-media ML) copying the +driver maintainer (Hans). If you are totally stumped as to whom to send the report, and ``get_maintainer.pl`` didn't provide you anything useful, send it to @@ -303,9 +312,9 @@ protection fault message can be simply cut out of the message files and forwarded to the kernel developers. Two types of address resolution are performed by ``klogd``. The first is -static translation and the second is dynamic translation. Static -translation uses the System.map file in much the same manner that -ksymoops does. In order to do static translation the ``klogd`` daemon +static translation and the second is dynamic translation. +Static translation uses the System.map file. +In order to do static translation the ``klogd`` daemon must be able to find a system map file at daemon initialization time. See the klogd man page for information on how ``klogd`` searches for map files. diff --git a/Documentation/admin-guide/cpu-load.rst b/Documentation/admin-guide/cpu-load.rst index 2d01ce43d2a2..ebdecf864080 100644 --- a/Documentation/admin-guide/cpu-load.rst +++ b/Documentation/admin-guide/cpu-load.rst @@ -105,7 +105,7 @@ References ---------- - http://lkml.org/lkml/2007/2/12/6 -- Documentation/filesystems/proc.txt (1.8) +- Documentation/filesystems/proc.rst (1.8) Thanks diff --git a/Documentation/admin-guide/hw-vuln/l1tf.rst b/Documentation/admin-guide/hw-vuln/l1tf.rst index f83212fae4d5..3eeeb488d955 100644 --- a/Documentation/admin-guide/hw-vuln/l1tf.rst +++ b/Documentation/admin-guide/hw-vuln/l1tf.rst @@ -268,7 +268,7 @@ Guest mitigation mechanisms /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is available at: - https://www.kernel.org/doc/Documentation/IRQ-affinity.txt + https://www.kernel.org/doc/Documentation/core-api/irq/irq-affinity.rst .. _smt_control: diff --git a/Documentation/admin-guide/init.rst b/Documentation/admin-guide/init.rst index e89d97f31eaf..41f06a09152e 100644 --- a/Documentation/admin-guide/init.rst +++ b/Documentation/admin-guide/init.rst @@ -1,52 +1,48 @@ -Explaining the dreaded "No init found." boot hang message +Explaining the "No working init found." boot hang message ========================================================= +:Authors: Andreas Mohr <andi at lisas period de> + Cristian Souza <cristianmsbr at gmail period com> -OK, so you've got this pretty unintuitive message (currently located -in init/main.c) and are wondering what the H*** went wrong. -Some high-level reasons for failure (listed roughly in order of execution) -to load the init binary are: - -A) Unable to mount root FS -B) init binary doesn't exist on rootfs -C) broken console device -D) binary exists but dependencies not available -E) binary cannot be loaded - -Detailed explanations: - -A) Set "debug" kernel parameter (in bootloader config file or CONFIG_CMDLINE) - to get more detailed kernel messages. -B) make sure you have the correct root FS type - (and ``root=`` kernel parameter points to the correct partition), - required drivers such as storage hardware (such as SCSI or USB!) - and filesystem (ext3, jffs2 etc.) are builtin (alternatively as modules, - to be pre-loaded by an initrd) -C) Possibly a conflict in ``console= setup`` --> initial console unavailable. - E.g. some serial consoles are unreliable due to serial IRQ issues (e.g. - missing interrupt-based configuration). +This document provides some high-level reasons for failure +(listed roughly in order of execution) to load the init binary. + +1) **Unable to mount root FS**: Set "debug" kernel parameter (in bootloader + config file or CONFIG_CMDLINE) to get more detailed kernel messages. + +2) **init binary doesn't exist on rootfs**: Make sure you have the correct + root FS type (and ``root=`` kernel parameter points to the correct + partition), required drivers such as storage hardware (such as SCSI or + USB!) and filesystem (ext3, jffs2, etc.) are builtin (alternatively as + modules, to be pre-loaded by an initrd). + +3) **Broken console device**: Possibly a conflict in ``console= setup`` + --> initial console unavailable. E.g. some serial consoles are unreliable + due to serial IRQ issues (e.g. missing interrupt-based configuration). Try using a different ``console= device`` or e.g. ``netconsole=``. -D) e.g. required library dependencies of the init binary such as - ``/lib/ld-linux.so.2`` missing or broken. Use - ``readelf -d <INIT>|grep NEEDED`` to find out which libraries are required. -E) make sure the binary's architecture matches your hardware. - E.g. i386 vs. x86_64 mismatch, or trying to load x86 on ARM hardware. - In case you tried loading a non-binary file here (shell script?), - you should make sure that the script specifies an interpreter in its shebang - header line (``#!/...``) that is fully working (including its library - dependencies). And before tackling scripts, better first test a simple - non-script binary such as ``/bin/sh`` and confirm its successful execution. - To find out more, add code ``to init/main.c`` to display kernel_execve()s - return values. + +4) **Binary exists but dependencies not available**: E.g. required library + dependencies of the init binary such as ``/lib/ld-linux.so.2`` missing or + broken. Use ``readelf -d <INIT>|grep NEEDED`` to find out which libraries + are required. + +5) **Binary cannot be loaded**: Make sure the binary's architecture matches + your hardware. E.g. i386 vs. x86_64 mismatch, or trying to load x86 on ARM + hardware. In case you tried loading a non-binary file here (shell script?), + you should make sure that the script specifies an interpreter in its + shebang header line (``#!/...``) that is fully working (including its + library dependencies). And before tackling scripts, better first test a + simple non-script binary such as ``/bin/sh`` and confirm its successful + execution. To find out more, add code ``to init/main.c`` to display + kernel_execve()s return values. Please extend this explanation whenever you find new failure causes (after all loading the init binary is a CRITICAL and hard transition step -which needs to be made as painless as possible), then submit patch to LKML. +which needs to be made as painless as possible), then submit a patch to LKML. Further TODOs: - Implement the various ``run_init_process()`` invocations via a struct array which can then store the ``kernel_execve()`` result value and on failure log it all by iterating over **all** results (very important usability fix). -- try to make the implementation itself more helpful in general, - e.g. by providing additional error messages at affected places. +- Try to make the implementation itself more helpful in general, e.g. by + providing additional error messages at affected places. -Andreas Mohr <andi at lisas period de> diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst index 007a6b86e0ee..e4ee8b2db604 100644 --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst @@ -393,6 +393,12 @@ KERNELOFFSET The kernel randomization offset. Used to compute the page offset. If KASLR is disabled, this value is zero. +KERNELPACMASK +------------- + +The mask to extract the Pointer Authentication Code from a kernel virtual +address. + arm === diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 7bc83f3d9bdf..4379c6ac3265 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1748,6 +1748,13 @@ initrd= [BOOT] Specify the location of the initial ramdisk + initrdmem= [KNL] Specify a physical address and size from which to + load the initrd. If an initrd is compiled in or + specified in the bootparams, it takes priority over this + setting. + Format: ss[KMG],nn[KMG] + Default is 0, 0 + init_on_alloc= [MM] Fill newly allocated pages and heap objects with zeroes. Format: 0 | 1 @@ -3329,7 +3336,7 @@ See Documentation/admin-guide/sysctl/vm.rst for details. ohci1394_dma=early [HW] enable debugging via the ohci1394 driver. - See Documentation/debugging-via-ohci1394.txt for more + See Documentation/core-api/debugging-via-ohci1394.rst for more info. olpc_ec_timeout= [OLPC] ms delay when issuing EC commands @@ -4210,12 +4217,24 @@ Duration of CPU stall (s) to test RCU CPU stall warnings, zero to disable. + rcutorture.stall_cpu_block= [KNL] + Sleep while stalling if set. This will result + in warnings from preemptible RCU in addition + to any other stall-related activity. + rcutorture.stall_cpu_holdoff= [KNL] Time to wait (s) after boot before inducing stall. rcutorture.stall_cpu_irqsoff= [KNL] Disable interrupts while stalling if set. + rcutorture.stall_gp_kthread= [KNL] + Duration (s) of forced sleep within RCU + grace-period kthread to test RCU CPU stall + warnings, zero to disable. If both stall_cpu + and stall_gp_kthread are specified, the + kthread is starved first, then the CPU. + rcutorture.stat_interval= [KNL] Time (s) between statistics printk()s. @@ -4286,6 +4305,13 @@ only normal grace-period primitives. No effect on CONFIG_TINY_RCU kernels. + rcupdate.rcu_task_ipi_delay= [KNL] + Set time in jiffies during which RCU tasks will + avoid sending IPIs, starting with the beginning + of a given grace period. Setting a large + number avoids disturbing real-time workloads, + but lengthens grace periods. + rcupdate.rcu_task_stall_timeout= [KNL] Set timeout in jiffies for RCU task stall warning messages. Disable with a value less than or equal diff --git a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst index 21818aca4708..dc36aeb65d0a 100644 --- a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst +++ b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst @@ -10,7 +10,7 @@ them to a "housekeeping" CPU dedicated to such work. References ========== -- Documentation/IRQ-affinity.txt: Binding interrupts to sets of CPUs. +- Documentation/core-api/irq/irq-affinity.rst: Binding interrupts to sets of CPUs. - Documentation/admin-guide/cgroup-v1: Using cgroups to bind tasks to sets of CPUs. diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst index c30176e67900..0bf49d7313ad 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -12,107 +12,107 @@ and more generally they allow userland to take control of various memory page faults, something otherwise only the kernel code could do. For example userfaults allows a proper and more optimal implementation -of the PROT_NONE+SIGSEGV trick. +of the ``PROT_NONE+SIGSEGV`` trick. Design ====== -Userfaults are delivered and resolved through the userfaultfd syscall. +Userfaults are delivered and resolved through the ``userfaultfd`` syscall. -The userfaultfd (aside from registering and unregistering virtual +The ``userfaultfd`` (aside from registering and unregistering virtual memory ranges) provides two primary functionalities: -1) read/POLLIN protocol to notify a userland thread of the faults +1) ``read/POLLIN`` protocol to notify a userland thread of the faults happening -2) various UFFDIO_* ioctls that can manage the virtual memory regions - registered in the userfaultfd that allows userland to efficiently +2) various ``UFFDIO_*`` ioctls that can manage the virtual memory regions + registered in the ``userfaultfd`` that allows userland to efficiently resolve the userfaults it receives via 1) or to manage the virtual memory in the background The real advantage of userfaults if compared to regular virtual memory management of mremap/mprotect is that the userfaults in all their operations never involve heavyweight structures like vmas (in fact the -userfaultfd runtime load never takes the mmap_sem for writing). +``userfaultfd`` runtime load never takes the mmap_sem for writing). Vmas are not suitable for page- (or hugepage) granular fault tracking when dealing with virtual address spaces that could span Terabytes. Too many vmas would be needed for that. -The userfaultfd once opened by invoking the syscall, can also be +The ``userfaultfd`` once opened by invoking the syscall, can also be passed using unix domain sockets to a manager process, so the same manager process could handle the userfaults of a multitude of different processes without them being aware about what is going on -(well of course unless they later try to use the userfaultfd +(well of course unless they later try to use the ``userfaultfd`` themselves on the same region the manager is already tracking, which -is a corner case that would currently return -EBUSY). +is a corner case that would currently return ``-EBUSY``). API === -When first opened the userfaultfd must be enabled invoking the -UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or -a later API version) which will specify the read/POLLIN protocol -userland intends to speak on the UFFD and the uffdio_api.features -userland requires. The UFFDIO_API ioctl if successful (i.e. if the -requested uffdio_api.api is spoken also by the running kernel and the +When first opened the ``userfaultfd`` must be enabled invoking the +``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or +a later API version) which will specify the ``read/POLLIN`` protocol +userland intends to speak on the ``UFFD`` and the ``uffdio_api.features`` +userland requires. The ``UFFDIO_API`` ioctl if successful (i.e. if the +requested ``uffdio_api.api`` is spoken also by the running kernel and the requested features are going to be enabled) will return into -uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of +``uffdio_api.features`` and ``uffdio_api.ioctls`` two 64bit bitmasks of respectively all the available features of the read(2) protocol and the generic ioctl available. -The uffdio_api.features bitmask returned by the UFFDIO_API ioctl -defines what memory types are supported by the userfaultfd and what +The ``uffdio_api.features`` bitmask returned by the ``UFFDIO_API`` ioctl +defines what memory types are supported by the ``userfaultfd`` and what events, except page fault notifications, may be generated. -If the kernel supports registering userfaultfd ranges on hugetlbfs -virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in -uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be -set if the kernel supports registering userfaultfd ranges on shared -memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero -MAP_SHARED, memfd_create, etc). +If the kernel supports registering ``userfaultfd`` ranges on hugetlbfs +virtual memory areas, ``UFFD_FEATURE_MISSING_HUGETLBFS`` will be set in +``uffdio_api.features``. Similarly, ``UFFD_FEATURE_MISSING_SHMEM`` will be +set if the kernel supports registering ``userfaultfd`` ranges on shared +memory (covering all shmem APIs, i.e. tmpfs, ``IPCSHM``, ``/dev/zero``, +``MAP_SHARED``, ``memfd_create``, etc). -The userland application that wants to use userfaultfd with hugetlbfs +The userland application that wants to use ``userfaultfd`` with hugetlbfs or shared memory need to set the corresponding flag in -uffdio_api.features to enable those features. +``uffdio_api.features`` to enable those features. If the userland desires to receive notifications for events other than -page faults, it has to verify that uffdio_api.features has appropriate -UFFD_FEATURE_EVENT_* bits set. These events are described in more -detail below in "Non-cooperative userfaultfd" section. - -Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should -be invoked (if present in the returned uffdio_api.ioctls bitmask) to -register a memory range in the userfaultfd by setting the -uffdio_register structure accordingly. The uffdio_register.mode +page faults, it has to verify that ``uffdio_api.features`` has appropriate +``UFFD_FEATURE_EVENT_*`` bits set. These events are described in more +detail below in `Non-cooperative userfaultfd`_ section. + +Once the ``userfaultfd`` has been enabled the ``UFFDIO_REGISTER`` ioctl should +be invoked (if present in the returned ``uffdio_api.ioctls`` bitmask) to +register a memory range in the ``userfaultfd`` by setting the +uffdio_register structure accordingly. The ``uffdio_register.mode`` bitmask will specify to the kernel which kind of faults to track for -the range (UFFDIO_REGISTER_MODE_MISSING would track missing -pages). The UFFDIO_REGISTER ioctl will return the -uffdio_register.ioctls bitmask of ioctls that are suitable to resolve +the range (``UFFDIO_REGISTER_MODE_MISSING`` would track missing +pages). The ``UFFDIO_REGISTER`` ioctl will return the +``uffdio_register.ioctls`` bitmask of ioctls that are suitable to resolve userfaults on the range registered. Not all ioctls will necessarily be supported for all memory types depending on the underlying virtual memory backend (anonymous memory vs tmpfs vs real filebacked mappings). -Userland can use the uffdio_register.ioctls to manage the virtual +Userland can use the ``uffdio_register.ioctls`` to manage the virtual address space in the background (to add or potentially also remove -memory from the userfaultfd registered range). This means a userfault +memory from the ``userfaultfd`` registered range). This means a userfault could be triggering just before userland maps in the background the user-faulted page. -The primary ioctl to resolve userfaults is UFFDIO_COPY. That +The primary ioctl to resolve userfaults is ``UFFDIO_COPY``. That atomically copies a page into the userfault registered range and wakes -up the blocked userfaults (unless uffdio_copy.mode & -UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to -UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an -half copied page since it'll keep userfaulting until the copy has -finished. +up the blocked userfaults +(unless ``uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE`` is set). +Other ioctl works similarly to ``UFFDIO_COPY``. They're atomic as in +guaranteeing that nothing can see an half copied page since it'll +keep userfaulting until the copy has finished. Notes: -- If you requested UFFDIO_REGISTER_MODE_MISSING when registering then +- If you requested ``UFFDIO_REGISTER_MODE_MISSING`` when registering then you must provide some kind of page in your thread after reading from - the uffd. You must provide either UFFDIO_COPY or UFFDIO_ZEROPAGE. + the uffd. You must provide either ``UFFDIO_COPY`` or ``UFFDIO_ZEROPAGE``. The normal behavior of the OS automatically providing a zero page on an annonymous mmaping is not in place. @@ -122,13 +122,13 @@ Notes: - You get the address of the access that triggered the missing page event out of a struct uffd_msg that you read in the thread from the - uffd. You can supply as many pages as you want with UFFDIO_COPY or - UFFDIO_ZEROPAGE. Keep in mind that unless you used DONTWAKE then + uffd. You can supply as many pages as you want with ``UFFDIO_COPY`` or + ``UFFDIO_ZEROPAGE``. Keep in mind that unless you used DONTWAKE then the first of any of those IOCTLs wakes up the faulting thread. -- Be sure to test for all errors including (pollfd[0].revents & - POLLERR). This can happen, e.g. when ranges supplied were - incorrect. +- Be sure to test for all errors including + (``pollfd[0].revents & POLLERR``). This can happen, e.g. when ranges + supplied were incorrect. Write Protect Notifications --------------------------- @@ -136,41 +136,42 @@ Write Protect Notifications This is equivalent to (but faster than) using mprotect and a SIGSEGV signal handler. -Firstly you need to register a range with UFFDIO_REGISTER_MODE_WP. -Instead of using mprotect(2) you use ioctl(uffd, UFFDIO_WRITEPROTECT, -struct *uffdio_writeprotect) while mode = UFFDIO_WRITEPROTECT_MODE_WP +Firstly you need to register a range with ``UFFDIO_REGISTER_MODE_WP``. +Instead of using mprotect(2) you use +``ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect)`` +while ``mode = UFFDIO_WRITEPROTECT_MODE_WP`` in the struct passed in. The range does not default to and does not have to be identical to the range you registered with. You can write protect as many ranges as you like (inside the registered range). Then, in the thread reading from uffd the struct will have -msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP set. Now you send -ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect) again -while pagefault.mode does not have UFFDIO_WRITEPROTECT_MODE_WP set. -This wakes up the thread which will continue to run with writes. This +``msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP`` set. Now you send +``ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect)`` +again while ``pagefault.mode`` does not have ``UFFDIO_WRITEPROTECT_MODE_WP`` +set. This wakes up the thread which will continue to run with writes. This allows you to do the bookkeeping about the write in the uffd reading thread before the ioctl. -If you registered with both UFFDIO_REGISTER_MODE_MISSING and -UFFDIO_REGISTER_MODE_WP then you need to think about the sequence in +If you registered with both ``UFFDIO_REGISTER_MODE_MISSING`` and +``UFFDIO_REGISTER_MODE_WP`` then you need to think about the sequence in which you supply a page and undo write protect. Note that there is a difference between writes into a WP area and into a !WP area. The -former will have UFFD_PAGEFAULT_FLAG_WP set, the latter -UFFD_PAGEFAULT_FLAG_WRITE. The latter did not fail on protection but -you still need to supply a page when UFFDIO_REGISTER_MODE_MISSING was +former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter +``UFFD_PAGEFAULT_FLAG_WRITE``. The latter did not fail on protection but +you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was used. QEMU/KVM ======== -QEMU/KVM is using the userfaultfd syscall to implement postcopy live +QEMU/KVM is using the ``userfaultfd`` syscall to implement postcopy live migration. Postcopy live migration is one form of memory externalization consisting of a virtual machine running with part or all of its memory residing on a different node in the cloud. The -userfaultfd abstraction is generic enough that not a single line of +``userfaultfd`` abstraction is generic enough that not a single line of KVM kernel code had to be modified in order to add postcopy live migration to QEMU. -Guest async page faults, FOLL_NOWAIT and all other GUP features work +Guest async page faults, ``FOLL_NOWAIT`` and all other ``GUP*`` features work just fine in combination with userfaults. Userfaults trigger async page faults in the guest scheduler so those guest processes that aren't waiting for userfaults (i.e. network bound) can keep running in @@ -183,19 +184,19 @@ generating userfaults for readonly guest regions. The implementation of postcopy live migration currently uses one single bidirectional socket but in the future two different sockets will be used (to reduce the latency of the userfaults to the minimum -possible without having to decrease /proc/sys/net/ipv4/tcp_wmem). +possible without having to decrease ``/proc/sys/net/ipv4/tcp_wmem``). The QEMU in the source node writes all pages that it knows are missing in the destination node, into the socket, and the migration thread of -the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE -ioctls on the userfaultfd in order to map the received pages into the -guest (UFFDIO_ZEROCOPY is used if the source page was a zero page). +the QEMU running in the destination node runs ``UFFDIO_COPY|ZEROPAGE`` +ioctls on the ``userfaultfd`` in order to map the received pages into the +guest (``UFFDIO_ZEROCOPY`` is used if the source page was a zero page). A different postcopy thread in the destination node listens with -poll() to the userfaultfd in parallel. When a POLLIN event is +poll() to the ``userfaultfd`` in parallel. When a ``POLLIN`` event is generated after a userfault triggers, the postcopy thread read() from -the userfaultfd and receives the fault address (or -EAGAIN in case the -userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run +the ``userfaultfd`` and receives the fault address (or ``-EAGAIN`` in case the +userfault was already resolved and waken by a ``UFFDIO_COPY|ZEROPAGE`` run by the parallel QEMU migration thread). After the QEMU postcopy thread (running in the destination node) gets @@ -206,7 +207,7 @@ remaining missing pages from that new page offset. Soon after that (just the time to flush the tcp_wmem queue through the network) the migration thread in the QEMU running in the destination node will receive the page that triggered the userfault and it'll map it as -usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it +usual with the ``UFFDIO_COPY|ZEROPAGE`` (without actually knowing if it was spontaneously sent by the source or if it was an urgent page requested through a userfault). @@ -219,74 +220,74 @@ checked to find which missing pages to send in round robin and we seek over it when receiving incoming userfaults. After sending each page of course the bitmap is updated accordingly. It's also useful to avoid sending the same page twice (in case the userfault is read by the -postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration +postcopy thread just before ``UFFDIO_COPY|ZEROPAGE`` runs in the migration thread). Non-cooperative userfaultfd =========================== -When the userfaultfd is monitored by an external manager, the manager +When the ``userfaultfd`` is monitored by an external manager, the manager must be able to track changes in the process virtual memory layout. Userfaultfd can notify the manager about such changes using the same read(2) protocol as for the page fault notifications. The manager has to explicitly enable these events by setting appropriate -bits in uffdio_api.features passed to UFFDIO_API ioctl: +bits in ``uffdio_api.features`` passed to ``UFFDIO_API`` ioctl: -UFFD_FEATURE_EVENT_FORK - enable userfaultfd hooks for fork(). When this feature is - enabled, the userfaultfd context of the parent process is +``UFFD_FEATURE_EVENT_FORK`` + enable ``userfaultfd`` hooks for fork(). When this feature is + enabled, the ``userfaultfd`` context of the parent process is duplicated into the newly created process. The manager - receives UFFD_EVENT_FORK with file descriptor of the new - userfaultfd context in the uffd_msg.fork. + receives ``UFFD_EVENT_FORK`` with file descriptor of the new + ``userfaultfd`` context in the ``uffd_msg.fork``. -UFFD_FEATURE_EVENT_REMAP +``UFFD_FEATURE_EVENT_REMAP`` enable notifications about mremap() calls. When the non-cooperative process moves a virtual memory area to a different location, the manager will receive - UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and + ``UFFD_EVENT_REMAP``. The ``uffd_msg.remap`` will contain the old and new addresses of the area and its original length. -UFFD_FEATURE_EVENT_REMOVE +``UFFD_FEATURE_EVENT_REMOVE`` enable notifications about madvise(MADV_REMOVE) and - madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will - be generated upon these calls to madvise. The uffd_msg.remove + madvise(MADV_DONTNEED) calls. The event ``UFFD_EVENT_REMOVE`` will + be generated upon these calls to madvise(). The ``uffd_msg.remove`` will contain start and end addresses of the removed area. -UFFD_FEATURE_EVENT_UNMAP +``UFFD_FEATURE_EVENT_UNMAP`` enable notifications about memory unmapping. The manager will - get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and + get ``UFFD_EVENT_UNMAP`` with ``uffd_msg.remove`` containing start and end addresses of the unmapped area. -Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP +Although the ``UFFD_FEATURE_EVENT_REMOVE`` and ``UFFD_FEATURE_EVENT_UNMAP`` are pretty similar, they quite differ in the action expected from the -userfaultfd manager. In the former case, the virtual memory is +``userfaultfd`` manager. In the former case, the virtual memory is removed, but the area is not, the area remains monitored by the -userfaultfd, and if a page fault occurs in that area it will be +``userfaultfd``, and if a page fault occurs in that area it will be delivered to the manager. The proper resolution for such page fault is to zeromap the faulting address. However, in the latter case, when an area is unmapped, either explicitly (with munmap() system call), or implicitly (e.g. during mremap()), the area is removed and in turn the -userfaultfd context for such area disappears too and the manager will +``userfaultfd`` context for such area disappears too and the manager will not get further userland page faults from the removed area. Still, the notification is required in order to prevent manager from using -UFFDIO_COPY on the unmapped area. +``UFFDIO_COPY`` on the unmapped area. Unlike userland page faults which have to be synchronous and require explicit or implicit wakeup, all the events are delivered asynchronously and the non-cooperative process resumes execution as -soon as manager executes read(). The userfaultfd manager should -carefully synchronize calls to UFFDIO_COPY with the events -processing. To aid the synchronization, the UFFDIO_COPY ioctl will -return -ENOSPC when the monitored process exits at the time of -UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed -its virtual memory layout simultaneously with outstanding UFFDIO_COPY +soon as manager executes read(). The ``userfaultfd`` manager should +carefully synchronize calls to ``UFFDIO_COPY`` with the events +processing. To aid the synchronization, the ``UFFDIO_COPY`` ioctl will +return ``-ENOSPC`` when the monitored process exits at the time of +``UFFDIO_COPY``, and ``-ENOENT``, when the non-cooperative process has changed +its virtual memory layout simultaneously with outstanding ``UFFDIO_COPY`` operation. The current asynchronous model of the event delivery is optimal for -single threaded non-cooperative userfaultfd manager implementations. A +single threaded non-cooperative ``userfaultfd`` manager implementations. A synchronous event delivery model can be added later as a new -userfaultfd feature to facilitate multithreading enhancements of the -non cooperative manager, for example to allow UFFDIO_COPY ioctls to +``userfaultfd`` feature to facilitate multithreading enhancements of the +non cooperative manager, for example to allow ``UFFDIO_COPY`` ioctls to run in parallel to the event reception. Single threaded implementations should continue to use the current async event delivery model instead. diff --git a/Documentation/admin-guide/nfs/nfsroot.rst b/Documentation/admin-guide/nfs/nfsroot.rst index 82a4fda057f9..c6772075c80c 100644 --- a/Documentation/admin-guide/nfs/nfsroot.rst +++ b/Documentation/admin-guide/nfs/nfsroot.rst @@ -18,7 +18,7 @@ Mounting the root filesystem via NFS (nfsroot) In order to use a diskless system, such as an X-terminal or printer server for example, it is necessary for the root filesystem to be present on a non-disk device. This may be an initramfs (see -Documentation/filesystems/ramfs-rootfs-initramfs.txt), a ramdisk (see +Documentation/filesystems/ramfs-rootfs-initramfs.rst), a ramdisk (see Documentation/admin-guide/initrd.rst) or a filesystem mounted via NFS. The following text describes on how to use NFS for the root filesystem. For the rest of this text 'client' means the diskless system, and 'server' means the NFS diff --git a/Documentation/admin-guide/numastat.rst b/Documentation/admin-guide/numastat.rst index aaf1667489f8..08ec2c2bdce3 100644 --- a/Documentation/admin-guide/numastat.rst +++ b/Documentation/admin-guide/numastat.rst @@ -6,6 +6,21 @@ Numa policy hit/miss statistics All units are pages. Hugepages have separate counters. +The numa_hit, numa_miss and numa_foreign counters reflect how well processes +are able to allocate memory from nodes they prefer. If they succeed, numa_hit +is incremented on the preferred node, otherwise numa_foreign is incremented on +the preferred node and numa_miss on the node where allocation succeeded. + +Usually preferred node is the one local to the CPU where the process executes, +but restrictions such as mempolicies can change that, so there are also two +counters based on CPU local node. local_node is similar to numa_hit and is +incremented on allocation from a node by CPU on the same node. other_node is +similar to numa_miss and is incremented on the node where allocation succeeds +from a CPU from a different node. Note there is no counter analogical to +numa_foreign. + +In more detail: + =============== ============================================================ numa_hit A process wanted to allocate memory from this node, and succeeded. @@ -14,11 +29,13 @@ numa_miss A process wanted to allocate memory from another node, but ended up with memory from this node. numa_foreign A process wanted to allocate on this node, - but ended up with memory from another one. + but ended up with memory from another node. -local_node A process ran on this node and got memory from it. +local_node A process ran on this node's CPU, + and got memory from this node. -other_node A process ran on this node and got memory from another node. +other_node A process ran on a different node's CPU + and got memory from this node. interleave_hit Interleaving wanted to allocate from this node and succeeded. @@ -28,3 +45,11 @@ For easier reading you can use the numastat utility from the numactl package (http://oss.sgi.com/projects/libnuma/). Note that it only works well right now on machines with a small number of CPUs. +Note that on systems with memoryless nodes (where a node has CPUs but no +memory) the numa_hit, numa_miss and numa_foreign statistics can be skewed +heavily. In the current kernel implementation, if a process prefers a +memoryless node (i.e. because it is running on one of its local CPU), the +implementation actually treats one of the nearest nodes with memory as the +preferred node. As a result, such allocation will not increase the numa_foreign +counter on the memoryless node, and will skew the numa_hit, numa_miss and +numa_foreign statistics of the nearest node. diff --git a/Documentation/admin-guide/perf-security.rst b/Documentation/admin-guide/perf-security.rst index 72effa7c23b9..1307b5274a0f 100644 --- a/Documentation/admin-guide/perf-security.rst +++ b/Documentation/admin-guide/perf-security.rst @@ -1,6 +1,6 @@ .. _perf_security: -Perf Events and tool security +Perf events and tool security ============================= Overview @@ -42,11 +42,11 @@ categories: Data that belong to the fourth category can potentially contain sensitive process data. If PMUs in some monitoring modes capture values of execution context registers or data from process memory then access -to such monitoring capabilities requires to be ordered and secured -properly. So, perf_events/Perf performance monitoring is the subject for -security access control management [5]_ . +to such monitoring modes requires to be ordered and secured properly. +So, perf_events performance monitoring and observability operations are +the subject for security access control management [5]_ . -perf_events/Perf access control +perf_events access control ------------------------------- To perform security checks, the Linux implementation splits processes @@ -66,11 +66,25 @@ into distinct units, known as capabilities [6]_ , which can be independently enabled and disabled on per-thread basis for processes and files of unprivileged users. -Unprivileged processes with enabled CAP_SYS_ADMIN capability are treated +Unprivileged processes with enabled CAP_PERFMON capability are treated as privileged processes with respect to perf_events performance -monitoring and bypass *scope* permissions checks in the kernel. - -Unprivileged processes using perf_events system call API is also subject +monitoring and observability operations, thus, bypass *scope* permissions +checks in the kernel. CAP_PERFMON implements the principle of least +privilege [13]_ (POSIX 1003.1e: 2.2.2.39) for performance monitoring and +observability operations in the kernel and provides a secure approach to +perfomance monitoring and observability in the system. + +For backward compatibility reasons the access to perf_events monitoring and +observability operations is also open for CAP_SYS_ADMIN privileged +processes but CAP_SYS_ADMIN usage for secure monitoring and observability +use cases is discouraged with respect to the CAP_PERFMON capability. +If system audit records [14]_ for a process using perf_events system call +API contain denial records of acquiring both CAP_PERFMON and CAP_SYS_ADMIN +capabilities then providing the process with CAP_PERFMON capability singly +is recommended as the preferred secure approach to resolve double access +denial logging related to usage of performance monitoring and observability. + +Unprivileged processes using perf_events system call are also subject for PTRACE_MODE_READ_REALCREDS ptrace access mode check [7]_ , whose outcome determines whether monitoring is permitted. So unprivileged processes provided with CAP_SYS_PTRACE capability are effectively @@ -82,14 +96,14 @@ performance analysis of monitored processes or a system. For example, CAP_SYSLOG capability permits reading kernel space memory addresses from /proc/kallsyms file. -perf_events/Perf privileged users +Privileged Perf users groups --------------------------------- Mechanisms of capabilities, privileged capability-dumb files [6]_ and -file system ACLs [10]_ can be used to create a dedicated group of -perf_events/Perf privileged users who are permitted to execute -performance monitoring without scope limits. The following steps can be -taken to create such a group of privileged Perf users. +file system ACLs [10]_ can be used to create dedicated groups of +privileged Perf users who are permitted to execute performance monitoring +and observability without scope limits. The following steps can be +taken to create such groups of privileged Perf users. 1. Create perf_users group of privileged Perf users, assign perf_users group to Perf tool executable and limit access to the executable for @@ -108,30 +122,51 @@ taken to create such a group of privileged Perf users. -rwxr-x--- 2 root perf_users 11M Oct 19 15:12 perf 2. Assign the required capabilities to the Perf tool executable file and - enable members of perf_users group with performance monitoring + enable members of perf_users group with monitoring and observability privileges [6]_ : :: - # setcap "cap_sys_admin,cap_sys_ptrace,cap_syslog=ep" perf - # setcap -v "cap_sys_admin,cap_sys_ptrace,cap_syslog=ep" perf + # setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf + # setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf perf: OK # getcap perf - perf = cap_sys_ptrace,cap_sys_admin,cap_syslog+ep + perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep + +If the libcap installed doesn't yet support "cap_perfmon", use "38" instead, +i.e.: + +:: + + # setcap "38,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf + +Note that you may need to have 'cap_ipc_lock' in the mix for tools such as +'perf top', alternatively use 'perf top -m N', to reduce the memory that +it uses for the perf ring buffer, see the memory allocation section below. + +Using a libcap without support for CAP_PERFMON will make cap_get_flag(caps, 38, +CAP_EFFECTIVE, &val) fail, which will lead the default event to be 'cycles:u', +so as a workaround explicitly ask for the 'cycles' event, i.e.: + +:: + + # perf top -e cycles + +To get kernel and user samples with a perf binary with just CAP_PERFMON. As a result, members of perf_users group are capable of conducting -performance monitoring by using functionality of the configured Perf -tool executable that, when executes, passes perf_events subsystem scope -checks. +performance monitoring and observability by using functionality of the +configured Perf tool executable that, when executes, passes perf_events +subsystem scope checks. This specific access control management is only available to superuser or root running processes with CAP_SETPCAP, CAP_SETFCAP [6]_ capabilities. -perf_events/Perf unprivileged users +Unprivileged users ----------------------------------- -perf_events/Perf *scope* and *access* control for unprivileged processes +perf_events *scope* and *access* control for unprivileged processes is governed by perf_event_paranoid [2]_ setting: -1: @@ -166,7 +201,7 @@ is governed by perf_event_paranoid [2]_ setting: perf_event_mlock_kb locking limit is imposed but ignored for unprivileged processes with CAP_IPC_LOCK capability. -perf_events/Perf resource control +Resource control --------------------------------- Open file descriptors @@ -227,4 +262,5 @@ Bibliography .. [10] `<http://man7.org/linux/man-pages/man5/acl.5.html>`_ .. [11] `<http://man7.org/linux/man-pages/man2/getrlimit.2.html>`_ .. [12] `<http://man7.org/linux/man-pages/man5/limits.conf.5.html>`_ - +.. [13] `<https://sites.google.com/site/fullycapable>`_ +.. [14] `<http://man7.org/linux/man-pages/man8/auditd.8.html>`_ diff --git a/Documentation/admin-guide/ras.rst b/Documentation/admin-guide/ras.rst index 0310db624964..7b481b2a368e 100644 --- a/Documentation/admin-guide/ras.rst +++ b/Documentation/admin-guide/ras.rst @@ -156,11 +156,11 @@ the labels provided by the BIOS won't match the real ones. ECC memory ---------- -As mentioned on the previous section, ECC memory has extra bits to be -used for error correction. So, on 64 bit systems, a memory module -has 64 bits of *data width*, and 74 bits of *total width*. So, there are -8 bits extra bits to be used for the error detection and correction -mechanisms. Those extra bits are called *syndrome*\ [#f1]_\ [#f2]_. +As mentioned in the previous section, ECC memory has extra bits to be +used for error correction. In the above example, a memory module has +64 bits of *data width*, and 72 bits of *total width*. The extra 8 +bits which are used for the error detection and correction mechanisms +are referred to as the *syndrome*\ [#f1]_\ [#f2]_. So, when the cpu requests the memory controller to write a word with *data width*, the memory controller calculates the *syndrome* in real time, @@ -212,7 +212,7 @@ EDAC - Error Detection And Correction purposes. When the subsystem was pushed upstream for the first time, on - Kernel 2.6.16, for the first time, it was renamed to ``EDAC``. + Kernel 2.6.16, it was renamed to ``EDAC``. Purpose ------- @@ -351,15 +351,17 @@ controllers. The following example will assume 2 channels: +------------+-----------+-----------+ | | ``ch0`` | ``ch1`` | +============+===========+===========+ - | ``csrow0`` | DIMM_A0 | DIMM_B0 | - | | rank0 | rank0 | - +------------+ - | - | + | |**DIMM_A0**|**DIMM_B0**| + +------------+-----------+-----------+ + | ``csrow0`` | rank0 | rank0 | + +------------+-----------+-----------+ | ``csrow1`` | rank1 | rank1 | +------------+-----------+-----------+ - | ``csrow2`` | DIMM_A1 | DIMM_B1 | - | | rank0 | rank0 | - +------------+ - | - | - | ``csrow3`` | rank1 | rank1 | + | |**DIMM_A1**|**DIMM_B1**| + +------------+-----------+-----------+ + | ``csrow2`` | rank0 | rank0 | + +------------+-----------+-----------+ + | ``csrow3`` | rank1 | rank1 | +------------+-----------+-----------+ In the above example, there are 4 physical slots on the motherboard diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 0d427fd10941..1ebf68d01141 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -102,6 +102,30 @@ See the ``type_of_loader`` and ``ext_loader_ver`` fields in :doc:`/x86/boot` for additional information. +bpf_stats_enabled +================= + +Controls whether the kernel should collect statistics on BPF programs +(total time spent running, number of times run...). Enabling +statistics causes a slight reduction in performance on each program +run. The statistics can be seen using ``bpftool``. + += =================================== +0 Don't collect statistics (default). +1 Collect statistics. += =================================== + + +cad_pid +======= + +This is the pid which will be signalled on reboot (notably, by +Ctrl-Alt-Delete). Writing a value to this file which doesn't +correspond to a running process will result in ``-ESRCH``. + +See also `ctrl-alt-del`_. + + cap_last_cap ============ @@ -241,6 +265,40 @@ domain names are in general different. For a detailed discussion see the ``hostname(1)`` man page. +firmware_config +=============== + +See :doc:`/driver-api/firmware/fallback-mechanisms`. + +The entries in this directory allow the firmware loader helper +fallback to be controlled: + +* ``force_sysfs_fallback``, when set to 1, forces the use of the + fallback; +* ``ignore_sysfs_fallback``, when set to 1, ignores any fallback. + + +ftrace_dump_on_oops +=================== + +Determines whether ``ftrace_dump()`` should be called on an oops (or +kernel panic). This will output the contents of the ftrace buffers to +the console. This is very useful for capturing traces that lead to +crashes and outputting them to a serial console. + += =================================================== +0 Disabled (default). +1 Dump buffers of all CPUs. +2 Dump the buffer of the CPU that triggered the oops. += =================================================== + + +ftrace_enabled, stack_tracer_enabled +==================================== + +See :doc:`/trace/ftrace`. + + hardlockup_all_cpu_backtrace ============================ @@ -344,6 +402,25 @@ Controls whether the panic kmsg data should be reported to Hyper-V. = ========================================================= +ignore-unaligned-usertrap +========================= + +On architectures where unaligned accesses cause traps, and where this +feature is supported (``CONFIG_SYSCTL_ARCH_UNALIGN_NO_WARN``; +currently, ``arc`` and ``ia64``), controls whether all unaligned traps +are logged. + += ============================================================= +0 Log all unaligned accesses. +1 Only warn the first time a process traps. This is the default + setting. += ============================================================= + +See also `unaligned-trap`_ and `unaligned-dump-stack`_. On ``ia64``, +this allows system administrators to override the +``IA64_THREAD_UAC_NOPRINT`` ``prctl`` and avoid logs being flooded. + + kexec_load_disabled =================== @@ -459,6 +536,15 @@ Notes: successful IPC object allocation. If an IPC object allocation syscall fails, it is undefined if the value remains unmodified or is reset to -1. + +ngroups_max +=========== + +Maximum number of supplementary groups, _i.e._ the maximum size which +``setgroups`` will accept. Exports ``NGROUPS_MAX`` from the kernel. + + + nmi_watchdog ============ @@ -721,7 +807,13 @@ perf_event_paranoid =================== Controls use of the performance events system by unprivileged -users (without CAP_SYS_ADMIN). The default value is 2. +users (without CAP_PERFMON). The default value is 2. + +For backward compatibility reasons access to system performance +monitoring and observability remains open for CAP_SYS_ADMIN +privileged processes but CAP_SYS_ADMIN usage for secure system +performance monitoring and observability operations is discouraged +with respect to CAP_PERFMON use cases. === ================================================================== -1 Allow use of (almost) all events by all users. @@ -730,13 +822,13 @@ users (without CAP_SYS_ADMIN). The default value is 2. ``CAP_IPC_LOCK``. >=0 Disallow ftrace function tracepoint by users without - ``CAP_SYS_ADMIN``. + ``CAP_PERFMON``. - Disallow raw tracepoint access by users without ``CAP_SYS_ADMIN``. + Disallow raw tracepoint access by users without ``CAP_PERFMON``. ->=1 Disallow CPU event access by users without ``CAP_SYS_ADMIN``. +>=1 Disallow CPU event access by users without ``CAP_PERFMON``. ->=2 Disallow kernel profiling by users without ``CAP_SYS_ADMIN``. +>=2 Disallow kernel profiling by users without ``CAP_PERFMON``. === ================================================================== @@ -871,7 +963,7 @@ this sysctl interface anymore. pty === -See Documentation/filesystems/devpts.txt. +See Documentation/filesystems/devpts.rst. randomize_va_space @@ -1167,6 +1259,65 @@ If a value outside of this range is written to ``threads-max`` an ``EINVAL`` error occurs. +traceoff_on_warning +=================== + +When set, disables tracing (see :doc:`/trace/ftrace`) when a +``WARN()`` is hit. + + +tracepoint_printk +================= + +When tracepoints are sent to printk() (enabled by the ``tp_printk`` +boot parameter), this entry provides runtime control:: + + echo 0 > /proc/sys/kernel/tracepoint_printk + +will stop tracepoints from being sent to printk(), and:: + + echo 1 > /proc/sys/kernel/tracepoint_printk + +will send them to printk() again. + +This only works if the kernel was booted with ``tp_printk`` enabled. + +See :doc:`/admin-guide/kernel-parameters` and +:doc:`/trace/boottime-trace`. + + +.. _unaligned-dump-stack: + +unaligned-dump-stack (ia64) +=========================== + +When logging unaligned accesses, controls whether the stack is +dumped. + += =================================================== +0 Do not dump the stack. This is the default setting. +1 Dump the stack. += =================================================== + +See also `ignore-unaligned-usertrap`_. + + +unaligned-trap +============== + +On architectures where unaligned accesses cause traps, and where this +feature is supported (``CONFIG_SYSCTL_ARCH_UNALIGN_ALLOW``; currently, +``arc`` and ``parisc``), controls whether unaligned traps are caught +and emulated (instead of failing). + += ======================================================== +0 Do not emulate unaligned accesses. +1 Emulate unaligned accesses. This is the default setting. += ======================================================== + +See also `ignore-unaligned-usertrap`_. + + unknown_nmi_panic ================= @@ -1178,6 +1329,16 @@ NMI switch that most IA32 servers have fires unknown NMI up, for example. If a system hangs up, try pressing the NMI switch. +unprivileged_bpf_disabled +========================= + +Writing 1 to this entry will disable unprivileged calls to ``bpf()``; +once disabled, calling ``bpf()`` without ``CAP_SYS_ADMIN`` will return +``-EPERM``. + +Once set, this can't be cleared. + + watchdog ======== diff --git a/Documentation/arm64/amu.rst b/Documentation/arm64/amu.rst index 036783ee327f..452ec8b115c2 100644 --- a/Documentation/arm64/amu.rst +++ b/Documentation/arm64/amu.rst @@ -24,13 +24,13 @@ optional external memory-mapped interface. Version 1 of the Activity Monitors architecture implements a counter group of four fixed and architecturally defined 64-bit event counters. -- CPU cycle counter: increments at the frequency of the CPU. -- Constant counter: increments at the fixed frequency of the system - clock. -- Instructions retired: increments with every architecturally executed - instruction. -- Memory stall cycles: counts instruction dispatch stall cycles caused by - misses in the last level cache within the clock domain. + - CPU cycle counter: increments at the frequency of the CPU. + - Constant counter: increments at the fixed frequency of the system + clock. + - Instructions retired: increments with every architecturally executed + instruction. + - Memory stall cycles: counts instruction dispatch stall cycles caused by + misses in the last level cache within the clock domain. When in WFI or WFE these counters do not increment. @@ -59,11 +59,11 @@ counters, only the presence of the extension. Firmware (code running at higher exception levels, e.g. arm-tf) support is needed to: -- Enable access for lower exception levels (EL2 and EL1) to the AMU - registers. -- Enable the counters. If not enabled these will read as 0. -- Save/restore the counters before/after the CPU is being put/brought up - from the 'off' power state. + - Enable access for lower exception levels (EL2 and EL1) to the AMU + registers. + - Enable the counters. If not enabled these will read as 0. + - Save/restore the counters before/after the CPU is being put/brought up + from the 'off' power state. When using kernels that have this feature enabled but boot with broken firmware the user may experience panics or lockups when accessing the @@ -81,10 +81,10 @@ are not trapped in EL2/EL3. The fixed counters of AMUv1 are accessible though the following system register definitions: -- SYS_AMEVCNTR0_CORE_EL0 -- SYS_AMEVCNTR0_CONST_EL0 -- SYS_AMEVCNTR0_INST_RET_EL0 -- SYS_AMEVCNTR0_MEM_STALL_EL0 + - SYS_AMEVCNTR0_CORE_EL0 + - SYS_AMEVCNTR0_CONST_EL0 + - SYS_AMEVCNTR0_INST_RET_EL0 + - SYS_AMEVCNTR0_MEM_STALL_EL0 Auxiliary platform specific counters can be accessed using SYS_AMEVCNTR1_EL0(n), where n is a value between 0 and 15. @@ -97,9 +97,9 @@ Userspace access Currently, access from userspace to the AMU registers is disabled due to: -- Security reasons: they might expose information about code executed in - secure mode. -- Purpose: AMU counters are intended for system management use. + - Security reasons: they might expose information about code executed in + secure mode. + - Purpose: AMU counters are intended for system management use. Also, the presence of the feature is not visible to userspace. @@ -110,8 +110,8 @@ Virtualization Currently, access from userspace (EL0) and kernelspace (EL1) on the KVM guest side is disabled due to: -- Security reasons: they might expose information about code executed - by other guests or the host. + - Security reasons: they might expose information about code executed + by other guests or the host. Any attempt to access the AMU registers will result in an UNDEFINED exception being injected into the guest. diff --git a/Documentation/arm64/booting.rst b/Documentation/arm64/booting.rst index a3f1a47b6f1c..7552dbc1cc54 100644 --- a/Documentation/arm64/booting.rst +++ b/Documentation/arm64/booting.rst @@ -173,7 +173,10 @@ Before jumping into the kernel, the following conditions must be met: - Caches, MMUs The MMU must be off. - Instruction cache may be on or off. + + The instruction cache may be on or off, and must not hold any stale + entries corresponding to the loaded kernel image. + The address range corresponding to the loaded kernel image must be cleaned to the PoC. In the presence of a system cache or other coherent masters with caches enabled, this will typically require @@ -238,6 +241,7 @@ Before jumping into the kernel, the following conditions must be met: - The DT or ACPI tables must describe a GICv2 interrupt controller. For CPUs with pointer authentication functionality: + - If EL3 is present: - SCR_EL3.APK (bit 16) must be initialised to 0b1 @@ -249,18 +253,22 @@ Before jumping into the kernel, the following conditions must be met: - HCR_EL2.API (bit 41) must be initialised to 0b1 For CPUs with Activity Monitors Unit v1 (AMUv1) extension present: + - If EL3 is present: - CPTR_EL3.TAM (bit 30) must be initialised to 0b0 - CPTR_EL2.TAM (bit 30) must be initialised to 0b0 - AMCNTENSET0_EL0 must be initialised to 0b1111 - AMCNTENSET1_EL0 must be initialised to a platform specific value - having 0b1 set for the corresponding bit for each of the auxiliary - counters present. + + - CPTR_EL3.TAM (bit 30) must be initialised to 0b0 + - CPTR_EL2.TAM (bit 30) must be initialised to 0b0 + - AMCNTENSET0_EL0 must be initialised to 0b1111 + - AMCNTENSET1_EL0 must be initialised to a platform specific value + having 0b1 set for the corresponding bit for each of the auxiliary + counters present. + - If the kernel is entered at EL1: - AMCNTENSET0_EL0 must be initialised to 0b1111 - AMCNTENSET1_EL0 must be initialised to a platform specific value - having 0b1 set for the corresponding bit for each of the auxiliary - counters present. + + - AMCNTENSET0_EL0 must be initialised to 0b1111 + - AMCNTENSET1_EL0 must be initialised to a platform specific value + having 0b1 set for the corresponding bit for each of the auxiliary + counters present. The requirements described above for CPU mode, caches, MMUs, architected timers, coherency and system registers apply to all CPUs. All CPUs must @@ -304,7 +312,8 @@ following manner: Documentation/devicetree/bindings/arm/psci.yaml. - Secondary CPU general-purpose register settings - x0 = 0 (reserved for future use) - x1 = 0 (reserved for future use) - x2 = 0 (reserved for future use) - x3 = 0 (reserved for future use) + + - x0 = 0 (reserved for future use) + - x1 = 0 (reserved for future use) + - x2 = 0 (reserved for future use) + - x3 = 0 (reserved for future use) diff --git a/Documentation/arm64/cpu-feature-registers.rst b/Documentation/arm64/cpu-feature-registers.rst index 41937a8091aa..314fa5bc2655 100644 --- a/Documentation/arm64/cpu-feature-registers.rst +++ b/Documentation/arm64/cpu-feature-registers.rst @@ -176,6 +176,8 @@ infrastructure: +------------------------------+---------+---------+ | SSBS | [7-4] | y | +------------------------------+---------+---------+ + | BT | [3-0] | y | + +------------------------------+---------+---------+ 4) MIDR_EL1 - Main ID Register diff --git a/Documentation/arm64/elf_hwcaps.rst b/Documentation/arm64/elf_hwcaps.rst index 7dfb97dfe416..84a9fd2d41b4 100644 --- a/Documentation/arm64/elf_hwcaps.rst +++ b/Documentation/arm64/elf_hwcaps.rst @@ -236,6 +236,11 @@ HWCAP2_RNG Functionality implied by ID_AA64ISAR0_EL1.RNDR == 0b0001. +HWCAP2_BTI + + Functionality implied by ID_AA64PFR0_EL1.BT == 0b0001. + + 4. Unused AT_HWCAP bits ----------------------- diff --git a/Documentation/arm64/silicon-errata.rst b/Documentation/arm64/silicon-errata.rst index 2c08c628febd..936cf2a59ca4 100644 --- a/Documentation/arm64/silicon-errata.rst +++ b/Documentation/arm64/silicon-errata.rst @@ -64,6 +64,10 @@ stable kernels. +----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A53 | #843419 | ARM64_ERRATUM_843419 | +----------------+-----------------+-----------------+-----------------------------+ +| ARM | Cortex-A55 | #1024718 | ARM64_ERRATUM_1024718 | ++----------------+-----------------+-----------------+-----------------------------+ +| ARM | Cortex-A55 | #1530923 | ARM64_ERRATUM_1530923 | ++----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A57 | #832075 | ARM64_ERRATUM_832075 | +----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A57 | #852523 | N/A | @@ -78,8 +82,6 @@ stable kernels. +----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A73 | #858921 | ARM64_ERRATUM_858921 | +----------------+-----------------+-----------------+-----------------------------+ -| ARM | Cortex-A55 | #1024718 | ARM64_ERRATUM_1024718 | -+----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A76 | #1188873,1418040| ARM64_ERRATUM_1418040 | +----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A76 | #1165522 | ARM64_ERRATUM_1165522 | @@ -88,8 +90,6 @@ stable kernels. +----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A76 | #1463225 | ARM64_ERRATUM_1463225 | +----------------+-----------------+-----------------+-----------------------------+ -| ARM | Cortex-A55 | #1530923 | ARM64_ERRATUM_1530923 | -+----------------+-----------------+-----------------+-----------------------------+ | ARM | Neoverse-N1 | #1188873,1418040| ARM64_ERRATUM_1418040 | +----------------+-----------------+-----------------+-----------------------------+ | ARM | Neoverse-N1 | #1349291 | N/A | diff --git a/Documentation/conf.py b/Documentation/conf.py index 9ae8e9abf846..f6a1bc07c410 100644 --- a/Documentation/conf.py +++ b/Documentation/conf.py @@ -388,44 +388,6 @@ if major == 1 and minor < 6: # author, documentclass [howto, manual, or own class]). # Sorted in alphabetical order latex_documents = [ - ('admin-guide/index', 'linux-user.tex', 'Linux Kernel User Documentation', - 'The kernel development community', 'manual'), - ('core-api/index', 'core-api.tex', 'The kernel core API manual', - 'The kernel development community', 'manual'), - ('crypto/index', 'crypto-api.tex', 'Linux Kernel Crypto API manual', - 'The kernel development community', 'manual'), - ('dev-tools/index', 'dev-tools.tex', 'Development tools for the Kernel', - 'The kernel development community', 'manual'), - ('doc-guide/index', 'kernel-doc-guide.tex', 'Linux Kernel Documentation Guide', - 'The kernel development community', 'manual'), - ('driver-api/index', 'driver-api.tex', 'The kernel driver API manual', - 'The kernel development community', 'manual'), - ('filesystems/index', 'filesystems.tex', 'Linux Filesystems API', - 'The kernel development community', 'manual'), - ('admin-guide/ext4', 'ext4-admin-guide.tex', 'ext4 Administration Guide', - 'ext4 Community', 'manual'), - ('filesystems/ext4/index', 'ext4-data-structures.tex', - 'ext4 Data Structures and Algorithms', 'ext4 Community', 'manual'), - ('gpu/index', 'gpu.tex', 'Linux GPU Driver Developer\'s Guide', - 'The kernel development community', 'manual'), - ('input/index', 'linux-input.tex', 'The Linux input driver subsystem', - 'The kernel development community', 'manual'), - ('kernel-hacking/index', 'kernel-hacking.tex', 'Unreliable Guide To Hacking The Linux Kernel', - 'The kernel development community', 'manual'), - ('media/index', 'media.tex', 'Linux Media Subsystem Documentation', - 'The kernel development community', 'manual'), - ('networking/index', 'networking.tex', 'Linux Networking Documentation', - 'The kernel development community', 'manual'), - ('process/index', 'development-process.tex', 'Linux Kernel Development Documentation', - 'The kernel development community', 'manual'), - ('security/index', 'security.tex', 'The kernel security subsystem manual', - 'The kernel development community', 'manual'), - ('sh/index', 'sh.tex', 'SuperH architecture implementation manual', - 'The kernel development community', 'manual'), - ('sound/index', 'sound.tex', 'Linux Sound Subsystem Documentation', - 'The kernel development community', 'manual'), - ('userspace-api/index', 'userspace-api.tex', 'The Linux kernel user-space API guide', - 'The kernel development community', 'manual'), ] # Add all other index files from Documentation/ subdirectories diff --git a/Documentation/debugging-via-ohci1394.txt b/Documentation/core-api/debugging-via-ohci1394.rst index 981ad4f89fd3..981ad4f89fd3 100644 --- a/Documentation/debugging-via-ohci1394.txt +++ b/Documentation/core-api/debugging-via-ohci1394.rst diff --git a/Documentation/DMA-API-HOWTO.txt b/Documentation/core-api/dma-api-howto.rst index 358d495456d1..358d495456d1 100644 --- a/Documentation/DMA-API-HOWTO.txt +++ b/Documentation/core-api/dma-api-howto.rst diff --git a/Documentation/DMA-API.txt b/Documentation/core-api/dma-api.rst index 2d8d2fed7317..2d8d2fed7317 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/core-api/dma-api.rst diff --git a/Documentation/DMA-attributes.txt b/Documentation/core-api/dma-attributes.rst index 29dcbe8826e8..29dcbe8826e8 100644 --- a/Documentation/DMA-attributes.txt +++ b/Documentation/core-api/dma-attributes.rst diff --git a/Documentation/DMA-ISA-LPC.txt b/Documentation/core-api/dma-isa-lpc.rst index b1ec7b16c21f..b1ec7b16c21f 100644 --- a/Documentation/DMA-ISA-LPC.txt +++ b/Documentation/core-api/dma-isa-lpc.rst diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index 0897ad12c119..15ab86112627 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -18,6 +18,7 @@ it. kernel-api workqueue + printk-basics printk-formats symbol-namespaces @@ -30,10 +31,12 @@ Library functionality that is used throughout the kernel. :maxdepth: 1 kobject + kref assoc_array xarray idr circular-buffers + rbtree generic-radix-tree packing timekeeping @@ -50,6 +53,7 @@ How Linux keeps everything from happening at the same time. See atomic_ops refcount-vs-atomic + irq/index local_ops padata ../RCU/index @@ -78,6 +82,10 @@ more memory-management documentation in :doc:`/vm/index`. :maxdepth: 1 memory-allocation + dma-api + dma-api-howto + dma-attributes + dma-isa-lpc mm-api genalloc pin_user_pages @@ -92,6 +100,7 @@ Interfaces for kernel debugging debug-objects tracepoint + debugging-via-ohci1394 Everything else =============== diff --git a/Documentation/IRQ.txt b/Documentation/core-api/irq/concepts.rst index 4273806a606b..4273806a606b 100644 --- a/Documentation/IRQ.txt +++ b/Documentation/core-api/irq/concepts.rst diff --git a/Documentation/core-api/irq/index.rst b/Documentation/core-api/irq/index.rst new file mode 100644 index 000000000000..0d65d11e5420 --- /dev/null +++ b/Documentation/core-api/irq/index.rst @@ -0,0 +1,11 @@ +==== +IRQs +==== + +.. toctree:: + :maxdepth: 1 + + concepts + irq-affinity + irq-domain + irqflags-tracing diff --git a/Documentation/IRQ-affinity.txt b/Documentation/core-api/irq/irq-affinity.rst index 29da5000836a..29da5000836a 100644 --- a/Documentation/IRQ-affinity.txt +++ b/Documentation/core-api/irq/irq-affinity.rst diff --git a/Documentation/IRQ-domain.txt b/Documentation/core-api/irq/irq-domain.rst index 507775cce753..096db12f32d5 100644 --- a/Documentation/IRQ-domain.txt +++ b/Documentation/core-api/irq/irq-domain.rst @@ -263,7 +263,8 @@ needs to: Hierarchy irq_domain is in no way x86 specific, and is heavily used to support other architectures, such as ARM, ARM64 etc. -=== Debugging === +Debugging +========= Most of the internals of the IRQ subsystem are exposed in debugfs by turning CONFIG_GENERIC_IRQ_DEBUGFS on. diff --git a/Documentation/irqflags-tracing.txt b/Documentation/core-api/irq/irqflags-tracing.rst index bdd208259fb3..bdd208259fb3 100644 --- a/Documentation/irqflags-tracing.txt +++ b/Documentation/core-api/irq/irqflags-tracing.rst diff --git a/Documentation/core-api/kobject.rst b/Documentation/core-api/kobject.rst index 1f62d4d7d966..e93dc8cf52dd 100644 --- a/Documentation/core-api/kobject.rst +++ b/Documentation/core-api/kobject.rst @@ -80,11 +80,11 @@ what is the pointer to the containing structure? You must avoid tricks (such as assuming that the kobject is at the beginning of the structure) and, instead, use the container_of() macro, found in ``<linux/kernel.h>``:: - container_of(pointer, type, member) + container_of(ptr, type, member) where: - * ``pointer`` is the pointer to the embedded kobject, + * ``ptr`` is the pointer to the embedded kobject, * ``type`` is the type of the containing structure, and * ``member`` is the name of the structure field to which ``pointer`` points. @@ -140,7 +140,7 @@ the name of the kobject, call kobject_rename():: int kobject_rename(struct kobject *kobj, const char *new_name); -kobject_rename does not perform any locking or have a solid notion of +kobject_rename() does not perform any locking or have a solid notion of what names are valid so the caller must provide their own sanity checking and serialization. @@ -210,7 +210,7 @@ statically and will warn the developer of this improper usage. If all that you want to use a kobject for is to provide a reference counter for your structure, please use the struct kref instead; a kobject would be overkill. For more information on how to use struct kref, please see the -file Documentation/kref.txt in the Linux kernel source tree. +file Documentation/core-api/kref.rst in the Linux kernel source tree. Creating "simple" kobjects @@ -222,17 +222,17 @@ ksets, show and store functions, and other details. This is the one exception where a single kobject should be created. To create such an entry, use the function:: - struct kobject *kobject_create_and_add(char *name, struct kobject *parent); + struct kobject *kobject_create_and_add(const char *name, struct kobject *parent); This function will create a kobject and place it in sysfs in the location underneath the specified parent kobject. To create simple attributes associated with this kobject, use:: - int sysfs_create_file(struct kobject *kobj, struct attribute *attr); + int sysfs_create_file(struct kobject *kobj, const struct attribute *attr); or:: - int sysfs_create_group(struct kobject *kobj, struct attribute_group *grp); + int sysfs_create_group(struct kobject *kobj, const struct attribute_group *grp); Both types of attributes used here, with a kobject that has been created with the kobject_create_and_add(), can be of type kobj_attribute, so no @@ -300,8 +300,10 @@ kobj_type:: void (*release)(struct kobject *kobj); const struct sysfs_ops *sysfs_ops; struct attribute **default_attrs; + const struct attribute_group **default_groups; const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj); const void *(*namespace)(struct kobject *kobj); + void (*get_ownership)(struct kobject *kobj, kuid_t *uid, kgid_t *gid); }; This structure is used to describe a particular type of kobject (or, more @@ -352,12 +354,12 @@ created and never declared statically or on the stack. To create a new kset use:: struct kset *kset_create_and_add(const char *name, - struct kset_uevent_ops *u, - struct kobject *parent); + const struct kset_uevent_ops *uevent_ops, + struct kobject *parent_kobj); When you are finished with the kset, call:: - void kset_unregister(struct kset *kset); + void kset_unregister(struct kset *k); to destroy it. This removes the kset from sysfs and decrements its reference count. When the reference count goes to zero, the kset will be released. @@ -371,9 +373,9 @@ If a kset wishes to control the uevent operations of the kobjects associated with it, it can use the struct kset_uevent_ops to handle it:: struct kset_uevent_ops { - int (*filter)(struct kset *kset, struct kobject *kobj); - const char *(*name)(struct kset *kset, struct kobject *kobj); - int (*uevent)(struct kset *kset, struct kobject *kobj, + int (* const filter)(struct kset *kset, struct kobject *kobj); + const char *(* const name)(struct kset *kset, struct kobject *kobj); + int (* const uevent)(struct kset *kset, struct kobject *kobj, struct kobj_uevent_env *env); }; diff --git a/Documentation/kref.txt b/Documentation/core-api/kref.rst index c61eea6f1bf2..c61eea6f1bf2 100644 --- a/Documentation/kref.txt +++ b/Documentation/core-api/kref.rst diff --git a/Documentation/core-api/printk-basics.rst b/Documentation/core-api/printk-basics.rst new file mode 100644 index 000000000000..563a9ce5fe1d --- /dev/null +++ b/Documentation/core-api/printk-basics.rst @@ -0,0 +1,115 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +Message logging with printk +=========================== + +printk() is one of the most widely known functions in the Linux kernel. It's the +standard tool we have for printing messages and usually the most basic way of +tracing and debugging. If you're familiar with printf(3) you can tell printk() +is based on it, although it has some functional differences: + + - printk() messages can specify a log level. + + - the format string, while largely compatible with C99, doesn't follow the + exact same specification. It has some extensions and a few limitations + (no ``%n`` or floating point conversion specifiers). See :ref:`How to get + printk format specifiers right <printk-specifiers>`. + +All printk() messages are printed to the kernel log buffer, which is a ring +buffer exported to userspace through /dev/kmsg. The usual way to read it is +using ``dmesg``. + +printk() is typically used like this:: + + printk(KERN_INFO "Message: %s\n", arg); + +where ``KERN_INFO`` is the log level (note that it's concatenated to the format +string, the log level is not a separate argument). The available log levels are: + ++----------------+--------+-----------------------------------------------+ +| Name | String | Alias function | ++================+========+===============================================+ +| KERN_EMERG | "0" | pr_emerg() | ++----------------+--------+-----------------------------------------------+ +| KERN_ALERT | "1" | pr_alert() | ++----------------+--------+-----------------------------------------------+ +| KERN_CRIT | "2" | pr_crit() | ++----------------+--------+-----------------------------------------------+ +| KERN_ERR | "3" | pr_err() | ++----------------+--------+-----------------------------------------------+ +| KERN_WARNING | "4" | pr_warn() | ++----------------+--------+-----------------------------------------------+ +| KERN_NOTICE | "5" | pr_notice() | ++----------------+--------+-----------------------------------------------+ +| KERN_INFO | "6" | pr_info() | ++----------------+--------+-----------------------------------------------+ +| KERN_DEBUG | "7" | pr_debug() and pr_devel() if DEBUG is defined | ++----------------+--------+-----------------------------------------------+ +| KERN_DEFAULT | "" | | ++----------------+--------+-----------------------------------------------+ +| KERN_CONT | "c" | pr_cont() | ++----------------+--------+-----------------------------------------------+ + + +The log level specifies the importance of a message. The kernel decides whether +to show the message immediately (printing it to the current console) depending +on its log level and the current *console_loglevel* (a kernel variable). If the +message priority is higher (lower log level value) than the *console_loglevel* +the message will be printed to the console. + +If the log level is omitted, the message is printed with ``KERN_DEFAULT`` +level. + +You can check the current *console_loglevel* with:: + + $ cat /proc/sys/kernel/printk + 4 4 1 7 + +The result shows the *current*, *default*, *minimum* and *boot-time-default* log +levels. + +To change the current console_loglevel simply write the the desired level to +``/proc/sys/kernel/printk``. For example, to print all messages to the console:: + + # echo 8 > /proc/sys/kernel/printk + +Another way, using ``dmesg``:: + + # dmesg -n 5 + +sets the console_loglevel to print KERN_WARNING (4) or more severe messages to +console. See ``dmesg(1)`` for more information. + +As an alternative to printk() you can use the ``pr_*()`` aliases for +logging. This family of macros embed the log level in the macro names. For +example:: + + pr_info("Info message no. %d\n", msg_num); + +prints a ``KERN_INFO`` message. + +Besides being more concise than the equivalent printk() calls, they can use a +common definition for the format string through the pr_fmt() macro. For +instance, defining this at the top of a source file (before any ``#include`` +directive):: + + #define pr_fmt(fmt) "%s:%s: " fmt, KBUILD_MODNAME, __func__ + +would prefix every pr_*() message in that file with the module and function name +that originated the message. + +For debugging purposes there are also two conditionally-compiled macros: +pr_debug() and pr_devel(), which are compiled-out unless ``DEBUG`` (or +also ``CONFIG_DYNAMIC_DEBUG`` in the case of pr_debug()) is defined. + + +Function reference +================== + +.. kernel-doc:: kernel/printk/printk.c + :functions: printk + +.. kernel-doc:: include/linux/printk.h + :functions: pr_emerg pr_alert pr_crit pr_err pr_warn pr_notice pr_info + pr_fmt pr_debug pr_devel pr_cont diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst index 5d8f1e84dd90..8c9aba262b1e 100644 --- a/Documentation/core-api/printk-formats.rst +++ b/Documentation/core-api/printk-formats.rst @@ -2,6 +2,8 @@ How to get printk format specifiers right ========================================= +.. _printk-specifiers: + :Author: Randy Dunlap <rdunlap@infradead.org> :Author: Andrew Murray <amurray@mpc-data.co.uk> diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst index 49d9833af871..ec575e72d0b2 100644 --- a/Documentation/core-api/protection-keys.rst +++ b/Documentation/core-api/protection-keys.rst @@ -5,8 +5,9 @@ Memory Protection Keys ====================== Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature -which is found on Intel's Skylake "Scalable Processor" Server CPUs. -It will be avalable in future non-server parts. +which is found on Intel's Skylake (and later) "Scalable Processor" +Server CPUs. It will be available in future non-server Intel parts +and future AMD processors. For anyone wishing to test or use this feature, it is available in Amazon's EC2 C5 instances and is known to work there using an Ubuntu diff --git a/Documentation/rbtree.txt b/Documentation/core-api/rbtree.rst index 523d54b60087..523d54b60087 100644 --- a/Documentation/rbtree.txt +++ b/Documentation/core-api/rbtree.rst diff --git a/Documentation/doc-guide/maintainer-profile.rst b/Documentation/doc-guide/maintainer-profile.rst index 5afc0ddba40a..755d39f0d407 100644 --- a/Documentation/doc-guide/maintainer-profile.rst +++ b/Documentation/doc-guide/maintainer-profile.rst @@ -6,7 +6,7 @@ Documentation subsystem maintainer entry profile The documentation "subsystem" is the central coordinating point for the kernel's documentation and associated infrastructure. It covers the hierarchy under Documentation/ (with the exception of -Documentation/device-tree), various utilities under scripts/ and, at least +Documentation/devicetree), various utilities under scripts/ and, at least some of the time, LICENSES/. It's worth noting, though, that the boundaries of this subsystem are rather diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index c78db28519f7..63dec76d1d8d 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -11,7 +11,7 @@ course not limited to GPU use cases. The three main components of this are: (1) dma-buf, representing a sg_table and exposed to userspace as a file descriptor to allow passing between devices, (2) fence, which provides a mechanism to signal when -one device as finished access, and (3) reservation, which manages the +one device has finished access, and (3) reservation, which manages the shared or exclusive fence(s) associated with the buffer. Shared DMA Buffers @@ -31,7 +31,7 @@ The exporter - implements and manages operations in :c:type:`struct dma_buf_ops <dma_buf_ops>` for the buffer, - allows other users to share the buffer by using dma_buf sharing APIs, - - manages the details of buffer allocation, wrapped int a :c:type:`struct + - manages the details of buffer allocation, wrapped in a :c:type:`struct dma_buf <dma_buf>`, - decides about the actual backing storage where this allocation happens, - and takes care of any migration of scatterlist - for all (shared) users of diff --git a/Documentation/driver-api/driver-model/device.rst b/Documentation/driver-api/driver-model/device.rst index 2b868d49d349..b9b022371e85 100644 --- a/Documentation/driver-api/driver-model/device.rst +++ b/Documentation/driver-api/driver-model/device.rst @@ -50,10 +50,10 @@ Attributes Attributes of devices can be exported by a device driver through sysfs. -Please see Documentation/filesystems/sysfs.txt for more information +Please see Documentation/filesystems/sysfs.rst for more information on how sysfs works. -As explained in Documentation/kobject.txt, device attributes must be +As explained in Documentation/core-api/kobject.rst, device attributes must be created before the KOBJ_ADD uevent is generated. The only way to realize that is by defining an attribute group. diff --git a/Documentation/driver-api/driver-model/overview.rst b/Documentation/driver-api/driver-model/overview.rst index d4d1e9b40e0c..e98d0ab4a9b6 100644 --- a/Documentation/driver-api/driver-model/overview.rst +++ b/Documentation/driver-api/driver-model/overview.rst @@ -121,4 +121,4 @@ device-specific data or tunable interfaces. More information about the sysfs directory layout can be found in the other documents in this directory and in the file -Documentation/filesystems/sysfs.txt. +Documentation/filesystems/sysfs.rst. diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index d4e78cb3ef4d..20c431c8e7be 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -39,6 +39,7 @@ available subsections can be seen below. spi i2c ipmb + ipmi i3c/index interconnect devfreq diff --git a/Documentation/IPMI.txt b/Documentation/driver-api/ipmi.rst index 5ef1047e2e66..5ef1047e2e66 100644 --- a/Documentation/IPMI.txt +++ b/Documentation/driver-api/ipmi.rst diff --git a/Documentation/driver-api/nvdimm/nvdimm.rst b/Documentation/driver-api/nvdimm/nvdimm.rst index 08f855cbb4e6..79c0fd39f2af 100644 --- a/Documentation/driver-api/nvdimm/nvdimm.rst +++ b/Documentation/driver-api/nvdimm/nvdimm.rst @@ -278,8 +278,8 @@ by a region device with a dynamically assigned id (REGION0 - REGION5). be contiguous in DPA-space. This bus is provided by the kernel under the device - /sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and - the nfit_test.ko module is loaded. This not only test LIBNVDIMM but the + /sys/devices/platform/nfit_test.0 when the nfit_test.ko module from + tools/testing/nvdimm is loaded. This not only test LIBNVDIMM but the acpi_nfit.ko driver as well. diff --git a/Documentation/driver-api/thermal/cpu-idle-cooling.rst b/Documentation/driver-api/thermal/cpu-idle-cooling.rst index a1c3edecae00..b9f34ceb2a38 100644 --- a/Documentation/driver-api/thermal/cpu-idle-cooling.rst +++ b/Documentation/driver-api/thermal/cpu-idle-cooling.rst @@ -1,3 +1,6 @@ +================ +CPU Idle Cooling +================ Situation: ---------- diff --git a/Documentation/driver-api/thermal/index.rst b/Documentation/driver-api/thermal/index.rst index 5ba61d19c6ae..4cb0b9b6bfb8 100644 --- a/Documentation/driver-api/thermal/index.rst +++ b/Documentation/driver-api/thermal/index.rst @@ -8,6 +8,7 @@ Thermal :maxdepth: 1 cpu-cooling-api + cpu-idle-cooling sysfs-api power_allocator diff --git a/Documentation/fb/efifb.rst b/Documentation/fb/efifb.rst index 04840331a00e..6badff64756f 100644 --- a/Documentation/fb/efifb.rst +++ b/Documentation/fb/efifb.rst @@ -2,8 +2,10 @@ What is efifb? ============== -This is a generic EFI platform driver for Intel based Apple computers. -efifb is only for EFI booted Intel Macs. +This is a generic EFI platform driver for systems with UEFI firmware. The +system must be booted via the EFI stub for this to be usable. efifb supports +both firmware with Graphics Output Protocol (GOP) displays as well as older +systems with only Universal Graphics Adapter (UGA) displays. Supported Hardware ================== @@ -12,11 +14,14 @@ Supported Hardware - Macbook - Macbook Pro 15"/17" - MacMini +- ARM/ARM64/X86 systems with UEFI firmware How to use it? ============== -efifb does not have any kind of autodetection of your machine. +For UGA displays, efifb does not have any kind of autodetection of your +machine. + You have to add the following kernel parameters in your elilo.conf:: Macbook : @@ -28,6 +33,9 @@ You have to add the following kernel parameters in your elilo.conf:: Macbook Pro 17", iMac 20" : video=efifb:i20 +For GOP displays, efifb can autodetect the display's resolution and framebuffer +address, so these should work out of the box without any special parameters. + Accepted options: ======= =========================================================== @@ -36,4 +44,28 @@ nowc Don't map the framebuffer write combined. This can be used when large amounts of console data are written. ======= =========================================================== +Options for GOP displays: + +mode=n + The EFI stub will set the mode of the display to mode number n if + possible. + +<xres>x<yres>[-(rgb|bgr|<bpp>)] + The EFI stub will search for a display mode that matches the specified + horizontal and vertical resolution, and optionally bit depth, and set + the mode of the display to it if one is found. The bit depth can either + "rgb" or "bgr" to match specifically those pixel formats, or a number + for a mode with matching bits per pixel. + +auto + The EFI stub will choose the mode with the highest resolution (product + of horizontal and vertical resolution). If there are multiple modes + with the highest resolution, it will choose one with the highest color + depth. + +list + The EFI stub will list out all the display modes that are available. A + specific mode can then be chosen using one of the above options for the + next boot. + Edgar Hucek <gimli@dark-green.com> diff --git a/Documentation/features/core/eBPF-JIT/arch-support.txt b/Documentation/features/core/eBPF-JIT/arch-support.txt index 9ae6e8d0d10d..9ed964f65224 100644 --- a/Documentation/features/core/eBPF-JIT/arch-support.txt +++ b/Documentation/features/core/eBPF-JIT/arch-support.txt @@ -23,7 +23,7 @@ | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | - | riscv: | TODO | + | riscv: | ok | | s390: | ok | | sh: | TODO | | sparc: | ok | diff --git a/Documentation/features/debug/KASAN/arch-support.txt b/Documentation/features/debug/KASAN/arch-support.txt index 304dcd461795..6ff38548923e 100644 --- a/Documentation/features/debug/KASAN/arch-support.txt +++ b/Documentation/features/debug/KASAN/arch-support.txt @@ -22,9 +22,9 @@ | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | - | powerpc: | TODO | - | riscv: | TODO | - | s390: | TODO | + | powerpc: | ok | + | riscv: | ok | + | s390: | ok | | sh: | TODO | | sparc: | TODO | | um: | TODO | diff --git a/Documentation/features/debug/gcov-profile-all/arch-support.txt b/Documentation/features/debug/gcov-profile-all/arch-support.txt index 6fb2b0671994..210256f6a4cf 100644 --- a/Documentation/features/debug/gcov-profile-all/arch-support.txt +++ b/Documentation/features/debug/gcov-profile-all/arch-support.txt @@ -11,7 +11,7 @@ | arm: | ok | | arm64: | ok | | c6x: | TODO | - | csky: | TODO | + | csky: | ok | | h8300: | TODO | | hexagon: | TODO | | ia64: | TODO | diff --git a/Documentation/features/debug/kprobes-on-ftrace/arch-support.txt b/Documentation/features/debug/kprobes-on-ftrace/arch-support.txt index 32b297295fff..97cd7aa74905 100644 --- a/Documentation/features/debug/kprobes-on-ftrace/arch-support.txt +++ b/Documentation/features/debug/kprobes-on-ftrace/arch-support.txt @@ -11,7 +11,7 @@ | arm: | TODO | | arm64: | TODO | | c6x: | TODO | - | csky: | TODO | + | csky: | ok | | h8300: | TODO | | hexagon: | TODO | | ia64: | TODO | diff --git a/Documentation/features/debug/kprobes/arch-support.txt b/Documentation/features/debug/kprobes/arch-support.txt index e68239b5d2f0..8b316c6e03d4 100644 --- a/Documentation/features/debug/kprobes/arch-support.txt +++ b/Documentation/features/debug/kprobes/arch-support.txt @@ -11,7 +11,7 @@ | arm: | ok | | arm64: | ok | | c6x: | TODO | - | csky: | TODO | + | csky: | ok | | h8300: | TODO | | hexagon: | TODO | | ia64: | ok | @@ -23,7 +23,7 @@ | openrisc: | TODO | | parisc: | ok | | powerpc: | ok | - | riscv: | ok | + | riscv: | TODO | | s390: | ok | | sh: | ok | | sparc: | ok | diff --git a/Documentation/features/debug/kretprobes/arch-support.txt b/Documentation/features/debug/kretprobes/arch-support.txt index f17131b328e5..b805aada395e 100644 --- a/Documentation/features/debug/kretprobes/arch-support.txt +++ b/Documentation/features/debug/kretprobes/arch-support.txt @@ -11,7 +11,7 @@ | arm: | ok | | arm64: | ok | | c6x: | TODO | - | csky: | TODO | + | csky: | ok | | h8300: | TODO | | hexagon: | TODO | | ia64: | ok | diff --git a/Documentation/features/debug/stackprotector/arch-support.txt b/Documentation/features/debug/stackprotector/arch-support.txt index 32bbdfc64c32..12410f606edc 100644 --- a/Documentation/features/debug/stackprotector/arch-support.txt +++ b/Documentation/features/debug/stackprotector/arch-support.txt @@ -11,7 +11,7 @@ | arm: | ok | | arm64: | ok | | c6x: | TODO | - | csky: | TODO | + | csky: | ok | | h8300: | TODO | | hexagon: | TODO | | ia64: | TODO | diff --git a/Documentation/features/debug/uprobes/arch-support.txt b/Documentation/features/debug/uprobes/arch-support.txt index 1c577d0cfc7f..be8acbb95b54 100644 --- a/Documentation/features/debug/uprobes/arch-support.txt +++ b/Documentation/features/debug/uprobes/arch-support.txt @@ -11,7 +11,7 @@ | arm: | ok | | arm64: | ok | | c6x: | TODO | - | csky: | TODO | + | csky: | ok | | h8300: | TODO | | hexagon: | TODO | | ia64: | TODO | diff --git a/Documentation/features/io/dma-contiguous/arch-support.txt b/Documentation/features/io/dma-contiguous/arch-support.txt index eb28b5c97ca6..895c3b0f6492 100644 --- a/Documentation/features/io/dma-contiguous/arch-support.txt +++ b/Documentation/features/io/dma-contiguous/arch-support.txt @@ -16,7 +16,7 @@ | hexagon: | TODO | | ia64: | TODO | | m68k: | TODO | - | microblaze: | TODO | + | microblaze: | ok | | mips: | ok | | nds32: | TODO | | nios2: | TODO | diff --git a/Documentation/features/locking/lockdep/arch-support.txt b/Documentation/features/locking/lockdep/arch-support.txt index 941fd5b1094d..98cb9d85c55d 100644 --- a/Documentation/features/locking/lockdep/arch-support.txt +++ b/Documentation/features/locking/lockdep/arch-support.txt @@ -11,7 +11,7 @@ | arm: | ok | | arm64: | ok | | c6x: | TODO | - | csky: | TODO | + | csky: | ok | | h8300: | TODO | | hexagon: | ok | | ia64: | TODO | diff --git a/Documentation/features/perf/kprobes-event/arch-support.txt b/Documentation/features/perf/kprobes-event/arch-support.txt index d8278bf62b85..518f352fc727 100644 --- a/Documentation/features/perf/kprobes-event/arch-support.txt +++ b/Documentation/features/perf/kprobes-event/arch-support.txt @@ -11,7 +11,7 @@ | arm: | ok | | arm64: | ok | | c6x: | TODO | - | csky: | TODO | + | csky: | ok | | h8300: | TODO | | hexagon: | ok | | ia64: | TODO | @@ -21,7 +21,7 @@ | nds32: | ok | | nios2: | TODO | | openrisc: | TODO | - | parisc: | TODO | + | parisc: | ok | | powerpc: | ok | | riscv: | TODO | | s390: | ok | diff --git a/Documentation/features/perf/perf-regs/arch-support.txt b/Documentation/features/perf/perf-regs/arch-support.txt index 687d049d9cee..c22cd6f8aa5e 100644 --- a/Documentation/features/perf/perf-regs/arch-support.txt +++ b/Documentation/features/perf/perf-regs/arch-support.txt @@ -11,7 +11,7 @@ | arm: | ok | | arm64: | ok | | c6x: | TODO | - | csky: | TODO | + | csky: | ok | | h8300: | TODO | | hexagon: | TODO | | ia64: | TODO | @@ -23,7 +23,7 @@ | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | - | riscv: | TODO | + | riscv: | ok | | s390: | ok | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/perf/perf-stackdump/arch-support.txt b/Documentation/features/perf/perf-stackdump/arch-support.txt index 90996e3d18a8..527fe4d0b074 100644 --- a/Documentation/features/perf/perf-stackdump/arch-support.txt +++ b/Documentation/features/perf/perf-stackdump/arch-support.txt @@ -11,7 +11,7 @@ | arm: | ok | | arm64: | ok | | c6x: | TODO | - | csky: | TODO | + | csky: | ok | | h8300: | TODO | | hexagon: | TODO | | ia64: | TODO | @@ -23,7 +23,7 @@ | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | - | riscv: | TODO | + | riscv: | ok | | s390: | ok | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/seccomp/seccomp-filter/arch-support.txt b/Documentation/features/seccomp/seccomp-filter/arch-support.txt index 4fe6c3c3be5c..c7b837f735b1 100644 --- a/Documentation/features/seccomp/seccomp-filter/arch-support.txt +++ b/Documentation/features/seccomp/seccomp-filter/arch-support.txt @@ -23,7 +23,7 @@ | openrisc: | TODO | | parisc: | ok | | powerpc: | ok | - | riscv: | TODO | + | riscv: | ok | | s390: | ok | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/vm/huge-vmap/arch-support.txt b/Documentation/features/vm/huge-vmap/arch-support.txt index 019131c5acce..8525f1981f19 100644 --- a/Documentation/features/vm/huge-vmap/arch-support.txt +++ b/Documentation/features/vm/huge-vmap/arch-support.txt @@ -22,7 +22,7 @@ | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | - | powerpc: | TODO | + | powerpc: | ok | | riscv: | TODO | | s390: | TODO | | sh: | TODO | diff --git a/Documentation/features/vm/pte_special/arch-support.txt b/Documentation/features/vm/pte_special/arch-support.txt index 3d492a34c8ee..2e017387e228 100644 --- a/Documentation/features/vm/pte_special/arch-support.txt +++ b/Documentation/features/vm/pte_special/arch-support.txt @@ -17,7 +17,7 @@ | ia64: | TODO | | m68k: | TODO | | microblaze: | TODO | - | mips: | TODO | + | mips: | ok | | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | diff --git a/Documentation/filesystems/9p.rst b/Documentation/filesystems/9p.rst index 671fef39a802..2995279ddc24 100644 --- a/Documentation/filesystems/9p.rst +++ b/Documentation/filesystems/9p.rst @@ -192,4 +192,4 @@ For more information on the Plan 9 Operating System check out http://plan9.bell-labs.com/plan9 For information on Plan 9 from User Space (Plan 9 applications and libraries -ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9 +ported to Linux/BSD/OSX/etc) check out https://9fans.github.io/plan9port/ diff --git a/Documentation/filesystems/automount-support.txt b/Documentation/filesystems/automount-support.rst index 7d9f82607562..430f0b40796b 100644 --- a/Documentation/filesystems/automount-support.txt +++ b/Documentation/filesystems/automount-support.rst @@ -1,3 +1,10 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +Automount Support +================= + + Support is available for filesystems that wish to do automounting support (such as kAFS which can be found in fs/afs/ and NFS in fs/nfs/). This facility includes allowing in-kernel mounts to be @@ -5,13 +12,12 @@ performed and mountpoint degradation to be requested. The latter can also be requested by userspace. -====================== -IN-KERNEL AUTOMOUNTING +In-Kernel Automounting ====================== See section "Mount Traps" of Documentation/filesystems/autofs.rst -Then from userspace, you can just do something like: +Then from userspace, you can just do something like:: [root@andromeda root]# mount -t afs \#root.afs. /afs [root@andromeda root]# ls /afs @@ -21,7 +27,7 @@ Then from userspace, you can just do something like: [root@andromeda root]# ls /afs/cambridge/afsdoc/ ChangeLog html LICENSE pdf RELNOTES-1.2.2 -And then if you look in the mountpoint catalogue, you'll see something like: +And then if you look in the mountpoint catalogue, you'll see something like:: [root@andromeda root]# cat /proc/mounts ... @@ -30,8 +36,7 @@ And then if you look in the mountpoint catalogue, you'll see something like: #afsdoc. /afs/cambridge.redhat.com/afsdoc afs rw 0 0 -=========================== -AUTOMATIC MOUNTPOINT EXPIRY +Automatic Mountpoint Expiry =========================== Automatic expiration of mountpoints is easy, provided you've mounted the @@ -43,7 +48,8 @@ To do expiration, you need to follow these steps: hung. (2) When a new mountpoint is created in the ->d_automount method, add - the mnt to the list using mnt_set_expiry() + the mnt to the list using mnt_set_expiry():: + mnt_set_expiry(newmnt, &afs_vfsmounts); (3) When you want mountpoints to be expired, call mark_mounts_for_expiry() @@ -70,8 +76,7 @@ and the copies of those that are on an expiration list will be added to the same expiration list. -======================= -USERSPACE DRIVEN EXPIRY +Userspace Driven Expiry ======================= As an alternative, it is possible for userspace to request expiry of any diff --git a/Documentation/filesystems/caching/backend-api.txt b/Documentation/filesystems/caching/backend-api.rst index c418280c915f..19fbf6b9aa36 100644 --- a/Documentation/filesystems/caching/backend-api.txt +++ b/Documentation/filesystems/caching/backend-api.rst @@ -1,6 +1,8 @@ - ========================== - FS-CACHE CACHE BACKEND API - ========================== +.. SPDX-License-Identifier: GPL-2.0 + +========================== +FS-Cache Cache backend API +========================== The FS-Cache system provides an API by which actual caches can be supplied to FS-Cache for it to then serve out to network filesystems and other interested @@ -9,15 +11,14 @@ parties. This API is declared in <linux/fscache-cache.h>. -==================================== -INITIALISING AND REGISTERING A CACHE +Initialising and Registering a Cache ==================================== To start off, a cache definition must be initialised and registered for each cache the backend wants to make available. For instance, CacheFS does this in the fill_super() operation on mounting. -The cache definition (struct fscache_cache) should be initialised by calling: +The cache definition (struct fscache_cache) should be initialised by calling:: void fscache_init_cache(struct fscache_cache *cache, struct fscache_cache_ops *ops, @@ -26,17 +27,17 @@ The cache definition (struct fscache_cache) should be initialised by calling: Where: - (*) "cache" is a pointer to the cache definition; + * "cache" is a pointer to the cache definition; - (*) "ops" is a pointer to the table of operations that the backend supports on + * "ops" is a pointer to the table of operations that the backend supports on this cache; and - (*) "idfmt" is a format and printf-style arguments for constructing a label + * "idfmt" is a format and printf-style arguments for constructing a label for the cache. The cache should then be registered with FS-Cache by passing a pointer to the -previously initialised cache definition to: +previously initialised cache definition to:: int fscache_add_cache(struct fscache_cache *cache, struct fscache_object *fsdef, @@ -44,12 +45,12 @@ previously initialised cache definition to: Two extra arguments should also be supplied: - (*) "fsdef" which should point to the object representation for the FS-Cache + * "fsdef" which should point to the object representation for the FS-Cache master index in this cache. Netfs primary index entries will be created here. FS-Cache keeps the caller's reference to the index object if successful and will release it upon withdrawal of the cache. - (*) "tagname" which, if given, should be a text string naming this cache. If + * "tagname" which, if given, should be a text string naming this cache. If this is NULL, the identifier will be used instead. For CacheFS, the identifier is set to name the underlying block device and the tag can be supplied by mount. @@ -58,20 +59,18 @@ This function may return -ENOMEM if it ran out of memory or -EEXIST if the tag is already in use. 0 will be returned on success. -===================== -UNREGISTERING A CACHE +Unregistering a Cache ===================== A cache can be withdrawn from the system by calling this function with a -pointer to the cache definition: +pointer to the cache definition:: void fscache_withdraw_cache(struct fscache_cache *cache); In CacheFS's case, this is called by put_super(). -======== -SECURITY +Security ======== The cache methods are executed one of two contexts: @@ -89,8 +88,7 @@ be masqueraded for the duration of the cache driver's access to the cache. This is left to the cache to handle; FS-Cache makes no effort in this regard. -=================================== -CONTROL AND STATISTICS PRESENTATION +Control and Statistics Presentation =================================== The cache may present data to the outside world through FS-Cache's interfaces @@ -101,11 +99,10 @@ is enabled. This is accessible through the kobject struct fscache_cache::kobj and is for use by the cache as it sees fit. -======================== -RELEVANT DATA STRUCTURES +Relevant Data Structures ======================== - (*) Index/Data file FS-Cache representation cookie: + * Index/Data file FS-Cache representation cookie:: struct fscache_cookie { struct fscache_object_def *def; @@ -121,7 +118,7 @@ RELEVANT DATA STRUCTURES cache operations. - (*) In-cache object representation: + * In-cache object representation:: struct fscache_object { int debug_id; @@ -150,7 +147,7 @@ RELEVANT DATA STRUCTURES initialised by calling fscache_object_init(object). - (*) FS-Cache operation record: + * FS-Cache operation record:: struct fscache_operation { atomic_t usage; @@ -173,7 +170,7 @@ RELEVANT DATA STRUCTURES an operation needs more processing time, it should be enqueued again. - (*) FS-Cache retrieval operation record: + * FS-Cache retrieval operation record:: struct fscache_retrieval { struct fscache_operation op; @@ -198,7 +195,7 @@ RELEVANT DATA STRUCTURES it sees fit. - (*) FS-Cache storage operation record: + * FS-Cache storage operation record:: struct fscache_storage { struct fscache_operation op; @@ -212,16 +209,17 @@ RELEVANT DATA STRUCTURES storage. -================ -CACHE OPERATIONS +Cache Operations ================ The cache backend provides FS-Cache with a table of operations that can be performed on the denizens of the cache. These are held in a structure of type: - struct fscache_cache_ops + :: + + struct fscache_cache_ops - (*) Name of cache provider [mandatory]: + * Name of cache provider [mandatory]:: const char *name @@ -229,7 +227,7 @@ performed on the denizens of the cache. These are held in a structure of type: the backend. - (*) Allocate a new object [mandatory]: + * Allocate a new object [mandatory]:: struct fscache_object *(*alloc_object)(struct fscache_cache *cache, struct fscache_cookie *cookie) @@ -244,7 +242,7 @@ performed on the denizens of the cache. These are held in a structure of type: form once lookup is complete or aborted. - (*) Look up and create object [mandatory]: + * Look up and create object [mandatory]:: void (*lookup_object)(struct fscache_object *object) @@ -263,7 +261,7 @@ performed on the denizens of the cache. These are held in a structure of type: to abort the lookup of that object. - (*) Release lookup data [mandatory]: + * Release lookup data [mandatory]:: void (*lookup_complete)(struct fscache_object *object) @@ -271,7 +269,7 @@ performed on the denizens of the cache. These are held in a structure of type: using to perform a lookup. - (*) Increment object refcount [mandatory]: + * Increment object refcount [mandatory]:: struct fscache_object *(*grab_object)(struct fscache_object *object) @@ -280,7 +278,7 @@ performed on the denizens of the cache. These are held in a structure of type: It should return the object pointer if successful. - (*) Lock/Unlock object [mandatory]: + * Lock/Unlock object [mandatory]:: void (*lock_object)(struct fscache_object *object) void (*unlock_object)(struct fscache_object *object) @@ -289,7 +287,7 @@ performed on the denizens of the cache. These are held in a structure of type: to schedule with the lock held, so a spinlock isn't sufficient. - (*) Pin/Unpin object [optional]: + * Pin/Unpin object [optional]:: int (*pin_object)(struct fscache_object *object) void (*unpin_object)(struct fscache_object *object) @@ -299,7 +297,7 @@ performed on the denizens of the cache. These are held in a structure of type: enough space in the cache to permit this. - (*) Check coherency state of an object [mandatory]: + * Check coherency state of an object [mandatory]:: int (*check_consistency)(struct fscache_object *object) @@ -308,7 +306,7 @@ performed on the denizens of the cache. These are held in a structure of type: if they're consistent and -ESTALE otherwise. -ENOMEM and -ERESTARTSYS may also be returned. - (*) Update object [mandatory]: + * Update object [mandatory]:: int (*update_object)(struct fscache_object *object) @@ -317,7 +315,7 @@ performed on the denizens of the cache. These are held in a structure of type: obtained by calling object->cookie->def->get_aux()/get_attr(). - (*) Invalidate data object [mandatory]: + * Invalidate data object [mandatory]:: int (*invalidate_object)(struct fscache_operation *op) @@ -329,7 +327,7 @@ performed on the denizens of the cache. These are held in a structure of type: fscache_op_complete() must be called on op before returning. - (*) Discard object [mandatory]: + * Discard object [mandatory]:: void (*drop_object)(struct fscache_object *object) @@ -341,7 +339,7 @@ performed on the denizens of the cache. These are held in a structure of type: caller. The caller will invoke the put_object() method as appropriate. - (*) Release object reference [mandatory]: + * Release object reference [mandatory]:: void (*put_object)(struct fscache_object *object) @@ -349,7 +347,7 @@ performed on the denizens of the cache. These are held in a structure of type: be freed when all the references to it are released. - (*) Synchronise a cache [mandatory]: + * Synchronise a cache [mandatory]:: void (*sync)(struct fscache_cache *cache) @@ -357,7 +355,7 @@ performed on the denizens of the cache. These are held in a structure of type: device. - (*) Dissociate a cache [mandatory]: + * Dissociate a cache [mandatory]:: void (*dissociate_pages)(struct fscache_cache *cache) @@ -365,7 +363,7 @@ performed on the denizens of the cache. These are held in a structure of type: cache withdrawal. - (*) Notification that the attributes on a netfs file changed [mandatory]: + * Notification that the attributes on a netfs file changed [mandatory]:: int (*attr_changed)(struct fscache_object *object); @@ -386,7 +384,7 @@ performed on the denizens of the cache. These are held in a structure of type: execution of this operation. - (*) Reserve cache space for an object's data [optional]: + * Reserve cache space for an object's data [optional]:: int (*reserve_space)(struct fscache_object *object, loff_t size); @@ -404,7 +402,7 @@ performed on the denizens of the cache. These are held in a structure of type: size if larger than that already. - (*) Request page be read from cache [mandatory]: + * Request page be read from cache [mandatory]:: int (*read_or_alloc_page)(struct fscache_retrieval *op, struct page *page, @@ -446,7 +444,7 @@ performed on the denizens of the cache. These are held in a structure of type: with. This will complete the operation when all pages are dealt with. - (*) Request pages be read from cache [mandatory]: + * Request pages be read from cache [mandatory]:: int (*read_or_alloc_pages)(struct fscache_retrieval *op, struct list_head *pages, @@ -457,7 +455,7 @@ performed on the denizens of the cache. These are held in a structure of type: of pages instead of one page. Any pages on which a read operation is started must be added to the page cache for the specified mapping and also to the LRU. Such pages must also be removed from the pages list and - *nr_pages decremented per page. + ``*nr_pages`` decremented per page. If there was an error such as -ENOMEM, then that should be returned; else if one or more pages couldn't be read or allocated, then -ENOBUFS should @@ -466,7 +464,7 @@ performed on the denizens of the cache. These are held in a structure of type: returned. - (*) Request page be allocated in the cache [mandatory]: + * Request page be allocated in the cache [mandatory]:: int (*allocate_page)(struct fscache_retrieval *op, struct page *page, @@ -482,7 +480,7 @@ performed on the denizens of the cache. These are held in a structure of type: allocated, then the netfs page should be marked and 0 returned. - (*) Request pages be allocated in the cache [mandatory]: + * Request pages be allocated in the cache [mandatory]:: int (*allocate_pages)(struct fscache_retrieval *op, struct list_head *pages, @@ -493,7 +491,7 @@ performed on the denizens of the cache. These are held in a structure of type: nr_pages should be treated as for the read_or_alloc_pages() method. - (*) Request page be written to cache [mandatory]: + * Request page be written to cache [mandatory]:: int (*write_page)(struct fscache_storage *op, struct page *page); @@ -514,7 +512,7 @@ performed on the denizens of the cache. These are held in a structure of type: appropriately. - (*) Discard retained per-page metadata [mandatory]: + * Discard retained per-page metadata [mandatory]:: void (*uncache_page)(struct fscache_object *object, struct page *page) @@ -523,13 +521,12 @@ performed on the denizens of the cache. These are held in a structure of type: maintains for this page. -================== -FS-CACHE UTILITIES +FS-Cache Utilities ================== FS-Cache provides some utilities that a cache backend may make use of: - (*) Note occurrence of an I/O error in a cache: + * Note occurrence of an I/O error in a cache:: void fscache_io_error(struct fscache_cache *cache) @@ -541,7 +538,7 @@ FS-Cache provides some utilities that a cache backend may make use of: This does not actually withdraw the cache. That must be done separately. - (*) Invoke the retrieval I/O completion function: + * Invoke the retrieval I/O completion function:: void fscache_end_io(struct fscache_retrieval *op, struct page *page, int error); @@ -550,8 +547,8 @@ FS-Cache provides some utilities that a cache backend may make use of: error value should be 0 if successful and an error otherwise. - (*) Record that one or more pages being retrieved or allocated have been dealt - with: + * Record that one or more pages being retrieved or allocated have been dealt + with:: void fscache_retrieval_complete(struct fscache_retrieval *op, int n_pages); @@ -562,7 +559,7 @@ FS-Cache provides some utilities that a cache backend may make use of: completed. - (*) Record operation completion: + * Record operation completion:: void fscache_op_complete(struct fscache_operation *op); @@ -571,7 +568,7 @@ FS-Cache provides some utilities that a cache backend may make use of: one or more pending operations to start running. - (*) Set highest store limit: + * Set highest store limit:: void fscache_set_store_limit(struct fscache_object *object, loff_t i_size); @@ -581,7 +578,7 @@ FS-Cache provides some utilities that a cache backend may make use of: rejected by fscache_read_alloc_page() and co with -ENOBUFS. - (*) Mark pages as being cached: + * Mark pages as being cached:: void fscache_mark_pages_cached(struct fscache_retrieval *op, struct pagevec *pagevec); @@ -590,7 +587,7 @@ FS-Cache provides some utilities that a cache backend may make use of: the netfs must call fscache_uncache_page() to unmark the pages. - (*) Perform coherency check on an object: + * Perform coherency check on an object:: enum fscache_checkaux fscache_check_aux(struct fscache_object *object, const void *data, @@ -603,29 +600,26 @@ FS-Cache provides some utilities that a cache backend may make use of: One of three values will be returned: - (*) FSCACHE_CHECKAUX_OKAY - + FSCACHE_CHECKAUX_OKAY The coherency data indicates the object is valid as is. - (*) FSCACHE_CHECKAUX_NEEDS_UPDATE - + FSCACHE_CHECKAUX_NEEDS_UPDATE The coherency data needs updating, but otherwise the object is valid. - (*) FSCACHE_CHECKAUX_OBSOLETE - + FSCACHE_CHECKAUX_OBSOLETE The coherency data indicates that the object is obsolete and should be discarded. - (*) Initialise a freshly allocated object: + * Initialise a freshly allocated object:: void fscache_object_init(struct fscache_object *object); This initialises all the fields in an object representation. - (*) Indicate the destruction of an object: + * Indicate the destruction of an object:: void fscache_object_destroyed(struct fscache_cache *cache); @@ -635,7 +629,7 @@ FS-Cache provides some utilities that a cache backend may make use of: all the objects. - (*) Indicate negative lookup on an object: + * Indicate negative lookup on an object:: void fscache_object_lookup_negative(struct fscache_object *object); @@ -650,7 +644,7 @@ FS-Cache provides some utilities that a cache backend may make use of: significant - all subsequent calls are ignored. - (*) Indicate an object has been obtained: + * Indicate an object has been obtained:: void fscache_obtained_object(struct fscache_object *object); @@ -667,7 +661,7 @@ FS-Cache provides some utilities that a cache backend may make use of: (2) that writes may now proceed against this object. - (*) Indicate that object lookup failed: + * Indicate that object lookup failed:: void fscache_object_lookup_error(struct fscache_object *object); @@ -676,7 +670,7 @@ FS-Cache provides some utilities that a cache backend may make use of: as possible. - (*) Indicate that a stale object was found and discarded: + * Indicate that a stale object was found and discarded:: void fscache_object_retrying_stale(struct fscache_object *object); @@ -685,7 +679,7 @@ FS-Cache provides some utilities that a cache backend may make use of: discarded from the cache and the lookup will be performed again. - (*) Indicate that the caching backend killed an object: + * Indicate that the caching backend killed an object:: void fscache_object_mark_killed(struct fscache_object *object, enum fscache_why_object_killed why); @@ -693,13 +687,20 @@ FS-Cache provides some utilities that a cache backend may make use of: This is called to indicate that the cache backend preemptively killed an object. The why parameter should be set to indicate the reason: - FSCACHE_OBJECT_IS_STALE - the object was stale and needs discarding. - FSCACHE_OBJECT_NO_SPACE - there was insufficient cache space - FSCACHE_OBJECT_WAS_RETIRED - the object was retired when relinquished. - FSCACHE_OBJECT_WAS_CULLED - the object was culled to make space. + FSCACHE_OBJECT_IS_STALE + - the object was stale and needs discarding. + + FSCACHE_OBJECT_NO_SPACE + - there was insufficient cache space + + FSCACHE_OBJECT_WAS_RETIRED + - the object was retired when relinquished. + + FSCACHE_OBJECT_WAS_CULLED + - the object was culled to make space. - (*) Get and release references on a retrieval record: + * Get and release references on a retrieval record:: void fscache_get_retrieval(struct fscache_retrieval *op); void fscache_put_retrieval(struct fscache_retrieval *op); @@ -708,7 +709,7 @@ FS-Cache provides some utilities that a cache backend may make use of: asynchronous data retrieval and block allocation. - (*) Enqueue a retrieval record for processing. + * Enqueue a retrieval record for processing:: void fscache_enqueue_retrieval(struct fscache_retrieval *op); @@ -718,7 +719,7 @@ FS-Cache provides some utilities that a cache backend may make use of: within the callback function. - (*) List of object state names: + * List of object state names:: const char *fscache_object_states[]; diff --git a/Documentation/filesystems/caching/cachefiles.txt b/Documentation/filesystems/caching/cachefiles.rst index 28aefcbb1442..65d3db476765 100644 --- a/Documentation/filesystems/caching/cachefiles.txt +++ b/Documentation/filesystems/caching/cachefiles.rst @@ -1,8 +1,10 @@ - =============================================== - CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM - =============================================== +.. SPDX-License-Identifier: GPL-2.0 -Contents: +=============================================== +CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM +=============================================== + +.. Contents: (*) Overview. @@ -27,8 +29,8 @@ Contents: (*) Debugging. -======== -OVERVIEW + +Overview ======== CacheFiles is a caching backend that's meant to use as a cache a directory on @@ -58,8 +60,8 @@ spare space and automatically contract when the set of data requires more space. -============ -REQUIREMENTS + +Requirements ============ The use of CacheFiles and its daemon requires the following features to be @@ -79,84 +81,70 @@ It is strongly recommended that the "dir_index" option is enabled on Ext3 filesystems being used as a cache. -============= -CONFIGURATION +Configuration ============= The cache is configured by a script in /etc/cachefilesd.conf. These commands set up cache ready for use. The following script commands are available: - (*) brun <N>% - (*) bcull <N>% - (*) bstop <N>% - (*) frun <N>% - (*) fcull <N>% - (*) fstop <N>% - + brun <N>%, bcull <N>%, bstop <N>%, frun <N>%, fcull <N>%, fstop <N>% Configure the culling limits. Optional. See the section on culling The defaults are 7% (run), 5% (cull) and 1% (stop) respectively. The commands beginning with a 'b' are file space (block) limits, those beginning with an 'f' are file count limits. - (*) dir <path> - + dir <path> Specify the directory containing the root of the cache. Mandatory. - (*) tag <name> - + tag <name> Specify a tag to FS-Cache to use in distinguishing multiple caches. Optional. The default is "CacheFiles". - (*) debug <mask> - + debug <mask> Specify a numeric bitmask to control debugging in the kernel module. Optional. The default is zero (all off). The following values can be OR'd into the mask to collect various information: + == ================================================= 1 Turn on trace of function entry (_enter() macros) 2 Turn on trace of function exit (_leave() macros) 4 Turn on trace of internal debug points (_debug()) + == ================================================= - This mask can also be set through sysfs, eg: + This mask can also be set through sysfs, eg:: echo 5 >/sys/modules/cachefiles/parameters/debug -================== -STARTING THE CACHE +Starting the Cache ================== The cache is started by running the daemon. The daemon opens the cache device, configures the cache and tells it to begin caching. At that point the cache binds to fscache and the cache becomes live. -The daemon is run as follows: +The daemon is run as follows:: /sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>] The flags are: - (*) -d - + ``-d`` Increase the debugging level. This can be specified multiple times and is cumulative with itself. - (*) -s - + ``-s`` Send messages to stderr instead of syslog. - (*) -n - + ``-n`` Don't daemonise and go into background. - (*) -f <configfile> - + ``-f <configfile>`` Use an alternative configuration file rather than the default one. -=============== -THINGS TO AVOID +Things to Avoid =============== Do not mount other things within the cache as this will cause problems. The @@ -179,8 +167,7 @@ Do not chmod files in the cache. The module creates things with minimal permissions to prevent random users being able to access them directly. -============= -CACHE CULLING +Cache Culling ============= The cache may need culling occasionally to make space. This involves @@ -192,27 +179,21 @@ Cache culling is done on the basis of the percentage of blocks and the percentage of files available in the underlying filesystem. There are six "limits": - (*) brun - (*) frun - + brun, frun If the amount of free space and the number of available files in the cache rises above both these limits, then culling is turned off. - (*) bcull - (*) fcull - + bcull, fcull If the amount of available space or the number of available files in the cache falls below either of these limits, then culling is started. - (*) bstop - (*) fstop - + bstop, fstop If the amount of available space or the number of available files in the cache falls below either of these limits, then no further allocation of disk space or files is permitted until culling has raised things above these limits again. -These must be configured thusly: +These must be configured thusly:: 0 <= bstop < bcull < brun < 100 0 <= fstop < fcull < frun < 100 @@ -226,16 +207,14 @@ started as soon as space is made in the table. Objects will be skipped if their atimes have changed or if the kernel module says it is still using them. -=============== -CACHE STRUCTURE +Cache Structure =============== The CacheFiles module will create two directories in the directory it was given: - (*) cache/ - - (*) graveyard/ + * cache/ + * graveyard/ The active cache objects all reside in the first directory. The CacheFiles kernel module moves any retired or culled objects that it can't simply unlink @@ -261,10 +240,10 @@ If an object has children, then it will be represented as a directory. Immediately in the representative directory are a collection of directories named for hash values of the child object keys with an '@' prepended. Into this directory, if possible, will be placed the representations of the child -objects: +objects:: - INDEX INDEX INDEX DATA FILES - ========= ========== ================================= ================ + /INDEX /INDEX /INDEX /DATA FILES + /=========/==========/=================================/================ cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400 cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry @@ -275,7 +254,7 @@ If the key is so long that it exceeds NAME_MAX with the decorations added on to it, then it will be cut into pieces, the first few of which will be used to make a nest of directories, and the last one of which will be the objects inside the last directory. The names of the intermediate directories will have -'+' prepended: +'+' prepended:: J1223/@23/+xy...z/+kl...m/Epqr @@ -288,11 +267,13 @@ To handle this, CacheFiles will use a suitably printable filename directly and "base-64" encode ones that aren't directly suitable. The two versions of object filenames indicate the encoding: + =============== =============== =============== OBJECT TYPE PRINTABLE ENCODED =============== =============== =============== Index "I..." "J..." Data "D..." "E..." Special "S..." "T..." + =============== =============== =============== Intermediate directories are always "@" or "+" as appropriate. @@ -307,8 +288,7 @@ Note that CacheFiles will erase from the cache any file it doesn't recognise or any file of an incorrect type (such as a FIFO file or a device file). -========================== -SECURITY MODEL AND SELINUX +Security Model and SELinux ========================== CacheFiles is implemented to deal properly with the LSM security features of @@ -331,26 +311,26 @@ When the CacheFiles module is asked to bind to its cache, it: (1) Finds the security label attached to the root cache directory and uses that as the security label with which it will create files. By default, - this is: + this is:: cachefiles_var_t (2) Finds the security label of the process which issued the bind request - (presumed to be the cachefilesd daemon), which by default will be: + (presumed to be the cachefilesd daemon), which by default will be:: cachefilesd_t and asks LSM to supply a security ID as which it should act given the - daemon's label. By default, this will be: + daemon's label. By default, this will be:: cachefiles_kernel_t SELinux transitions the daemon's security ID to the module's security ID - based on a rule of this form in the policy. + based on a rule of this form in the policy:: type_transition <daemon's-ID> kernel_t : process <module's-ID>; - For instance: + For instance:: type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t; @@ -370,7 +350,7 @@ There are policy source files available in: http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2 -and later versions. In that tarball, see the files: +and later versions. In that tarball, see the files:: cachefilesd.te cachefilesd.fc @@ -379,7 +359,7 @@ and later versions. In that tarball, see the files: They are built and installed directly by the RPM. If a non-RPM based system is being used, then copy the above files to their own -directory and run: +directory and run:: make -f /usr/share/selinux/devel/Makefile semodule -i cachefilesd.pp @@ -394,7 +374,7 @@ an auxiliary policy must be installed to label the alternate location of the cache. For instructions on how to add an auxiliary policy to enable the cache to be -located elsewhere when SELinux is in enforcing mode, please see: +located elsewhere when SELinux is in enforcing mode, please see:: /usr/share/doc/cachefilesd-*/move-cache.txt @@ -402,8 +382,7 @@ When the cachefilesd rpm is installed; alternatively, the document can be found in the sources. -================== -A NOTE ON SECURITY +A Note on Security ================== CacheFiles makes use of the split security in the task_struct. It allocates @@ -445,17 +424,18 @@ for CacheFiles to run in a context of a specific security label, or to create files and directories with another security label. -======================= -STATISTICAL INFORMATION +Statistical Information ======================= -If FS-Cache is compiled with the following option enabled: +If FS-Cache is compiled with the following option enabled:: CONFIG_CACHEFILES_HISTOGRAM=y then it will gather certain statistics and display them through a proc file. - (*) /proc/fs/cachefiles/histogram + /proc/fs/cachefiles/histogram + + :: cat /proc/fs/cachefiles/histogram JIFS SECS LOOKUPS MKDIRS CREATES @@ -465,36 +445,39 @@ then it will gather certain statistics and display them through a proc file. between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The columns are as follows: + ======= ======================================================= COLUMN TIME MEASUREMENT ======= ======================================================= LOOKUPS Length of time to perform a lookup on the backing fs MKDIRS Length of time to perform a mkdir on the backing fs CREATES Length of time to perform a create on the backing fs + ======= ======================================================= Each row shows the number of events that took a particular range of times. Each step is 1 jiffy in size. The JIFS column indicates the particular jiffy range covered, and the SECS field the equivalent number of seconds. -========= -DEBUGGING +Debugging ========= If CONFIG_CACHEFILES_DEBUG is enabled, the CacheFiles facility can have runtime -debugging enabled by adjusting the value in: +debugging enabled by adjusting the value in:: /sys/module/cachefiles/parameters/debug This is a bitmask of debugging streams to enable: + ======= ======= =============================== ======================= BIT VALUE STREAM POINT ======= ======= =============================== ======================= 0 1 General Function entry trace 1 2 Function exit trace 2 4 General + ======= ======= =============================== ======================= The appropriate set of values should be OR'd together and the result written to -the control file. For example: +the control file. For example:: echo $((1|4|8)) >/sys/module/cachefiles/parameters/debug diff --git a/Documentation/filesystems/caching/fscache.rst b/Documentation/filesystems/caching/fscache.rst new file mode 100644 index 000000000000..70de86922b6a --- /dev/null +++ b/Documentation/filesystems/caching/fscache.rst @@ -0,0 +1,565 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +General Filesystem Caching +========================== + +Overview +======== + +This facility is a general purpose cache for network filesystems, though it +could be used for caching other things such as ISO9660 filesystems too. + +FS-Cache mediates between cache backends (such as CacheFS) and network +filesystems:: + + +---------+ + | | +--------------+ + | NFS |--+ | | + | | | +-->| CacheFS | + +---------+ | +----------+ | | /dev/hda5 | + | | | | +--------------+ + +---------+ +-->| | | + | | | |--+ + | AFS |----->| FS-Cache | + | | | |--+ + +---------+ +-->| | | + | | | | +--------------+ + +---------+ | +----------+ | | | + | | | +-->| CacheFiles | + | ISOFS |--+ | /var/cache | + | | +--------------+ + +---------+ + +Or to look at it another way, FS-Cache is a module that provides a caching +facility to a network filesystem such that the cache is transparent to the +user:: + + +---------+ + | | + | Server | + | | + +---------+ + | NETWORK + ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + | + | +----------+ + V | | + +---------+ | | + | | | | + | NFS |----->| FS-Cache | + | | | |--+ + +---------+ | | | +--------------+ +--------------+ + | | | | | | | | + V +----------+ +-->| CacheFiles |-->| Ext3 | + +---------+ | /var/cache | | /dev/sda6 | + | | +--------------+ +--------------+ + | VFS | ^ ^ + | | | | + +---------+ +--------------+ | + | KERNEL SPACE | | + ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|~~~~~~|~~~~ + | USER SPACE | | + V | | + +---------+ +--------------+ + | | | | + | Process | | cachefilesd | + | | | | + +---------+ +--------------+ + + +FS-Cache does not follow the idea of completely loading every netfs file +opened in its entirety into a cache before permitting it to be accessed and +then serving the pages out of that cache rather than the netfs inode because: + + (1) It must be practical to operate without a cache. + + (2) The size of any accessible file must not be limited to the size of the + cache. + + (3) The combined size of all opened files (this includes mapped libraries) + must not be limited to the size of the cache. + + (4) The user should not be forced to download an entire file just to do a + one-off access of a small portion of it (such as might be done with the + "file" program). + +It instead serves the cache out in PAGE_SIZE chunks as and when requested by +the netfs('s) using it. + + +FS-Cache provides the following facilities: + + (1) More than one cache can be used at once. Caches can be selected + explicitly by use of tags. + + (2) Caches can be added / removed at any time. + + (3) The netfs is provided with an interface that allows either party to + withdraw caching facilities from a file (required for (2)). + + (4) The interface to the netfs returns as few errors as possible, preferring + rather to let the netfs remain oblivious. + + (5) Cookies are used to represent indices, files and other objects to the + netfs. The simplest cookie is just a NULL pointer - indicating nothing + cached there. + + (6) The netfs is allowed to propose - dynamically - any index hierarchy it + desires, though it must be aware that the index search function is + recursive, stack space is limited, and indices can only be children of + indices. + + (7) Data I/O is done direct to and from the netfs's pages. The netfs + indicates that page A is at index B of the data-file represented by cookie + C, and that it should be read or written. The cache backend may or may + not start I/O on that page, but if it does, a netfs callback will be + invoked to indicate completion. The I/O may be either synchronous or + asynchronous. + + (8) Cookies can be "retired" upon release. At this point FS-Cache will mark + them as obsolete and the index hierarchy rooted at that point will get + recycled. + + (9) The netfs provides a "match" function for index searches. In addition to + saying whether a match was made or not, this can also specify that an + entry should be updated or deleted. + +(10) As much as possible is done asynchronously. + + +FS-Cache maintains a virtual indexing tree in which all indices, files, objects +and pages are kept. Bits of this tree may actually reside in one or more +caches:: + + FSDEF + | + +------------------------------------+ + | | + NFS AFS + | | + +--------------------------+ +-----------+ + | | | | + homedir mirror afs.org redhat.com + | | | + +------------+ +---------------+ +----------+ + | | | | | | + 00001 00002 00007 00125 vol00001 vol00002 + | | | | | + +---+---+ +-----+ +---+ +------+------+ +-----+----+ + | | | | | | | | | | | | | + PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak + | | + PG0 +-------+ + | | + 00001 00003 + | + +---+---+ + | | | + PG0 PG1 PG2 + +In the example above, you can see two netfs's being backed: NFS and AFS. These +have different index hierarchies: + + * The NFS primary index contains per-server indices. Each server index is + indexed by NFS file handles to get data file objects. Each data file + objects can have an array of pages, but may also have further child + objects, such as extended attributes and directory entries. Extended + attribute objects themselves have page-array contents. + + * The AFS primary index contains per-cell indices. Each cell index contains + per-logical-volume indices. Each of volume index contains up to three + indices for the read-write, read-only and backup mirrors of those volumes. + Each of these contains vnode data file objects, each of which contains an + array of pages. + +The very top index is the FS-Cache master index in which individual netfs's +have entries. + +Any index object may reside in more than one cache, provided it only has index +children. Any index with non-index object children will be assumed to only +reside in one cache. + + +The netfs API to FS-Cache can be found in: + + Documentation/filesystems/caching/netfs-api.rst + +The cache backend API to FS-Cache can be found in: + + Documentation/filesystems/caching/backend-api.rst + +A description of the internal representations and object state machine can be +found in: + + Documentation/filesystems/caching/object.rst + + +Statistical Information +======================= + +If FS-Cache is compiled with the following options enabled:: + + CONFIG_FSCACHE_STATS=y + CONFIG_FSCACHE_HISTOGRAM=y + +then it will gather certain statistics and display them through a number of +proc files. + +/proc/fs/fscache/stats +---------------------- + + This shows counts of a number of events that can happen in FS-Cache: + ++--------------+-------+-------------------------------------------------------+ +|CLASS |EVENT |MEANING | ++==============+=======+=======================================================+ +|Cookies |idx=N |Number of index cookies allocated | ++ +-------+-------------------------------------------------------+ +| |dat=N |Number of data storage cookies allocated | ++ +-------+-------------------------------------------------------+ +| |spc=N |Number of special cookies allocated | ++--------------+-------+-------------------------------------------------------+ +|Objects |alc=N |Number of objects allocated | ++ +-------+-------------------------------------------------------+ +| |nal=N |Number of object allocation failures | ++ +-------+-------------------------------------------------------+ +| |avl=N |Number of objects that reached the available state | ++ +-------+-------------------------------------------------------+ +| |ded=N |Number of objects that reached the dead state | ++--------------+-------+-------------------------------------------------------+ +|ChkAux |non=N |Number of objects that didn't have a coherency check | ++ +-------+-------------------------------------------------------+ +| |ok=N |Number of objects that passed a coherency check | ++ +-------+-------------------------------------------------------+ +| |upd=N |Number of objects that needed a coherency data update | ++ +-------+-------------------------------------------------------+ +| |obs=N |Number of objects that were declared obsolete | ++--------------+-------+-------------------------------------------------------+ +|Pages |mrk=N |Number of pages marked as being cached | +| |unc=N |Number of uncache page requests seen | ++--------------+-------+-------------------------------------------------------+ +|Acquire |n=N |Number of acquire cookie requests seen | ++ +-------+-------------------------------------------------------+ +| |nul=N |Number of acq reqs given a NULL parent | ++ +-------+-------------------------------------------------------+ +| |noc=N |Number of acq reqs rejected due to no cache available | ++ +-------+-------------------------------------------------------+ +| |ok=N |Number of acq reqs succeeded | ++ +-------+-------------------------------------------------------+ +| |nbf=N |Number of acq reqs rejected due to error | ++ +-------+-------------------------------------------------------+ +| |oom=N |Number of acq reqs failed on ENOMEM | ++--------------+-------+-------------------------------------------------------+ +|Lookups |n=N |Number of lookup calls made on cache backends | ++ +-------+-------------------------------------------------------+ +| |neg=N |Number of negative lookups made | ++ +-------+-------------------------------------------------------+ +| |pos=N |Number of positive lookups made | ++ +-------+-------------------------------------------------------+ +| |crt=N |Number of objects created by lookup | ++ +-------+-------------------------------------------------------+ +| |tmo=N |Number of lookups timed out and requeued | ++--------------+-------+-------------------------------------------------------+ +|Updates |n=N |Number of update cookie requests seen | ++ +-------+-------------------------------------------------------+ +| |nul=N |Number of upd reqs given a NULL parent | ++ +-------+-------------------------------------------------------+ +| |run=N |Number of upd reqs granted CPU time | ++--------------+-------+-------------------------------------------------------+ +|Relinqs |n=N |Number of relinquish cookie requests seen | ++ +-------+-------------------------------------------------------+ +| |nul=N |Number of rlq reqs given a NULL parent | ++ +-------+-------------------------------------------------------+ +| |wcr=N |Number of rlq reqs waited on completion of creation | ++--------------+-------+-------------------------------------------------------+ +|AttrChg |n=N |Number of attribute changed requests seen | ++ +-------+-------------------------------------------------------+ +| |ok=N |Number of attr changed requests queued | ++ +-------+-------------------------------------------------------+ +| |nbf=N |Number of attr changed rejected -ENOBUFS | ++ +-------+-------------------------------------------------------+ +| |oom=N |Number of attr changed failed -ENOMEM | ++ +-------+-------------------------------------------------------+ +| |run=N |Number of attr changed ops given CPU time | ++--------------+-------+-------------------------------------------------------+ +|Allocs |n=N |Number of allocation requests seen | ++ +-------+-------------------------------------------------------+ +| |ok=N |Number of successful alloc reqs | ++ +-------+-------------------------------------------------------+ +| |wt=N |Number of alloc reqs that waited on lookup completion | ++ +-------+-------------------------------------------------------+ +| |nbf=N |Number of alloc reqs rejected -ENOBUFS | ++ +-------+-------------------------------------------------------+ +| |int=N |Number of alloc reqs aborted -ERESTARTSYS | ++ +-------+-------------------------------------------------------+ +| |ops=N |Number of alloc reqs submitted | ++ +-------+-------------------------------------------------------+ +| |owt=N |Number of alloc reqs waited for CPU time | ++ +-------+-------------------------------------------------------+ +| |abt=N |Number of alloc reqs aborted due to object death | ++--------------+-------+-------------------------------------------------------+ +|Retrvls |n=N |Number of retrieval (read) requests seen | ++ +-------+-------------------------------------------------------+ +| |ok=N |Number of successful retr reqs | ++ +-------+-------------------------------------------------------+ +| |wt=N |Number of retr reqs that waited on lookup completion | ++ +-------+-------------------------------------------------------+ +| |nod=N |Number of retr reqs returned -ENODATA | ++ +-------+-------------------------------------------------------+ +| |nbf=N |Number of retr reqs rejected -ENOBUFS | ++ +-------+-------------------------------------------------------+ +| |int=N |Number of retr reqs aborted -ERESTARTSYS | ++ +-------+-------------------------------------------------------+ +| |oom=N |Number of retr reqs failed -ENOMEM | ++ +-------+-------------------------------------------------------+ +| |ops=N |Number of retr reqs submitted | ++ +-------+-------------------------------------------------------+ +| |owt=N |Number of retr reqs waited for CPU time | ++ +-------+-------------------------------------------------------+ +| |abt=N |Number of retr reqs aborted due to object death | ++--------------+-------+-------------------------------------------------------+ +|Stores |n=N |Number of storage (write) requests seen | ++ +-------+-------------------------------------------------------+ +| |ok=N |Number of successful store reqs | ++ +-------+-------------------------------------------------------+ +| |agn=N |Number of store reqs on a page already pending storage | ++ +-------+-------------------------------------------------------+ +| |nbf=N |Number of store reqs rejected -ENOBUFS | ++ +-------+-------------------------------------------------------+ +| |oom=N |Number of store reqs failed -ENOMEM | ++ +-------+-------------------------------------------------------+ +| |ops=N |Number of store reqs submitted | ++ +-------+-------------------------------------------------------+ +| |run=N |Number of store reqs granted CPU time | ++ +-------+-------------------------------------------------------+ +| |pgs=N |Number of pages given store req processing time | ++ +-------+-------------------------------------------------------+ +| |rxd=N |Number of store reqs deleted from tracking tree | ++ +-------+-------------------------------------------------------+ +| |olm=N |Number of store reqs over store limit | ++--------------+-------+-------------------------------------------------------+ +|VmScan |nos=N |Number of release reqs against pages with no | +| | |pending store | ++ +-------+-------------------------------------------------------+ +| |gon=N |Number of release reqs against pages stored by | +| | |time lock granted | ++ +-------+-------------------------------------------------------+ +| |bsy=N |Number of release reqs ignored due to in-progress store| ++ +-------+-------------------------------------------------------+ +| |can=N |Number of page stores cancelled due to release req | ++--------------+-------+-------------------------------------------------------+ +|Ops |pend=N |Number of times async ops added to pending queues | ++ +-------+-------------------------------------------------------+ +| |run=N |Number of times async ops given CPU time | ++ +-------+-------------------------------------------------------+ +| |enq=N |Number of times async ops queued for processing | ++ +-------+-------------------------------------------------------+ +| |can=N |Number of async ops cancelled | ++ +-------+-------------------------------------------------------+ +| |rej=N |Number of async ops rejected due to object | +| | |lookup/create failure | ++ +-------+-------------------------------------------------------+ +| |ini=N |Number of async ops initialised | ++ +-------+-------------------------------------------------------+ +| |dfr=N |Number of async ops queued for deferred release | ++ +-------+-------------------------------------------------------+ +| |rel=N |Number of async ops released | +| | |(should equal ini=N when idle) | ++ +-------+-------------------------------------------------------+ +| |gc=N |Number of deferred-release async ops garbage collected | ++--------------+-------+-------------------------------------------------------+ +|CacheOp |alo=N |Number of in-progress alloc_object() cache ops | ++ +-------+-------------------------------------------------------+ +| |luo=N |Number of in-progress lookup_object() cache ops | ++ +-------+-------------------------------------------------------+ +| |luc=N |Number of in-progress lookup_complete() cache ops | ++ +-------+-------------------------------------------------------+ +| |gro=N |Number of in-progress grab_object() cache ops | ++ +-------+-------------------------------------------------------+ +| |upo=N |Number of in-progress update_object() cache ops | ++ +-------+-------------------------------------------------------+ +| |dro=N |Number of in-progress drop_object() cache ops | ++ +-------+-------------------------------------------------------+ +| |pto=N |Number of in-progress put_object() cache ops | ++ +-------+-------------------------------------------------------+ +| |syn=N |Number of in-progress sync_cache() cache ops | ++ +-------+-------------------------------------------------------+ +| |atc=N |Number of in-progress attr_changed() cache ops | ++ +-------+-------------------------------------------------------+ +| |rap=N |Number of in-progress read_or_alloc_page() cache ops | ++ +-------+-------------------------------------------------------+ +| |ras=N |Number of in-progress read_or_alloc_pages() cache ops | ++ +-------+-------------------------------------------------------+ +| |alp=N |Number of in-progress allocate_page() cache ops | ++ +-------+-------------------------------------------------------+ +| |als=N |Number of in-progress allocate_pages() cache ops | ++ +-------+-------------------------------------------------------+ +| |wrp=N |Number of in-progress write_page() cache ops | ++ +-------+-------------------------------------------------------+ +| |ucp=N |Number of in-progress uncache_page() cache ops | ++ +-------+-------------------------------------------------------+ +| |dsp=N |Number of in-progress dissociate_pages() cache ops | ++--------------+-------+-------------------------------------------------------+ +|CacheEv |nsp=N |Number of object lookups/creations rejected due to | +| | |lack of space | ++ +-------+-------------------------------------------------------+ +| |stl=N |Number of stale objects deleted | ++ +-------+-------------------------------------------------------+ +| |rtr=N |Number of objects retired when relinquished | ++ +-------+-------------------------------------------------------+ +| |cul=N |Number of objects culled | ++--------------+-------+-------------------------------------------------------+ + + + +/proc/fs/fscache/histogram +-------------------------- + + :: + + cat /proc/fs/fscache/histogram + JIFS SECS OBJ INST OP RUNS OBJ RUNS RETRV DLY RETRIEVLS + ===== ===== ========= ========= ========= ========= ========= + + This shows the breakdown of the number of times each amount of time + between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The + columns are as follows: + + ========= ======================================================= + COLUMN TIME MEASUREMENT + ========= ======================================================= + OBJ INST Length of time to instantiate an object + OP RUNS Length of time a call to process an operation took + OBJ RUNS Length of time a call to process an object event took + RETRV DLY Time between an requesting a read and lookup completing + RETRIEVLS Time between beginning and end of a retrieval + ========= ======================================================= + + Each row shows the number of events that took a particular range of times. + Each step is 1 jiffy in size. The JIFS column indicates the particular + jiffy range covered, and the SECS field the equivalent number of seconds. + + + +Object List +=========== + +If CONFIG_FSCACHE_OBJECT_LIST is enabled, the FS-Cache facility will maintain a +list of all the objects currently allocated and allow them to be viewed +through:: + + /proc/fs/fscache/objects + +This will look something like:: + + [root@andromeda ~]# head /proc/fs/fscache/objects + OBJECT PARENT STAT CHLDN OPS OOP IPR EX READS EM EV F S | NETFS_COOKIE_DEF TY FL NETFS_DATA OBJECT_KEY, AUX_DATA + ======== ======== ==== ===== === === === == ===== == == = = | ================ == == ================ ================ + 17e4b 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88001dd82820 010006017edcf8bbc93b43298fdfbe71e50b57b13a172c0117f38472, e567634700000000000000000000000063f2404a000000000000000000000000c9030000000000000000000063f2404a + 1693a 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88002db23380 010006017edcf8bbc93b43298fdfbe71e50b57b1e0162c01a2df0ea6, 420ebc4a000000000000000000000000420ebc4a0000000000000000000000000e1801000000000000000000420ebc4a + +where the first set of columns before the '|' describe the object: + + ======= =============================================================== + COLUMN DESCRIPTION + ======= =============================================================== + OBJECT Object debugging ID (appears as OBJ%x in some debug messages) + PARENT Debugging ID of parent object + STAT Object state + CHLDN Number of child objects of this object + OPS Number of outstanding operations on this object + OOP Number of outstanding child object management operations + IPR + EX Number of outstanding exclusive operations + READS Number of outstanding read operations + EM Object's event mask + EV Events raised on this object + F Object flags + S Object work item busy state mask (1:pending 2:running) + ======= =============================================================== + +and the second set of columns describe the object's cookie, if present: + + ================ ====================================================== + COLUMN DESCRIPTION + ================ ====================================================== + NETFS_COOKIE_DEF Name of netfs cookie definition + TY Cookie type (IX - index, DT - data, hex - special) + FL Cookie flags + NETFS_DATA Netfs private data stored in the cookie + OBJECT_KEY Object key } 1 column, with separating comma + AUX_DATA Object aux data } presence may be configured + ================ ====================================================== + +The data shown may be filtered by attaching the a key to an appropriate keyring +before viewing the file. Something like:: + + keyctl add user fscache:objlist <restrictions> @s + +where <restrictions> are a selection of the following letters: + + == ========================================================= + K Show hexdump of object key (don't show if not given) + A Show hexdump of object aux data (don't show if not given) + == ========================================================= + +and the following paired letters: + + == ========================================================= + C Show objects that have a cookie + c Show objects that don't have a cookie + B Show objects that are busy + b Show objects that aren't busy + W Show objects that have pending writes + w Show objects that don't have pending writes + R Show objects that have outstanding reads + r Show objects that don't have outstanding reads + S Show objects that have work queued + s Show objects that don't have work queued + == ========================================================= + +If neither side of a letter pair is given, then both are implied. For example: + + keyctl add user fscache:objlist KB @s + +shows objects that are busy, and lists their object keys, but does not dump +their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is +not implied. + +By default all objects and all fields will be shown. + + +Debugging +========= + +If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime +debugging enabled by adjusting the value in:: + + /sys/module/fscache/parameters/debug + +This is a bitmask of debugging streams to enable: + + ======= ======= =============================== ======================= + BIT VALUE STREAM POINT + ======= ======= =============================== ======================= + 0 1 Cache management Function entry trace + 1 2 Function exit trace + 2 4 General + 3 8 Cookie management Function entry trace + 4 16 Function exit trace + 5 32 General + 6 64 Page handling Function entry trace + 7 128 Function exit trace + 8 256 General + 9 512 Operation management Function entry trace + 10 1024 Function exit trace + 11 2048 General + ======= ======= =============================== ======================= + +The appropriate set of values should be OR'd together and the result written to +the control file. For example:: + + echo $((1|8|64)) >/sys/module/fscache/parameters/debug + +will turn on all function entry debugging. diff --git a/Documentation/filesystems/caching/fscache.txt b/Documentation/filesystems/caching/fscache.txt deleted file mode 100644 index 50f0a5757f48..000000000000 --- a/Documentation/filesystems/caching/fscache.txt +++ /dev/null @@ -1,448 +0,0 @@ - ========================== - General Filesystem Caching - ========================== - -======== -OVERVIEW -======== - -This facility is a general purpose cache for network filesystems, though it -could be used for caching other things such as ISO9660 filesystems too. - -FS-Cache mediates between cache backends (such as CacheFS) and network -filesystems: - - +---------+ - | | +--------------+ - | NFS |--+ | | - | | | +-->| CacheFS | - +---------+ | +----------+ | | /dev/hda5 | - | | | | +--------------+ - +---------+ +-->| | | - | | | |--+ - | AFS |----->| FS-Cache | - | | | |--+ - +---------+ +-->| | | - | | | | +--------------+ - +---------+ | +----------+ | | | - | | | +-->| CacheFiles | - | ISOFS |--+ | /var/cache | - | | +--------------+ - +---------+ - -Or to look at it another way, FS-Cache is a module that provides a caching -facility to a network filesystem such that the cache is transparent to the -user: - - +---------+ - | | - | Server | - | | - +---------+ - | NETWORK - ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - | - | +----------+ - V | | - +---------+ | | - | | | | - | NFS |----->| FS-Cache | - | | | |--+ - +---------+ | | | +--------------+ +--------------+ - | | | | | | | | - V +----------+ +-->| CacheFiles |-->| Ext3 | - +---------+ | /var/cache | | /dev/sda6 | - | | +--------------+ +--------------+ - | VFS | ^ ^ - | | | | - +---------+ +--------------+ | - | KERNEL SPACE | | - ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|~~~~~~|~~~~ - | USER SPACE | | - V | | - +---------+ +--------------+ - | | | | - | Process | | cachefilesd | - | | | | - +---------+ +--------------+ - - -FS-Cache does not follow the idea of completely loading every netfs file -opened in its entirety into a cache before permitting it to be accessed and -then serving the pages out of that cache rather than the netfs inode because: - - (1) It must be practical to operate without a cache. - - (2) The size of any accessible file must not be limited to the size of the - cache. - - (3) The combined size of all opened files (this includes mapped libraries) - must not be limited to the size of the cache. - - (4) The user should not be forced to download an entire file just to do a - one-off access of a small portion of it (such as might be done with the - "file" program). - -It instead serves the cache out in PAGE_SIZE chunks as and when requested by -the netfs('s) using it. - - -FS-Cache provides the following facilities: - - (1) More than one cache can be used at once. Caches can be selected - explicitly by use of tags. - - (2) Caches can be added / removed at any time. - - (3) The netfs is provided with an interface that allows either party to - withdraw caching facilities from a file (required for (2)). - - (4) The interface to the netfs returns as few errors as possible, preferring - rather to let the netfs remain oblivious. - - (5) Cookies are used to represent indices, files and other objects to the - netfs. The simplest cookie is just a NULL pointer - indicating nothing - cached there. - - (6) The netfs is allowed to propose - dynamically - any index hierarchy it - desires, though it must be aware that the index search function is - recursive, stack space is limited, and indices can only be children of - indices. - - (7) Data I/O is done direct to and from the netfs's pages. The netfs - indicates that page A is at index B of the data-file represented by cookie - C, and that it should be read or written. The cache backend may or may - not start I/O on that page, but if it does, a netfs callback will be - invoked to indicate completion. The I/O may be either synchronous or - asynchronous. - - (8) Cookies can be "retired" upon release. At this point FS-Cache will mark - them as obsolete and the index hierarchy rooted at that point will get - recycled. - - (9) The netfs provides a "match" function for index searches. In addition to - saying whether a match was made or not, this can also specify that an - entry should be updated or deleted. - -(10) As much as possible is done asynchronously. - - -FS-Cache maintains a virtual indexing tree in which all indices, files, objects -and pages are kept. Bits of this tree may actually reside in one or more -caches. - - FSDEF - | - +------------------------------------+ - | | - NFS AFS - | | - +--------------------------+ +-----------+ - | | | | - homedir mirror afs.org redhat.com - | | | - +------------+ +---------------+ +----------+ - | | | | | | - 00001 00002 00007 00125 vol00001 vol00002 - | | | | | - +---+---+ +-----+ +---+ +------+------+ +-----+----+ - | | | | | | | | | | | | | -PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak - | | - PG0 +-------+ - | | - 00001 00003 - | - +---+---+ - | | | - PG0 PG1 PG2 - -In the example above, you can see two netfs's being backed: NFS and AFS. These -have different index hierarchies: - - (*) The NFS primary index contains per-server indices. Each server index is - indexed by NFS file handles to get data file objects. Each data file - objects can have an array of pages, but may also have further child - objects, such as extended attributes and directory entries. Extended - attribute objects themselves have page-array contents. - - (*) The AFS primary index contains per-cell indices. Each cell index contains - per-logical-volume indices. Each of volume index contains up to three - indices for the read-write, read-only and backup mirrors of those volumes. - Each of these contains vnode data file objects, each of which contains an - array of pages. - -The very top index is the FS-Cache master index in which individual netfs's -have entries. - -Any index object may reside in more than one cache, provided it only has index -children. Any index with non-index object children will be assumed to only -reside in one cache. - - -The netfs API to FS-Cache can be found in: - - Documentation/filesystems/caching/netfs-api.txt - -The cache backend API to FS-Cache can be found in: - - Documentation/filesystems/caching/backend-api.txt - -A description of the internal representations and object state machine can be -found in: - - Documentation/filesystems/caching/object.txt - - -======================= -STATISTICAL INFORMATION -======================= - -If FS-Cache is compiled with the following options enabled: - - CONFIG_FSCACHE_STATS=y - CONFIG_FSCACHE_HISTOGRAM=y - -then it will gather certain statistics and display them through a number of -proc files. - - (*) /proc/fs/fscache/stats - - This shows counts of a number of events that can happen in FS-Cache: - - CLASS EVENT MEANING - ======= ======= ======================================================= - Cookies idx=N Number of index cookies allocated - dat=N Number of data storage cookies allocated - spc=N Number of special cookies allocated - Objects alc=N Number of objects allocated - nal=N Number of object allocation failures - avl=N Number of objects that reached the available state - ded=N Number of objects that reached the dead state - ChkAux non=N Number of objects that didn't have a coherency check - ok=N Number of objects that passed a coherency check - upd=N Number of objects that needed a coherency data update - obs=N Number of objects that were declared obsolete - Pages mrk=N Number of pages marked as being cached - unc=N Number of uncache page requests seen - Acquire n=N Number of acquire cookie requests seen - nul=N Number of acq reqs given a NULL parent - noc=N Number of acq reqs rejected due to no cache available - ok=N Number of acq reqs succeeded - nbf=N Number of acq reqs rejected due to error - oom=N Number of acq reqs failed on ENOMEM - Lookups n=N Number of lookup calls made on cache backends - neg=N Number of negative lookups made - pos=N Number of positive lookups made - crt=N Number of objects created by lookup - tmo=N Number of lookups timed out and requeued - Updates n=N Number of update cookie requests seen - nul=N Number of upd reqs given a NULL parent - run=N Number of upd reqs granted CPU time - Relinqs n=N Number of relinquish cookie requests seen - nul=N Number of rlq reqs given a NULL parent - wcr=N Number of rlq reqs waited on completion of creation - AttrChg n=N Number of attribute changed requests seen - ok=N Number of attr changed requests queued - nbf=N Number of attr changed rejected -ENOBUFS - oom=N Number of attr changed failed -ENOMEM - run=N Number of attr changed ops given CPU time - Allocs n=N Number of allocation requests seen - ok=N Number of successful alloc reqs - wt=N Number of alloc reqs that waited on lookup completion - nbf=N Number of alloc reqs rejected -ENOBUFS - int=N Number of alloc reqs aborted -ERESTARTSYS - ops=N Number of alloc reqs submitted - owt=N Number of alloc reqs waited for CPU time - abt=N Number of alloc reqs aborted due to object death - Retrvls n=N Number of retrieval (read) requests seen - ok=N Number of successful retr reqs - wt=N Number of retr reqs that waited on lookup completion - nod=N Number of retr reqs returned -ENODATA - nbf=N Number of retr reqs rejected -ENOBUFS - int=N Number of retr reqs aborted -ERESTARTSYS - oom=N Number of retr reqs failed -ENOMEM - ops=N Number of retr reqs submitted - owt=N Number of retr reqs waited for CPU time - abt=N Number of retr reqs aborted due to object death - Stores n=N Number of storage (write) requests seen - ok=N Number of successful store reqs - agn=N Number of store reqs on a page already pending storage - nbf=N Number of store reqs rejected -ENOBUFS - oom=N Number of store reqs failed -ENOMEM - ops=N Number of store reqs submitted - run=N Number of store reqs granted CPU time - pgs=N Number of pages given store req processing time - rxd=N Number of store reqs deleted from tracking tree - olm=N Number of store reqs over store limit - VmScan nos=N Number of release reqs against pages with no pending store - gon=N Number of release reqs against pages stored by time lock granted - bsy=N Number of release reqs ignored due to in-progress store - can=N Number of page stores cancelled due to release req - Ops pend=N Number of times async ops added to pending queues - run=N Number of times async ops given CPU time - enq=N Number of times async ops queued for processing - can=N Number of async ops cancelled - rej=N Number of async ops rejected due to object lookup/create failure - ini=N Number of async ops initialised - dfr=N Number of async ops queued for deferred release - rel=N Number of async ops released (should equal ini=N when idle) - gc=N Number of deferred-release async ops garbage collected - CacheOp alo=N Number of in-progress alloc_object() cache ops - luo=N Number of in-progress lookup_object() cache ops - luc=N Number of in-progress lookup_complete() cache ops - gro=N Number of in-progress grab_object() cache ops - upo=N Number of in-progress update_object() cache ops - dro=N Number of in-progress drop_object() cache ops - pto=N Number of in-progress put_object() cache ops - syn=N Number of in-progress sync_cache() cache ops - atc=N Number of in-progress attr_changed() cache ops - rap=N Number of in-progress read_or_alloc_page() cache ops - ras=N Number of in-progress read_or_alloc_pages() cache ops - alp=N Number of in-progress allocate_page() cache ops - als=N Number of in-progress allocate_pages() cache ops - wrp=N Number of in-progress write_page() cache ops - ucp=N Number of in-progress uncache_page() cache ops - dsp=N Number of in-progress dissociate_pages() cache ops - CacheEv nsp=N Number of object lookups/creations rejected due to lack of space - stl=N Number of stale objects deleted - rtr=N Number of objects retired when relinquished - cul=N Number of objects culled - - - (*) /proc/fs/fscache/histogram - - cat /proc/fs/fscache/histogram - JIFS SECS OBJ INST OP RUNS OBJ RUNS RETRV DLY RETRIEVLS - ===== ===== ========= ========= ========= ========= ========= - - This shows the breakdown of the number of times each amount of time - between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The - columns are as follows: - - COLUMN TIME MEASUREMENT - ======= ======================================================= - OBJ INST Length of time to instantiate an object - OP RUNS Length of time a call to process an operation took - OBJ RUNS Length of time a call to process an object event took - RETRV DLY Time between an requesting a read and lookup completing - RETRIEVLS Time between beginning and end of a retrieval - - Each row shows the number of events that took a particular range of times. - Each step is 1 jiffy in size. The JIFS column indicates the particular - jiffy range covered, and the SECS field the equivalent number of seconds. - - -=========== -OBJECT LIST -=========== - -If CONFIG_FSCACHE_OBJECT_LIST is enabled, the FS-Cache facility will maintain a -list of all the objects currently allocated and allow them to be viewed -through: - - /proc/fs/fscache/objects - -This will look something like: - - [root@andromeda ~]# head /proc/fs/fscache/objects - OBJECT PARENT STAT CHLDN OPS OOP IPR EX READS EM EV F S | NETFS_COOKIE_DEF TY FL NETFS_DATA OBJECT_KEY, AUX_DATA - ======== ======== ==== ===== === === === == ===== == == = = | ================ == == ================ ================ - 17e4b 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88001dd82820 010006017edcf8bbc93b43298fdfbe71e50b57b13a172c0117f38472, e567634700000000000000000000000063f2404a000000000000000000000000c9030000000000000000000063f2404a - 1693a 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88002db23380 010006017edcf8bbc93b43298fdfbe71e50b57b1e0162c01a2df0ea6, 420ebc4a000000000000000000000000420ebc4a0000000000000000000000000e1801000000000000000000420ebc4a - -where the first set of columns before the '|' describe the object: - - COLUMN DESCRIPTION - ======= =============================================================== - OBJECT Object debugging ID (appears as OBJ%x in some debug messages) - PARENT Debugging ID of parent object - STAT Object state - CHLDN Number of child objects of this object - OPS Number of outstanding operations on this object - OOP Number of outstanding child object management operations - IPR - EX Number of outstanding exclusive operations - READS Number of outstanding read operations - EM Object's event mask - EV Events raised on this object - F Object flags - S Object work item busy state mask (1:pending 2:running) - -and the second set of columns describe the object's cookie, if present: - - COLUMN DESCRIPTION - =============== ======================================================= - NETFS_COOKIE_DEF Name of netfs cookie definition - TY Cookie type (IX - index, DT - data, hex - special) - FL Cookie flags - NETFS_DATA Netfs private data stored in the cookie - OBJECT_KEY Object key } 1 column, with separating comma - AUX_DATA Object aux data } presence may be configured - -The data shown may be filtered by attaching the a key to an appropriate keyring -before viewing the file. Something like: - - keyctl add user fscache:objlist <restrictions> @s - -where <restrictions> are a selection of the following letters: - - K Show hexdump of object key (don't show if not given) - A Show hexdump of object aux data (don't show if not given) - -and the following paired letters: - - C Show objects that have a cookie - c Show objects that don't have a cookie - B Show objects that are busy - b Show objects that aren't busy - W Show objects that have pending writes - w Show objects that don't have pending writes - R Show objects that have outstanding reads - r Show objects that don't have outstanding reads - S Show objects that have work queued - s Show objects that don't have work queued - -If neither side of a letter pair is given, then both are implied. For example: - - keyctl add user fscache:objlist KB @s - -shows objects that are busy, and lists their object keys, but does not dump -their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is -not implied. - -By default all objects and all fields will be shown. - - -========= -DEBUGGING -========= - -If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime -debugging enabled by adjusting the value in: - - /sys/module/fscache/parameters/debug - -This is a bitmask of debugging streams to enable: - - BIT VALUE STREAM POINT - ======= ======= =============================== ======================= - 0 1 Cache management Function entry trace - 1 2 Function exit trace - 2 4 General - 3 8 Cookie management Function entry trace - 4 16 Function exit trace - 5 32 General - 6 64 Page handling Function entry trace - 7 128 Function exit trace - 8 256 General - 9 512 Operation management Function entry trace - 10 1024 Function exit trace - 11 2048 General - -The appropriate set of values should be OR'd together and the result written to -the control file. For example: - - echo $((1|8|64)) >/sys/module/fscache/parameters/debug - -will turn on all function entry debugging. diff --git a/Documentation/filesystems/caching/index.rst b/Documentation/filesystems/caching/index.rst new file mode 100644 index 000000000000..033da7ac7c6e --- /dev/null +++ b/Documentation/filesystems/caching/index.rst @@ -0,0 +1,14 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Filesystem Caching +================== + +.. toctree:: + :maxdepth: 2 + + fscache + object + backend-api + cachefiles + netfs-api + operations diff --git a/Documentation/filesystems/caching/netfs-api.txt b/Documentation/filesystems/caching/netfs-api.rst index ba968e8f5704..d9f14b8610ba 100644 --- a/Documentation/filesystems/caching/netfs-api.txt +++ b/Documentation/filesystems/caching/netfs-api.rst @@ -1,6 +1,8 @@ - =============================== - FS-CACHE NETWORK FILESYSTEM API - =============================== +.. SPDX-License-Identifier: GPL-2.0 + +=============================== +FS-Cache Network Filesystem API +=============================== There's an API by which a network filesystem can make use of the FS-Cache facilities. This is based around a number of principles: @@ -19,7 +21,7 @@ facilities. This is based around a number of principles: This API is declared in <linux/fscache.h>. -This document contains the following sections: +.. This document contains the following sections: (1) Network filesystem definition (2) Index definition @@ -41,12 +43,11 @@ This document contains the following sections: (18) FS-Cache specific page flags. -============================= -NETWORK FILESYSTEM DEFINITION +Network Filesystem Definition ============================= FS-Cache needs a description of the network filesystem. This is specified -using a record of the following structure: +using a record of the following structure:: struct fscache_netfs { uint32_t version; @@ -71,7 +72,7 @@ The fields are: another parameter passed into the registration function. For example, kAFS (linux/fs/afs/) uses the following definitions to describe -itself: +itself:: struct fscache_netfs afs_cache_netfs = { .version = 0, @@ -79,8 +80,7 @@ itself: }; -================ -INDEX DEFINITION +Index Definition ================ Indices are used for two purposes: @@ -114,11 +114,10 @@ There are some limits on indices: function is recursive. Too many layers will run the kernel out of stack. -================= -OBJECT DEFINITION +Object Definition ================= -To define an object, a structure of the following type should be filled out: +To define an object, a structure of the following type should be filled out:: struct fscache_cookie_def { @@ -149,16 +148,13 @@ This has the following fields: This is one of the following values: - (*) FSCACHE_COOKIE_TYPE_INDEX - + FSCACHE_COOKIE_TYPE_INDEX This defines an index, which is a special FS-Cache type. - (*) FSCACHE_COOKIE_TYPE_DATAFILE - + FSCACHE_COOKIE_TYPE_DATAFILE This defines an ordinary data file. - (*) Any other value between 2 and 255 - + Any other value between 2 and 255 This defines an extraordinary object such as an XATTR. (2) The name of the object type (NUL terminated unless all 16 chars are used) @@ -192,9 +188,14 @@ This has the following fields: If present, the function should return one of the following values: - (*) FSCACHE_CHECKAUX_OKAY - the entry is okay as is - (*) FSCACHE_CHECKAUX_NEEDS_UPDATE - the entry requires update - (*) FSCACHE_CHECKAUX_OBSOLETE - the entry should be deleted + FSCACHE_CHECKAUX_OKAY + - the entry is okay as is + + FSCACHE_CHECKAUX_NEEDS_UPDATE + - the entry requires update + + FSCACHE_CHECKAUX_OBSOLETE + - the entry should be deleted This function can also be used to extract data from the auxiliary data in the cache and copy it into the netfs's structures. @@ -236,32 +237,30 @@ This has the following fields: This function is not required for indices as they're not permitted data. -=================================== -NETWORK FILESYSTEM (UN)REGISTRATION +Network Filesystem (Un)registration =================================== The first step is to declare the network filesystem to the cache. This also involves specifying the layout of the primary index (for AFS, this would be the "cell" level). -The registration function is: +The registration function is:: int fscache_register_netfs(struct fscache_netfs *netfs); It just takes a pointer to the netfs definition. It returns 0 or an error as appropriate. -For kAFS, registration is done as follows: +For kAFS, registration is done as follows:: ret = fscache_register_netfs(&afs_cache_netfs); -The last step is, of course, unregistration: +The last step is, of course, unregistration:: void fscache_unregister_netfs(struct fscache_netfs *netfs); -================ -CACHE TAG LOOKUP +Cache Tag Lookup ================ FS-Cache permits the use of more than one cache. To permit particular index @@ -270,7 +269,7 @@ representation tags. This step is optional; it can be left entirely up to FS-Cache as to which cache should be used. The problem with doing that is that FS-Cache will always pick the first cache that was registered. -To get the representation for a named tag: +To get the representation for a named tag:: struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name); @@ -278,7 +277,7 @@ This takes a text string as the name and returns a representation of a tag. It will never return an error. It may return a dummy tag, however, if it runs out of memory; this will inhibit caching with this tag. -Any representation so obtained must be released by passing it to this function: +Any representation so obtained must be released by passing it to this function:: void fscache_release_cache_tag(struct fscache_cache_tag *tag); @@ -286,13 +285,12 @@ The tag will be retrieved by FS-Cache when it calls the object definition operation select_cache(). -================== -INDEX REGISTRATION +Index Registration ================== The third step is to inform FS-Cache about part of an index hierarchy that can be used to locate files. This is done by requesting a cookie for each index in -the path to the file: +the path to the file:: struct fscache_cookie * fscache_acquire_cookie(struct fscache_cookie *parent, @@ -339,7 +337,7 @@ must be enabled to do anything with it. A disabled cookie can be enabled by calling fscache_enable_cookie() (see below). For example, with AFS, a cell would be added to the primary index. This index -entry would have a dependent inode containing volume mappings within this cell: +entry would have a dependent inode containing volume mappings within this cell:: cell->cache = fscache_acquire_cookie(afs_cache_netfs.primary_index, @@ -349,7 +347,7 @@ entry would have a dependent inode containing volume mappings within this cell: cell, 0, true); And then a particular volume could be added to that index by ID, creating -another index for vnodes (AFS inode equivalents): +another index for vnodes (AFS inode equivalents):: volume->cache = fscache_acquire_cookie(volume->cell->cache, @@ -359,13 +357,12 @@ another index for vnodes (AFS inode equivalents): volume, 0, true); -====================== -DATA FILE REGISTRATION +Data File Registration ====================== The fourth step is to request a data file be created in the cache. This is identical to index cookie acquisition. The only difference is that the type in -the object definition should be something other than index type. +the object definition should be something other than index type:: vnode->cache = fscache_acquire_cookie(volume->cache, @@ -375,15 +372,14 @@ the object definition should be something other than index type. vnode, vnode->status.size, true); -================================= -MISCELLANEOUS OBJECT REGISTRATION +Miscellaneous Object Registration ================================= An optional step is to request an object of miscellaneous type be created in the cache. This is almost identical to index cookie acquisition. The only difference is that the type in the object definition should be something other than index type. While the parent object could be an index, it's more likely -it would be some other type of object such as a data file. +it would be some other type of object such as a data file:: xattr->cache = fscache_acquire_cookie(vnode->cache, @@ -396,13 +392,12 @@ Miscellaneous objects might be used to store extended attributes or directory entries for example. -========================== -SETTING THE DATA FILE SIZE +Setting the Data File Size ========================== The fifth step is to set the physical attributes of the file, such as its size. This doesn't automatically reserve any space in the cache, but permits the -cache to adjust its metadata for data tracking appropriately: +cache to adjust its metadata for data tracking appropriately:: int fscache_attr_changed(struct fscache_cookie *cookie); @@ -417,8 +412,7 @@ some point in the future, and as such, it may happen after the function returns to the caller. The attribute adjustment excludes read and write operations. -===================== -PAGE ALLOC/READ/WRITE +Page alloc/read/write ===================== And the sixth step is to store and retrieve pages in the cache. There are @@ -441,7 +435,7 @@ PAGE READ Firstly, the netfs should ask FS-Cache to examine the caches and read the contents cached for a particular page of a particular file if present, or else -allocate space to store the contents if not: +allocate space to store the contents if not:: typedef void (*fscache_rw_complete_t)(struct page *page, @@ -474,14 +468,14 @@ Else if there's a copy of the page resident in the cache: (4) When the read is complete, end_io_func() will be invoked with: - (*) The netfs data supplied when the cookie was created. + * The netfs data supplied when the cookie was created. - (*) The page descriptor. + * The page descriptor. - (*) The context argument passed to the above function. This will be + * The context argument passed to the above function. This will be maintained with the get_context/put_context functions mentioned above. - (*) An argument that's 0 on success or negative for an error code. + * An argument that's 0 on success or negative for an error code. If an error occurs, it should be assumed that the page contains no usable data. fscache_readpages_cancel() may need to be called. @@ -504,11 +498,11 @@ This function may also return -ENOMEM or -EINTR, in which case it won't have read any data from the cache. -PAGE ALLOCATE +Page Allocate ------------- Alternatively, if there's not expected to be any data in the cache for a page -because the file has been extended, a block can simply be allocated instead: +because the file has been extended, a block can simply be allocated instead:: int fscache_alloc_page(struct fscache_cookie *cookie, struct page *page, @@ -523,12 +517,12 @@ The mark_pages_cached() cookie operation will be called on the page if successful. -PAGE WRITE +Page Write ---------- Secondly, if the netfs changes the contents of the page (either due to an initial download or if a user performs a write), then the page should be -written back to the cache: +written back to the cache:: int fscache_write_page(struct fscache_cookie *cookie, struct page *page, @@ -566,11 +560,11 @@ place if unforeseen circumstances arose (such as a disk error). Writing takes place asynchronously. -MULTIPLE PAGE READ +Multiple Page Read ------------------ A facility is provided to read several pages at once, as requested by the -readpages() address space operation: +readpages() address space operation:: int fscache_read_or_alloc_pages(struct fscache_cookie *cookie, struct address_space *mapping, @@ -598,7 +592,7 @@ This works in a similar way to fscache_read_or_alloc_page(), except: be returned. Otherwise, if all pages had reads dispatched, then 0 will be returned, the - list will be empty and *nr_pages will be 0. + list will be empty and ``*nr_pages`` will be 0. (4) end_io_func will be called once for each page being read as the reads complete. It will be called in process context if error != 0, but it may @@ -609,13 +603,13 @@ some of the pages being read and some being allocated. Those pages will have been marked appropriately and will need uncaching. -CANCELLATION OF UNREAD PAGES +Cancellation of Unread Pages ---------------------------- If one or more pages are passed to fscache_read_or_alloc_pages() but not then read from the cache and also not read from the underlying filesystem then those pages will need to have any marks and reservations removed. This can be -done by calling: +done by calling:: void fscache_readpages_cancel(struct fscache_cookie *cookie, struct list_head *pages); @@ -625,11 +619,10 @@ fscache_read_or_alloc_pages(). Every page in the pages list will be examined and any that have PG_fscache set will be uncached. -============== -PAGE UNCACHING +Page Uncaching ============== -To uncache a page, this function should be called: +To uncache a page, this function should be called:: void fscache_uncache_page(struct fscache_cookie *cookie, struct page *page); @@ -644,12 +637,12 @@ data file must be retired (see the relinquish cookie function below). Furthermore, note that this does not cancel the asynchronous read or write operation started by the read/alloc and write functions, so the page -invalidation functions must use: +invalidation functions must use:: bool fscache_check_page_write(struct fscache_cookie *cookie, struct page *page); -to see if a page is being written to the cache, and: +to see if a page is being written to the cache, and:: void fscache_wait_on_page_write(struct fscache_cookie *cookie, struct page *page); @@ -660,7 +653,7 @@ to wait for it to finish if it is. When releasepage() is being implemented, a special FS-Cache function exists to manage the heuristics of coping with vmscan trying to eject pages, which may conflict with the cache trying to write pages to the cache (which may itself -need to allocate memory): +need to allocate memory):: bool fscache_maybe_release_page(struct fscache_cookie *cookie, struct page *page, @@ -676,12 +669,12 @@ storage request to complete, or it may attempt to cancel the storage request - in which case the page will not be stored in the cache this time. -BULK INODE PAGE UNCACHE +Bulk Image Page Uncache ----------------------- A convenience routine is provided to perform an uncache on all the pages attached to an inode. This assumes that the pages on the inode correspond on a -1:1 basis with the pages in the cache. +1:1 basis with the pages in the cache:: void fscache_uncache_all_inode_pages(struct fscache_cookie *cookie, struct inode *inode); @@ -692,12 +685,11 @@ written to the cache and for the cache to finish with the page generally. No error is returned. -=============================== -INDEX AND DATA FILE CONSISTENCY +Index and Data File consistency =============================== To find out whether auxiliary data for an object is up to data within the -cache, the following function can be called: +cache, the following function can be called:: int fscache_check_consistency(struct fscache_cookie *cookie, const void *aux_data); @@ -708,7 +700,7 @@ data buffer first. It returns 0 if it is and -ESTALE if it isn't; it may also return -ENOMEM and -ERESTARTSYS. To request an update of the index data for an index or other object, the -following function should be called: +following function should be called:: void fscache_update_cookie(struct fscache_cookie *cookie, const void *aux_data); @@ -721,8 +713,7 @@ Note that partial updates may happen automatically at other times, such as when data blocks are added to a data file object. -================= -COOKIE ENABLEMENT +Cookie Enablement ================= Cookies exist in one of two states: enabled and disabled. If a cookie is @@ -731,7 +722,7 @@ invalidate its state; allocate, read or write backing pages - though it is still possible to uncache pages and relinquish the cookie. The initial enablement state is set by fscache_acquire_cookie(), but the cookie -can be enabled or disabled later. To disable a cookie, call: +can be enabled or disabled later. To disable a cookie, call:: void fscache_disable_cookie(struct fscache_cookie *cookie, const void *aux_data, @@ -746,7 +737,7 @@ All possible failures are handled internally. The caller should consider calling fscache_uncache_all_inode_pages() afterwards to make sure all page markings are cleared up. -Cookies can be enabled or reenabled with: +Cookies can be enabled or reenabled with:: void fscache_enable_cookie(struct fscache_cookie *cookie, const void *aux_data, @@ -771,13 +762,12 @@ In both cases, the cookie's auxiliary data buffer is updated from aux_data if that is non-NULL inside the enablement lock before proceeding. -=============================== -MISCELLANEOUS COOKIE OPERATIONS +Miscellaneous Cookie operations =============================== There are a number of operations that can be used to control cookies: - (*) Cookie pinning: + * Cookie pinning:: int fscache_pin_cookie(struct fscache_cookie *cookie); void fscache_unpin_cookie(struct fscache_cookie *cookie); @@ -790,7 +780,7 @@ There are a number of operations that can be used to control cookies: -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or -EIO if there's any other problem. - (*) Data space reservation: + * Data space reservation:: int fscache_reserve_space(struct fscache_cookie *cookie, loff_t size); @@ -809,11 +799,10 @@ There are a number of operations that can be used to control cookies: make space if it's not in use. -===================== -COOKIE UNREGISTRATION +Cookie Unregistration ===================== -To get rid of a cookie, this function should be called. +To get rid of a cookie, this function should be called:: void fscache_relinquish_cookie(struct fscache_cookie *cookie, const void *aux_data, @@ -835,16 +824,14 @@ the cookies for "child" indices, objects and pages have been relinquished first. -================== -INDEX INVALIDATION +Index Invalidation ================== There is no direct way to invalidate an index subtree. To do this, the caller should relinquish and retire the cookie they have, and then acquire a new one. -====================== -DATA FILE INVALIDATION +Data File Invalidation ====================== Sometimes it will be necessary to invalidate an object that contains data. @@ -853,7 +840,7 @@ change - at which point the netfs has to throw away all the state it had for an inode and reload from the server. To indicate that a cache object should be invalidated, the following function -can be called: +can be called:: void fscache_invalidate(struct fscache_cookie *cookie); @@ -868,13 +855,12 @@ auxiliary data update operation as it is very likely these will have changed. Using the following function, the netfs can wait for the invalidation operation to have reached a point at which it can start submitting ordinary operations -once again: +once again:: void fscache_wait_on_invalidate(struct fscache_cookie *cookie); -=========================== -FS-CACHE SPECIFIC PAGE FLAG +FS-cache Specific Page Flag =========================== FS-Cache makes use of a page flag, PG_private_2, for its own purpose. This is @@ -898,7 +884,7 @@ was given under certain circumstances. This bit does not overlap with such as PG_private. This means that FS-Cache can be used with a filesystem that uses the block buffering code. -There are a number of operations defined on this flag: +There are a number of operations defined on this flag:: int PageFsCache(struct page *page); void SetPageFsCache(struct page *page) diff --git a/Documentation/filesystems/caching/object.txt b/Documentation/filesystems/caching/object.rst index 100ff41127e4..ce0e043ccd33 100644 --- a/Documentation/filesystems/caching/object.txt +++ b/Documentation/filesystems/caching/object.rst @@ -1,10 +1,12 @@ - ==================================================== - IN-KERNEL CACHE OBJECT REPRESENTATION AND MANAGEMENT - ==================================================== +.. SPDX-License-Identifier: GPL-2.0 + +==================================================== +In-Kernel Cache Object Representation and Management +==================================================== By: David Howells <dhowells@redhat.com> -Contents: +.. Contents: (*) Representation @@ -18,8 +20,7 @@ Contents: (*) The set of events. -============== -REPRESENTATION +Representation ============== FS-Cache maintains an in-kernel representation of each object that a netfs is @@ -38,7 +39,7 @@ or even by no objects (it may not be cached). Furthermore, both cookies and objects are hierarchical. The two hierarchies correspond, but the cookies tree is a superset of the union of the object trees -of multiple caches: +of multiple caches:: NETFS INDEX TREE : CACHE 1 : CACHE 2 : : @@ -89,8 +90,7 @@ pointers to the cookies. The cookies themselves and any objects attached to those cookies are hidden from it. -=============================== -OBJECT MANAGEMENT STATE MACHINE +Object Management State Machine =============================== Within FS-Cache, each active object is managed by its own individual state @@ -124,7 +124,7 @@ is not masked, the object will be queued for processing (by calling fscache_enqueue_object()). -PROVISION OF CPU TIME +Provision of CPU Time --------------------- The work to be done by the various states was given CPU time by the threads of @@ -141,7 +141,7 @@ because: workqueues don't necessarily have the right numbers of threads. -LOCKING SIMPLIFICATION +Locking Simplification ---------------------- Because only one worker thread may be operating on any particular object's @@ -151,8 +151,7 @@ from the cache backend's representation (fscache_object) - which may be requested from either end. -================= -THE SET OF STATES +The Set of States ================= The object state machine has a set of states that it can be in. There are @@ -275,19 +274,17 @@ memory and potentially deletes stuff from disk: this state. -THE SET OF EVENTS +The Set of Events ----------------- There are a number of events that can be raised to an object state machine: - (*) FSCACHE_OBJECT_EV_UPDATE - + FSCACHE_OBJECT_EV_UPDATE The netfs requested that an object be updated. The state machine will ask the cache backend to update the object, and the cache backend will ask the netfs for details of the change through its cookie definition ops. - (*) FSCACHE_OBJECT_EV_CLEARED - + FSCACHE_OBJECT_EV_CLEARED This is signalled in two circumstances: (a) when an object's last child object is dropped and @@ -296,20 +293,16 @@ There are a number of events that can be raised to an object state machine: This is used to proceed from the dying state. - (*) FSCACHE_OBJECT_EV_ERROR - + FSCACHE_OBJECT_EV_ERROR This is signalled when an I/O error occurs during the processing of some object. - (*) FSCACHE_OBJECT_EV_RELEASE - (*) FSCACHE_OBJECT_EV_RETIRE - + FSCACHE_OBJECT_EV_RELEASE, FSCACHE_OBJECT_EV_RETIRE These are signalled when the netfs relinquishes a cookie it was using. The event selected depends on whether the netfs asks for the backing object to be retired (deleted) or retained. - (*) FSCACHE_OBJECT_EV_WITHDRAW - + FSCACHE_OBJECT_EV_WITHDRAW This is signalled when the cache backend wants to withdraw an object. This means that the object will have to be detached from the netfs's cookie. diff --git a/Documentation/filesystems/caching/operations.txt b/Documentation/filesystems/caching/operations.rst index d8976c434718..f7ddcc028939 100644 --- a/Documentation/filesystems/caching/operations.txt +++ b/Documentation/filesystems/caching/operations.rst @@ -1,10 +1,12 @@ - ================================ - ASYNCHRONOUS OPERATIONS HANDLING - ================================ +.. SPDX-License-Identifier: GPL-2.0 + +================================ +Asynchronous Operations Handling +================================ By: David Howells <dhowells@redhat.com> -Contents: +.. Contents: (*) Overview. @@ -17,8 +19,7 @@ Contents: (*) Asynchronous callback. -======== -OVERVIEW +Overview ======== FS-Cache has an asynchronous operations handling facility that it uses for its @@ -33,11 +34,10 @@ backend for completion. To make use of this facility, <linux/fscache-cache.h> should be #included. -=============================== -OPERATION RECORD INITIALISATION +Operation Record Initialisation =============================== -An operation is recorded in an fscache_operation struct: +An operation is recorded in an fscache_operation struct:: struct fscache_operation { union { @@ -50,7 +50,7 @@ An operation is recorded in an fscache_operation struct: }; Someone wanting to issue an operation should allocate something with this -struct embedded in it. They should initialise it by calling: +struct embedded in it. They should initialise it by calling:: void fscache_operation_init(struct fscache_operation *op, fscache_operation_release_t release); @@ -67,8 +67,7 @@ FSCACHE_OP_WAITING may be set in op->flags prior to each submission of the operation and waited for afterwards. -========== -PARAMETERS +Parameters ========== There are a number of parameters that can be set in the operation record's flag @@ -87,7 +86,7 @@ operations: If this option is to be used, FSCACHE_OP_WAITING must be set in op->flags before submitting the operation, and the operating thread must wait for it - to be cleared before proceeding: + to be cleared before proceeding:: wait_on_bit(&op->flags, FSCACHE_OP_WAITING, TASK_UNINTERRUPTIBLE); @@ -101,7 +100,7 @@ operations: page to a netfs page after the backing fs has read the page in. If this option is used, op->fast_work and op->processor must be - initialised before submitting the operation: + initialised before submitting the operation:: INIT_WORK(&op->fast_work, do_some_work); @@ -114,7 +113,7 @@ operations: pages that have just been fetched from a remote server. If this option is used, op->slow_work and op->processor must be - initialised before submitting the operation: + initialised before submitting the operation:: fscache_operation_init_slow(op, processor) @@ -132,8 +131,7 @@ Furthermore, operations may be one of two types: operations running at the same time. -========= -PROCEDURE +Procedure ========= Operations are used through the following procedure: @@ -143,7 +141,7 @@ Operations are used through the following procedure: generic op embedded within. (2) The submitting thread must then submit the operation for processing using - one of the following two functions: + one of the following two functions:: int fscache_submit_op(struct fscache_object *object, struct fscache_operation *op); @@ -164,7 +162,7 @@ Operations are used through the following procedure: operation of conflicting exclusivity is in progress on the object. If the operation is asynchronous, the manager will retain a reference to - it, so the caller should put their reference to it by passing it to: + it, so the caller should put their reference to it by passing it to:: void fscache_put_operation(struct fscache_operation *op); @@ -179,12 +177,12 @@ Operations are used through the following procedure: (4) The operation holds an effective lock upon the object, preventing other exclusive ops conflicting until it is released. The operation can be enqueued for further immediate asynchronous processing by adjusting the - CPU time provisioning option if necessary, eg: + CPU time provisioning option if necessary, eg:: op->flags &= ~FSCACHE_OP_TYPE; op->flags |= ~FSCACHE_OP_FAST; - and calling: + and calling:: void fscache_enqueue_operation(struct fscache_operation *op) @@ -192,13 +190,12 @@ Operations are used through the following procedure: pools. -===================== -ASYNCHRONOUS CALLBACK +Asynchronous Callback ===================== When used in asynchronous mode, the worker thread pool will invoke the processor method with a pointer to the operation. This should then get at the -container struct by using container_of(): +container struct by using container_of():: static void fscache_write_op(struct fscache_operation *_op) { diff --git a/Documentation/filesystems/cifs/cifsroot.txt b/Documentation/filesystems/cifs/cifsroot.rst index 947b7ec6ce9e..4930bb443134 100644 --- a/Documentation/filesystems/cifs/cifsroot.txt +++ b/Documentation/filesystems/cifs/cifsroot.rst @@ -1,7 +1,11 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================================== Mounting root file system via SMB (cifs.ko) =========================================== Written 2019 by Paulo Alcantara <palcantara@suse.de> + Written 2019 by Aurelien Aptel <aaptel@suse.com> The CONFIG_CIFS_ROOT option enables experimental root file system @@ -32,7 +36,7 @@ Server configuration ==================== To enable SMB1+UNIX extensions you will need to set these global -settings in Samba smb.conf: +settings in Samba smb.conf:: [global] server min protocol = NT1 @@ -41,12 +45,16 @@ settings in Samba smb.conf: Kernel command line =================== -root=/dev/cifs +:: + + root=/dev/cifs This is just a virtual device that basically tells the kernel to mount the root file system via SMB protocol. -cifsroot=//<server-ip>/<share>[,options] +:: + + cifsroot=//<server-ip>/<share>[,options] Enables the kernel to mount the root file system via SMB that are located in the <server-ip> and <share> specified in this option. @@ -65,33 +73,33 @@ options Examples ======== -Export root file system as a Samba share in smb.conf file. +Export root file system as a Samba share in smb.conf file:: -... -[linux] - path = /path/to/rootfs - read only = no - guest ok = yes - force user = root - force group = root - browseable = yes - writeable = yes - admin users = root - public = yes - create mask = 0777 - directory mask = 0777 -... + ... + [linux] + path = /path/to/rootfs + read only = no + guest ok = yes + force user = root + force group = root + browseable = yes + writeable = yes + admin users = root + public = yes + create mask = 0777 + directory mask = 0777 + ... -Restart smb service. +Restart smb service:: -# systemctl restart smb + # systemctl restart smb Test it under QEMU on a kernel built with CONFIG_CIFS_ROOT and -CONFIG_IP_PNP options enabled. +CONFIG_IP_PNP options enabled:: -# qemu-system-x86_64 -enable-kvm -cpu host -m 1024 \ - -kernel /path/to/linux/arch/x86/boot/bzImage -nographic \ - -append "root=/dev/cifs rw ip=dhcp cifsroot=//10.0.2.2/linux,username=foo,password=bar console=ttyS0 3" + # qemu-system-x86_64 -enable-kvm -cpu host -m 1024 \ + -kernel /path/to/linux/arch/x86/boot/bzImage -nographic \ + -append "root=/dev/cifs rw ip=dhcp cifsroot=//10.0.2.2/linux,username=foo,password=bar console=ttyS0 3" 1: https://wiki.samba.org/index.php/UNIX_Extensions diff --git a/Documentation/filesystems/coda.rst b/Documentation/filesystems/coda.rst new file mode 100644 index 000000000000..84c860c89887 --- /dev/null +++ b/Documentation/filesystems/coda.rst @@ -0,0 +1,1670 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +Coda Kernel-Venus Interface +=========================== + +.. Note:: + + This is one of the technical documents describing a component of + Coda -- this document describes the client kernel-Venus interface. + +For more information: + + http://www.coda.cs.cmu.edu + +For user level software needed to run Coda: + + ftp://ftp.coda.cs.cmu.edu + +To run Coda you need to get a user level cache manager for the client, +named Venus, as well as tools to manipulate ACLs, to log in, etc. The +client needs to have the Coda filesystem selected in the kernel +configuration. + +The server needs a user level server and at present does not depend on +kernel support. + + The Venus kernel interface + + Peter J. Braam + + v1.0, Nov 9, 1997 + + This document describes the communication between Venus and kernel + level filesystem code needed for the operation of the Coda file sys- + tem. This document version is meant to describe the current interface + (version 1.0) as well as improvements we envisage. + +.. Table of Contents + + 1. Introduction + + 2. Servicing Coda filesystem calls + + 3. The message layer + + 3.1 Implementation details + + 4. The interface at the call level + + 4.1 Data structures shared by the kernel and Venus + 4.2 The pioctl interface + 4.3 root + 4.4 lookup + 4.5 getattr + 4.6 setattr + 4.7 access + 4.8 create + 4.9 mkdir + 4.10 link + 4.11 symlink + 4.12 remove + 4.13 rmdir + 4.14 readlink + 4.15 open + 4.16 close + 4.17 ioctl + 4.18 rename + 4.19 readdir + 4.20 vget + 4.21 fsync + 4.22 inactive + 4.23 rdwr + 4.24 odymount + 4.25 ody_lookup + 4.26 ody_expand + 4.27 prefetch + 4.28 signal + + 5. The minicache and downcalls + + 5.1 INVALIDATE + 5.2 FLUSH + 5.3 PURGEUSER + 5.4 ZAPFILE + 5.5 ZAPDIR + 5.6 ZAPVNODE + 5.7 PURGEFID + 5.8 REPLACE + + 6. Initialization and cleanup + + 6.1 Requirements + +1. Introduction +=============== + + A key component in the Coda Distributed File System is the cache + manager, Venus. + + When processes on a Coda enabled system access files in the Coda + filesystem, requests are directed at the filesystem layer in the + operating system. The operating system will communicate with Venus to + service the request for the process. Venus manages a persistent + client cache and makes remote procedure calls to Coda file servers and + related servers (such as authentication servers) to service these + requests it receives from the operating system. When Venus has + serviced a request it replies to the operating system with appropriate + return codes, and other data related to the request. Optionally the + kernel support for Coda may maintain a minicache of recently processed + requests to limit the number of interactions with Venus. Venus + possesses the facility to inform the kernel when elements from its + minicache are no longer valid. + + This document describes precisely this communication between the + kernel and Venus. The definitions of so called upcalls and downcalls + will be given with the format of the data they handle. We shall also + describe the semantic invariants resulting from the calls. + + Historically Coda was implemented in a BSD file system in Mach 2.6. + The interface between the kernel and Venus is very similar to the BSD + VFS interface. Similar functionality is provided, and the format of + the parameters and returned data is very similar to the BSD VFS. This + leads to an almost natural environment for implementing a kernel-level + filesystem driver for Coda in a BSD system. However, other operating + systems such as Linux and Windows 95 and NT have virtual filesystem + with different interfaces. + + To implement Coda on these systems some reverse engineering of the + Venus/Kernel protocol is necessary. Also it came to light that other + systems could profit significantly from certain small optimizations + and modifications to the protocol. To facilitate this work as well as + to make future ports easier, communication between Venus and the + kernel should be documented in great detail. This is the aim of this + document. + +2. Servicing Coda filesystem calls +=================================== + + The service of a request for a Coda file system service originates in + a process P which accessing a Coda file. It makes a system call which + traps to the OS kernel. Examples of such calls trapping to the kernel + are ``read``, ``write``, ``open``, ``close``, ``create``, ``mkdir``, + ``rmdir``, ``chmod`` in a Unix ontext. Similar calls exist in the Win32 + environment, and are named ``CreateFile``. + + Generally the operating system handles the request in a virtual + filesystem (VFS) layer, which is named I/O Manager in NT and IFS + manager in Windows 95. The VFS is responsible for partial processing + of the request and for locating the specific filesystem(s) which will + service parts of the request. Usually the information in the path + assists in locating the correct FS drivers. Sometimes after extensive + pre-processing, the VFS starts invoking exported routines in the FS + driver. This is the point where the FS specific processing of the + request starts, and here the Coda specific kernel code comes into + play. + + The FS layer for Coda must expose and implement several interfaces. + First and foremost the VFS must be able to make all necessary calls to + the Coda FS layer, so the Coda FS driver must expose the VFS interface + as applicable in the operating system. These differ very significantly + among operating systems, but share features such as facilities to + read/write and create and remove objects. The Coda FS layer services + such VFS requests by invoking one or more well defined services + offered by the cache manager Venus. When the replies from Venus have + come back to the FS driver, servicing of the VFS call continues and + finishes with a reply to the kernel's VFS. Finally the VFS layer + returns to the process. + + As a result of this design a basic interface exposed by the FS driver + must allow Venus to manage message traffic. In particular Venus must + be able to retrieve and place messages and to be notified of the + arrival of a new message. The notification must be through a mechanism + which does not block Venus since Venus must attend to other tasks even + when no messages are waiting or being processed. + + **Interfaces of the Coda FS Driver** + + Furthermore the FS layer provides for a special path of communication + between a user process and Venus, called the pioctl interface. The + pioctl interface is used for Coda specific services, such as + requesting detailed information about the persistent cache managed by + Venus. Here the involvement of the kernel is minimal. It identifies + the calling process and passes the information on to Venus. When + Venus replies the response is passed back to the caller in unmodified + form. + + Finally Venus allows the kernel FS driver to cache the results from + certain services. This is done to avoid excessive context switches + and results in an efficient system. However, Venus may acquire + information, for example from the network which implies that cached + information must be flushed or replaced. Venus then makes a downcall + to the Coda FS layer to request flushes or updates in the cache. The + kernel FS driver handles such requests synchronously. + + Among these interfaces the VFS interface and the facility to place, + receive and be notified of messages are platform specific. We will + not go into the calls exported to the VFS layer but we will state the + requirements of the message exchange mechanism. + + +3. The message layer +===================== + + At the lowest level the communication between Venus and the FS driver + proceeds through messages. The synchronization between processes + requesting Coda file service and Venus relies on blocking and waking + up processes. The Coda FS driver processes VFS- and pioctl-requests + on behalf of a process P, creates messages for Venus, awaits replies + and finally returns to the caller. The implementation of the exchange + of messages is platform specific, but the semantics have (so far) + appeared to be generally applicable. Data buffers are created by the + FS Driver in kernel memory on behalf of P and copied to user memory in + Venus. + + The FS Driver while servicing P makes upcalls to Venus. Such an + upcall is dispatched to Venus by creating a message structure. The + structure contains the identification of P, the message sequence + number, the size of the request and a pointer to the data in kernel + memory for the request. Since the data buffer is re-used to hold the + reply from Venus, there is a field for the size of the reply. A flags + field is used in the message to precisely record the status of the + message. Additional platform dependent structures involve pointers to + determine the position of the message on queues and pointers to + synchronization objects. In the upcall routine the message structure + is filled in, flags are set to 0, and it is placed on the *pending* + queue. The routine calling upcall is responsible for allocating the + data buffer; its structure will be described in the next section. + + A facility must exist to notify Venus that the message has been + created, and implemented using available synchronization objects in + the OS. This notification is done in the upcall context of the process + P. When the message is on the pending queue, process P cannot proceed + in upcall. The (kernel mode) processing of P in the filesystem + request routine must be suspended until Venus has replied. Therefore + the calling thread in P is blocked in upcall. A pointer in the + message structure will locate the synchronization object on which P is + sleeping. + + Venus detects the notification that a message has arrived, and the FS + driver allow Venus to retrieve the message with a getmsg_from_kernel + call. This action finishes in the kernel by putting the message on the + queue of processing messages and setting flags to READ. Venus is + passed the contents of the data buffer. The getmsg_from_kernel call + now returns and Venus processes the request. + + At some later point the FS driver receives a message from Venus, + namely when Venus calls sendmsg_to_kernel. At this moment the Coda FS + driver looks at the contents of the message and decides if: + + + * the message is a reply for a suspended thread P. If so it removes + the message from the processing queue and marks the message as + WRITTEN. Finally, the FS driver unblocks P (still in the kernel + mode context of Venus) and the sendmsg_to_kernel call returns to + Venus. The process P will be scheduled at some point and continues + processing its upcall with the data buffer replaced with the reply + from Venus. + + * The message is a ``downcall``. A downcall is a request from Venus to + the FS Driver. The FS driver processes the request immediately + (usually a cache eviction or replacement) and when it finishes + sendmsg_to_kernel returns. + + Now P awakes and continues processing upcall. There are some + subtleties to take account of. First P will determine if it was woken + up in upcall by a signal from some other source (for example an + attempt to terminate P) or as is normally the case by Venus in its + sendmsg_to_kernel call. In the normal case, the upcall routine will + deallocate the message structure and return. The FS routine can proceed + with its processing. + + + **Sleeping and IPC arrangements** + + In case P is woken up by a signal and not by Venus, it will first look + at the flags field. If the message is not yet READ, the process P can + handle its signal without notifying Venus. If Venus has READ, and + the request should not be processed, P can send Venus a signal message + to indicate that it should disregard the previous message. Such + signals are put in the queue at the head, and read first by Venus. If + the message is already marked as WRITTEN it is too late to stop the + processing. The VFS routine will now continue. (-- If a VFS request + involves more than one upcall, this can lead to complicated state, an + extra field "handle_signals" could be added in the message structure + to indicate points of no return have been passed.--) + + + +3.1. Implementation details +---------------------------- + + The Unix implementation of this mechanism has been through the + implementation of a character device associated with Coda. Venus + retrieves messages by doing a read on the device, replies are sent + with a write and notification is through the select system call on the + file descriptor for the device. The process P is kept waiting on an + interruptible wait queue object. + + In Windows NT and the DPMI Windows 95 implementation a DeviceIoControl + call is used. The DeviceIoControl call is designed to copy buffers + from user memory to kernel memory with OPCODES. The sendmsg_to_kernel + is issued as a synchronous call, while the getmsg_from_kernel call is + asynchronous. Windows EventObjects are used for notification of + message arrival. The process P is kept waiting on a KernelEvent + object in NT and a semaphore in Windows 95. + + +4. The interface at the call level +=================================== + + + This section describes the upcalls a Coda FS driver can make to Venus. + Each of these upcalls make use of two structures: inputArgs and + outputArgs. In pseudo BNF form the structures take the following + form:: + + + struct inputArgs { + u_long opcode; + u_long unique; /* Keep multiple outstanding msgs distinct */ + u_short pid; /* Common to all */ + u_short pgid; /* Common to all */ + struct CodaCred cred; /* Common to all */ + + <union "in" of call dependent parts of inputArgs> + }; + + struct outputArgs { + u_long opcode; + u_long unique; /* Keep multiple outstanding msgs distinct */ + u_long result; + + <union "out" of call dependent parts of inputArgs> + }; + + + + Before going on let us elucidate the role of the various fields. The + inputArgs start with the opcode which defines the type of service + requested from Venus. There are approximately 30 upcalls at present + which we will discuss. The unique field labels the inputArg with a + unique number which will identify the message uniquely. A process and + process group id are passed. Finally the credentials of the caller + are included. + + Before delving into the specific calls we need to discuss a variety of + data structures shared by the kernel and Venus. + + + + +4.1. Data structures shared by the kernel and Venus +---------------------------------------------------- + + + The CodaCred structure defines a variety of user and group ids as + they are set for the calling process. The vuid_t and vgid_t are 32 bit + unsigned integers. It also defines group membership in an array. On + Unix the CodaCred has proven sufficient to implement good security + semantics for Coda but the structure may have to undergo modification + for the Windows environment when these mature:: + + struct CodaCred { + vuid_t cr_uid, cr_euid, cr_suid, cr_fsuid; /* Real, effective, set, fs uid */ + vgid_t cr_gid, cr_egid, cr_sgid, cr_fsgid; /* same for groups */ + vgid_t cr_groups[NGROUPS]; /* Group membership for caller */ + }; + + + .. Note:: + + It is questionable if we need CodaCreds in Venus. Finally Venus + doesn't know about groups, although it does create files with the + default uid/gid. Perhaps the list of group membership is superfluous. + + + The next item is the fundamental identifier used to identify Coda + files, the ViceFid. A fid of a file uniquely defines a file or + directory in the Coda filesystem within a cell [1]_:: + + typedef struct ViceFid { + VolumeId Volume; + VnodeId Vnode; + Unique_t Unique; + } ViceFid; + + .. [1] A cell is agroup of Coda servers acting under the aegis of a single + system control machine or SCM. See the Coda Administration manual + for a detailed description of the role of the SCM. + + Each of the constituent fields: VolumeId, VnodeId and Unique_t are + unsigned 32 bit integers. We envisage that a further field will need + to be prefixed to identify the Coda cell; this will probably take the + form of a Ipv6 size IP address naming the Coda cell through DNS. + + The next important structure shared between Venus and the kernel is + the attributes of the file. The following structure is used to + exchange information. It has room for future extensions such as + support for device files (currently not present in Coda):: + + + struct coda_timespec { + int64_t tv_sec; /* seconds */ + long tv_nsec; /* nanoseconds */ + }; + + struct coda_vattr { + enum coda_vtype va_type; /* vnode type (for create) */ + u_short va_mode; /* files access mode and type */ + short va_nlink; /* number of references to file */ + vuid_t va_uid; /* owner user id */ + vgid_t va_gid; /* owner group id */ + long va_fsid; /* file system id (dev for now) */ + long va_fileid; /* file id */ + u_quad_t va_size; /* file size in bytes */ + long va_blocksize; /* blocksize preferred for i/o */ + struct coda_timespec va_atime; /* time of last access */ + struct coda_timespec va_mtime; /* time of last modification */ + struct coda_timespec va_ctime; /* time file changed */ + u_long va_gen; /* generation number of file */ + u_long va_flags; /* flags defined for file */ + dev_t va_rdev; /* device special file represents */ + u_quad_t va_bytes; /* bytes of disk space held by file */ + u_quad_t va_filerev; /* file modification number */ + u_int va_vaflags; /* operations flags, see below */ + long va_spare; /* remain quad aligned */ + }; + + +4.2. The pioctl interface +-------------------------- + + + Coda specific requests can be made by application through the pioctl + interface. The pioctl is implemented as an ordinary ioctl on a + fictitious file /coda/.CONTROL. The pioctl call opens this file, gets + a file handle and makes the ioctl call. Finally it closes the file. + + The kernel involvement in this is limited to providing the facility to + open and close and pass the ioctl message and to verify that a path in + the pioctl data buffers is a file in a Coda filesystem. + + The kernel is handed a data packet of the form:: + + struct { + const char *path; + struct ViceIoctl vidata; + int follow; + } data; + + + + where:: + + + struct ViceIoctl { + caddr_t in, out; /* Data to be transferred in, or out */ + short in_size; /* Size of input buffer <= 2K */ + short out_size; /* Maximum size of output buffer, <= 2K */ + }; + + + + The path must be a Coda file, otherwise the ioctl upcall will not be + made. + + .. Note:: The data structures and code are a mess. We need to clean this up. + + +**We now proceed to document the individual calls**: + + +4.3. root +---------- + + + Arguments + in + + empty + + out:: + + struct cfs_root_out { + ViceFid VFid; + } cfs_root; + + + + Description + This call is made to Venus during the initialization of + the Coda filesystem. If the result is zero, the cfs_root structure + contains the ViceFid of the root of the Coda filesystem. If a non-zero + result is generated, its value is a platform dependent error code + indicating the difficulty Venus encountered in locating the root of + the Coda filesystem. + +4.4. lookup +------------ + + + Summary + Find the ViceFid and type of an object in a directory if it exists. + + Arguments + in:: + + struct cfs_lookup_in { + ViceFid VFid; + char *name; /* Place holder for data. */ + } cfs_lookup; + + + + out:: + + struct cfs_lookup_out { + ViceFid VFid; + int vtype; + } cfs_lookup; + + + + Description + This call is made to determine the ViceFid and filetype of + a directory entry. The directory entry requested carries name name + and Venus will search the directory identified by cfs_lookup_in.VFid. + The result may indicate that the name does not exist, or that + difficulty was encountered in finding it (e.g. due to disconnection). + If the result is zero, the field cfs_lookup_out.VFid contains the + targets ViceFid and cfs_lookup_out.vtype the coda_vtype giving the + type of object the name designates. + + The name of the object is an 8 bit character string of maximum length + CFS_MAXNAMLEN, currently set to 256 (including a 0 terminator.) + + It is extremely important to realize that Venus bitwise ors the field + cfs_lookup.vtype with CFS_NOCACHE to indicate that the object should + not be put in the kernel name cache. + + .. Note:: + + The type of the vtype is currently wrong. It should be + coda_vtype. Linux does not take note of CFS_NOCACHE. It should. + + +4.5. getattr +------------- + + + Summary Get the attributes of a file. + + Arguments + in:: + + struct cfs_getattr_in { + ViceFid VFid; + struct coda_vattr attr; /* XXXXX */ + } cfs_getattr; + + + + out:: + + struct cfs_getattr_out { + struct coda_vattr attr; + } cfs_getattr; + + + + Description + This call returns the attributes of the file identified by fid. + + Errors + Errors can occur if the object with fid does not exist, is + unaccessible or if the caller does not have permission to fetch + attributes. + + .. Note:: + + Many kernel FS drivers (Linux, NT and Windows 95) need to acquire + the attributes as well as the Fid for the instantiation of an internal + "inode" or "FileHandle". A significant improvement in performance on + such systems could be made by combining the lookup and getattr calls + both at the Venus/kernel interaction level and at the RPC level. + + The vattr structure included in the input arguments is superfluous and + should be removed. + + +4.6. setattr +------------- + + + Summary + Set the attributes of a file. + + Arguments + in:: + + struct cfs_setattr_in { + ViceFid VFid; + struct coda_vattr attr; + } cfs_setattr; + + + + + out + + empty + + Description + The structure attr is filled with attributes to be changed + in BSD style. Attributes not to be changed are set to -1, apart from + vtype which is set to VNON. Other are set to the value to be assigned. + The only attributes which the FS driver may request to change are the + mode, owner, groupid, atime, mtime and ctime. The return value + indicates success or failure. + + Errors + A variety of errors can occur. The object may not exist, may + be inaccessible, or permission may not be granted by Venus. + + +4.7. access +------------ + + + Arguments + in:: + + struct cfs_access_in { + ViceFid VFid; + int flags; + } cfs_access; + + + + out + + empty + + Description + Verify if access to the object identified by VFid for + operations described by flags is permitted. The result indicates if + access will be granted. It is important to remember that Coda uses + ACLs to enforce protection and that ultimately the servers, not the + clients enforce the security of the system. The result of this call + will depend on whether a token is held by the user. + + Errors + The object may not exist, or the ACL describing the protection + may not be accessible. + + +4.8. create +------------ + + + Summary + Invoked to create a file + + Arguments + in:: + + struct cfs_create_in { + ViceFid VFid; + struct coda_vattr attr; + int excl; + int mode; + char *name; /* Place holder for data. */ + } cfs_create; + + + + + out:: + + struct cfs_create_out { + ViceFid VFid; + struct coda_vattr attr; + } cfs_create; + + + + Description + This upcall is invoked to request creation of a file. + The file will be created in the directory identified by VFid, its name + will be name, and the mode will be mode. If excl is set an error will + be returned if the file already exists. If the size field in attr is + set to zero the file will be truncated. The uid and gid of the file + are set by converting the CodaCred to a uid using a macro CRTOUID + (this macro is platform dependent). Upon success the VFid and + attributes of the file are returned. The Coda FS Driver will normally + instantiate a vnode, inode or file handle at kernel level for the new + object. + + + Errors + A variety of errors can occur. Permissions may be insufficient. + If the object exists and is not a file the error EISDIR is returned + under Unix. + + .. Note:: + + The packing of parameters is very inefficient and appears to + indicate confusion between the system call creat and the VFS operation + create. The VFS operation create is only called to create new objects. + This create call differs from the Unix one in that it is not invoked + to return a file descriptor. The truncate and exclusive options, + together with the mode, could simply be part of the mode as it is + under Unix. There should be no flags argument; this is used in open + (2) to return a file descriptor for READ or WRITE mode. + + The attributes of the directory should be returned too, since the size + and mtime changed. + + +4.9. mkdir +----------- + + + Summary + Create a new directory. + + Arguments + in:: + + struct cfs_mkdir_in { + ViceFid VFid; + struct coda_vattr attr; + char *name; /* Place holder for data. */ + } cfs_mkdir; + + + + out:: + + struct cfs_mkdir_out { + ViceFid VFid; + struct coda_vattr attr; + } cfs_mkdir; + + + + + Description + This call is similar to create but creates a directory. + Only the mode field in the input parameters is used for creation. + Upon successful creation, the attr returned contains the attributes of + the new directory. + + Errors + As for create. + + .. Note:: + + The input parameter should be changed to mode instead of + attributes. + + The attributes of the parent should be returned since the size and + mtime changes. + + +4.10. link +----------- + + + Summary + Create a link to an existing file. + + Arguments + in:: + + struct cfs_link_in { + ViceFid sourceFid; /* cnode to link *to* */ + ViceFid destFid; /* Directory in which to place link */ + char *tname; /* Place holder for data. */ + } cfs_link; + + + + out + + empty + + Description + This call creates a link to the sourceFid in the directory + identified by destFid with name tname. The source must reside in the + target's parent, i.e. the source must be have parent destFid, i.e. Coda + does not support cross directory hard links. Only the return value is + relevant. It indicates success or the type of failure. + + Errors + The usual errors can occur. + + +4.11. symlink +-------------- + + + Summary + create a symbolic link + + Arguments + in:: + + struct cfs_symlink_in { + ViceFid VFid; /* Directory to put symlink in */ + char *srcname; + struct coda_vattr attr; + char *tname; + } cfs_symlink; + + + + out + + none + + Description + Create a symbolic link. The link is to be placed in the + directory identified by VFid and named tname. It should point to the + pathname srcname. The attributes of the newly created object are to + be set to attr. + + .. Note:: + + The attributes of the target directory should be returned since + its size changed. + + +4.12. remove +------------- + + + Summary + Remove a file + + Arguments + in:: + + struct cfs_remove_in { + ViceFid VFid; + char *name; /* Place holder for data. */ + } cfs_remove; + + + + out + + none + + Description + Remove file named cfs_remove_in.name in directory + identified by VFid. + + + .. Note:: + + The attributes of the directory should be returned since its + mtime and size may change. + + +4.13. rmdir +------------ + + + Summary + Remove a directory + + Arguments + in:: + + struct cfs_rmdir_in { + ViceFid VFid; + char *name; /* Place holder for data. */ + } cfs_rmdir; + + + + out + + none + + Description + Remove the directory with name name from the directory + identified by VFid. + + .. Note:: The attributes of the parent directory should be returned since + its mtime and size may change. + + +4.14. readlink +--------------- + + + Summary + Read the value of a symbolic link. + + Arguments + in:: + + struct cfs_readlink_in { + ViceFid VFid; + } cfs_readlink; + + + + out:: + + struct cfs_readlink_out { + int count; + caddr_t data; /* Place holder for data. */ + } cfs_readlink; + + + + Description + This routine reads the contents of symbolic link + identified by VFid into the buffer data. The buffer data must be able + to hold any name up to CFS_MAXNAMLEN (PATH or NAM??). + + Errors + No unusual errors. + + +4.15. open +----------- + + + Summary + Open a file. + + Arguments + in:: + + struct cfs_open_in { + ViceFid VFid; + int flags; + } cfs_open; + + + + out:: + + struct cfs_open_out { + dev_t dev; + ino_t inode; + } cfs_open; + + + + Description + This request asks Venus to place the file identified by + VFid in its cache and to note that the calling process wishes to open + it with flags as in open(2). The return value to the kernel differs + for Unix and Windows systems. For Unix systems the Coda FS Driver is + informed of the device and inode number of the container file in the + fields dev and inode. For Windows the path of the container file is + returned to the kernel. + + + .. Note:: + + Currently the cfs_open_out structure is not properly adapted to + deal with the Windows case. It might be best to implement two + upcalls, one to open aiming at a container file name, the other at a + container file inode. + + +4.16. close +------------ + + + Summary + Close a file, update it on the servers. + + Arguments + in:: + + struct cfs_close_in { + ViceFid VFid; + int flags; + } cfs_close; + + + + out + + none + + Description + Close the file identified by VFid. + + .. Note:: + + The flags argument is bogus and not used. However, Venus' code + has room to deal with an execp input field, probably this field should + be used to inform Venus that the file was closed but is still memory + mapped for execution. There are comments about fetching versus not + fetching the data in Venus vproc_vfscalls. This seems silly. If a + file is being closed, the data in the container file is to be the new + data. Here again the execp flag might be in play to create confusion: + currently Venus might think a file can be flushed from the cache when + it is still memory mapped. This needs to be understood. + + +4.17. ioctl +------------ + + + Summary + Do an ioctl on a file. This includes the pioctl interface. + + Arguments + in:: + + struct cfs_ioctl_in { + ViceFid VFid; + int cmd; + int len; + int rwflag; + char *data; /* Place holder for data. */ + } cfs_ioctl; + + + + out:: + + + struct cfs_ioctl_out { + int len; + caddr_t data; /* Place holder for data. */ + } cfs_ioctl; + + + + Description + Do an ioctl operation on a file. The command, len and + data arguments are filled as usual. flags is not used by Venus. + + .. Note:: + + Another bogus parameter. flags is not used. What is the + business about PREFETCHING in the Venus code? + + + +4.18. rename +------------- + + + Summary + Rename a fid. + + Arguments + in:: + + struct cfs_rename_in { + ViceFid sourceFid; + char *srcname; + ViceFid destFid; + char *destname; + } cfs_rename; + + + + out + + none + + Description + Rename the object with name srcname in directory + sourceFid to destname in destFid. It is important that the names + srcname and destname are 0 terminated strings. Strings in Unix + kernels are not always null terminated. + + +4.19. readdir +-------------- + + + Summary + Read directory entries. + + Arguments + in:: + + struct cfs_readdir_in { + ViceFid VFid; + int count; + int offset; + } cfs_readdir; + + + + + out:: + + struct cfs_readdir_out { + int size; + caddr_t data; /* Place holder for data. */ + } cfs_readdir; + + + + Description + Read directory entries from VFid starting at offset and + read at most count bytes. Returns the data in data and returns + the size in size. + + + .. Note:: + + This call is not used. Readdir operations exploit container + files. We will re-evaluate this during the directory revamp which is + about to take place. + + +4.20. vget +----------- + + + Summary + instructs Venus to do an FSDB->Get. + + Arguments + in:: + + struct cfs_vget_in { + ViceFid VFid; + } cfs_vget; + + + + out:: + + struct cfs_vget_out { + ViceFid VFid; + int vtype; + } cfs_vget; + + + + Description + This upcall asks Venus to do a get operation on an fsobj + labelled by VFid. + + .. Note:: + + This operation is not used. However, it is extremely useful + since it can be used to deal with read/write memory mapped files. + These can be "pinned" in the Venus cache using vget and released with + inactive. + + +4.21. fsync +------------ + + + Summary + Tell Venus to update the RVM attributes of a file. + + Arguments + in:: + + struct cfs_fsync_in { + ViceFid VFid; + } cfs_fsync; + + + + out + + none + + Description + Ask Venus to update RVM attributes of object VFid. This + should be called as part of kernel level fsync type calls. The + result indicates if the syncing was successful. + + .. Note:: Linux does not implement this call. It should. + + +4.22. inactive +--------------- + + + Summary + Tell Venus a vnode is no longer in use. + + Arguments + in:: + + struct cfs_inactive_in { + ViceFid VFid; + } cfs_inactive; + + + + out + + none + + Description + This operation returns EOPNOTSUPP. + + .. Note:: This should perhaps be removed. + + +4.23. rdwr +----------- + + + Summary + Read or write from a file + + Arguments + in:: + + struct cfs_rdwr_in { + ViceFid VFid; + int rwflag; + int count; + int offset; + int ioflag; + caddr_t data; /* Place holder for data. */ + } cfs_rdwr; + + + + + out:: + + struct cfs_rdwr_out { + int rwflag; + int count; + caddr_t data; /* Place holder for data. */ + } cfs_rdwr; + + + + Description + This upcall asks Venus to read or write from a file. + + + .. Note:: + + It should be removed since it is against the Coda philosophy that + read/write operations never reach Venus. I have been told the + operation does not work. It is not currently used. + + + +4.24. odymount +--------------- + + + Summary + Allows mounting multiple Coda "filesystems" on one Unix mount point. + + Arguments + in:: + + struct ody_mount_in { + char *name; /* Place holder for data. */ + } ody_mount; + + + + out:: + + struct ody_mount_out { + ViceFid VFid; + } ody_mount; + + + + Description + Asks Venus to return the rootfid of a Coda system named + name. The fid is returned in VFid. + + .. Note:: + + This call was used by David for dynamic sets. It should be + removed since it causes a jungle of pointers in the VFS mounting area. + It is not used by Coda proper. Call is not implemented by Venus. + + +4.25. ody_lookup +----------------- + + + Summary + Looks up something. + + Arguments + in + + irrelevant + + + out + + irrelevant + + + .. Note:: Gut it. Call is not implemented by Venus. + + +4.26. ody_expand +----------------- + + + Summary + expands something in a dynamic set. + + Arguments + in + + irrelevant + + out + + irrelevant + + .. Note:: Gut it. Call is not implemented by Venus. + + +4.27. prefetch +--------------- + + + Summary + Prefetch a dynamic set. + + Arguments + + in + + Not documented. + + out + + Not documented. + + Description + Venus worker.cc has support for this call, although it is + noted that it doesn't work. Not surprising, since the kernel does not + have support for it. (ODY_PREFETCH is not a defined operation). + + + .. Note:: Gut it. It isn't working and isn't used by Coda. + + + +4.28. signal +------------- + + + Summary + Send Venus a signal about an upcall. + + Arguments + in + + none + + out + + not applicable. + + Description + This is an out-of-band upcall to Venus to inform Venus + that the calling process received a signal after Venus read the + message from the input queue. Venus is supposed to clean up the + operation. + + Errors + No reply is given. + + .. Note:: + + We need to better understand what Venus needs to clean up and if + it is doing this correctly. Also we need to handle multiple upcall + per system call situations correctly. It would be important to know + what state changes in Venus take place after an upcall for which the + kernel is responsible for notifying Venus to clean up (e.g. open + definitely is such a state change, but many others are maybe not). + + +5. The minicache and downcalls +=============================== + + + The Coda FS Driver can cache results of lookup and access upcalls, to + limit the frequency of upcalls. Upcalls carry a price since a process + context switch needs to take place. The counterpart of caching the + information is that Venus will notify the FS Driver that cached + entries must be flushed or renamed. + + The kernel code generally has to maintain a structure which links the + internal file handles (called vnodes in BSD, inodes in Linux and + FileHandles in Windows) with the ViceFid's which Venus maintains. The + reason is that frequent translations back and forth are needed in + order to make upcalls and use the results of upcalls. Such linking + objects are called cnodes. + + The current minicache implementations have cache entries which record + the following: + + 1. the name of the file + + 2. the cnode of the directory containing the object + + 3. a list of CodaCred's for which the lookup is permitted. + + 4. the cnode of the object + + The lookup call in the Coda FS Driver may request the cnode of the + desired object from the cache, by passing its name, directory and the + CodaCred's of the caller. The cache will return the cnode or indicate + that it cannot be found. The Coda FS Driver must be careful to + invalidate cache entries when it modifies or removes objects. + + When Venus obtains information that indicates that cache entries are + no longer valid, it will make a downcall to the kernel. Downcalls are + intercepted by the Coda FS Driver and lead to cache invalidations of + the kind described below. The Coda FS Driver does not return an error + unless the downcall data could not be read into kernel memory. + + +5.1. INVALIDATE +---------------- + + + No information is available on this call. + + +5.2. FLUSH +----------- + + + + Arguments + None + + Summary + Flush the name cache entirely. + + Description + Venus issues this call upon startup and when it dies. This + is to prevent stale cache information being held. Some operating + systems allow the kernel name cache to be switched off dynamically. + When this is done, this downcall is made. + + +5.3. PURGEUSER +--------------- + + + Arguments + :: + + struct cfs_purgeuser_out {/* CFS_PURGEUSER is a venus->kernel call */ + struct CodaCred cred; + } cfs_purgeuser; + + + + Description + Remove all entries in the cache carrying the Cred. This + call is issued when tokens for a user expire or are flushed. + + +5.4. ZAPFILE +------------- + + + Arguments + :: + + struct cfs_zapfile_out { /* CFS_ZAPFILE is a venus->kernel call */ + ViceFid CodaFid; + } cfs_zapfile; + + + + Description + Remove all entries which have the (dir vnode, name) pair. + This is issued as a result of an invalidation of cached attributes of + a vnode. + + .. Note:: + + Call is not named correctly in NetBSD and Mach. The minicache + zapfile routine takes different arguments. Linux does not implement + the invalidation of attributes correctly. + + + +5.5. ZAPDIR +------------ + + + Arguments + :: + + struct cfs_zapdir_out { /* CFS_ZAPDIR is a venus->kernel call */ + ViceFid CodaFid; + } cfs_zapdir; + + + + Description + Remove all entries in the cache lying in a directory + CodaFid, and all children of this directory. This call is issued when + Venus receives a callback on the directory. + + +5.6. ZAPVNODE +-------------- + + + + Arguments + :: + + struct cfs_zapvnode_out { /* CFS_ZAPVNODE is a venus->kernel call */ + struct CodaCred cred; + ViceFid VFid; + } cfs_zapvnode; + + + + Description + Remove all entries in the cache carrying the cred and VFid + as in the arguments. This downcall is probably never issued. + + +5.7. PURGEFID +-------------- + + + Arguments + :: + + struct cfs_purgefid_out { /* CFS_PURGEFID is a venus->kernel call */ + ViceFid CodaFid; + } cfs_purgefid; + + + + Description + Flush the attribute for the file. If it is a dir (odd + vnode), purge its children from the namecache and remove the file from the + namecache. + + + +5.8. REPLACE +------------- + + + Summary + Replace the Fid's for a collection of names. + + Arguments + :: + + struct cfs_replace_out { /* cfs_replace is a venus->kernel call */ + ViceFid NewFid; + ViceFid OldFid; + } cfs_replace; + + + + Description + This routine replaces a ViceFid in the name cache with + another. It is added to allow Venus during reintegration to replace + locally allocated temp fids while disconnected with global fids even + when the reference counts on those fids are not zero. + + +6. Initialization and cleanup +============================== + + + This section gives brief hints as to desirable features for the Coda + FS Driver at startup and upon shutdown or Venus failures. Before + entering the discussion it is useful to repeat that the Coda FS Driver + maintains the following data: + + + 1. message queues + + 2. cnodes + + 3. name cache entries + + The name cache entries are entirely private to the driver, so they + can easily be manipulated. The message queues will generally have + clear points of initialization and destruction. The cnodes are + much more delicate. User processes hold reference counts in Coda + filesystems and it can be difficult to clean up the cnodes. + + It can expect requests through: + + 1. the message subsystem + + 2. the VFS layer + + 3. pioctl interface + + Currently the pioctl passes through the VFS for Coda so we can + treat these similarly. + + +6.1. Requirements +------------------ + + + The following requirements should be accommodated: + + 1. The message queues should have open and close routines. On Unix + the opening of the character devices are such routines. + + - Before opening, no messages can be placed. + + - Opening will remove any old messages still pending. + + - Close will notify any sleeping processes that their upcall cannot + be completed. + + - Close will free all memory allocated by the message queues. + + + 2. At open the namecache shall be initialized to empty state. + + 3. Before the message queues are open, all VFS operations will fail. + Fortunately this can be achieved by making sure than mounting the + Coda filesystem cannot succeed before opening. + + 4. After closing of the queues, no VFS operations can succeed. Here + one needs to be careful, since a few operations (lookup, + read/write, readdir) can proceed without upcalls. These must be + explicitly blocked. + + 5. Upon closing the namecache shall be flushed and disabled. + + 6. All memory held by cnodes can be freed without relying on upcalls. + + 7. Unmounting the file system can be done without relying on upcalls. + + 8. Mounting the Coda filesystem should fail gracefully if Venus cannot + get the rootfid or the attributes of the rootfid. The latter is + best implemented by Venus fetching these objects before attempting + to mount. + + .. Note:: + + NetBSD in particular but also Linux have not implemented the + above requirements fully. For smooth operation this needs to be + corrected. + + + diff --git a/Documentation/filesystems/coda.txt b/Documentation/filesystems/coda.txt deleted file mode 100644 index 1711ad48e38a..000000000000 --- a/Documentation/filesystems/coda.txt +++ /dev/null @@ -1,1676 +0,0 @@ -NOTE: -This is one of the technical documents describing a component of -Coda -- this document describes the client kernel-Venus interface. - -For more information: - http://www.coda.cs.cmu.edu -For user level software needed to run Coda: - ftp://ftp.coda.cs.cmu.edu - -To run Coda you need to get a user level cache manager for the client, -named Venus, as well as tools to manipulate ACLs, to log in, etc. The -client needs to have the Coda filesystem selected in the kernel -configuration. - -The server needs a user level server and at present does not depend on -kernel support. - - - - - - - - The Venus kernel interface - Peter J. Braam - v1.0, Nov 9, 1997 - - This document describes the communication between Venus and kernel - level filesystem code needed for the operation of the Coda file sys- - tem. This document version is meant to describe the current interface - (version 1.0) as well as improvements we envisage. - ______________________________________________________________________ - - Table of Contents - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 1. Introduction - - 2. Servicing Coda filesystem calls - - 3. The message layer - - 3.1 Implementation details - - 4. The interface at the call level - - 4.1 Data structures shared by the kernel and Venus - 4.2 The pioctl interface - 4.3 root - 4.4 lookup - 4.5 getattr - 4.6 setattr - 4.7 access - 4.8 create - 4.9 mkdir - 4.10 link - 4.11 symlink - 4.12 remove - 4.13 rmdir - 4.14 readlink - 4.15 open - 4.16 close - 4.17 ioctl - 4.18 rename - 4.19 readdir - 4.20 vget - 4.21 fsync - 4.22 inactive - 4.23 rdwr - 4.24 odymount - 4.25 ody_lookup - 4.26 ody_expand - 4.27 prefetch - 4.28 signal - - 5. The minicache and downcalls - - 5.1 INVALIDATE - 5.2 FLUSH - 5.3 PURGEUSER - 5.4 ZAPFILE - 5.5 ZAPDIR - 5.6 ZAPVNODE - 5.7 PURGEFID - 5.8 REPLACE - - 6. Initialization and cleanup - - 6.1 Requirements - - - ______________________________________________________________________ - 0wpage - - 11.. IInnttrroodduuccttiioonn - - - - A key component in the Coda Distributed File System is the cache - manager, _V_e_n_u_s. - - - When processes on a Coda enabled system access files in the Coda - filesystem, requests are directed at the filesystem layer in the - operating system. The operating system will communicate with Venus to - service the request for the process. Venus manages a persistent - client cache and makes remote procedure calls to Coda file servers and - related servers (such as authentication servers) to service these - requests it receives from the operating system. When Venus has - serviced a request it replies to the operating system with appropriate - return codes, and other data related to the request. Optionally the - kernel support for Coda may maintain a minicache of recently processed - requests to limit the number of interactions with Venus. Venus - possesses the facility to inform the kernel when elements from its - minicache are no longer valid. - - This document describes precisely this communication between the - kernel and Venus. The definitions of so called upcalls and downcalls - will be given with the format of the data they handle. We shall also - describe the semantic invariants resulting from the calls. - - Historically Coda was implemented in a BSD file system in Mach 2.6. - The interface between the kernel and Venus is very similar to the BSD - VFS interface. Similar functionality is provided, and the format of - the parameters and returned data is very similar to the BSD VFS. This - leads to an almost natural environment for implementing a kernel-level - filesystem driver for Coda in a BSD system. However, other operating - systems such as Linux and Windows 95 and NT have virtual filesystem - with different interfaces. - - To implement Coda on these systems some reverse engineering of the - Venus/Kernel protocol is necessary. Also it came to light that other - systems could profit significantly from certain small optimizations - and modifications to the protocol. To facilitate this work as well as - to make future ports easier, communication between Venus and the - kernel should be documented in great detail. This is the aim of this - document. - - 0wpage - - 22.. SSeerrvviicciinngg CCooddaa ffiilleessyysstteemm ccaallllss - - The service of a request for a Coda file system service originates in - a process PP which accessing a Coda file. It makes a system call which - traps to the OS kernel. Examples of such calls trapping to the kernel - are _r_e_a_d_, _w_r_i_t_e_, _o_p_e_n_, _c_l_o_s_e_, _c_r_e_a_t_e_, _m_k_d_i_r_, _r_m_d_i_r_, _c_h_m_o_d in a Unix - context. Similar calls exist in the Win32 environment, and are named - _C_r_e_a_t_e_F_i_l_e_, . - - Generally the operating system handles the request in a virtual - filesystem (VFS) layer, which is named I/O Manager in NT and IFS - manager in Windows 95. The VFS is responsible for partial processing - of the request and for locating the specific filesystem(s) which will - service parts of the request. Usually the information in the path - assists in locating the correct FS drivers. Sometimes after extensive - pre-processing, the VFS starts invoking exported routines in the FS - driver. This is the point where the FS specific processing of the - request starts, and here the Coda specific kernel code comes into - play. - - The FS layer for Coda must expose and implement several interfaces. - First and foremost the VFS must be able to make all necessary calls to - the Coda FS layer, so the Coda FS driver must expose the VFS interface - as applicable in the operating system. These differ very significantly - among operating systems, but share features such as facilities to - read/write and create and remove objects. The Coda FS layer services - such VFS requests by invoking one or more well defined services - offered by the cache manager Venus. When the replies from Venus have - come back to the FS driver, servicing of the VFS call continues and - finishes with a reply to the kernel's VFS. Finally the VFS layer - returns to the process. - - As a result of this design a basic interface exposed by the FS driver - must allow Venus to manage message traffic. In particular Venus must - be able to retrieve and place messages and to be notified of the - arrival of a new message. The notification must be through a mechanism - which does not block Venus since Venus must attend to other tasks even - when no messages are waiting or being processed. - - - - - - - Interfaces of the Coda FS Driver - - Furthermore the FS layer provides for a special path of communication - between a user process and Venus, called the pioctl interface. The - pioctl interface is used for Coda specific services, such as - requesting detailed information about the persistent cache managed by - Venus. Here the involvement of the kernel is minimal. It identifies - the calling process and passes the information on to Venus. When - Venus replies the response is passed back to the caller in unmodified - form. - - Finally Venus allows the kernel FS driver to cache the results from - certain services. This is done to avoid excessive context switches - and results in an efficient system. However, Venus may acquire - information, for example from the network which implies that cached - information must be flushed or replaced. Venus then makes a downcall - to the Coda FS layer to request flushes or updates in the cache. The - kernel FS driver handles such requests synchronously. - - Among these interfaces the VFS interface and the facility to place, - receive and be notified of messages are platform specific. We will - not go into the calls exported to the VFS layer but we will state the - requirements of the message exchange mechanism. - - 0wpage - - 33.. TThhee mmeessssaaggee llaayyeerr - - - - At the lowest level the communication between Venus and the FS driver - proceeds through messages. The synchronization between processes - requesting Coda file service and Venus relies on blocking and waking - up processes. The Coda FS driver processes VFS- and pioctl-requests - on behalf of a process P, creates messages for Venus, awaits replies - and finally returns to the caller. The implementation of the exchange - of messages is platform specific, but the semantics have (so far) - appeared to be generally applicable. Data buffers are created by the - FS Driver in kernel memory on behalf of P and copied to user memory in - Venus. - - The FS Driver while servicing P makes upcalls to Venus. Such an - upcall is dispatched to Venus by creating a message structure. The - structure contains the identification of P, the message sequence - number, the size of the request and a pointer to the data in kernel - memory for the request. Since the data buffer is re-used to hold the - reply from Venus, there is a field for the size of the reply. A flags - field is used in the message to precisely record the status of the - message. Additional platform dependent structures involve pointers to - determine the position of the message on queues and pointers to - synchronization objects. In the upcall routine the message structure - is filled in, flags are set to 0, and it is placed on the _p_e_n_d_i_n_g - queue. The routine calling upcall is responsible for allocating the - data buffer; its structure will be described in the next section. - - A facility must exist to notify Venus that the message has been - created, and implemented using available synchronization objects in - the OS. This notification is done in the upcall context of the process - P. When the message is on the pending queue, process P cannot proceed - in upcall. The (kernel mode) processing of P in the filesystem - request routine must be suspended until Venus has replied. Therefore - the calling thread in P is blocked in upcall. A pointer in the - message structure will locate the synchronization object on which P is - sleeping. - - Venus detects the notification that a message has arrived, and the FS - driver allow Venus to retrieve the message with a getmsg_from_kernel - call. This action finishes in the kernel by putting the message on the - queue of processing messages and setting flags to READ. Venus is - passed the contents of the data buffer. The getmsg_from_kernel call - now returns and Venus processes the request. - - At some later point the FS driver receives a message from Venus, - namely when Venus calls sendmsg_to_kernel. At this moment the Coda FS - driver looks at the contents of the message and decides if: - - - +o the message is a reply for a suspended thread P. If so it removes - the message from the processing queue and marks the message as - WRITTEN. Finally, the FS driver unblocks P (still in the kernel - mode context of Venus) and the sendmsg_to_kernel call returns to - Venus. The process P will be scheduled at some point and continues - processing its upcall with the data buffer replaced with the reply - from Venus. - - +o The message is a _d_o_w_n_c_a_l_l. A downcall is a request from Venus to - the FS Driver. The FS driver processes the request immediately - (usually a cache eviction or replacement) and when it finishes - sendmsg_to_kernel returns. - - Now P awakes and continues processing upcall. There are some - subtleties to take account of. First P will determine if it was woken - up in upcall by a signal from some other source (for example an - attempt to terminate P) or as is normally the case by Venus in its - sendmsg_to_kernel call. In the normal case, the upcall routine will - deallocate the message structure and return. The FS routine can proceed - with its processing. - - - - - - - - Sleeping and IPC arrangements - - In case P is woken up by a signal and not by Venus, it will first look - at the flags field. If the message is not yet READ, the process P can - handle its signal without notifying Venus. If Venus has READ, and - the request should not be processed, P can send Venus a signal message - to indicate that it should disregard the previous message. Such - signals are put in the queue at the head, and read first by Venus. If - the message is already marked as WRITTEN it is too late to stop the - processing. The VFS routine will now continue. (-- If a VFS request - involves more than one upcall, this can lead to complicated state, an - extra field "handle_signals" could be added in the message structure - to indicate points of no return have been passed.--) - - - - 33..11.. IImmpplleemmeennttaattiioonn ddeettaaiillss - - The Unix implementation of this mechanism has been through the - implementation of a character device associated with Coda. Venus - retrieves messages by doing a read on the device, replies are sent - with a write and notification is through the select system call on the - file descriptor for the device. The process P is kept waiting on an - interruptible wait queue object. - - In Windows NT and the DPMI Windows 95 implementation a DeviceIoControl - call is used. The DeviceIoControl call is designed to copy buffers - from user memory to kernel memory with OPCODES. The sendmsg_to_kernel - is issued as a synchronous call, while the getmsg_from_kernel call is - asynchronous. Windows EventObjects are used for notification of - message arrival. The process P is kept waiting on a KernelEvent - object in NT and a semaphore in Windows 95. - - 0wpage - - 44.. TThhee iinntteerrffaaccee aatt tthhee ccaallll lleevveell - - - This section describes the upcalls a Coda FS driver can make to Venus. - Each of these upcalls make use of two structures: inputArgs and - outputArgs. In pseudo BNF form the structures take the following - form: - - - struct inputArgs { - u_long opcode; - u_long unique; /* Keep multiple outstanding msgs distinct */ - u_short pid; /* Common to all */ - u_short pgid; /* Common to all */ - struct CodaCred cred; /* Common to all */ - - <union "in" of call dependent parts of inputArgs> - }; - - struct outputArgs { - u_long opcode; - u_long unique; /* Keep multiple outstanding msgs distinct */ - u_long result; - - <union "out" of call dependent parts of inputArgs> - }; - - - - Before going on let us elucidate the role of the various fields. The - inputArgs start with the opcode which defines the type of service - requested from Venus. There are approximately 30 upcalls at present - which we will discuss. The unique field labels the inputArg with a - unique number which will identify the message uniquely. A process and - process group id are passed. Finally the credentials of the caller - are included. - - Before delving into the specific calls we need to discuss a variety of - data structures shared by the kernel and Venus. - - - - - 44..11.. DDaattaa ssttrruuccttuurreess sshhaarreedd bbyy tthhee kkeerrnneell aanndd VVeennuuss - - - The CodaCred structure defines a variety of user and group ids as - they are set for the calling process. The vuid_t and vgid_t are 32 bit - unsigned integers. It also defines group membership in an array. On - Unix the CodaCred has proven sufficient to implement good security - semantics for Coda but the structure may have to undergo modification - for the Windows environment when these mature. - - struct CodaCred { - vuid_t cr_uid, cr_euid, cr_suid, cr_fsuid; /* Real, effective, set, fs uid */ - vgid_t cr_gid, cr_egid, cr_sgid, cr_fsgid; /* same for groups */ - vgid_t cr_groups[NGROUPS]; /* Group membership for caller */ - }; - - - - NNOOTTEE It is questionable if we need CodaCreds in Venus. Finally Venus - doesn't know about groups, although it does create files with the - default uid/gid. Perhaps the list of group membership is superfluous. - - - The next item is the fundamental identifier used to identify Coda - files, the ViceFid. A fid of a file uniquely defines a file or - directory in the Coda filesystem within a _c_e_l_l. (-- A _c_e_l_l is a - group of Coda servers acting under the aegis of a single system - control machine or SCM. See the Coda Administration manual for a - detailed description of the role of the SCM.--) - - - typedef struct ViceFid { - VolumeId Volume; - VnodeId Vnode; - Unique_t Unique; - } ViceFid; - - - - Each of the constituent fields: VolumeId, VnodeId and Unique_t are - unsigned 32 bit integers. We envisage that a further field will need - to be prefixed to identify the Coda cell; this will probably take the - form of a Ipv6 size IP address naming the Coda cell through DNS. - - The next important structure shared between Venus and the kernel is - the attributes of the file. The following structure is used to - exchange information. It has room for future extensions such as - support for device files (currently not present in Coda). - - - - - - - - - - - - - - - - - struct coda_timespec { - int64_t tv_sec; /* seconds */ - long tv_nsec; /* nanoseconds */ - }; - - struct coda_vattr { - enum coda_vtype va_type; /* vnode type (for create) */ - u_short va_mode; /* files access mode and type */ - short va_nlink; /* number of references to file */ - vuid_t va_uid; /* owner user id */ - vgid_t va_gid; /* owner group id */ - long va_fsid; /* file system id (dev for now) */ - long va_fileid; /* file id */ - u_quad_t va_size; /* file size in bytes */ - long va_blocksize; /* blocksize preferred for i/o */ - struct coda_timespec va_atime; /* time of last access */ - struct coda_timespec va_mtime; /* time of last modification */ - struct coda_timespec va_ctime; /* time file changed */ - u_long va_gen; /* generation number of file */ - u_long va_flags; /* flags defined for file */ - dev_t va_rdev; /* device special file represents */ - u_quad_t va_bytes; /* bytes of disk space held by file */ - u_quad_t va_filerev; /* file modification number */ - u_int va_vaflags; /* operations flags, see below */ - long va_spare; /* remain quad aligned */ - }; - - - - - 44..22.. TThhee ppiiooccttll iinntteerrffaaccee - - - Coda specific requests can be made by application through the pioctl - interface. The pioctl is implemented as an ordinary ioctl on a - fictitious file /coda/.CONTROL. The pioctl call opens this file, gets - a file handle and makes the ioctl call. Finally it closes the file. - - The kernel involvement in this is limited to providing the facility to - open and close and pass the ioctl message _a_n_d to verify that a path in - the pioctl data buffers is a file in a Coda filesystem. - - The kernel is handed a data packet of the form: - - struct { - const char *path; - struct ViceIoctl vidata; - int follow; - } data; - - - - where - - - struct ViceIoctl { - caddr_t in, out; /* Data to be transferred in, or out */ - short in_size; /* Size of input buffer <= 2K */ - short out_size; /* Maximum size of output buffer, <= 2K */ - }; - - - - The path must be a Coda file, otherwise the ioctl upcall will not be - made. - - NNOOTTEE The data structures and code are a mess. We need to clean this - up. - - We now proceed to document the individual calls: - - 0wpage - - 44..33.. rroooott - - - AArrgguummeennttss - - iinn empty - - oouutt - - struct cfs_root_out { - ViceFid VFid; - } cfs_root; - - - - DDeessccrriippttiioonn This call is made to Venus during the initialization of - the Coda filesystem. If the result is zero, the cfs_root structure - contains the ViceFid of the root of the Coda filesystem. If a non-zero - result is generated, its value is a platform dependent error code - indicating the difficulty Venus encountered in locating the root of - the Coda filesystem. - - 0wpage - - 44..44.. llooookkuupp - - - SSuummmmaarryy Find the ViceFid and type of an object in a directory if it - exists. - - AArrgguummeennttss - - iinn - - struct cfs_lookup_in { - ViceFid VFid; - char *name; /* Place holder for data. */ - } cfs_lookup; - - - - oouutt - - struct cfs_lookup_out { - ViceFid VFid; - int vtype; - } cfs_lookup; - - - - DDeessccrriippttiioonn This call is made to determine the ViceFid and filetype of - a directory entry. The directory entry requested carries name name - and Venus will search the directory identified by cfs_lookup_in.VFid. - The result may indicate that the name does not exist, or that - difficulty was encountered in finding it (e.g. due to disconnection). - If the result is zero, the field cfs_lookup_out.VFid contains the - targets ViceFid and cfs_lookup_out.vtype the coda_vtype giving the - type of object the name designates. - - The name of the object is an 8 bit character string of maximum length - CFS_MAXNAMLEN, currently set to 256 (including a 0 terminator.) - - It is extremely important to realize that Venus bitwise ors the field - cfs_lookup.vtype with CFS_NOCACHE to indicate that the object should - not be put in the kernel name cache. - - NNOOTTEE The type of the vtype is currently wrong. It should be - coda_vtype. Linux does not take note of CFS_NOCACHE. It should. - - 0wpage - - 44..55.. ggeettaattttrr - - - SSuummmmaarryy Get the attributes of a file. - - AArrgguummeennttss - - iinn - - struct cfs_getattr_in { - ViceFid VFid; - struct coda_vattr attr; /* XXXXX */ - } cfs_getattr; - - - - oouutt - - struct cfs_getattr_out { - struct coda_vattr attr; - } cfs_getattr; - - - - DDeessccrriippttiioonn This call returns the attributes of the file identified by - fid. - - EErrrroorrss Errors can occur if the object with fid does not exist, is - unaccessible or if the caller does not have permission to fetch - attributes. - - NNoottee Many kernel FS drivers (Linux, NT and Windows 95) need to acquire - the attributes as well as the Fid for the instantiation of an internal - "inode" or "FileHandle". A significant improvement in performance on - such systems could be made by combining the _l_o_o_k_u_p and _g_e_t_a_t_t_r calls - both at the Venus/kernel interaction level and at the RPC level. - - The vattr structure included in the input arguments is superfluous and - should be removed. - - 0wpage - - 44..66.. sseettaattttrr - - - SSuummmmaarryy Set the attributes of a file. - - AArrgguummeennttss - - iinn - - struct cfs_setattr_in { - ViceFid VFid; - struct coda_vattr attr; - } cfs_setattr; - - - - - oouutt - empty - - DDeessccrriippttiioonn The structure attr is filled with attributes to be changed - in BSD style. Attributes not to be changed are set to -1, apart from - vtype which is set to VNON. Other are set to the value to be assigned. - The only attributes which the FS driver may request to change are the - mode, owner, groupid, atime, mtime and ctime. The return value - indicates success or failure. - - EErrrroorrss A variety of errors can occur. The object may not exist, may - be inaccessible, or permission may not be granted by Venus. - - 0wpage - - 44..77.. aacccceessss - - - SSuummmmaarryy - - AArrgguummeennttss - - iinn - - struct cfs_access_in { - ViceFid VFid; - int flags; - } cfs_access; - - - - oouutt - empty - - DDeessccrriippttiioonn Verify if access to the object identified by VFid for - operations described by flags is permitted. The result indicates if - access will be granted. It is important to remember that Coda uses - ACLs to enforce protection and that ultimately the servers, not the - clients enforce the security of the system. The result of this call - will depend on whether a _t_o_k_e_n is held by the user. - - EErrrroorrss The object may not exist, or the ACL describing the protection - may not be accessible. - - 0wpage - - 44..88.. ccrreeaattee - - - SSuummmmaarryy Invoked to create a file - - AArrgguummeennttss - - iinn - - struct cfs_create_in { - ViceFid VFid; - struct coda_vattr attr; - int excl; - int mode; - char *name; /* Place holder for data. */ - } cfs_create; - - - - - oouutt - - struct cfs_create_out { - ViceFid VFid; - struct coda_vattr attr; - } cfs_create; - - - - DDeessccrriippttiioonn This upcall is invoked to request creation of a file. - The file will be created in the directory identified by VFid, its name - will be name, and the mode will be mode. If excl is set an error will - be returned if the file already exists. If the size field in attr is - set to zero the file will be truncated. The uid and gid of the file - are set by converting the CodaCred to a uid using a macro CRTOUID - (this macro is platform dependent). Upon success the VFid and - attributes of the file are returned. The Coda FS Driver will normally - instantiate a vnode, inode or file handle at kernel level for the new - object. - - - EErrrroorrss A variety of errors can occur. Permissions may be insufficient. - If the object exists and is not a file the error EISDIR is returned - under Unix. - - NNOOTTEE The packing of parameters is very inefficient and appears to - indicate confusion between the system call creat and the VFS operation - create. The VFS operation create is only called to create new objects. - This create call differs from the Unix one in that it is not invoked - to return a file descriptor. The truncate and exclusive options, - together with the mode, could simply be part of the mode as it is - under Unix. There should be no flags argument; this is used in open - (2) to return a file descriptor for READ or WRITE mode. - - The attributes of the directory should be returned too, since the size - and mtime changed. - - 0wpage - - 44..99.. mmkkddiirr - - - SSuummmmaarryy Create a new directory. - - AArrgguummeennttss - - iinn - - struct cfs_mkdir_in { - ViceFid VFid; - struct coda_vattr attr; - char *name; /* Place holder for data. */ - } cfs_mkdir; - - - - oouutt - - struct cfs_mkdir_out { - ViceFid VFid; - struct coda_vattr attr; - } cfs_mkdir; - - - - - DDeessccrriippttiioonn This call is similar to create but creates a directory. - Only the mode field in the input parameters is used for creation. - Upon successful creation, the attr returned contains the attributes of - the new directory. - - EErrrroorrss As for create. - - NNOOTTEE The input parameter should be changed to mode instead of - attributes. - - The attributes of the parent should be returned since the size and - mtime changes. - - 0wpage - - 44..1100.. lliinnkk - - - SSuummmmaarryy Create a link to an existing file. - - AArrgguummeennttss - - iinn - - struct cfs_link_in { - ViceFid sourceFid; /* cnode to link *to* */ - ViceFid destFid; /* Directory in which to place link */ - char *tname; /* Place holder for data. */ - } cfs_link; - - - - oouutt - empty - - DDeessccrriippttiioonn This call creates a link to the sourceFid in the directory - identified by destFid with name tname. The source must reside in the - target's parent, i.e. the source must be have parent destFid, i.e. Coda - does not support cross directory hard links. Only the return value is - relevant. It indicates success or the type of failure. - - EErrrroorrss The usual errors can occur.0wpage - - 44..1111.. ssyymmlliinnkk - - - SSuummmmaarryy create a symbolic link - - AArrgguummeennttss - - iinn - - struct cfs_symlink_in { - ViceFid VFid; /* Directory to put symlink in */ - char *srcname; - struct coda_vattr attr; - char *tname; - } cfs_symlink; - - - - oouutt - none - - DDeessccrriippttiioonn Create a symbolic link. The link is to be placed in the - directory identified by VFid and named tname. It should point to the - pathname srcname. The attributes of the newly created object are to - be set to attr. - - EErrrroorrss - - NNOOTTEE The attributes of the target directory should be returned since - its size changed. - - 0wpage - - 44..1122.. rreemmoovvee - - - SSuummmmaarryy Remove a file - - AArrgguummeennttss - - iinn - - struct cfs_remove_in { - ViceFid VFid; - char *name; /* Place holder for data. */ - } cfs_remove; - - - - oouutt - none - - DDeessccrriippttiioonn Remove file named cfs_remove_in.name in directory - identified by VFid. - - EErrrroorrss - - NNOOTTEE The attributes of the directory should be returned since its - mtime and size may change. - - 0wpage - - 44..1133.. rrmmddiirr - - - SSuummmmaarryy Remove a directory - - AArrgguummeennttss - - iinn - - struct cfs_rmdir_in { - ViceFid VFid; - char *name; /* Place holder for data. */ - } cfs_rmdir; - - - - oouutt - none - - DDeessccrriippttiioonn Remove the directory with name name from the directory - identified by VFid. - - EErrrroorrss - - NNOOTTEE The attributes of the parent directory should be returned since - its mtime and size may change. - - 0wpage - - 44..1144.. rreeaaddlliinnkk - - - SSuummmmaarryy Read the value of a symbolic link. - - AArrgguummeennttss - - iinn - - struct cfs_readlink_in { - ViceFid VFid; - } cfs_readlink; - - - - oouutt - - struct cfs_readlink_out { - int count; - caddr_t data; /* Place holder for data. */ - } cfs_readlink; - - - - DDeessccrriippttiioonn This routine reads the contents of symbolic link - identified by VFid into the buffer data. The buffer data must be able - to hold any name up to CFS_MAXNAMLEN (PATH or NAM??). - - EErrrroorrss No unusual errors. - - 0wpage - - 44..1155.. ooppeenn - - - SSuummmmaarryy Open a file. - - AArrgguummeennttss - - iinn - - struct cfs_open_in { - ViceFid VFid; - int flags; - } cfs_open; - - - - oouutt - - struct cfs_open_out { - dev_t dev; - ino_t inode; - } cfs_open; - - - - DDeessccrriippttiioonn This request asks Venus to place the file identified by - VFid in its cache and to note that the calling process wishes to open - it with flags as in open(2). The return value to the kernel differs - for Unix and Windows systems. For Unix systems the Coda FS Driver is - informed of the device and inode number of the container file in the - fields dev and inode. For Windows the path of the container file is - returned to the kernel. - EErrrroorrss - - NNOOTTEE Currently the cfs_open_out structure is not properly adapted to - deal with the Windows case. It might be best to implement two - upcalls, one to open aiming at a container file name, the other at a - container file inode. - - 0wpage - - 44..1166.. cclloossee - - - SSuummmmaarryy Close a file, update it on the servers. - - AArrgguummeennttss - - iinn - - struct cfs_close_in { - ViceFid VFid; - int flags; - } cfs_close; - - - - oouutt - none - - DDeessccrriippttiioonn Close the file identified by VFid. - - EErrrroorrss - - NNOOTTEE The flags argument is bogus and not used. However, Venus' code - has room to deal with an execp input field, probably this field should - be used to inform Venus that the file was closed but is still memory - mapped for execution. There are comments about fetching versus not - fetching the data in Venus vproc_vfscalls. This seems silly. If a - file is being closed, the data in the container file is to be the new - data. Here again the execp flag might be in play to create confusion: - currently Venus might think a file can be flushed from the cache when - it is still memory mapped. This needs to be understood. - - 0wpage - - 44..1177.. iiooccttll - - - SSuummmmaarryy Do an ioctl on a file. This includes the pioctl interface. - - AArrgguummeennttss - - iinn - - struct cfs_ioctl_in { - ViceFid VFid; - int cmd; - int len; - int rwflag; - char *data; /* Place holder for data. */ - } cfs_ioctl; - - - - oouutt - - - struct cfs_ioctl_out { - int len; - caddr_t data; /* Place holder for data. */ - } cfs_ioctl; - - - - DDeessccrriippttiioonn Do an ioctl operation on a file. The command, len and - data arguments are filled as usual. flags is not used by Venus. - - EErrrroorrss - - NNOOTTEE Another bogus parameter. flags is not used. What is the - business about PREFETCHING in the Venus code? - - - 0wpage - - 44..1188.. rreennaammee - - - SSuummmmaarryy Rename a fid. - - AArrgguummeennttss - - iinn - - struct cfs_rename_in { - ViceFid sourceFid; - char *srcname; - ViceFid destFid; - char *destname; - } cfs_rename; - - - - oouutt - none - - DDeessccrriippttiioonn Rename the object with name srcname in directory - sourceFid to destname in destFid. It is important that the names - srcname and destname are 0 terminated strings. Strings in Unix - kernels are not always null terminated. - - EErrrroorrss - - 0wpage - - 44..1199.. rreeaaddddiirr - - - SSuummmmaarryy Read directory entries. - - AArrgguummeennttss - - iinn - - struct cfs_readdir_in { - ViceFid VFid; - int count; - int offset; - } cfs_readdir; - - - - - oouutt - - struct cfs_readdir_out { - int size; - caddr_t data; /* Place holder for data. */ - } cfs_readdir; - - - - DDeessccrriippttiioonn Read directory entries from VFid starting at offset and - read at most count bytes. Returns the data in data and returns - the size in size. - - EErrrroorrss - - NNOOTTEE This call is not used. Readdir operations exploit container - files. We will re-evaluate this during the directory revamp which is - about to take place. - - 0wpage - - 44..2200.. vvggeett - - - SSuummmmaarryy instructs Venus to do an FSDB->Get. - - AArrgguummeennttss - - iinn - - struct cfs_vget_in { - ViceFid VFid; - } cfs_vget; - - - - oouutt - - struct cfs_vget_out { - ViceFid VFid; - int vtype; - } cfs_vget; - - - - DDeessccrriippttiioonn This upcall asks Venus to do a get operation on an fsobj - labelled by VFid. - - EErrrroorrss - - NNOOTTEE This operation is not used. However, it is extremely useful - since it can be used to deal with read/write memory mapped files. - These can be "pinned" in the Venus cache using vget and released with - inactive. - - 0wpage - - 44..2211.. ffssyynncc - - - SSuummmmaarryy Tell Venus to update the RVM attributes of a file. - - AArrgguummeennttss - - iinn - - struct cfs_fsync_in { - ViceFid VFid; - } cfs_fsync; - - - - oouutt - none - - DDeessccrriippttiioonn Ask Venus to update RVM attributes of object VFid. This - should be called as part of kernel level fsync type calls. The - result indicates if the syncing was successful. - - EErrrroorrss - - NNOOTTEE Linux does not implement this call. It should. - - 0wpage - - 44..2222.. iinnaaccttiivvee - - - SSuummmmaarryy Tell Venus a vnode is no longer in use. - - AArrgguummeennttss - - iinn - - struct cfs_inactive_in { - ViceFid VFid; - } cfs_inactive; - - - - oouutt - none - - DDeessccrriippttiioonn This operation returns EOPNOTSUPP. - - EErrrroorrss - - NNOOTTEE This should perhaps be removed. - - 0wpage - - 44..2233.. rrddwwrr - - - SSuummmmaarryy Read or write from a file - - AArrgguummeennttss - - iinn - - struct cfs_rdwr_in { - ViceFid VFid; - int rwflag; - int count; - int offset; - int ioflag; - caddr_t data; /* Place holder for data. */ - } cfs_rdwr; - - - - - oouutt - - struct cfs_rdwr_out { - int rwflag; - int count; - caddr_t data; /* Place holder for data. */ - } cfs_rdwr; - - - - DDeessccrriippttiioonn This upcall asks Venus to read or write from a file. - - EErrrroorrss - - NNOOTTEE It should be removed since it is against the Coda philosophy that - read/write operations never reach Venus. I have been told the - operation does not work. It is not currently used. - - - 0wpage - - 44..2244.. ooddyymmoouunntt - - - SSuummmmaarryy Allows mounting multiple Coda "filesystems" on one Unix mount - point. - - AArrgguummeennttss - - iinn - - struct ody_mount_in { - char *name; /* Place holder for data. */ - } ody_mount; - - - - oouutt - - struct ody_mount_out { - ViceFid VFid; - } ody_mount; - - - - DDeessccrriippttiioonn Asks Venus to return the rootfid of a Coda system named - name. The fid is returned in VFid. - - EErrrroorrss - - NNOOTTEE This call was used by David for dynamic sets. It should be - removed since it causes a jungle of pointers in the VFS mounting area. - It is not used by Coda proper. Call is not implemented by Venus. - - 0wpage - - 44..2255.. ooddyy__llooookkuupp - - - SSuummmmaarryy Looks up something. - - AArrgguummeennttss - - iinn irrelevant - - - oouutt - irrelevant - - DDeessccrriippttiioonn - - EErrrroorrss - - NNOOTTEE Gut it. Call is not implemented by Venus. - - 0wpage - - 44..2266.. ooddyy__eexxppaanndd - - - SSuummmmaarryy expands something in a dynamic set. - - AArrgguummeennttss - - iinn irrelevant - - oouutt - irrelevant - - DDeessccrriippttiioonn - - EErrrroorrss - - NNOOTTEE Gut it. Call is not implemented by Venus. - - 0wpage - - 44..2277.. pprreeffeettcchh - - - SSuummmmaarryy Prefetch a dynamic set. - - AArrgguummeennttss - - iinn Not documented. - - oouutt - Not documented. - - DDeessccrriippttiioonn Venus worker.cc has support for this call, although it is - noted that it doesn't work. Not surprising, since the kernel does not - have support for it. (ODY_PREFETCH is not a defined operation). - - EErrrroorrss - - NNOOTTEE Gut it. It isn't working and isn't used by Coda. - - - 0wpage - - 44..2288.. ssiiggnnaall - - - SSuummmmaarryy Send Venus a signal about an upcall. - - AArrgguummeennttss - - iinn none - - oouutt - not applicable. - - DDeessccrriippttiioonn This is an out-of-band upcall to Venus to inform Venus - that the calling process received a signal after Venus read the - message from the input queue. Venus is supposed to clean up the - operation. - - EErrrroorrss No reply is given. - - NNOOTTEE We need to better understand what Venus needs to clean up and if - it is doing this correctly. Also we need to handle multiple upcall - per system call situations correctly. It would be important to know - what state changes in Venus take place after an upcall for which the - kernel is responsible for notifying Venus to clean up (e.g. open - definitely is such a state change, but many others are maybe not). - - 0wpage - - 55.. TThhee mmiinniiccaacchhee aanndd ddoowwnnccaallllss - - - The Coda FS Driver can cache results of lookup and access upcalls, to - limit the frequency of upcalls. Upcalls carry a price since a process - context switch needs to take place. The counterpart of caching the - information is that Venus will notify the FS Driver that cached - entries must be flushed or renamed. - - The kernel code generally has to maintain a structure which links the - internal file handles (called vnodes in BSD, inodes in Linux and - FileHandles in Windows) with the ViceFid's which Venus maintains. The - reason is that frequent translations back and forth are needed in - order to make upcalls and use the results of upcalls. Such linking - objects are called ccnnooddeess. - - The current minicache implementations have cache entries which record - the following: - - 1. the name of the file - - 2. the cnode of the directory containing the object - - 3. a list of CodaCred's for which the lookup is permitted. - - 4. the cnode of the object - - The lookup call in the Coda FS Driver may request the cnode of the - desired object from the cache, by passing its name, directory and the - CodaCred's of the caller. The cache will return the cnode or indicate - that it cannot be found. The Coda FS Driver must be careful to - invalidate cache entries when it modifies or removes objects. - - When Venus obtains information that indicates that cache entries are - no longer valid, it will make a downcall to the kernel. Downcalls are - intercepted by the Coda FS Driver and lead to cache invalidations of - the kind described below. The Coda FS Driver does not return an error - unless the downcall data could not be read into kernel memory. - - - 55..11.. IINNVVAALLIIDDAATTEE - - - No information is available on this call. - - - 55..22.. FFLLUUSSHH - - - - AArrgguummeennttss None - - SSuummmmaarryy Flush the name cache entirely. - - DDeessccrriippttiioonn Venus issues this call upon startup and when it dies. This - is to prevent stale cache information being held. Some operating - systems allow the kernel name cache to be switched off dynamically. - When this is done, this downcall is made. - - - 55..33.. PPUURRGGEEUUSSEERR - - - AArrgguummeennttss - - struct cfs_purgeuser_out {/* CFS_PURGEUSER is a venus->kernel call */ - struct CodaCred cred; - } cfs_purgeuser; - - - - DDeessccrriippttiioonn Remove all entries in the cache carrying the Cred. This - call is issued when tokens for a user expire or are flushed. - - - 55..44.. ZZAAPPFFIILLEE - - - AArrgguummeennttss - - struct cfs_zapfile_out { /* CFS_ZAPFILE is a venus->kernel call */ - ViceFid CodaFid; - } cfs_zapfile; - - - - DDeessccrriippttiioonn Remove all entries which have the (dir vnode, name) pair. - This is issued as a result of an invalidation of cached attributes of - a vnode. - - NNOOTTEE Call is not named correctly in NetBSD and Mach. The minicache - zapfile routine takes different arguments. Linux does not implement - the invalidation of attributes correctly. - - - - 55..55.. ZZAAPPDDIIRR - - - AArrgguummeennttss - - struct cfs_zapdir_out { /* CFS_ZAPDIR is a venus->kernel call */ - ViceFid CodaFid; - } cfs_zapdir; - - - - DDeessccrriippttiioonn Remove all entries in the cache lying in a directory - CodaFid, and all children of this directory. This call is issued when - Venus receives a callback on the directory. - - - 55..66.. ZZAAPPVVNNOODDEE - - - - AArrgguummeennttss - - struct cfs_zapvnode_out { /* CFS_ZAPVNODE is a venus->kernel call */ - struct CodaCred cred; - ViceFid VFid; - } cfs_zapvnode; - - - - DDeessccrriippttiioonn Remove all entries in the cache carrying the cred and VFid - as in the arguments. This downcall is probably never issued. - - - 55..77.. PPUURRGGEEFFIIDD - - - SSuummmmaarryy - - AArrgguummeennttss - - struct cfs_purgefid_out { /* CFS_PURGEFID is a venus->kernel call */ - ViceFid CodaFid; - } cfs_purgefid; - - - - DDeessccrriippttiioonn Flush the attribute for the file. If it is a dir (odd - vnode), purge its children from the namecache and remove the file from the - namecache. - - - - 55..88.. RREEPPLLAACCEE - - - SSuummmmaarryy Replace the Fid's for a collection of names. - - AArrgguummeennttss - - struct cfs_replace_out { /* cfs_replace is a venus->kernel call */ - ViceFid NewFid; - ViceFid OldFid; - } cfs_replace; - - - - DDeessccrriippttiioonn This routine replaces a ViceFid in the name cache with - another. It is added to allow Venus during reintegration to replace - locally allocated temp fids while disconnected with global fids even - when the reference counts on those fids are not zero. - - 0wpage - - 66.. IInniittiiaalliizzaattiioonn aanndd cclleeaannuupp - - - This section gives brief hints as to desirable features for the Coda - FS Driver at startup and upon shutdown or Venus failures. Before - entering the discussion it is useful to repeat that the Coda FS Driver - maintains the following data: - - - 1. message queues - - 2. cnodes - - 3. name cache entries - - The name cache entries are entirely private to the driver, so they - can easily be manipulated. The message queues will generally have - clear points of initialization and destruction. The cnodes are - much more delicate. User processes hold reference counts in Coda - filesystems and it can be difficult to clean up the cnodes. - - It can expect requests through: - - 1. the message subsystem - - 2. the VFS layer - - 3. pioctl interface - - Currently the _p_i_o_c_t_l passes through the VFS for Coda so we can - treat these similarly. - - - 66..11.. RReeqquuiirreemmeennttss - - - The following requirements should be accommodated: - - 1. The message queues should have open and close routines. On Unix - the opening of the character devices are such routines. - - +o Before opening, no messages can be placed. - - +o Opening will remove any old messages still pending. - - +o Close will notify any sleeping processes that their upcall cannot - be completed. - - +o Close will free all memory allocated by the message queues. - - - 2. At open the namecache shall be initialized to empty state. - - 3. Before the message queues are open, all VFS operations will fail. - Fortunately this can be achieved by making sure than mounting the - Coda filesystem cannot succeed before opening. - - 4. After closing of the queues, no VFS operations can succeed. Here - one needs to be careful, since a few operations (lookup, - read/write, readdir) can proceed without upcalls. These must be - explicitly blocked. - - 5. Upon closing the namecache shall be flushed and disabled. - - 6. All memory held by cnodes can be freed without relying on upcalls. - - 7. Unmounting the file system can be done without relying on upcalls. - - 8. Mounting the Coda filesystem should fail gracefully if Venus cannot - get the rootfid or the attributes of the rootfid. The latter is - best implemented by Venus fetching these objects before attempting - to mount. - - NNOOTTEE NetBSD in particular but also Linux have not implemented the - above requirements fully. For smooth operation this needs to be - corrected. - - - diff --git a/Documentation/filesystems/configfs/configfs.txt b/Documentation/filesystems/configfs.rst index 16e606c11f40..f8941954c667 100644 --- a/Documentation/filesystems/configfs/configfs.txt +++ b/Documentation/filesystems/configfs.rst @@ -1,5 +1,6 @@ - -configfs - Userspace-driven kernel object configuration. +======================================================= +Configfs - Userspace-driven Kernel Object Configuration +======================================================= Joel Becker <joel.becker@oracle.com> @@ -9,7 +10,8 @@ Copyright (c) 2005 Oracle Corporation, Joel Becker <joel.becker@oracle.com> -[What is configfs?] +What is configfs? +================= configfs is a ram-based filesystem that provides the converse of sysfs's functionality. Where sysfs is a filesystem-based view of @@ -35,10 +37,11 @@ kernel modules backing the items must respond to this. Both sysfs and configfs can and should exist together on the same system. One is not a replacement for the other. -[Using configfs] +Using configfs +============== configfs can be compiled as a module or into the kernel. You can access -it by doing +it by doing:: mount -t configfs none /config @@ -56,28 +59,29 @@ values. Don't mix more than one attribute in one attribute file. There are two types of configfs attributes: * Normal attributes, which similar to sysfs attributes, are small ASCII text -files, with a maximum size of one page (PAGE_SIZE, 4096 on i386). Preferably -only one value per file should be used, and the same caveats from sysfs apply. -Configfs expects write(2) to store the entire buffer at once. When writing to -normal configfs attributes, userspace processes should first read the entire -file, modify the portions they wish to change, and then write the entire -buffer back. + files, with a maximum size of one page (PAGE_SIZE, 4096 on i386). Preferably + only one value per file should be used, and the same caveats from sysfs apply. + Configfs expects write(2) to store the entire buffer at once. When writing to + normal configfs attributes, userspace processes should first read the entire + file, modify the portions they wish to change, and then write the entire + buffer back. * Binary attributes, which are somewhat similar to sysfs binary attributes, -but with a few slight changes to semantics. The PAGE_SIZE limitation does not -apply, but the whole binary item must fit in single kernel vmalloc'ed buffer. -The write(2) calls from user space are buffered, and the attributes' -write_bin_attribute method will be invoked on the final close, therefore it is -imperative for user-space to check the return code of close(2) in order to -verify that the operation finished successfully. -To avoid a malicious user OOMing the kernel, there's a per-binary attribute -maximum buffer value. + but with a few slight changes to semantics. The PAGE_SIZE limitation does not + apply, but the whole binary item must fit in single kernel vmalloc'ed buffer. + The write(2) calls from user space are buffered, and the attributes' + write_bin_attribute method will be invoked on the final close, therefore it is + imperative for user-space to check the return code of close(2) in order to + verify that the operation finished successfully. + To avoid a malicious user OOMing the kernel, there's a per-binary attribute + maximum buffer value. When an item needs to be destroyed, remove it with rmdir(2). An item cannot be destroyed if any other item has a link to it (via symlink(2)). Links can be removed via unlink(2). -[Configuring FakeNBD: an Example] +Configuring FakeNBD: an Example +=============================== Imagine there's a Network Block Device (NBD) driver that allows you to access remote block devices. Call it FakeNBD. FakeNBD uses configfs @@ -86,14 +90,14 @@ sysadmins use to configure FakeNBD, but somehow that program has to tell the driver about it. Here's where configfs comes in. When the FakeNBD driver is loaded, it registers itself with configfs. -readdir(3) sees this just fine: +readdir(3) sees this just fine:: # ls /config fakenbd A fakenbd connection can be created with mkdir(2). The name is arbitrary, but likely the tool will make some use of the name. Perhaps -it is a uuid or a disk name: +it is a uuid or a disk name:: # mkdir /config/fakenbd/disk1 # ls /config/fakenbd/disk1 @@ -102,7 +106,7 @@ it is a uuid or a disk name: The target attribute contains the IP address of the server FakeNBD will connect to. The device attribute is the device on the server. Predictably, the rw attribute determines whether the connection is -read-only or read-write. +read-only or read-write:: # echo 10.0.0.1 > /config/fakenbd/disk1/target # echo /dev/sda1 > /config/fakenbd/disk1/device @@ -111,7 +115,8 @@ read-only or read-write. That's it. That's all there is. Now the device is configured, via the shell no less. -[Coding With configfs] +Coding With configfs +==================== Every object in configfs is a config_item. A config_item reflects an object in the subsystem. It has attributes that match values on that @@ -130,7 +135,10 @@ appears as a directory at the top of the configfs filesystem. A subsystem is also a config_group, and can do everything a config_group can. -[struct config_item] +struct config_item +================== + +:: struct config_item { char *ci_name; @@ -168,7 +176,10 @@ By itself, a config_item cannot do much more than appear in configfs. Usually a subsystem wants the item to display and/or store attributes, among other things. For that, it needs a type. -[struct config_item_type] +struct config_item_type +======================= + +:: struct configfs_item_operations { void (*release)(struct config_item *); @@ -192,7 +203,10 @@ allocated dynamically will need to provide the ct_item_ops->release() method. This method is called when the config_item's reference count reaches zero. -[struct configfs_attribute] +struct configfs_attribute +========================= + +:: struct configfs_attribute { char *ca_name; @@ -214,7 +228,10 @@ be called whenever userspace asks for a read(2) on the attribute. If an attribute is writable and provides a ->store method, that method will be be called whenever userspace asks for a write(2) on the attribute. -[struct configfs_bin_attribute] +struct configfs_bin_attribute +============================= + +:: struct configfs_bin_attribute { struct configfs_attribute cb_attr; @@ -240,11 +257,12 @@ will happen for write(2). The reads/writes are bufferred so only a single read/write will occur; the attributes' need not concern itself with it. -[struct config_group] +struct config_group +=================== A config_item cannot live in a vacuum. The only way one can be created is via mkdir(2) on a config_group. This will trigger creation of a -child item. +child item:: struct config_group { struct config_item cg_item; @@ -264,7 +282,7 @@ The config_group structure contains a config_item. Properly configuring that item means that a group can behave as an item in its own right. However, it can do more: it can create child items or groups. This is accomplished via the group operations specified on the group's -config_item_type. +config_item_type:: struct configfs_group_operations { struct config_item *(*make_item)(struct config_group *group, @@ -279,7 +297,8 @@ config_item_type. }; A group creates child items by providing the -ct_group_ops->make_item() method. If provided, this method is called from mkdir(2) in the group's directory. The subsystem allocates a new +ct_group_ops->make_item() method. If provided, this method is called from +mkdir(2) in the group's directory. The subsystem allocates a new config_item (or more likely, its container structure), initializes it, and returns it to configfs. Configfs will then populate the filesystem tree to reflect the new item. @@ -296,13 +315,14 @@ upon item allocation. If a subsystem has no work to do, it may omit the ct_group_ops->drop_item() method, and configfs will call config_item_put() on the item on behalf of the subsystem. -IMPORTANT: drop_item() is void, and as such cannot fail. When rmdir(2) -is called, configfs WILL remove the item from the filesystem tree -(assuming that it has no children to keep it busy). The subsystem is -responsible for responding to this. If the subsystem has references to -the item in other threads, the memory is safe. It may take some time -for the item to actually disappear from the subsystem's usage. But it -is gone from configfs. +Important: + drop_item() is void, and as such cannot fail. When rmdir(2) + is called, configfs WILL remove the item from the filesystem tree + (assuming that it has no children to keep it busy). The subsystem is + responsible for responding to this. If the subsystem has references to + the item in other threads, the memory is safe. It may take some time + for the item to actually disappear from the subsystem's usage. But it + is gone from configfs. When drop_item() is called, the item's linkage has already been torn down. It no longer has a reference on its parent and has no place in @@ -319,10 +339,11 @@ is implemented in the configfs rmdir(2) code. ->drop_item() will not be called, as the item has not been dropped. rmdir(2) will fail, as the directory is not empty. -[struct configfs_subsystem] +struct configfs_subsystem +========================= A subsystem must register itself, usually at module_init time. This -tells configfs to make the subsystem appear in the file tree. +tells configfs to make the subsystem appear in the file tree:: struct configfs_subsystem { struct config_group su_group; @@ -332,17 +353,19 @@ tells configfs to make the subsystem appear in the file tree. int configfs_register_subsystem(struct configfs_subsystem *subsys); void configfs_unregister_subsystem(struct configfs_subsystem *subsys); - A subsystem consists of a toplevel config_group and a mutex. +A subsystem consists of a toplevel config_group and a mutex. The group is where child config_items are created. For a subsystem, this group is usually defined statically. Before calling configfs_register_subsystem(), the subsystem must have initialized the group via the usual group _init() functions, and it must also have initialized the mutex. - When the register call returns, the subsystem is live, and it + +When the register call returns, the subsystem is live, and it will be visible via configfs. At that point, mkdir(2) can be called and the subsystem must be ready for it. -[An Example] +An Example +========== The best example of these basic concepts is the simple_children subsystem/group and the simple_child item in @@ -350,7 +373,8 @@ samples/configfs/configfs_sample.c. It shows a trivial object displaying and storing an attribute, and a simple group creating and destroying these children. -[Hierarchy Navigation and the Subsystem Mutex] +Hierarchy Navigation and the Subsystem Mutex +============================================ There is an extra bonus that configfs provides. The config_groups and config_items are arranged in a hierarchy due to the fact that they @@ -375,7 +399,8 @@ be in its parent's cg_children list for the same duration. This allows a subsystem to trust ci_parent and cg_children while they hold the mutex. -[Item Aggregation Via symlink(2)] +Item Aggregation Via symlink(2) +=============================== configfs provides a simple group via the group->item parent/child relationship. Often, however, a larger environment requires aggregation @@ -403,7 +428,8 @@ A config_item cannot be removed while it links to any other item, nor can it be removed while an item links to it. Dangling symlinks are not allowed in configfs. -[Automatically Created Subgroups] +Automatically Created Subgroups +=============================== A new config_group may want to have two types of child config_items. While this could be codified by magic names in ->make_item(), it is much @@ -433,7 +459,8 @@ As a consequence of this, default groups cannot be removed directly via rmdir(2). They also are not considered when rmdir(2) on the parent group is checking for children. -[Dependent Subsystems] +Dependent Subsystems +==================== Sometimes other drivers depend on particular configfs items. For example, ocfs2 mounts depend on a heartbeat region item. If that @@ -460,9 +487,11 @@ succeeds, then heartbeat knows the region is safe to give to ocfs2. If it fails, it was being torn down anyway, and heartbeat can gracefully pass up an error. -[Committable Items] +Committable Items +================= -NOTE: Committable items are currently unimplemented. +Note: + Committable items are currently unimplemented. Some config_items cannot have a valid initial state. That is, no default values can be specified for the item's attributes such that the @@ -504,5 +533,3 @@ As rmdir(2) does not work in the "live" directory, an item must be shutdown, or "uncommitted". Again, this is done via rename(2), this time from the "live" directory back to the "pending" one. The subsystem is notified by the ct_group_ops->uncommit_object() method. - - diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt index 679729442fd2..735f3859b19f 100644 --- a/Documentation/filesystems/dax.txt +++ b/Documentation/filesystems/dax.txt @@ -74,7 +74,7 @@ are zeroed out and converted to written extents before being returned to avoid exposure of uninitialized data through mmap. These filesystems may be used for inspiration: -- ext2: see Documentation/filesystems/ext2.txt +- ext2: see Documentation/filesystems/ext2.rst - ext4: see Documentation/filesystems/ext4/ - xfs: see Documentation/admin-guide/xfs.rst diff --git a/Documentation/filesystems/debugfs.rst b/Documentation/filesystems/debugfs.rst index 6c032db235a5..1da7a4b7383d 100644 --- a/Documentation/filesystems/debugfs.rst +++ b/Documentation/filesystems/debugfs.rst @@ -166,16 +166,17 @@ file:: }; struct debugfs_regset32 { - struct debugfs_reg32 *regs; + const struct debugfs_reg32 *regs; int nregs; void __iomem *base; + struct device *dev; /* Optional device for Runtime PM */ }; debugfs_create_regset32(const char *name, umode_t mode, struct dentry *parent, struct debugfs_regset32 *regset); - void debugfs_print_regs32(struct seq_file *s, struct debugfs_reg32 *regs, + void debugfs_print_regs32(struct seq_file *s, const struct debugfs_reg32 *regs, int nregs, void __iomem *base, char *prefix); The "base" argument may be 0, but you may want to build the reg32 array diff --git a/Documentation/filesystems/devpts.rst b/Documentation/filesystems/devpts.rst new file mode 100644 index 000000000000..a03248ddfb4c --- /dev/null +++ b/Documentation/filesystems/devpts.rst @@ -0,0 +1,36 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +The Devpts Filesystem +===================== + +Each mount of the devpts filesystem is now distinct such that ptys +and their indicies allocated in one mount are independent from ptys +and their indicies in all other mounts. + +All mounts of the devpts filesystem now create a ``/dev/pts/ptmx`` node +with permissions ``0000``. + +To retain backwards compatibility the a ptmx device node (aka any node +created with ``mknod name c 5 2``) when opened will look for an instance +of devpts under the name ``pts`` in the same directory as the ptmx device +node. + +As an option instead of placing a ``/dev/ptmx`` device node at ``/dev/ptmx`` +it is possible to place a symlink to ``/dev/pts/ptmx`` at ``/dev/ptmx`` or +to bind mount ``/dev/ptx/ptmx`` to ``/dev/ptmx``. If you opt for using +the devpts filesystem in this manner devpts should be mounted with +the ``ptmxmode=0666``, or ``chmod 0666 /dev/pts/ptmx`` should be called. + +Total count of pty pairs in all instances is limited by sysctls:: + + kernel.pty.max = 4096 - global limit + kernel.pty.reserve = 1024 - reserved for filesystems mounted from the initial mount namespace + kernel.pty.nr - current count of ptys + +Per-instance limit could be set by adding mount option ``max=<count>``. + +This feature was added in kernel 3.4 together with +``sysctl kernel.pty.reserve``. + +In kernels older than 3.4 sysctl ``kernel.pty.max`` works as per-instance limit. diff --git a/Documentation/filesystems/devpts.txt b/Documentation/filesystems/devpts.txt deleted file mode 100644 index 9f94fe276dea..000000000000 --- a/Documentation/filesystems/devpts.txt +++ /dev/null @@ -1,26 +0,0 @@ -Each mount of the devpts filesystem is now distinct such that ptys -and their indicies allocated in one mount are independent from ptys -and their indicies in all other mounts. - -All mounts of the devpts filesystem now create a /dev/pts/ptmx node -with permissions 0000. - -To retain backwards compatibility the a ptmx device node (aka any node -created with "mknod name c 5 2") when opened will look for an instance -of devpts under the name "pts" in the same directory as the ptmx device -node. - -As an option instead of placing a /dev/ptmx device node at /dev/ptmx -it is possible to place a symlink to /dev/pts/ptmx at /dev/ptmx or -to bind mount /dev/ptx/ptmx to /dev/ptmx. If you opt for using -the devpts filesystem in this manner devpts should be mounted with -the ptmxmode=0666, or chmod 0666 /dev/pts/ptmx should be called. - -Total count of pty pairs in all instances is limited by sysctls: -kernel.pty.max = 4096 - global limit -kernel.pty.reserve = 1024 - reserved for filesystems mounted from the initial mount namespace -kernel.pty.nr - current count of ptys - -Per-instance limit could be set by adding mount option "max=<count>". -This feature was added in kernel 3.4 together with sysctl kernel.pty.reserve. -In kernels older than 3.4 sysctl kernel.pty.max works as per-instance limit. diff --git a/Documentation/filesystems/dnotify.txt b/Documentation/filesystems/dnotify.rst index 15156883d321..a28a1f9ef79c 100644 --- a/Documentation/filesystems/dnotify.txt +++ b/Documentation/filesystems/dnotify.rst @@ -1,5 +1,8 @@ - Linux Directory Notification - ============================ +.. SPDX-License-Identifier: GPL-2.0 + +============================ +Linux Directory Notification +============================ Stephen Rothwell <sfr@canb.auug.org.au> @@ -12,6 +15,7 @@ being delivered using signals. The application decides which "events" it wants to be notified about. The currently defined events are: + ========= ===================================================== DN_ACCESS A file in the directory was accessed (read) DN_MODIFY A file in the directory was modified (write,truncate) DN_CREATE A file was created in the directory @@ -19,6 +23,7 @@ The currently defined events are: DN_RENAME A file in the directory was renamed DN_ATTRIB A file in the directory had its attributes changed (chmod,chown) + ========= ===================================================== Usually, the application must reregister after each notification, but if DN_MULTISHOT is or'ed with the event mask, then the registration will @@ -36,7 +41,7 @@ especially important if DN_MULTISHOT is specified. Note that SIGRTMIN is often blocked, so it is better to use (at least) SIGRTMIN + 1. Implementation expectations (features and bugs :-)) ---------------------------- +--------------------------------------------------- The notification should work for any local access to files even if the actual file system is on a remote server. This implies that remote @@ -67,4 +72,4 @@ See tools/testing/selftests/filesystems/dnotify_test.c for an example. NOTE ---- Beginning with Linux 2.6.13, dnotify has been replaced by inotify. -See Documentation/filesystems/inotify.txt for more information on it. +See Documentation/filesystems/inotify.rst for more information on it. diff --git a/Documentation/filesystems/efivarfs.rst b/Documentation/filesystems/efivarfs.rst index 90ac65683e7e..0551985821b8 100644 --- a/Documentation/filesystems/efivarfs.rst +++ b/Documentation/filesystems/efivarfs.rst @@ -24,3 +24,20 @@ files that are not well-known standardized variables are created as immutable files. This doesn't prevent removal - "chattr -i" will work - but it does prevent this kind of failure from being accomplished accidentally. + +.. warning :: + When a content of an UEFI variable in /sys/firmware/efi/efivars is + displayed, for example using "hexdump", pay attention that the first + 4 bytes of the output represent the UEFI variable attributes, + in little-endian format. + + Practically the output of each efivar is composed of: + + +-----------------------------------+ + |4_bytes_of_attributes + efivar_data| + +-----------------------------------+ + +*See also:* + +- Documentation/admin-guide/acpi/ssdt-overlays.rst +- Documentation/ABI/stable/sysfs-firmware-efi-vars diff --git a/Documentation/filesystems/fiemap.txt b/Documentation/filesystems/fiemap.rst index ac87e6fda842..2a572e7edc08 100644 --- a/Documentation/filesystems/fiemap.txt +++ b/Documentation/filesystems/fiemap.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ============ Fiemap Ioctl ============ @@ -10,9 +12,9 @@ returns a list of extents. Request Basics -------------- -A fiemap request is encoded within struct fiemap: +A fiemap request is encoded within struct fiemap:: -struct fiemap { + struct fiemap { __u64 fm_start; /* logical offset (inclusive) at * which to start mapping (in) */ __u64 fm_length; /* logical length of mapping which @@ -23,7 +25,7 @@ struct fiemap { __u32 fm_extent_count; /* size of fm_extents array (in) */ __u32 fm_reserved; struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */ -}; + }; fm_start, and fm_length specify the logical range within the file @@ -51,12 +53,12 @@ nothing to prevent the file from changing between calls to FIEMAP. The following flags can be set in fm_flags: -* FIEMAP_FLAG_SYNC -If this flag is set, the kernel will sync the file before mapping extents. +FIEMAP_FLAG_SYNC + If this flag is set, the kernel will sync the file before mapping extents. -* FIEMAP_FLAG_XATTR -If this flag is set, the extents returned will describe the inodes -extended attribute lookup tree, instead of its data tree. +FIEMAP_FLAG_XATTR + If this flag is set, the extents returned will describe the inodes + extended attribute lookup tree, instead of its data tree. Extent Mapping @@ -75,18 +77,18 @@ complete the requested range and will not have the FIEMAP_EXTENT_LAST flag set (see the next section on extent flags). Each extent is described by a single fiemap_extent structure as -returned in fm_extents. - -struct fiemap_extent { - __u64 fe_logical; /* logical offset in bytes for the start of - * the extent */ - __u64 fe_physical; /* physical offset in bytes for the start - * of the extent */ - __u64 fe_length; /* length in bytes for the extent */ - __u64 fe_reserved64[2]; - __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ - __u32 fe_reserved[3]; -}; +returned in fm_extents:: + + struct fiemap_extent { + __u64 fe_logical; /* logical offset in bytes for the start of + * the extent */ + __u64 fe_physical; /* physical offset in bytes for the start + * of the extent */ + __u64 fe_length; /* length in bytes for the extent */ + __u64 fe_reserved64[2]; + __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ + __u32 fe_reserved[3]; + }; All offsets and lengths are in bytes and mirror those on disk. It is valid for an extents logical offset to start before the request or its logical @@ -114,26 +116,27 @@ worry about all present and future flags which might imply unaligned data. Note that the opposite is not true - it would be valid for FIEMAP_EXTENT_NOT_ALIGNED to appear alone. -* FIEMAP_EXTENT_LAST -This is generally the last extent in the file. A mapping attempt past -this extent may return nothing. Some implementations set this flag to -indicate this extent is the last one in the range queried by the user -(via fiemap->fm_length). +FIEMAP_EXTENT_LAST + This is generally the last extent in the file. A mapping attempt past + this extent may return nothing. Some implementations set this flag to + indicate this extent is the last one in the range queried by the user + (via fiemap->fm_length). + +FIEMAP_EXTENT_UNKNOWN + The location of this extent is currently unknown. This may indicate + the data is stored on an inaccessible volume or that no storage has + been allocated for the file yet. -* FIEMAP_EXTENT_UNKNOWN -The location of this extent is currently unknown. This may indicate -the data is stored on an inaccessible volume or that no storage has -been allocated for the file yet. +FIEMAP_EXTENT_DELALLOC + This will also set FIEMAP_EXTENT_UNKNOWN. -* FIEMAP_EXTENT_DELALLOC - - This will also set FIEMAP_EXTENT_UNKNOWN. -Delayed allocation - while there is data for this extent, its -physical location has not been allocated yet. + Delayed allocation - while there is data for this extent, its + physical location has not been allocated yet. -* FIEMAP_EXTENT_ENCODED -This extent does not consist of plain filesystem blocks but is -encoded (e.g. encrypted or compressed). Reading the data in this -extent via I/O to the block device will have undefined results. +FIEMAP_EXTENT_ENCODED + This extent does not consist of plain filesystem blocks but is + encoded (e.g. encrypted or compressed). Reading the data in this + extent via I/O to the block device will have undefined results. Note that it is *always* undefined to try to update the data in-place by writing to the indicated location without the @@ -145,32 +148,32 @@ unmounted, and then only if the FIEMAP_EXTENT_ENCODED flag is clear; user applications must not try reading or writing to the filesystem via the block device under any other circumstances. -* FIEMAP_EXTENT_DATA_ENCRYPTED - - This will also set FIEMAP_EXTENT_ENCODED -The data in this extent has been encrypted by the file system. +FIEMAP_EXTENT_DATA_ENCRYPTED + This will also set FIEMAP_EXTENT_ENCODED + The data in this extent has been encrypted by the file system. -* FIEMAP_EXTENT_NOT_ALIGNED -Extent offsets and length are not guaranteed to be block aligned. +FIEMAP_EXTENT_NOT_ALIGNED + Extent offsets and length are not guaranteed to be block aligned. -* FIEMAP_EXTENT_DATA_INLINE +FIEMAP_EXTENT_DATA_INLINE This will also set FIEMAP_EXTENT_NOT_ALIGNED -Data is located within a meta data block. + Data is located within a meta data block. -* FIEMAP_EXTENT_DATA_TAIL +FIEMAP_EXTENT_DATA_TAIL This will also set FIEMAP_EXTENT_NOT_ALIGNED -Data is packed into a block with data from other files. + Data is packed into a block with data from other files. -* FIEMAP_EXTENT_UNWRITTEN -Unwritten extent - the extent is allocated but its data has not been -initialized. This indicates the extent's data will be all zero if read -through the filesystem but the contents are undefined if read directly from -the device. +FIEMAP_EXTENT_UNWRITTEN + Unwritten extent - the extent is allocated but its data has not been + initialized. This indicates the extent's data will be all zero if read + through the filesystem but the contents are undefined if read directly from + the device. -* FIEMAP_EXTENT_MERGED -This will be set when a file does not support extents, i.e., it uses a block -based addressing scheme. Since returning an extent for each block back to -userspace would be highly inefficient, the kernel will try to merge most -adjacent blocks into 'extents'. +FIEMAP_EXTENT_MERGED + This will be set when a file does not support extents, i.e., it uses a block + based addressing scheme. Since returning an extent for each block back to + userspace would be highly inefficient, the kernel will try to merge most + adjacent blocks into 'extents'. VFS -> File System Implementation @@ -179,23 +182,23 @@ VFS -> File System Implementation File systems wishing to support fiemap must implement a ->fiemap callback on their inode_operations structure. The fs ->fiemap call is responsible for defining its set of supported fiemap flags, and calling a helper function on -each discovered extent: +each discovered extent:: -struct inode_operations { + struct inode_operations { ... int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len); ->fiemap is passed struct fiemap_extent_info which describes the -fiemap request: +fiemap request:: -struct fiemap_extent_info { + struct fiemap_extent_info { unsigned int fi_flags; /* Flags as passed from user */ unsigned int fi_extents_mapped; /* Number of mapped extents */ unsigned int fi_extents_max; /* Size of fiemap_extent array */ struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent array */ -}; + }; It is intended that the file system should not need to access any of this structure directly. Filesystem handlers should be tolerant to signals and return @@ -203,9 +206,9 @@ EINTR once fatal signal received. Flag checking should be done at the beginning of the ->fiemap callback via the -fiemap_check_flags() helper: +fiemap_check_flags() helper:: -int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); + int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); The struct fieinfo should be passed in as received from ioctl_fiemap(). The set of fiemap flags which the fs understands should be passed via fs_flags. If @@ -216,10 +219,10 @@ ioctl_fiemap(). For each extent in the request range, the file system should call -the helper function, fiemap_fill_next_extent(): +the helper function, fiemap_fill_next_extent():: -int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, - u64 phys, u64 len, u32 flags, u32 dev); + int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, + u64 phys, u64 len, u32 flags, u32 dev); fiemap_fill_next_extent() will use the passed values to populate the next free extent in the fm_extents array. 'General' extent flags will diff --git a/Documentation/filesystems/files.txt b/Documentation/filesystems/files.rst index 46dfc6b038c3..cbf8e57376bf 100644 --- a/Documentation/filesystems/files.txt +++ b/Documentation/filesystems/files.rst @@ -1,5 +1,8 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================================== File management in the Linux kernel ------------------------------------ +=================================== This document describes how locking for files (struct file) and file descriptor table (struct files) works. @@ -34,7 +37,7 @@ appear atomic. Here are the locking rules for the fdtable structure - 1. All references to the fdtable must be done through - the files_fdtable() macro : + the files_fdtable() macro:: struct fdtable *fdt; @@ -61,7 +64,8 @@ the fdtable structure - 4. To look up the file structure given an fd, a reader must use either fcheck() or fcheck_files() APIs. These take care of barrier requirements due to lock-free lookup. - An example : + + An example:: struct file *file; @@ -77,7 +81,7 @@ the fdtable structure - of the fd (fget()/fget_light()) are lock-free, it is possible that look-up may race with the last put() operation on the file structure. This is avoided using atomic_long_inc_not_zero() - on ->f_count : + on ->f_count:: rcu_read_lock(); file = fcheck_files(files, fd); @@ -106,7 +110,8 @@ the fdtable structure - holding files->file_lock. If ->file_lock is dropped, then another thread expand the files thereby creating a new fdtable and making the earlier fdtable pointer stale. - For example : + + For example:: spin_lock(&files->file_lock); fd = locate_fd(files, file, start); diff --git a/Documentation/filesystems/fuse-io.txt b/Documentation/filesystems/fuse-io.rst index 07b8f73f100f..255a368fe534 100644 --- a/Documentation/filesystems/fuse-io.txt +++ b/Documentation/filesystems/fuse-io.rst @@ -1,3 +1,9 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============== +Fuse I/O Modes +============== + Fuse supports the following I/O modes: - direct-io diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index e7b46dac7079..17795341e0a3 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -24,6 +24,22 @@ algorithms work. splice locking directory-locking + devpts + dnotify + fiemap + files + locks + mandatory-locking + mount_api + quota + seq_file + sharedsubtree + sysfs-pci + sysfs-tagging + + automount-support + + caching/index porting @@ -57,7 +73,10 @@ Documentation for filesystem implementations. befs bfs btrfs + cifs/cifsroot ceph + coda + configfs cramfs debugfs dlmfs @@ -73,6 +92,7 @@ Documentation for filesystem implementations. hfsplus hpfs fuse + fuse-io inotify isofs nilfs2 @@ -88,6 +108,7 @@ Documentation for filesystem implementations. ramfs-rootfs-initramfs relay romfs + spufs/index squashfs sysfs sysv-fs @@ -97,4 +118,6 @@ Documentation for filesystem implementations. udf virtiofs vfat + xfs-delayed-logging-design + xfs-self-describing-metadata zonefs diff --git a/Documentation/filesystems/locks.txt b/Documentation/filesystems/locks.rst index 5368690f412e..c5ae858b1aac 100644 --- a/Documentation/filesystems/locks.txt +++ b/Documentation/filesystems/locks.rst @@ -1,4 +1,8 @@ - File Locking Release Notes +.. SPDX-License-Identifier: GPL-2.0 + +========================== +File Locking Release Notes +========================== Andy Walker <andy@lysaker.kvaerner.no> @@ -6,7 +10,7 @@ 1. What's New? --------------- +============== 1.1 Broken Flock Emulation -------------------------- @@ -25,7 +29,7 @@ anyway (see the file "Documentation/process/changes.rst".) --------------------------- 1.2.1 Typical Problems - Sendmail ---------------------------------- +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Because sendmail was unable to use the old flock() emulation, many sendmail installations use fcntl() instead of flock(). This is true of Slackware 3.0 for example. This gave rise to some other subtle problems if sendmail was @@ -37,7 +41,7 @@ to lock solid with deadlocked processes. 1.2.2 The Solution ------------------- +^^^^^^^^^^^^^^^^^^ The solution I have chosen, after much experimentation and discussion, is to make flock() and fcntl() locks oblivious to each other. Both can exists, and neither will have any effect on the other. @@ -54,7 +58,7 @@ fcntl(), with all the problems that implies. --------------------------------------- Mandatory locking, as described in -'Documentation/filesystems/mandatory-locking.txt' was prior to this release a +'Documentation/filesystems/mandatory-locking.rst' was prior to this release a general configuration option that was valid for all mounted filesystems. This had a number of inherent dangers, not the least of which was the ability to freeze an NFS server by asking it to read a file for which a mandatory lock diff --git a/Documentation/filesystems/mandatory-locking.txt b/Documentation/filesystems/mandatory-locking.rst index a251ca33164a..9ce73544a8f0 100644 --- a/Documentation/filesystems/mandatory-locking.txt +++ b/Documentation/filesystems/mandatory-locking.rst @@ -1,8 +1,13 @@ - Mandatory File Locking For The Linux Operating System +.. SPDX-License-Identifier: GPL-2.0 + +===================================================== +Mandatory File Locking For The Linux Operating System +===================================================== Andy Walker <andy@lysaker.kvaerner.no> 15 April 1996 + (Updated September 2007) 0. Why you should avoid mandatory locking @@ -53,15 +58,17 @@ possible on existing user code. The scheme is based on marking individual files as candidates for mandatory locking, and using the existing fcntl()/lockf() interface for applying locks just as if they were normal, advisory locks. -Note 1: In saying "file" in the paragraphs above I am actually not telling -the whole truth. System V locking is based on fcntl(). The granularity of -fcntl() is such that it allows the locking of byte ranges in files, in addition -to entire files, so the mandatory locking rules also have byte level -granularity. +.. Note:: + + 1. In saying "file" in the paragraphs above I am actually not telling + the whole truth. System V locking is based on fcntl(). The granularity of + fcntl() is such that it allows the locking of byte ranges in files, in + addition to entire files, so the mandatory locking rules also have byte + level granularity. -Note 2: POSIX.1 does not specify any scheme for mandatory locking, despite -borrowing the fcntl() locking scheme from System V. The mandatory locking -scheme is defined by the System V Interface Definition (SVID) Version 3. + 2. POSIX.1 does not specify any scheme for mandatory locking, despite + borrowing the fcntl() locking scheme from System V. The mandatory locking + scheme is defined by the System V Interface Definition (SVID) Version 3. 2. Marking a file for mandatory locking --------------------------------------- diff --git a/Documentation/filesystems/mount_api.txt b/Documentation/filesystems/mount_api.rst index 87c14bbb2b35..dea22d64f060 100644 --- a/Documentation/filesystems/mount_api.txt +++ b/Documentation/filesystems/mount_api.rst @@ -1,8 +1,10 @@ - ==================== - FILESYSTEM MOUNT API - ==================== +.. SPDX-License-Identifier: GPL-2.0 -CONTENTS +==================== +fILESYSTEM Mount API +==================== + +.. CONTENTS (1) Overview. @@ -21,8 +23,7 @@ CONTENTS (8) Parameter helper functions. -======== -OVERVIEW +Overview ======== The creation of new mounts is now to be done in a multistep process: @@ -43,7 +44,7 @@ The creation of new mounts is now to be done in a multistep process: (7) Destroy the context. -To support this, the file_system_type struct gains two new fields: +To support this, the file_system_type struct gains two new fields:: int (*init_fs_context)(struct fs_context *fc); const struct fs_parameter_description *parameters; @@ -57,12 +58,11 @@ Note that security initialisation is done *after* the filesystem is called so that the namespaces may be adjusted first. -====================== -THE FILESYSTEM CONTEXT +The Filesystem context ====================== The creation and reconfiguration of a superblock is governed by a filesystem -context. This is represented by the fs_context structure: +context. This is represented by the fs_context structure:: struct fs_context { const struct fs_context_operations *ops; @@ -86,78 +86,106 @@ context. This is represented by the fs_context structure: The fs_context fields are as follows: - (*) const struct fs_context_operations *ops + * :: + + const struct fs_context_operations *ops These are operations that can be done on a filesystem context (see below). This must be set by the ->init_fs_context() file_system_type operation. - (*) struct file_system_type *fs_type + * :: + + struct file_system_type *fs_type A pointer to the file_system_type of the filesystem that is being constructed or reconfigured. This retains a reference on the type owner. - (*) void *fs_private + * :: + + void *fs_private A pointer to the file system's private data. This is where the filesystem will need to store any options it parses. - (*) struct dentry *root + * :: + + struct dentry *root A pointer to the root of the mountable tree (and indirectly, the superblock thereof). This is filled in by the ->get_tree() op. If this is set, an active reference on root->d_sb must also be held. - (*) struct user_namespace *user_ns - (*) struct net *net_ns + * :: + + struct user_namespace *user_ns + struct net *net_ns There are a subset of the namespaces in use by the invoking process. They retain references on each namespace. The subscribed namespaces may be replaced by the filesystem to reflect other sources, such as the parent mount superblock on an automount. - (*) const struct cred *cred + * :: + + const struct cred *cred The mounter's credentials. This retains a reference on the credentials. - (*) char *source + * :: + + char *source This specifies the source. It may be a block device (e.g. /dev/sda1) or something more exotic, such as the "host:/path" that NFS desires. - (*) char *subtype + * :: + + char *subtype This is a string to be added to the type displayed in /proc/mounts to qualify it (used by FUSE). This is available for the filesystem to set if desired. - (*) void *security + * :: + + void *security A place for the LSMs to hang their security data for the superblock. The relevant security operations are described below. - (*) void *s_fs_info + * :: + + void *s_fs_info The proposed s_fs_info for a new superblock, set in the superblock by sget_fc(). This can be used to distinguish superblocks. - (*) unsigned int sb_flags - (*) unsigned int sb_flags_mask + * :: + + unsigned int sb_flags + unsigned int sb_flags_mask Which bits SB_* flags are to be set/cleared in super_block::s_flags. - (*) unsigned int s_iflags + * :: + + unsigned int s_iflags These will be bitwise-OR'd with s->s_iflags when a superblock is created. - (*) enum fs_context_purpose + * :: + + enum fs_context_purpose This indicates the purpose for which the context is intended. The available values are: - FS_CONTEXT_FOR_MOUNT, -- New superblock for explicit mount - FS_CONTEXT_FOR_SUBMOUNT -- New automatic submount of extant mount - FS_CONTEXT_FOR_RECONFIGURE -- Change an existing mount + ========================== ====================================== + FS_CONTEXT_FOR_MOUNT, New superblock for explicit mount + FS_CONTEXT_FOR_SUBMOUNT New automatic submount of extant mount + FS_CONTEXT_FOR_RECONFIGURE Change an existing mount + ========================== ====================================== The mount context is created by calling vfs_new_fs_context() or vfs_dup_fs_context() and is destroyed with put_fs_context(). Note that the @@ -176,11 +204,10 @@ mount context. For instance, NFS might pin the appropriate protocol version module. -================================= -THE FILESYSTEM CONTEXT OPERATIONS +The Filesystem Context Operations ================================= -The filesystem context points to a table of operations: +The filesystem context points to a table of operations:: struct fs_context_operations { void (*free)(struct fs_context *fc); @@ -195,24 +222,32 @@ The filesystem context points to a table of operations: These operations are invoked by the various stages of the mount procedure to manage the filesystem context. They are as follows: - (*) void (*free)(struct fs_context *fc); + * :: + + void (*free)(struct fs_context *fc); Called to clean up the filesystem-specific part of the filesystem context when the context is destroyed. It should be aware that parts of the context may have been removed and NULL'd out by ->get_tree(). - (*) int (*dup)(struct fs_context *fc, struct fs_context *src_fc); + * :: + + int (*dup)(struct fs_context *fc, struct fs_context *src_fc); Called when a filesystem context has been duplicated to duplicate the filesystem-private data. An error may be returned to indicate failure to do this. - [!] Note that even if this fails, put_fs_context() will be called + .. Warning:: + + Note that even if this fails, put_fs_context() will be called immediately thereafter, so ->dup() *must* make the filesystem-private data safe for ->free(). - (*) int (*parse_param)(struct fs_context *fc, - struct struct fs_parameter *param); + * :: + + int (*parse_param)(struct fs_context *fc, + struct struct fs_parameter *param); Called when a parameter is being added to the filesystem context. param points to the key name and maybe a value object. VFS-specific options @@ -224,7 +259,9 @@ manage the filesystem context. They are as follows: If successful, 0 should be returned or a negative error code otherwise. - (*) int (*parse_monolithic)(struct fs_context *fc, void *data); + * :: + + int (*parse_monolithic)(struct fs_context *fc, void *data); Called when the mount(2) system call is invoked to pass the entire data page in one go. If this is expected to be just a list of "key[=val]" @@ -236,7 +273,9 @@ manage the filesystem context. They are as follows: finds it's the standard key-val list then it may pass it off to generic_parse_monolithic(). - (*) int (*get_tree)(struct fs_context *fc); + * :: + + int (*get_tree)(struct fs_context *fc); Called to get or create the mountable root and superblock, using the information stored in the filesystem context (reconfiguration goes via a @@ -249,7 +288,9 @@ manage the filesystem context. They are as follows: The phase on a userspace-driven context will be set to only allow this to be called once on any particular context. - (*) int (*reconfigure)(struct fs_context *fc); + * :: + + int (*reconfigure)(struct fs_context *fc); Called to effect reconfiguration of a superblock using information stored in the filesystem context. It may detach any resources it desires from @@ -259,19 +300,20 @@ manage the filesystem context. They are as follows: On success it should return 0. In the case of an error, it should return a negative error code. - [NOTE] reconfigure is intended as a replacement for remount_fs. + .. Note:: reconfigure is intended as a replacement for remount_fs. -=========================== -FILESYSTEM CONTEXT SECURITY +Filesystem context Security =========================== The filesystem context contains a security pointer that the LSMs can use for building up a security context for the superblock to be mounted. There are a number of operations used by the new mount code for this purpose: - (*) int security_fs_context_alloc(struct fs_context *fc, - struct dentry *reference); + * :: + + int security_fs_context_alloc(struct fs_context *fc, + struct dentry *reference); Called to initialise fc->security (which is preset to NULL) and allocate any resources needed. It should return 0 on success or a negative error @@ -283,22 +325,28 @@ number of operations used by the new mount code for this purpose: non-NULL in the case of a submount (FS_CONTEXT_FOR_SUBMOUNT) in which case it indicates the automount point. - (*) int security_fs_context_dup(struct fs_context *fc, - struct fs_context *src_fc); + * :: + + int security_fs_context_dup(struct fs_context *fc, + struct fs_context *src_fc); Called to initialise fc->security (which is preset to NULL) and allocate any resources needed. The original filesystem context is pointed to by src_fc and may be used for reference. It should return 0 on success or a negative error code on failure. - (*) void security_fs_context_free(struct fs_context *fc); + * :: + + void security_fs_context_free(struct fs_context *fc); Called to clean up anything attached to fc->security. Note that the contents may have been transferred to a superblock and the pointer cleared during get_tree. - (*) int security_fs_context_parse_param(struct fs_context *fc, - struct fs_parameter *param); + * :: + + int security_fs_context_parse_param(struct fs_context *fc, + struct fs_parameter *param); Called for each mount parameter, including the source. The arguments are as for the ->parse_param() method. It should return 0 to indicate that @@ -310,7 +358,9 @@ number of operations used by the new mount code for this purpose: (provided the value pointer is NULL'd out). If it is stolen, 1 must be returned to prevent it being passed to the filesystem. - (*) int security_fs_context_validate(struct fs_context *fc); + * :: + + int security_fs_context_validate(struct fs_context *fc); Called after all the options have been parsed to validate the collection as a whole and to do any necessary allocation so that @@ -320,36 +370,43 @@ number of operations used by the new mount code for this purpose: In the case of reconfiguration, the target superblock will be accessible via fc->root. - (*) int security_sb_get_tree(struct fs_context *fc); + * :: + + int security_sb_get_tree(struct fs_context *fc); Called during the mount procedure to verify that the specified superblock is allowed to be mounted and to transfer the security data there. It should return 0 or a negative error code. - (*) void security_sb_reconfigure(struct fs_context *fc); + * :: + + void security_sb_reconfigure(struct fs_context *fc); Called to apply any reconfiguration to an LSM's context. It must not fail. Error checking and resource allocation must be done in advance by the parameter parsing and validation hooks. - (*) int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint, - unsigned int mnt_flags); + * :: + + int security_sb_mountpoint(struct fs_context *fc, + struct path *mountpoint, + unsigned int mnt_flags); Called during the mount procedure to verify that the root dentry attached to the context is permitted to be attached to the specified mountpoint. It should return 0 on success or a negative error code on failure. -========================== -VFS FILESYSTEM CONTEXT API +VFS Filesystem context API ========================== There are four operations for creating a filesystem context and one for destroying a context: - (*) struct fs_context *fs_context_for_mount( - struct file_system_type *fs_type, - unsigned int sb_flags); + * :: + + struct fs_context *fs_context_for_mount(struct file_system_type *fs_type, + unsigned int sb_flags); Allocate a filesystem context for the purpose of setting up a new mount, whether that be with a new superblock or sharing an existing one. This @@ -359,7 +416,9 @@ destroying a context: fs_type specifies the filesystem type that will manage the context and sb_flags presets the superblock flags stored therein. - (*) struct fs_context *fs_context_for_reconfigure( + * :: + + struct fs_context *fs_context_for_reconfigure( struct dentry *dentry, unsigned int sb_flags, unsigned int sb_flags_mask); @@ -369,7 +428,9 @@ destroying a context: configured. sb_flags and sb_flags_mask indicate which superblock flags need changing and to what. - (*) struct fs_context *fs_context_for_submount( + * :: + + struct fs_context *fs_context_for_submount( struct file_system_type *fs_type, struct dentry *reference); @@ -382,7 +443,9 @@ destroying a context: Note that it's not a requirement that the reference dentry be of the same filesystem type as fs_type. - (*) struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc); + * :: + + struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc); Duplicate a filesystem context, copying any options noted and duplicating or additionally referencing any resources held therein. This is available @@ -392,14 +455,18 @@ destroying a context: The purpose in the new context is inherited from the old one. - (*) void put_fs_context(struct fs_context *fc); + * :: + + void put_fs_context(struct fs_context *fc); Destroy a filesystem context, releasing any resources it holds. This calls the ->free() operation. This is intended to be called by anyone who created a filesystem context. - [!] filesystem contexts are not refcounted, so this causes unconditional - destruction. + .. Warning:: + + filesystem contexts are not refcounted, so this causes unconditional + destruction. In all the above operations, apart from the put op, the return is a mount context pointer or a negative error code. @@ -407,8 +474,10 @@ context pointer or a negative error code. For the remaining operations, if an error occurs, a negative error code will be returned. - (*) int vfs_parse_fs_param(struct fs_context *fc, - struct fs_parameter *param); + * :: + + int vfs_parse_fs_param(struct fs_context *fc, + struct fs_parameter *param); Supply a single mount parameter to the filesystem context. This include the specification of the source/device which is specified as the "source" @@ -423,53 +492,64 @@ returned. The parameter value is typed and can be one of: - fs_value_is_flag, Parameter not given a value. - fs_value_is_string, Value is a string - fs_value_is_blob, Value is a binary blob - fs_value_is_filename, Value is a filename* + dirfd - fs_value_is_file, Value is an open file (file*) + ==================== ============================= + fs_value_is_flag Parameter not given a value + fs_value_is_string Value is a string + fs_value_is_blob Value is a binary blob + fs_value_is_filename Value is a filename* + dirfd + fs_value_is_file Value is an open file (file*) + ==================== ============================= If there is a value, that value is stored in a union in the struct in one of param->{string,blob,name,file}. Note that the function may steal and clear the pointer, but then becomes responsible for disposing of the object. - (*) int vfs_parse_fs_string(struct fs_context *fc, const char *key, - const char *value, size_t v_size); + * :: + + int vfs_parse_fs_string(struct fs_context *fc, const char *key, + const char *value, size_t v_size); A wrapper around vfs_parse_fs_param() that copies the value string it is passed. - (*) int generic_parse_monolithic(struct fs_context *fc, void *data); + * :: + + int generic_parse_monolithic(struct fs_context *fc, void *data); Parse a sys_mount() data page, assuming the form to be a text list consisting of key[=val] options separated by commas. Each item in the list is passed to vfs_mount_option(). This is the default when the ->parse_monolithic() method is NULL. - (*) int vfs_get_tree(struct fs_context *fc); + * :: + + int vfs_get_tree(struct fs_context *fc); Get or create the mountable root and superblock, using the parameters in the filesystem context to select/configure the superblock. This invokes the ->get_tree() method. - (*) struct vfsmount *vfs_create_mount(struct fs_context *fc); + * :: + + struct vfsmount *vfs_create_mount(struct fs_context *fc); Create a mount given the parameters in the specified filesystem context. Note that this does not attach the mount to anything. -=========================== -SUPERBLOCK CREATION HELPERS +Superblock Creation Helpers =========================== A number of VFS helpers are available for use by filesystems for the creation or looking up of superblocks. - (*) struct super_block * - sget_fc(struct fs_context *fc, - int (*test)(struct super_block *sb, struct fs_context *fc), - int (*set)(struct super_block *sb, struct fs_context *fc)); + * :: + + struct super_block * + sget_fc(struct fs_context *fc, + int (*test)(struct super_block *sb, struct fs_context *fc), + int (*set)(struct super_block *sb, struct fs_context *fc)); This is the core routine. If test is non-NULL, it searches for an existing superblock matching the criteria held in the fs_context, using @@ -482,10 +562,12 @@ or looking up of superblocks. The following helpers all wrap sget_fc(): - (*) int vfs_get_super(struct fs_context *fc, - enum vfs_get_super_keying keying, - int (*fill_super)(struct super_block *sb, - struct fs_context *fc)) + * :: + + int vfs_get_super(struct fs_context *fc, + enum vfs_get_super_keying keying, + int (*fill_super)(struct super_block *sb, + struct fs_context *fc)) This creates/looks up a deviceless superblock. The keying indicates how many superblocks of this type may exist and in what manner they may be @@ -515,14 +597,14 @@ PARAMETER DESCRIPTION ===================== Parameters are described using structures defined in linux/fs_parser.h. -There's a core description struct that links everything together: +There's a core description struct that links everything together:: struct fs_parameter_description { const struct fs_parameter_spec *specs; const struct fs_parameter_enum *enums; }; -For example: +For example:: enum { Opt_autocell, @@ -539,10 +621,12 @@ For example: The members are as follows: - (1) const struct fs_parameter_specification *specs; + (1) :: + + const struct fs_parameter_specification *specs; Table of parameter specifications, terminated with a null entry, where the - entries are of type: + entries are of type:: struct fs_parameter_spec { const char *name; @@ -558,6 +642,7 @@ The members are as follows: The 'type' field indicates the desired value type and must be one of: + ======================= ======================= ===================== TYPE NAME EXPECTED VALUE RESULT IN ======================= ======================= ===================== fs_param_is_flag No value n/a @@ -573,19 +658,23 @@ The members are as follows: fs_param_is_blockdev Blockdev path * Needs lookup fs_param_is_path Path * Needs lookup fs_param_is_fd File descriptor result->int_32 + ======================= ======================= ===================== Note that if the value is of fs_param_is_bool type, fs_parse() will try to match any string value against "0", "1", "no", "yes", "false", "true". Each parameter can also be qualified with 'flags': + ======================= ================================================ fs_param_v_optional The value is optional fs_param_neg_with_no result->negated set if key is prefixed with "no" fs_param_neg_with_empty result->negated set if value is "" fs_param_deprecated The parameter is deprecated. + ======================= ================================================ These are wrapped with a number of convenience wrappers: + ======================= =============================================== MACRO SPECIFIES ======================= =============================================== fsparam_flag() fs_param_is_flag @@ -602,9 +691,10 @@ The members are as follows: fsparam_bdev() fs_param_is_blockdev fsparam_path() fs_param_is_path fsparam_fd() fs_param_is_fd + ======================= =============================================== all of which take two arguments, name string and option number - for - example: + example:: static const struct fs_parameter_spec afs_param_specs[] = { fsparam_flag ("autocell", Opt_autocell), @@ -618,10 +708,12 @@ The members are as follows: of arguments to specify the type and the flags for anything that doesn't match one of the above macros. - (2) const struct fs_parameter_enum *enums; + (2) :: + + const struct fs_parameter_enum *enums; Table of enum value names to integer mappings, terminated with a null - entry. This is of type: + entry. This is of type:: struct fs_parameter_enum { u8 opt; @@ -630,7 +722,7 @@ The members are as follows: }; Where the array is an unsorted list of { parameter ID, name }-keyed - elements that indicate the value to map to, e.g.: + elements that indicate the value to map to, e.g.:: static const struct fs_parameter_enum afs_param_enums[] = { { Opt_bar, "x", 1}, @@ -648,18 +740,19 @@ CONFIG_VALIDATE_FS_PARSER=y) and will allow the description to be queried from userspace using the fsinfo() syscall. -========================== -PARAMETER HELPER FUNCTIONS +Parameter Helper Functions ========================== A number of helper functions are provided to help a filesystem or an LSM process the parameters it is given. - (*) int lookup_constant(const struct constant_table tbl[], - const char *name, int not_found); + * :: + + int lookup_constant(const struct constant_table tbl[], + const char *name, int not_found); Look up a constant by name in a table of name -> integer mappings. The - table is an array of elements of the following type: + table is an array of elements of the following type:: struct constant_table { const char *name; @@ -669,9 +762,11 @@ process the parameters it is given. If a match is found, the corresponding value is returned. If a match isn't found, the not_found value is returned instead. - (*) bool validate_constant_table(const struct constant_table *tbl, - size_t tbl_size, - int low, int high, int special); + * :: + + bool validate_constant_table(const struct constant_table *tbl, + size_t tbl_size, + int low, int high, int special); Validate a constant table. Checks that all the elements are appropriately ordered, that there are no duplicates and that the values are between low @@ -682,16 +777,20 @@ process the parameters it is given. If all is good, true is returned. If the table is invalid, errors are logged to dmesg and false is returned. - (*) bool fs_validate_description(const struct fs_parameter_description *desc); + * :: + + bool fs_validate_description(const struct fs_parameter_description *desc); This performs some validation checks on a parameter description. It returns true if the description is good and false if it is not. It will log errors to dmesg if validation fails. - (*) int fs_parse(struct fs_context *fc, - const struct fs_parameter_description *desc, - struct fs_parameter *param, - struct fs_parse_result *result); + * :: + + int fs_parse(struct fs_context *fc, + const struct fs_parameter_description *desc, + struct fs_parameter *param, + struct fs_parse_result *result); This is the main interpreter of parameters. It uses the parameter description to look up a parameter by key name and to convert that to an @@ -711,14 +810,16 @@ process the parameters it is given. parameter is matched, but the value is erroneous, -EINVAL will be returned; otherwise the parameter's option number will be returned. - (*) int fs_lookup_param(struct fs_context *fc, - struct fs_parameter *value, - bool want_bdev, - struct path *_path); + * :: + + int fs_lookup_param(struct fs_context *fc, + struct fs_parameter *value, + bool want_bdev, + struct path *_path); This takes a parameter that carries a string or filename type and attempts to do a path lookup on it. If the parameter expects a blockdev, a check is made that the inode actually represents one. - Returns 0 if successful and *_path will be set; returns a negative error - code if not. + Returns 0 if successful and ``*_path`` will be set; returns a negative + error code if not. diff --git a/Documentation/filesystems/orangefs.rst b/Documentation/filesystems/orangefs.rst index e41369709c5b..463e37694250 100644 --- a/Documentation/filesystems/orangefs.rst +++ b/Documentation/filesystems/orangefs.rst @@ -119,9 +119,7 @@ it comes to that question:: /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf -Create an /etc/pvfs2tab file:: - -Localhost is fine for your pvfs2tab file: +Create an /etc/pvfs2tab file (localhost is fine):: echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \ /etc/pvfs2tab diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 092b7b44d158..430963e0e8c3 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -543,6 +543,7 @@ encoded manner. The codes are the following: hg huge page advise flag nh no huge page advise flag mg mergable advise flag + bt - arm64 BTI guarded page == ======================================= Note that there is no guarantee that every flag and associated mnemonic will @@ -1870,7 +1871,7 @@ unbindable mount is unbindable For more information on mount propagation see: - Documentation/filesystems/sharedsubtree.txt + Documentation/filesystems/sharedsubtree.rst 3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm diff --git a/Documentation/filesystems/quota.txt b/Documentation/filesystems/quota.rst index 32874b06ebe9..a30cdd47c652 100644 --- a/Documentation/filesystems/quota.txt +++ b/Documentation/filesystems/quota.rst @@ -1,4 +1,6 @@ +.. SPDX-License-Identifier: GPL-2.0 +=============== Quota subsystem =============== @@ -39,6 +41,7 @@ Currently, the interface supports only one message type QUOTA_NL_C_WARNING. This command is used to send a notification about any of the above mentioned events. Each message has six attributes. These are (type of the argument is in parentheses): + QUOTA_NL_A_QTYPE (u32) - type of quota being exceeded (one of USRQUOTA, GRPQUOTA) QUOTA_NL_A_EXCESS_ID (u64) @@ -48,20 +51,34 @@ in parentheses): - UID of a user who caused the event QUOTA_NL_A_WARNING (u32) - what kind of limit is exceeded: - QUOTA_NL_IHARDWARN - inode hardlimit - QUOTA_NL_ISOFTLONGWARN - inode softlimit is exceeded longer - than given grace period - QUOTA_NL_ISOFTWARN - inode softlimit - QUOTA_NL_BHARDWARN - space (block) hardlimit - QUOTA_NL_BSOFTLONGWARN - space (block) softlimit is exceeded - longer than given grace period. - QUOTA_NL_BSOFTWARN - space (block) softlimit + + QUOTA_NL_IHARDWARN + inode hardlimit + QUOTA_NL_ISOFTLONGWARN + inode softlimit is exceeded longer + than given grace period + QUOTA_NL_ISOFTWARN + inode softlimit + QUOTA_NL_BHARDWARN + space (block) hardlimit + QUOTA_NL_BSOFTLONGWARN + space (block) softlimit is exceeded + longer than given grace period. + QUOTA_NL_BSOFTWARN + space (block) softlimit + - four warnings are also defined for the event when user stops exceeding some limit: - QUOTA_NL_IHARDBELOW - inode hardlimit - QUOTA_NL_ISOFTBELOW - inode softlimit - QUOTA_NL_BHARDBELOW - space (block) hardlimit - QUOTA_NL_BSOFTBELOW - space (block) softlimit + + QUOTA_NL_IHARDBELOW + inode hardlimit + QUOTA_NL_ISOFTBELOW + inode softlimit + QUOTA_NL_BHARDBELOW + space (block) hardlimit + QUOTA_NL_BSOFTBELOW + space (block) softlimit + QUOTA_NL_A_DEV_MAJOR (u32) - major number of a device with the affected filesystem QUOTA_NL_A_DEV_MINOR (u32) diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.rst b/Documentation/filesystems/ramfs-rootfs-initramfs.rst index 6c576e241d86..3fddacc6bf14 100644 --- a/Documentation/filesystems/ramfs-rootfs-initramfs.rst +++ b/Documentation/filesystems/ramfs-rootfs-initramfs.rst @@ -71,7 +71,7 @@ be allowed write access to a ramfs mount. A ramfs derivative called tmpfs was created to add size limits, and the ability to write the data to swap space. Normal users can be allowed write access to -tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information. +tmpfs mounts. See Documentation/filesystems/tmpfs.rst for more information. What is rootfs? --------------- diff --git a/Documentation/filesystems/seq_file.txt b/Documentation/filesystems/seq_file.rst index d412b236a9d6..fab302046b13 100644 --- a/Documentation/filesystems/seq_file.txt +++ b/Documentation/filesystems/seq_file.rst @@ -1,6 +1,11 @@ -The seq_file interface +.. SPDX-License-Identifier: GPL-2.0 + +====================== +The seq_file Interface +====================== Copyright 2003 Jonathan Corbet <corbet@lwn.net> + This file is originally from the LWN.net Driver Porting series at http://lwn.net/Articles/driver-porting/ @@ -43,7 +48,7 @@ loadable module which creates a file called /proc/sequence. The file, when read, simply produces a set of increasing integer values, one per line. The sequence will continue until the user loses patience and finds something better to do. The file is seekable, in that one can do something like the -following: +following:: dd if=/proc/sequence of=out1 count=1 dd if=/proc/sequence skip=1 of=out2 count=1 @@ -55,16 +60,18 @@ wanting to see the full source for this module can find it at http://lwn.net/Articles/22359/). Deprecated create_proc_entry +============================ Note that the above article uses create_proc_entry which was removed in -kernel 3.10. Current versions require the following update +kernel 3.10. Current versions require the following update:: -- entry = create_proc_entry("sequence", 0, NULL); -- if (entry) -- entry->proc_fops = &ct_file_ops; -+ entry = proc_create("sequence", 0, NULL, &ct_file_ops); + - entry = create_proc_entry("sequence", 0, NULL); + - if (entry) + - entry->proc_fops = &ct_file_ops; + + entry = proc_create("sequence", 0, NULL, &ct_file_ops); The iterator interface +====================== Modules implementing a virtual file with seq_file must implement an iterator object that allows stepping through the data of interest @@ -99,7 +106,7 @@ position. The pos passed to start() will always be either zero, or the most recent pos used in the previous session. For our simple sequence example, -the start() function looks like: +the start() function looks like:: static void *ct_seq_start(struct seq_file *s, loff_t *pos) { @@ -129,7 +136,7 @@ move the iterator forward to the next position in the sequence. The example module can simply increment the position by one; more useful modules will do what is needed to step through some data structure. The next() function returns a new iterator, or NULL if the sequence is -complete. Here's the example version: +complete. Here's the example version:: static void *ct_seq_next(struct seq_file *s, void *v, loff_t *pos) { @@ -141,10 +148,10 @@ complete. Here's the example version: The stop() function closes a session; its job, of course, is to clean up. If dynamic memory is allocated for the iterator, stop() is the place to free it; if a lock was taken by start(), stop() must release -that lock. The value that *pos was set to by the last next() call +that lock. The value that ``*pos`` was set to by the last next() call before stop() is remembered, and used for the first start() call of the next session unless lseek() has been called on the file; in that -case next start() will be asked to start at position zero. +case next start() will be asked to start at position zero:: static void ct_seq_stop(struct seq_file *s, void *v) { @@ -152,7 +159,7 @@ case next start() will be asked to start at position zero. } Finally, the show() function should format the object currently pointed to -by the iterator for output. The example module's show() function is: +by the iterator for output. The example module's show() function is:: static int ct_seq_show(struct seq_file *s, void *v) { @@ -169,7 +176,7 @@ generated output before returning SEQ_SKIP, that output will be dropped. We will look at seq_printf() in a moment. But first, the definition of the seq_file iterator is finished by creating a seq_operations structure with -the four functions we have just defined: +the four functions we have just defined:: static const struct seq_operations ct_seq_ops = { .start = ct_seq_start, @@ -194,6 +201,7 @@ other locks while the iterator is active. Formatted output +================ The seq_file code manages positioning within the output created by the iterator and getting it into the user's buffer. But, for that to work, that @@ -203,7 +211,7 @@ been defined which make this task easy. Most code will simply use seq_printf(), which works pretty much like printk(), but which requires the seq_file pointer as an argument. -For straight character output, the following functions may be used: +For straight character output, the following functions may be used:: seq_putc(struct seq_file *m, char c); seq_puts(struct seq_file *m, const char *s); @@ -213,7 +221,7 @@ The first two output a single character and a string, just like one would expect. seq_escape() is like seq_puts(), except that any character in s which is in the string esc will be represented in octal form in the output. -There are also a pair of functions for printing filenames: +There are also a pair of functions for printing filenames:: int seq_path(struct seq_file *m, const struct path *path, const char *esc); @@ -226,8 +234,10 @@ the path relative to the current process's filesystem root. If a different root is desired, it can be used with seq_path_root(). If it turns out that path cannot be reached from root, seq_path_root() returns SEQ_SKIP. -A function producing complicated output may want to check +A function producing complicated output may want to check:: + bool seq_has_overflowed(struct seq_file *m); + and avoid further seq_<output> calls if true is returned. A true return from seq_has_overflowed means that the seq_file buffer will @@ -236,6 +246,7 @@ buffer and retry printing. Making it all work +================== So far, we have a nice set of functions which can produce output within the seq_file system, but we have not yet turned them into a file that a user @@ -244,7 +255,7 @@ creation of a set of file_operations which implement the operations on that file. The seq_file interface provides a set of canned operations which do most of the work. The virtual file author still must implement the open() method, however, to hook everything up. The open function is often a single -line, as in the example module: +line, as in the example module:: static int ct_open(struct inode *inode, struct file *file) { @@ -263,7 +274,7 @@ by the iterator functions. There is also a wrapper function to seq_open() called seq_open_private(). It kmallocs a zero filled block of memory and stores a pointer to it in the private field of the seq_file structure, returning 0 on success. The -block size is specified in a third parameter to the function, e.g.: +block size is specified in a third parameter to the function, e.g.:: static int ct_open(struct inode *inode, struct file *file) { @@ -273,7 +284,7 @@ block size is specified in a third parameter to the function, e.g.: There is also a variant function, __seq_open_private(), which is functionally identical except that, if successful, it returns the pointer to the allocated -memory block, allowing further initialisation e.g.: +memory block, allowing further initialisation e.g.:: static int ct_open(struct inode *inode, struct file *file) { @@ -295,7 +306,7 @@ frees the memory allocated in the corresponding open. The other operations of interest - read(), llseek(), and release() - are all implemented by the seq_file code itself. So a virtual file's -file_operations structure will look like: +file_operations structure will look like:: static const struct file_operations ct_file_ops = { .owner = THIS_MODULE, @@ -309,7 +320,7 @@ There is also a seq_release_private() which passes the contents of the seq_file private field to kfree() before releasing the structure. The final step is the creation of the /proc file itself. In the example -code, that is done in the initialization code in the usual way: +code, that is done in the initialization code in the usual way:: static int ct_init(void) { @@ -325,9 +336,10 @@ And that is pretty much it. seq_list +======== If your file will be iterating through a linked list, you may find these -routines useful: +routines useful:: struct list_head *seq_list_start(struct list_head *head, loff_t pos); @@ -338,15 +350,16 @@ routines useful: These helpers will interpret pos as a position within the list and iterate accordingly. Your start() and next() functions need only invoke the -seq_list_* helpers with a pointer to the appropriate list_head structure. +``seq_list_*`` helpers with a pointer to the appropriate list_head structure. The extra-simple version +======================== For extremely simple virtual files, there is an even easier interface. A module can define only the show() function, which should create all the output that the virtual file will contain. The file's open() method then -calls: +calls:: int single_open(struct file *file, int (*show)(struct seq_file *m, void *p), diff --git a/Documentation/filesystems/sharedsubtree.txt b/Documentation/filesystems/sharedsubtree.rst index 8ccfbd55244b..d83395354250 100644 --- a/Documentation/filesystems/sharedsubtree.txt +++ b/Documentation/filesystems/sharedsubtree.rst @@ -1,7 +1,10 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== Shared Subtrees ---------------- +=============== -Contents: +.. Contents: 1) Overview 2) Features 3) Setting mount states @@ -41,31 +44,38 @@ replicas continue to be exactly same. Here is an example: - Let's say /mnt has a mount that is shared. - mount --make-shared /mnt + Let's say /mnt has a mount that is shared:: + + mount --make-shared /mnt Note: mount(8) command now supports the --make-shared flag, so the sample 'smount' program is no longer needed and has been removed. - # mount --bind /mnt /tmp + :: + + # mount --bind /mnt /tmp + The above command replicates the mount at /mnt to the mountpoint /tmp and the contents of both the mounts remain identical. - #ls /mnt - a b c + :: - #ls /tmp - a b c + #ls /mnt + a b c - Now let's say we mount a device at /tmp/a - # mount /dev/sd0 /tmp/a + #ls /tmp + a b c - #ls /tmp/a - t1 t2 t3 + Now let's say we mount a device at /tmp/a:: - #ls /mnt/a - t1 t2 t3 + # mount /dev/sd0 /tmp/a + + #ls /tmp/a + t1 t2 t3 + + #ls /mnt/a + t1 t2 t3 Note that the mount has propagated to the mount at /mnt as well. @@ -123,14 +133,15 @@ replicas continue to be exactly same. 2d) A unbindable mount is a unbindable private mount - let's say we have a mount at /mnt and we make it unbindable + let's say we have a mount at /mnt and we make it unbindable:: + + # mount --make-unbindable /mnt - # mount --make-unbindable /mnt + Let's try to bind mount this mount somewhere else:: - Let's try to bind mount this mount somewhere else. - # mount --bind /mnt /tmp - mount: wrong fs type, bad option, bad superblock on /mnt, - or too many mounted file systems + # mount --bind /mnt /tmp + mount: wrong fs type, bad option, bad superblock on /mnt, + or too many mounted file systems Binding a unbindable mount is a invalid operation. @@ -138,12 +149,12 @@ replicas continue to be exactly same. 3) Setting mount states The mount command (util-linux package) can be used to set mount - states: + states:: - mount --make-shared mountpoint - mount --make-slave mountpoint - mount --make-private mountpoint - mount --make-unbindable mountpoint + mount --make-shared mountpoint + mount --make-slave mountpoint + mount --make-private mountpoint + mount --make-unbindable mountpoint 4) Use cases @@ -154,9 +165,10 @@ replicas continue to be exactly same. Solution: - The system administrator can make the mount at /cdrom shared - mount --bind /cdrom /cdrom - mount --make-shared /cdrom + The system administrator can make the mount at /cdrom shared:: + + mount --bind /cdrom /cdrom + mount --make-shared /cdrom Now any process that clones off a new namespace will have a mount at /cdrom which is a replica of the same mount in the @@ -172,14 +184,14 @@ replicas continue to be exactly same. Solution: To begin with, the administrator can mark the entire mount tree - as shareable. + as shareable:: - mount --make-rshared / + mount --make-rshared / A new process can clone off a new namespace. And mark some part - of its namespace as slave + of its namespace as slave:: - mount --make-rslave /myprivatetree + mount --make-rslave /myprivatetree Hence forth any mounts within the /myprivatetree done by the process will not show up in any other namespace. However mounts @@ -206,13 +218,13 @@ replicas continue to be exactly same. versions of the file depending on the path used to access that file. - An example is: + An example is:: - mount --make-shared / - mount --rbind / /view/v1 - mount --rbind / /view/v2 - mount --rbind / /view/v3 - mount --rbind / /view/v4 + mount --make-shared / + mount --rbind / /view/v1 + mount --rbind / /view/v2 + mount --rbind / /view/v3 + mount --rbind / /view/v4 and if /usr has a versioning filesystem mounted, then that mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and @@ -224,8 +236,8 @@ replicas continue to be exactly same. filesystem is being requested and return the corresponding inode. -5) Detailed semantics: -------------------- +5) Detailed semantics +--------------------- The section below explains the detailed semantics of bind, rbind, move, mount, umount and clone-namespace operations. @@ -235,6 +247,7 @@ replicas continue to be exactly same. 5a) Mount states A given mount can be in one of the following states + 1) shared 2) slave 3) shared and slave @@ -252,7 +265,8 @@ replicas continue to be exactly same. A 'shared mount' is defined as a vfsmount that belongs to a 'peer group'. - For example: + For example:: + mount --make-shared /mnt mount --bind /mnt /tmp @@ -270,7 +284,7 @@ replicas continue to be exactly same. A slave mount as the name implies has a master mount from which mount/unmount events are received. Events do not propagate from the slave mount to the master. Only a shared mount can be made - a slave by executing the following command + a slave by executing the following command:: mount --make-slave mount @@ -290,8 +304,10 @@ replicas continue to be exactly same. peer group. Only a slave vfsmount can be made as 'shared and slave' by - either executing the following command + either executing the following command:: + mount --make-shared mount + or by moving the slave vfsmount under a shared vfsmount. (4) Private mount @@ -307,30 +323,32 @@ replicas continue to be exactly same. State diagram: + The state diagram below explains the state transition of a mount, - in response to various commands. - ------------------------------------------------------------------------ - | |make-shared | make-slave | make-private |make-unbindab| - --------------|------------|--------------|--------------|-------------| - |shared |shared |*slave/private| private | unbindable | - | | | | | | - |-------------|------------|--------------|--------------|-------------| - |slave |shared | **slave | private | unbindable | - | |and slave | | | | - |-------------|------------|--------------|--------------|-------------| - |shared |shared | slave | private | unbindable | - |and slave |and slave | | | | - |-------------|------------|--------------|--------------|-------------| - |private |shared | **private | private | unbindable | - |-------------|------------|--------------|--------------|-------------| - |unbindable |shared |**unbindable | private | unbindable | - ------------------------------------------------------------------------ - - * if the shared mount is the only mount in its peer group, making it - slave, makes it private automatically. Note that there is no master to - which it can be slaved to. - - ** slaving a non-shared mount has no effect on the mount. + in response to various commands:: + + ----------------------------------------------------------------------- + | |make-shared | make-slave | make-private |make-unbindab| + --------------|------------|--------------|--------------|-------------| + |shared |shared |*slave/private| private | unbindable | + | | | | | | + |-------------|------------|--------------|--------------|-------------| + |slave |shared | **slave | private | unbindable | + | |and slave | | | | + |-------------|------------|--------------|--------------|-------------| + |shared |shared | slave | private | unbindable | + |and slave |and slave | | | | + |-------------|------------|--------------|--------------|-------------| + |private |shared | **private | private | unbindable | + |-------------|------------|--------------|--------------|-------------| + |unbindable |shared |**unbindable | private | unbindable | + ------------------------------------------------------------------------ + + * if the shared mount is the only mount in its peer group, making it + slave, makes it private automatically. Note that there is no master to + which it can be slaved to. + + ** slaving a non-shared mount has no effect on the mount. Apart from the commands listed below, the 'move' operation also changes the state of a mount depending on type of the destination mount. Its @@ -338,31 +356,32 @@ replicas continue to be exactly same. 5b) Bind semantics - Consider the following command + Consider the following command:: - mount --bind A/a B/b + mount --bind A/a B/b where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B' is the destination mount and 'b' is the dentry in the destination mount. The outcome depends on the type of mount of 'A' and 'B'. The table - below contains quick reference. - --------------------------------------------------------------------------- - | BIND MOUNT OPERATION | - |************************************************************************** - |source(A)->| shared | private | slave | unbindable | - | dest(B) | | | | | - | | | | | | | - | v | | | | | - |************************************************************************** - | shared | shared | shared | shared & slave | invalid | - | | | | | | - |non-shared| shared | private | slave | invalid | - *************************************************************************** + below contains quick reference:: + + -------------------------------------------------------------------------- + | BIND MOUNT OPERATION | + |************************************************************************| + |source(A)->| shared | private | slave | unbindable | + | dest(B) | | | | | + | | | | | | | + | v | | | | | + |************************************************************************| + | shared | shared | shared | shared & slave | invalid | + | | | | | | + |non-shared| shared | private | slave | invalid | + ************************************************************************** Details: - 1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C' + 1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C' which is clone of 'A', is created. Its root dentry is 'a' . 'C' is mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ... are created and mounted at the dentry 'b' on all mounts where 'B' @@ -371,7 +390,7 @@ replicas continue to be exactly same. 'B'. And finally the peer-group of 'C' is merged with the peer group of 'A'. - 2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C' + 2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C' which is clone of 'A', is created. Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ... are created and mounted at the dentry 'b' on all mounts where 'B' @@ -379,7 +398,7 @@ replicas continue to be exactly same. 'C', 'C1', .., 'Cn' with exactly the same configuration as the propagation tree for 'B'. - 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new + 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new mount 'C' which is clone of 'A', is created. Its root dentry is 'a' . 'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2', 'C3' ... are created and mounted at the dentry 'b' on all mounts where @@ -389,19 +408,19 @@ replicas continue to be exactly same. is made the slave of mount 'Z'. In other words, mount 'C' is in the state 'slave and shared'. - 4. 'A' is a unbindable mount and 'B' is a shared mount. This is a + 4. 'A' is a unbindable mount and 'B' is a shared mount. This is a invalid operation. - 5. 'A' is a private mount and 'B' is a non-shared(private or slave or + 5. 'A' is a private mount and 'B' is a non-shared(private or slave or unbindable) mount. A new mount 'C' which is clone of 'A', is created. Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'. - 6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C' + 6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C' which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'. 'C' is made a member of the peer-group of 'A'. - 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A + 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A new mount 'C' which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of @@ -409,7 +428,7 @@ replicas continue to be exactly same. mount/unmount on 'A' do not propagate anywhere else. Similarly mount/unmount on 'C' do not propagate anywhere else. - 8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a + 8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a invalid operation. A unbindable mount cannot be bind mounted. 5c) Rbind semantics @@ -422,7 +441,9 @@ replicas continue to be exactly same. then the subtree under the unbindable mount is pruned in the new location. - eg: let's say we have the following mount tree. + eg: + + let's say we have the following mount tree:: A / \ @@ -430,12 +451,12 @@ replicas continue to be exactly same. / \ / \ D E F G - Let's say all the mount except the mount C in the tree are - of a type other than unbindable. + Let's say all the mount except the mount C in the tree are + of a type other than unbindable. - If this tree is rbound to say Z + If this tree is rbound to say Z - We will have the following tree at the new location. + We will have the following tree at the new location:: Z | @@ -457,24 +478,26 @@ replicas continue to be exactly same. the dentry in the destination mount. The outcome depends on the type of the mount of 'A' and 'B'. The table - below is a quick reference. - --------------------------------------------------------------------------- - | MOVE MOUNT OPERATION | - |************************************************************************** - | source(A)->| shared | private | slave | unbindable | - | dest(B) | | | | | - | | | | | | | - | v | | | | | - |************************************************************************** - | shared | shared | shared |shared and slave| invalid | - | | | | | | - |non-shared| shared | private | slave | unbindable | - *************************************************************************** - NOTE: moving a mount residing under a shared mount is invalid. + below is a quick reference:: + + --------------------------------------------------------------------------- + | MOVE MOUNT OPERATION | + |************************************************************************** + | source(A)->| shared | private | slave | unbindable | + | dest(B) | | | | | + | | | | | | | + | v | | | | | + |************************************************************************** + | shared | shared | shared |shared and slave| invalid | + | | | | | | + |non-shared| shared | private | slave | unbindable | + *************************************************************************** + + .. Note:: moving a mount residing under a shared mount is invalid. Details follow: - 1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is + 1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An' are created and mounted at dentry 'b' on all mounts that receive propagation from mount 'B'. A new propagation tree is created in the @@ -483,7 +506,7 @@ replicas continue to be exactly same. propagation tree is appended to the already existing propagation tree of 'A'. - 2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is + 2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An' are created and mounted at dentry 'b' on all mounts that receive propagation from mount 'B'. The mount 'A' becomes a shared mount and a @@ -491,7 +514,7 @@ replicas continue to be exactly same. 'B'. This new propagation tree contains all the new mounts 'A1', 'A2'... 'An'. - 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The + 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'... 'An' are created and mounted at dentry 'b' on all mounts that receive propagation from mount 'B'. A new propagation tree is created @@ -501,32 +524,32 @@ replicas continue to be exactly same. 'A'. Mount 'A' continues to be the slave mount of 'Z' but it also becomes 'shared'. - 4. 'A' is a unbindable mount and 'B' is a shared mount. The operation + 4. 'A' is a unbindable mount and 'B' is a shared mount. The operation is invalid. Because mounting anything on the shared mount 'B' can create new mounts that get mounted on the mounts that receive propagation from 'B'. And since the mount 'A' is unbindable, cloning it to mount at other mountpoints is not possible. - 5. 'A' is a private mount and 'B' is a non-shared(private or slave or + 5. 'A' is a private mount and 'B' is a non-shared(private or slave or unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'. - 6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A' + 6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a shared mount. - 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. + 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a slave mount of mount 'Z'. - 8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount + 8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a unbindable mount. 5e) Mount semantics - Consider the following command + Consider the following command:: - mount device B/b + mount device B/b 'B' is the destination mount and 'b' is the dentry in the destination mount. @@ -537,9 +560,9 @@ replicas continue to be exactly same. 5f) Unmount semantics - Consider the following command + Consider the following command:: - umount A + umount A where 'A' is a mount mounted on mount 'B' at dentry 'b'. @@ -592,10 +615,12 @@ replicas continue to be exactly same. A. What is the result of the following command sequence? - mount --bind /mnt /mnt - mount --make-shared /mnt - mount --bind /mnt /tmp - mount --move /tmp /mnt/1 + :: + + mount --bind /mnt /mnt + mount --make-shared /mnt + mount --bind /mnt /tmp + mount --move /tmp /mnt/1 what should be the contents of /mnt /mnt/1 /mnt/1/1 should be? Should they all be identical? or should /mnt and /mnt/1 be @@ -604,23 +629,27 @@ replicas continue to be exactly same. B. What is the result of the following command sequence? - mount --make-rshared / - mkdir -p /v/1 - mount --rbind / /v/1 + :: + + mount --make-rshared / + mkdir -p /v/1 + mount --rbind / /v/1 what should be the content of /v/1/v/1 be? C. What is the result of the following command sequence? - mount --bind /mnt /mnt - mount --make-shared /mnt - mkdir -p /mnt/1/2/3 /mnt/1/test - mount --bind /mnt/1 /tmp - mount --make-slave /mnt - mount --make-shared /mnt - mount --bind /mnt/1/2 /tmp1 - mount --make-slave /mnt + :: + + mount --bind /mnt /mnt + mount --make-shared /mnt + mkdir -p /mnt/1/2/3 /mnt/1/test + mount --bind /mnt/1 /tmp + mount --make-slave /mnt + mount --make-shared /mnt + mount --bind /mnt/1/2 /tmp1 + mount --make-slave /mnt At this point we have the first mount at /tmp and its root dentry is 1. Let's call this mount 'A' @@ -668,7 +697,8 @@ replicas continue to be exactly same. step 1: let's say the root tree has just two directories with - one vfsmount. + one vfsmount:: + root / \ tmp usr @@ -676,14 +706,17 @@ replicas continue to be exactly same. And we want to replicate the tree at multiple mountpoints under /root/tmp - step2: - mount --make-shared /root + step 2: + :: - mkdir -p /tmp/m1 - mount --rbind /root /tmp/m1 + mount --make-shared /root - the new tree now looks like this: + mkdir -p /tmp/m1 + + mount --rbind /root /tmp/m1 + + the new tree now looks like this:: root / \ @@ -697,11 +730,13 @@ replicas continue to be exactly same. it has two vfsmounts - step3: + step 3: + :: + mkdir -p /tmp/m2 mount --rbind /root /tmp/m2 - the new tree now looks like this: + the new tree now looks like this:: root / \ @@ -724,6 +759,7 @@ replicas continue to be exactly same. it has 6 vfsmounts step 4: + :: mkdir -p /tmp/m3 mount --rbind /root /tmp/m3 @@ -740,7 +776,8 @@ replicas continue to be exactly same. step 1: let's say the root tree has just two directories with - one vfsmount. + one vfsmount:: + root / \ tmp usr @@ -748,17 +785,20 @@ replicas continue to be exactly same. How do we set up the same tree at multiple locations under /root/tmp - step2: - mount --bind /root/tmp /root/tmp + step 2: + :: - mount --make-rshared /root - mount --make-unbindable /root/tmp - mkdir -p /tmp/m1 + mount --bind /root/tmp /root/tmp - mount --rbind /root /tmp/m1 + mount --make-rshared /root + mount --make-unbindable /root/tmp - the new tree now looks like this: + mkdir -p /tmp/m1 + + mount --rbind /root /tmp/m1 + + the new tree now looks like this:: root / \ @@ -768,11 +808,13 @@ replicas continue to be exactly same. / \ tmp usr - step3: + step 3: + :: + mkdir -p /tmp/m2 mount --rbind /root /tmp/m2 - the new tree now looks like this: + the new tree now looks like this:: root / \ @@ -782,12 +824,13 @@ replicas continue to be exactly same. / \ / \ tmp usr tmp usr - step4: + step 4: + :: mkdir -p /tmp/m3 mount --rbind /root /tmp/m3 - the new tree now looks like this: + the new tree now looks like this:: root / \ @@ -801,25 +844,31 @@ replicas continue to be exactly same. 8A) Datastructure - 4 new fields are introduced to struct vfsmount - ->mnt_share - ->mnt_slave_list - ->mnt_slave - ->mnt_master + 4 new fields are introduced to struct vfsmount: + + * ->mnt_share + * ->mnt_slave_list + * ->mnt_slave + * ->mnt_master - ->mnt_share links together all the mount to/from which this vfsmount + ->mnt_share + links together all the mount to/from which this vfsmount send/receives propagation events. - ->mnt_slave_list links all the mounts to which this vfsmount propagates + ->mnt_slave_list + links all the mounts to which this vfsmount propagates to. - ->mnt_slave links together all the slaves that its master vfsmount + ->mnt_slave + links together all the slaves that its master vfsmount propagates to. - ->mnt_master points to the master vfsmount from which this vfsmount + ->mnt_master + points to the master vfsmount from which this vfsmount receives propagation. - ->mnt_flags takes two more flags to indicate the propagation status of + ->mnt_flags + takes two more flags to indicate the propagation status of the vfsmount. MNT_SHARE indicates that the vfsmount is a shared vfsmount. MNT_UNCLONABLE indicates that the vfsmount cannot be replicated. @@ -842,7 +891,7 @@ replicas continue to be exactly same. A example propagation tree looks as shown in the figure below. [ NOTE: Though it looks like a forest, if we consider all the shared - mounts as a conceptual entity called 'pnode', it becomes a tree] + mounts as a conceptual entity called 'pnode', it becomes a tree]:: A <--> B <--> C <---> D @@ -864,14 +913,19 @@ replicas continue to be exactly same. A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G' E's ->mnt_share links with ->mnt_share of K - 'E', 'K', 'F', 'G' have their ->mnt_master point to struct - vfsmount of 'A' + + 'E', 'K', 'F', 'G' have their ->mnt_master point to struct vfsmount of 'A' + 'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K' + K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N' C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K' + J and K's ->mnt_master points to struct vfsmount of C + and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I' + 'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'. @@ -903,6 +957,7 @@ replicas continue to be exactly same. Prepare phase: for each mount in the source tree: + a) Create the necessary number of mount trees to be attached to each of the mounts that receive propagation from the destination mount. @@ -929,11 +984,12 @@ replicas continue to be exactly same. Abort phase delete all the newly created trees. - NOTE: all the propagation related functionality resides in the file - pnode.c + .. Note:: + all the propagation related functionality resides in the file pnode.c ------------------------------------------------------------------------ version 0.1 (created the initial document, Ram Pai linuxram@us.ibm.com) + version 0.2 (Incorporated comments from Al Viro) diff --git a/Documentation/filesystems/spufs/index.rst b/Documentation/filesystems/spufs/index.rst new file mode 100644 index 000000000000..5ed4a8494967 --- /dev/null +++ b/Documentation/filesystems/spufs/index.rst @@ -0,0 +1,13 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============== +SPU Filesystem +============== + + +.. toctree:: + :maxdepth: 1 + + spufs + spu_create + spu_run diff --git a/Documentation/filesystems/spufs/spu_create.rst b/Documentation/filesystems/spufs/spu_create.rst new file mode 100644 index 000000000000..83108c099696 --- /dev/null +++ b/Documentation/filesystems/spufs/spu_create.rst @@ -0,0 +1,131 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========== +spu_create +========== + +Name +==== + spu_create - create a new spu context + + +Synopsis +======== + + :: + + #include <sys/types.h> + #include <sys/spu.h> + + int spu_create(const char *pathname, int flags, mode_t mode); + +Description +=========== + The spu_create system call is used on PowerPC machines that implement + the Cell Broadband Engine Architecture in order to access Synergistic + Processor Units (SPUs). It creates a new logical context for an SPU in + pathname and returns a handle to associated with it. pathname must + point to a non-existing directory in the mount point of the SPU file + system (spufs). When spu_create is successful, a directory gets cre- + ated on pathname and it is populated with files. + + The returned file handle can only be passed to spu_run(2) or closed, + other operations are not defined on it. When it is closed, all associ- + ated directory entries in spufs are removed. When the last file handle + pointing either inside of the context directory or to this file + descriptor is closed, the logical SPU context is destroyed. + + The parameter flags can be zero or any bitwise or'd combination of the + following constants: + + SPU_RAWIO + Allow mapping of some of the hardware registers of the SPU into + user space. This flag requires the CAP_SYS_RAWIO capability, see + capabilities(7). + + The mode parameter specifies the permissions used for creating the new + directory in spufs. mode is modified with the user's umask(2) value + and then used for both the directory and the files contained in it. The + file permissions mask out some more bits of mode because they typically + support only read or write access. See stat(2) for a full list of the + possible mode values. + + +Return Value +============ + spu_create returns a new file descriptor. It may return -1 to indicate + an error condition and set errno to one of the error codes listed + below. + + +Errors +====== + EACCES + The current user does not have write access on the spufs mount + point. + + EEXIST An SPU context already exists at the given path name. + + EFAULT pathname is not a valid string pointer in the current address + space. + + EINVAL pathname is not a directory in the spufs mount point. + + ELOOP Too many symlinks were found while resolving pathname. + + EMFILE The process has reached its maximum open file limit. + + ENAMETOOLONG + pathname was too long. + + ENFILE The system has reached the global open file limit. + + ENOENT Part of pathname could not be resolved. + + ENOMEM The kernel could not allocate all resources required. + + ENOSPC There are not enough SPU resources available to create a new + context or the user specific limit for the number of SPU con- + texts has been reached. + + ENOSYS the functionality is not provided by the current system, because + either the hardware does not provide SPUs or the spufs module is + not loaded. + + ENOTDIR + A part of pathname is not a directory. + + + +Notes +===== + spu_create is meant to be used from libraries that implement a more + abstract interface to SPUs, not to be used from regular applications. + See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec- + ommended libraries. + + +Files +===== + pathname must point to a location beneath the mount point of spufs. By + convention, it gets mounted in /spu. + + +Conforming to +============= + This call is Linux specific and only implemented by the ppc64 architec- + ture. Programs using this system call are not portable. + + +Bugs +==== + The code does not yet fully implement all features lined out here. + + +Author +====== + Arnd Bergmann <arndb@de.ibm.com> + +See Also +======== + capabilities(7), close(2), spu_run(2), spufs(7) diff --git a/Documentation/filesystems/spufs/spu_run.rst b/Documentation/filesystems/spufs/spu_run.rst new file mode 100644 index 000000000000..7fdb1c31cb91 --- /dev/null +++ b/Documentation/filesystems/spufs/spu_run.rst @@ -0,0 +1,138 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======= +spu_run +======= + + +Name +==== + spu_run - execute an spu context + + +Synopsis +======== + + :: + + #include <sys/spu.h> + + int spu_run(int fd, unsigned int *npc, unsigned int *event); + +Description +=========== + The spu_run system call is used on PowerPC machines that implement the + Cell Broadband Engine Architecture in order to access Synergistic Pro- + cessor Units (SPUs). It uses the fd that was returned from spu_cre- + ate(2) to address a specific SPU context. When the context gets sched- + uled to a physical SPU, it starts execution at the instruction pointer + passed in npc. + + Execution of SPU code happens synchronously, meaning that spu_run does + not return while the SPU is still running. If there is a need to exe- + cute SPU code in parallel with other code on either the main CPU or + other SPUs, you need to create a new thread of execution first, e.g. + using the pthread_create(3) call. + + When spu_run returns, the current value of the SPU instruction pointer + is written back to npc, so you can call spu_run again without updating + the pointers. + + event can be a NULL pointer or point to an extended status code that + gets filled when spu_run returns. It can be one of the following con- + stants: + + SPE_EVENT_DMA_ALIGNMENT + A DMA alignment error + + SPE_EVENT_SPE_DATA_SEGMENT + A DMA segmentation error + + SPE_EVENT_SPE_DATA_STORAGE + A DMA storage error + + If NULL is passed as the event argument, these errors will result in a + signal delivered to the calling process. + +Return Value +============ + spu_run returns the value of the spu_status register or -1 to indicate + an error and set errno to one of the error codes listed below. The + spu_status register value contains a bit mask of status codes and + optionally a 14 bit code returned from the stop-and-signal instruction + on the SPU. The bit masks for the status codes are: + + 0x02 + SPU was stopped by stop-and-signal. + + 0x04 + SPU was stopped by halt. + + 0x08 + SPU is waiting for a channel. + + 0x10 + SPU is in single-step mode. + + 0x20 + SPU has tried to execute an invalid instruction. + + 0x40 + SPU has tried to access an invalid channel. + + 0x3fff0000 + The bits masked with this value contain the code returned from + stop-and-signal. + + There are always one or more of the lower eight bits set or an error + code is returned from spu_run. + +Errors +====== + EAGAIN or EWOULDBLOCK + fd is in non-blocking mode and spu_run would block. + + EBADF fd is not a valid file descriptor. + + EFAULT npc is not a valid pointer or status is neither NULL nor a valid + pointer. + + EINTR A signal occurred while spu_run was in progress. The npc value + has been updated to the new program counter value if necessary. + + EINVAL fd is not a file descriptor returned from spu_create(2). + + ENOMEM Insufficient memory was available to handle a page fault result- + ing from an MFC direct memory access. + + ENOSYS the functionality is not provided by the current system, because + either the hardware does not provide SPUs or the spufs module is + not loaded. + + +Notes +===== + spu_run is meant to be used from libraries that implement a more + abstract interface to SPUs, not to be used from regular applications. + See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec- + ommended libraries. + + +Conforming to +============= + This call is Linux specific and only implemented by the ppc64 architec- + ture. Programs using this system call are not portable. + + +Bugs +==== + The code does not yet fully implement all features lined out here. + + +Author +====== + Arnd Bergmann <arndb@de.ibm.com> + +See Also +======== + capabilities(7), close(2), spu_create(2), spufs(7) diff --git a/Documentation/filesystems/spufs.txt b/Documentation/filesystems/spufs/spufs.rst index eb9e3aa63026..8a42859bb100 100644 --- a/Documentation/filesystems/spufs.txt +++ b/Documentation/filesystems/spufs/spufs.rst @@ -1,12 +1,18 @@ -SPUFS(2) Linux Programmer's Manual SPUFS(2) +.. SPDX-License-Identifier: GPL-2.0 +===== +spufs +===== +Name +==== -NAME spufs - the SPU file system -DESCRIPTION +Description +=========== + The SPU file system is used on PowerPC machines that implement the Cell Broadband Engine Architecture in order to access Synergistic Processor Units (SPUs). @@ -21,7 +27,9 @@ DESCRIPTION ally add or remove files. -MOUNT OPTIONS +Mount Options +============= + uid=<uid> set the user owning the mount point, the default is 0 (root). @@ -29,7 +37,9 @@ MOUNT OPTIONS set the group owning the mount point, the default is 0 (root). -FILES +Files +===== + The files in spufs mostly follow the standard behavior for regular sys- tem calls like read(2) or write(2), but often support only a subset of the operations supported on regular file systems. This list details the @@ -125,14 +135,12 @@ FILES space is available for writing. - /mbox_stat - /ibox_stat - /wbox_stat + /mbox_stat, /ibox_stat, /wbox_stat Read-only files that contain the length of the current queue, i.e. how many words can be read from mbox or ibox or how many words can be written to wbox without blocking. The files can be read only in 4-byte units and return a big-endian binary integer number. The possible - operations on an open *box_stat file are: + operations on an open ``*box_stat`` file are: read(2) If a count smaller than four is requested, read returns -1 and @@ -143,12 +151,7 @@ FILES in EAGAIN. - /npc - /decr - /decr_status - /spu_tag_mask - /event_mask - /srr0 + /npc, /decr, /decr_status, /spu_tag_mask, /event_mask, /srr0 Internal registers of the SPU. The representation is an ASCII string with the numeric value of the next instruction to be executed. These can be used in read/write mode for debugging, but normal operation of @@ -157,17 +160,14 @@ FILES The contents of these files are: + =================== =================================== npc Next Program Counter - decr SPU Decrementer - decr_status Decrementer Status - spu_tag_mask MFC tag mask for SPU DMA - event_mask Event mask for SPU interrupts - srr0 Interrupt Return address register + =================== =================================== The possible operations on an open npc, decr, decr_status, @@ -206,8 +206,7 @@ FILES from the data buffer, updating the value of the fpcr register. - /signal1 - /signal2 + /signal1, /signal2 The two signal notification channels of an SPU. These are read-write files that operate on a 32 bit word. Writing to one of these files triggers an interrupt on the SPU. The value written to the signal @@ -233,8 +232,7 @@ FILES file. - /signal1_type - /signal2_type + /signal1_type, /signal2_type These two files change the behavior of the signal1 and signal2 notifi- cation files. The contain a numerical ASCII string which is read as either "1" or "0". In mode 0 (overwrite), the hardware replaces the @@ -259,263 +257,17 @@ FILES the previous setting. -EXAMPLES +Examples +======== /etc/fstab entry none /spu spufs gid=spu 0 0 -AUTHORS +Authors +======= Arnd Bergmann <arndb@de.ibm.com>, Mark Nutter <mnutter@us.ibm.com>, Ulrich Weigand <Ulrich.Weigand@de.ibm.com> -SEE ALSO +See Also +======== capabilities(7), close(2), spu_create(2), spu_run(2), spufs(7) - - - -Linux 2005-09-28 SPUFS(2) - ------------------------------------------------------------------------------- - -SPU_RUN(2) Linux Programmer's Manual SPU_RUN(2) - - - -NAME - spu_run - execute an spu context - - -SYNOPSIS - #include <sys/spu.h> - - int spu_run(int fd, unsigned int *npc, unsigned int *event); - -DESCRIPTION - The spu_run system call is used on PowerPC machines that implement the - Cell Broadband Engine Architecture in order to access Synergistic Pro- - cessor Units (SPUs). It uses the fd that was returned from spu_cre- - ate(2) to address a specific SPU context. When the context gets sched- - uled to a physical SPU, it starts execution at the instruction pointer - passed in npc. - - Execution of SPU code happens synchronously, meaning that spu_run does - not return while the SPU is still running. If there is a need to exe- - cute SPU code in parallel with other code on either the main CPU or - other SPUs, you need to create a new thread of execution first, e.g. - using the pthread_create(3) call. - - When spu_run returns, the current value of the SPU instruction pointer - is written back to npc, so you can call spu_run again without updating - the pointers. - - event can be a NULL pointer or point to an extended status code that - gets filled when spu_run returns. It can be one of the following con- - stants: - - SPE_EVENT_DMA_ALIGNMENT - A DMA alignment error - - SPE_EVENT_SPE_DATA_SEGMENT - A DMA segmentation error - - SPE_EVENT_SPE_DATA_STORAGE - A DMA storage error - - If NULL is passed as the event argument, these errors will result in a - signal delivered to the calling process. - -RETURN VALUE - spu_run returns the value of the spu_status register or -1 to indicate - an error and set errno to one of the error codes listed below. The - spu_status register value contains a bit mask of status codes and - optionally a 14 bit code returned from the stop-and-signal instruction - on the SPU. The bit masks for the status codes are: - - 0x02 SPU was stopped by stop-and-signal. - - 0x04 SPU was stopped by halt. - - 0x08 SPU is waiting for a channel. - - 0x10 SPU is in single-step mode. - - 0x20 SPU has tried to execute an invalid instruction. - - 0x40 SPU has tried to access an invalid channel. - - 0x3fff0000 - The bits masked with this value contain the code returned from - stop-and-signal. - - There are always one or more of the lower eight bits set or an error - code is returned from spu_run. - -ERRORS - EAGAIN or EWOULDBLOCK - fd is in non-blocking mode and spu_run would block. - - EBADF fd is not a valid file descriptor. - - EFAULT npc is not a valid pointer or status is neither NULL nor a valid - pointer. - - EINTR A signal occurred while spu_run was in progress. The npc value - has been updated to the new program counter value if necessary. - - EINVAL fd is not a file descriptor returned from spu_create(2). - - ENOMEM Insufficient memory was available to handle a page fault result- - ing from an MFC direct memory access. - - ENOSYS the functionality is not provided by the current system, because - either the hardware does not provide SPUs or the spufs module is - not loaded. - - -NOTES - spu_run is meant to be used from libraries that implement a more - abstract interface to SPUs, not to be used from regular applications. - See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec- - ommended libraries. - - -CONFORMING TO - This call is Linux specific and only implemented by the ppc64 architec- - ture. Programs using this system call are not portable. - - -BUGS - The code does not yet fully implement all features lined out here. - - -AUTHOR - Arnd Bergmann <arndb@de.ibm.com> - -SEE ALSO - capabilities(7), close(2), spu_create(2), spufs(7) - - - -Linux 2005-09-28 SPU_RUN(2) - ------------------------------------------------------------------------------- - -SPU_CREATE(2) Linux Programmer's Manual SPU_CREATE(2) - - - -NAME - spu_create - create a new spu context - - -SYNOPSIS - #include <sys/types.h> - #include <sys/spu.h> - - int spu_create(const char *pathname, int flags, mode_t mode); - -DESCRIPTION - The spu_create system call is used on PowerPC machines that implement - the Cell Broadband Engine Architecture in order to access Synergistic - Processor Units (SPUs). It creates a new logical context for an SPU in - pathname and returns a handle to associated with it. pathname must - point to a non-existing directory in the mount point of the SPU file - system (spufs). When spu_create is successful, a directory gets cre- - ated on pathname and it is populated with files. - - The returned file handle can only be passed to spu_run(2) or closed, - other operations are not defined on it. When it is closed, all associ- - ated directory entries in spufs are removed. When the last file handle - pointing either inside of the context directory or to this file - descriptor is closed, the logical SPU context is destroyed. - - The parameter flags can be zero or any bitwise or'd combination of the - following constants: - - SPU_RAWIO - Allow mapping of some of the hardware registers of the SPU into - user space. This flag requires the CAP_SYS_RAWIO capability, see - capabilities(7). - - The mode parameter specifies the permissions used for creating the new - directory in spufs. mode is modified with the user's umask(2) value - and then used for both the directory and the files contained in it. The - file permissions mask out some more bits of mode because they typically - support only read or write access. See stat(2) for a full list of the - possible mode values. - - -RETURN VALUE - spu_create returns a new file descriptor. It may return -1 to indicate - an error condition and set errno to one of the error codes listed - below. - - -ERRORS - EACCES - The current user does not have write access on the spufs mount - point. - - EEXIST An SPU context already exists at the given path name. - - EFAULT pathname is not a valid string pointer in the current address - space. - - EINVAL pathname is not a directory in the spufs mount point. - - ELOOP Too many symlinks were found while resolving pathname. - - EMFILE The process has reached its maximum open file limit. - - ENAMETOOLONG - pathname was too long. - - ENFILE The system has reached the global open file limit. - - ENOENT Part of pathname could not be resolved. - - ENOMEM The kernel could not allocate all resources required. - - ENOSPC There are not enough SPU resources available to create a new - context or the user specific limit for the number of SPU con- - texts has been reached. - - ENOSYS the functionality is not provided by the current system, because - either the hardware does not provide SPUs or the spufs module is - not loaded. - - ENOTDIR - A part of pathname is not a directory. - - - -NOTES - spu_create is meant to be used from libraries that implement a more - abstract interface to SPUs, not to be used from regular applications. - See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec- - ommended libraries. - - -FILES - pathname must point to a location beneath the mount point of spufs. By - convention, it gets mounted in /spu. - - -CONFORMING TO - This call is Linux specific and only implemented by the ppc64 architec- - ture. Programs using this system call are not portable. - - -BUGS - The code does not yet fully implement all features lined out here. - - -AUTHOR - Arnd Bergmann <arndb@de.ibm.com> - -SEE ALSO - capabilities(7), close(2), spu_run(2), spufs(7) - - - -Linux 2005-09-28 SPU_CREATE(2) diff --git a/Documentation/filesystems/sysfs-pci.txt b/Documentation/filesystems/sysfs-pci.rst index 06f1d64c6f70..a265f3e2cc80 100644 --- a/Documentation/filesystems/sysfs-pci.txt +++ b/Documentation/filesystems/sysfs-pci.rst @@ -1,8 +1,11 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================ Accessing PCI device resources through sysfs --------------------------------------------- +============================================ sysfs, usually mounted at /sys, provides access to PCI resources on platforms -that support it. For example, a given bus might look like this: +that support it. For example, a given bus might look like this:: /sys/devices/pci0000:17 |-- 0000:17:00.0 @@ -30,8 +33,9 @@ This bus contains a single function device in slot 0. The domain and bus numbers are reproduced for convenience. Under the device directory are several files, each with their own function. + =================== ===================================================== file function - ---- -------- + =================== ===================================================== class PCI class (ascii, ro) config PCI config space (binary, rw) device PCI device (ascii, ro) @@ -40,13 +44,16 @@ files, each with their own function. local_cpus nearby CPU mask (cpumask, ro) remove remove device from kernel's list (ascii, wo) resource PCI resource host addresses (ascii, ro) - resource0..N PCI resource N, if present (binary, mmap, rw[1]) + resource0..N PCI resource N, if present (binary, mmap, rw\ [1]_) resource0_wc..N_wc PCI WC map resource N, if prefetchable (binary, mmap) revision PCI revision (ascii, ro) rom PCI ROM resource, if present (binary, ro) subsystem_device PCI subsystem device (ascii, ro) subsystem_vendor PCI subsystem vendor (ascii, ro) vendor PCI vendor (ascii, ro) + =================== ===================================================== + +:: ro - read only file rw - file is readable and writable @@ -56,7 +63,7 @@ files, each with their own function. binary - file contains binary data cpumask - file contains a cpumask type -[1] rw for RESOURCE_IO (I/O port) regions only +.. [1] rw for RESOURCE_IO (I/O port) regions only The read only files are informational, writes to them will be ignored, with the exception of the 'rom' file. Writable files can be used to perform @@ -67,11 +74,11 @@ don't support mmapping of certain resources, so be sure to check the return value from any attempted mmap. The most notable of these are I/O port resources, which also provide read/write access. -The 'enable' file provides a counter that indicates how many times the device +The 'enable' file provides a counter that indicates how many times the device has been enabled. If the 'enable' file currently returns '4', and a '1' is echoed into it, it will then return '5'. Echoing a '0' into it will decrease the count. Even when it returns to 0, though, some of the initialisation -may not be reversed. +may not be reversed. The 'rom' file is special in that it provides read-only access to the device's ROM file, if available. It's disabled by default, however, so applications @@ -93,7 +100,7 @@ Accessing legacy resources through sysfs Legacy I/O port and ISA memory resources are also provided in sysfs if the underlying platform supports them. They're located in the PCI class hierarchy, -e.g. +e.g.:: /sys/class/pci_bus/0000:17/ |-- bridge -> ../../../devices/pci0000:17 diff --git a/Documentation/filesystems/sysfs-tagging.txt b/Documentation/filesystems/sysfs-tagging.rst index c7c8e6438958..8888a05c398e 100644 --- a/Documentation/filesystems/sysfs-tagging.txt +++ b/Documentation/filesystems/sysfs-tagging.rst @@ -1,5 +1,8 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= Sysfs tagging -------------- +============= (Taken almost verbatim from Eric Biederman's netns tagging patch commit msg) @@ -18,25 +21,28 @@ in the directories and applications only see a limited set of the network devices. Each sysfs directory entry may be tagged with a namespace via the -void *ns member of its kernfs_node. If a directory entry is tagged, -then kernfs_node->flags will have a flag between KOBJ_NS_TYPE_NONE +``void *ns member`` of its ``kernfs_node``. If a directory entry is tagged, +then ``kernfs_node->flags`` will have a flag between KOBJ_NS_TYPE_NONE and KOBJ_NS_TYPES, and ns will point to the namespace to which it belongs. -Each sysfs superblock's kernfs_super_info contains an array void -*ns[KOBJ_NS_TYPES]. When a task in a tagging namespace +Each sysfs superblock's kernfs_super_info contains an array +``void *ns[KOBJ_NS_TYPES]``. When a task in a tagging namespace kobj_nstype first mounts sysfs, a new superblock is created. It will be differentiated from other sysfs mounts by having its -s_fs_info->ns[kobj_nstype] set to the new namespace. Note that +``s_fs_info->ns[kobj_nstype]`` set to the new namespace. Note that through bind mounting and mounts propagation, a task can easily view the contents of other namespaces' sysfs mounts. Therefore, when a namespace exits, it will call kobj_ns_exit() to invalidate any kernfs_node->ns pointers pointing to it. Users of this interface: -- define a type in the kobj_ns_type enumeration. -- call kobj_ns_type_register() with its kobj_ns_type_operations which has + +- define a type in the ``kobj_ns_type`` enumeration. +- call kobj_ns_type_register() with its ``kobj_ns_type_operations`` which has + - current_ns() which returns current's namespace - netlink_ns() which returns a socket's namespace - initial_ns() which returns the initial namesapce + - call kobj_ns_exit() when an individual tag is no longer valid diff --git a/Documentation/filesystems/sysfs.rst b/Documentation/filesystems/sysfs.rst index 290891c3fecb..ab0f7795792b 100644 --- a/Documentation/filesystems/sysfs.rst +++ b/Documentation/filesystems/sysfs.rst @@ -20,7 +20,7 @@ a means to export kernel data structures, their attributes, and the linkages between them to userspace. sysfs is tied inherently to the kobject infrastructure. Please read -Documentation/kobject.txt for more information concerning the kobject +Documentation/core-api/kobject.rst for more information concerning the kobject interface. diff --git a/Documentation/filesystems/xfs-delayed-logging-design.txt b/Documentation/filesystems/xfs-delayed-logging-design.rst index 9a6dd289b17b..464405d2801e 100644 --- a/Documentation/filesystems/xfs-delayed-logging-design.txt +++ b/Documentation/filesystems/xfs-delayed-logging-design.rst @@ -1,8 +1,11 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== XFS Delayed Logging Design --------------------------- +========================== Introduction to Re-logging in XFS ---------------------------------- +================================= XFS logging is a combination of logical and physical logging. Some objects, such as inodes and dquots, are logged in logical format where the details @@ -25,7 +28,7 @@ changes in the new transaction that is written to the log. That is, if we have a sequence of changes A through to F, and the object was written to disk after change D, we would see in the log the following series of transactions, their contents and the log sequence number (LSN) of the -transaction: +transaction:: Transaction Contents LSN A A X @@ -85,7 +88,7 @@ IO permanently. Hence the XFS journalling subsystem can be considered to be IO bound. Delayed Logging: Concepts -------------------------- +========================= The key thing to note about the asynchronous logging combined with the relogging technique XFS uses is that we can be relogging changed objects @@ -154,9 +157,10 @@ The fundamental requirements for delayed logging in XFS are simple: 6. No performance regressions for synchronous transaction workloads. Delayed Logging: Design ------------------------ +======================= Storing Changes +--------------- The problem with accumulating changes at a logical level (i.e. just using the existing log item dirty region tracking) is that when it comes to writing the @@ -194,30 +198,30 @@ asynchronous transactions to the log. The differences between the existing formatting method and the delayed logging formatting can be seen in the diagram below. -Current format log vector: +Current format log vector:: -Object +---------------------------------------------+ -Vector 1 +----+ -Vector 2 +----+ -Vector 3 +----------+ + Object +---------------------------------------------+ + Vector 1 +----+ + Vector 2 +----+ + Vector 3 +----------+ -After formatting: +After formatting:: -Log Buffer +-V1-+-V2-+----V3----+ + Log Buffer +-V1-+-V2-+----V3----+ -Delayed logging vector: +Delayed logging vector:: -Object +---------------------------------------------+ -Vector 1 +----+ -Vector 2 +----+ -Vector 3 +----------+ + Object +---------------------------------------------+ + Vector 1 +----+ + Vector 2 +----+ + Vector 3 +----------+ -After formatting: +After formatting:: -Memory Buffer +-V1-+-V2-+----V3----+ -Vector 1 +----+ -Vector 2 +----+ -Vector 3 +----------+ + Memory Buffer +-V1-+-V2-+----V3----+ + Vector 1 +----+ + Vector 2 +----+ + Vector 3 +----------+ The memory buffer and associated vector need to be passed as a single object, but still need to be associated with the parent object so if the object is @@ -242,6 +246,7 @@ relogged in memory. Tracking Changes +---------------- Now that we can record transactional changes in memory in a form that allows them to be used without limitations, we need to be able to track and accumulate @@ -278,6 +283,7 @@ done for convenience/sanity of the developers. Delayed Logging: Checkpoints +---------------------------- When we have a log synchronisation event, commonly known as a "log force", all the items in the CIL must be written into the log via the log buffers. @@ -341,7 +347,7 @@ Hence log vectors need to be able to be chained together to allow them to be detached from the log items. That is, when the CIL is flushed the memory buffer and log vector attached to each log item needs to be attached to the checkpoint context so that the log item can be released. In diagrammatic form, -the CIL would look like this before the flush: +the CIL would look like this before the flush:: CIL Head | @@ -362,7 +368,7 @@ the CIL would look like this before the flush: -> vector array And after the flush the CIL head is empty, and the checkpoint context log -vector list would look like: +vector list would look like:: Checkpoint Context | @@ -411,6 +417,7 @@ compare" situation that can be done after a working and reviewed implementation is in the dev tree.... Delayed Logging: Checkpoint Sequencing +-------------------------------------- One of the key aspects of the XFS transaction subsystem is that it tags committed transactions with the log sequence number of the transaction commit. @@ -474,6 +481,7 @@ force the log at the LSN of that transaction) and so the higher level code behaves the same regardless of whether delayed logging is being used or not. Delayed Logging: Checkpoint Log Space Accounting +------------------------------------------------ The big issue for a checkpoint transaction is the log space reservation for the transaction. We don't know how big a checkpoint transaction is going to be @@ -491,7 +499,7 @@ the size of the transaction and the number of regions being logged (the number of log vectors in the transaction). An example of the differences would be logging directory changes versus logging -inode changes. If you modify lots of inode cores (e.g. chmod -R g+w *), then +inode changes. If you modify lots of inode cores (e.g. ``chmod -R g+w *``), then there are lots of transactions that only contain an inode core and an inode log format structure. That is, two vectors totaling roughly 150 bytes. If we modify 10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each @@ -565,6 +573,7 @@ which is once every 30s. Delayed Logging: Log Item Pinning +--------------------------------- Currently log items are pinned during transaction commit while the items are still locked. This happens just after the items are formatted, though it could @@ -605,6 +614,7 @@ object, we have a race with CIL being flushed between the check and the pin lock to guarantee that we pin the items correctly. Delayed Logging: Concurrent Scalability +--------------------------------------- A fundamental requirement for the CIL is that accesses through transaction commits must scale to many concurrent commits. The current transaction commit @@ -683,8 +693,9 @@ woken by the wrong event. Lifecycle Changes +----------------- -The existing log item life cycle is as follows: +The existing log item life cycle is as follows:: 1. Transaction allocate 2. Transaction reserve @@ -729,7 +740,7 @@ at the same time. If the log item is in the AIL or between steps 6 and 7 and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9 are entered and completed is the object considered clean. -With delayed logging, there are new steps inserted into the life cycle: +With delayed logging, there are new steps inserted into the life cycle:: 1. Transaction allocate 2. Transaction reserve diff --git a/Documentation/filesystems/xfs-self-describing-metadata.txt b/Documentation/filesystems/xfs-self-describing-metadata.rst index 8db0121d0980..51cdafe01ab1 100644 --- a/Documentation/filesystems/xfs-self-describing-metadata.txt +++ b/Documentation/filesystems/xfs-self-describing-metadata.rst @@ -1,8 +1,11 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================ XFS Self Describing Metadata ----------------------------- +============================ Introduction ------------- +============ The largest scalability problem facing XFS is not one of algorithmic scalability, but of verification of the filesystem structure. Scalabilty of the @@ -34,7 +37,7 @@ required for basic forensic analysis of the filesystem structure. Self Describing Metadata ------------------------- +======================== One of the problems with the current metadata format is that apart from the magic number in the metadata block, we have no other way of identifying what it @@ -142,7 +145,7 @@ modification occurred between the corruption being written and when it was detected. Runtime Validation ------------------- +================== Validation of self-describing metadata takes place at runtime in two places: @@ -183,18 +186,18 @@ error occurs during this process, the buffer is again marked with a EFSCORRUPTED error for the higher layers to catch. Structures ----------- +========== -A typical on-disk structure needs to contain the following information: +A typical on-disk structure needs to contain the following information:: -struct xfs_ondisk_hdr { - __be32 magic; /* magic number */ - __be32 crc; /* CRC, not logged */ - uuid_t uuid; /* filesystem identifier */ - __be64 owner; /* parent object */ - __be64 blkno; /* location on disk */ - __be64 lsn; /* last modification in log, not logged */ -}; + struct xfs_ondisk_hdr { + __be32 magic; /* magic number */ + __be32 crc; /* CRC, not logged */ + uuid_t uuid; /* filesystem identifier */ + __be64 owner; /* parent object */ + __be64 blkno; /* location on disk */ + __be64 lsn; /* last modification in log, not logged */ + }; Depending on the metadata, this information may be part of a header structure separate to the metadata contents, or may be distributed through an existing @@ -214,24 +217,24 @@ level of information is generally provided. For example: well. hence the additional metadata headers change the overall format of the metadata. -A typical buffer read verifier is structured as follows: +A typical buffer read verifier is structured as follows:: -#define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc) + #define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc) -static void -xfs_foo_read_verify( - struct xfs_buf *bp) -{ - struct xfs_mount *mp = bp->b_mount; + static void + xfs_foo_read_verify( + struct xfs_buf *bp) + { + struct xfs_mount *mp = bp->b_mount; - if ((xfs_sb_version_hascrc(&mp->m_sb) && - !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length), - XFS_FOO_CRC_OFF)) || - !xfs_foo_verify(bp)) { - XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); - xfs_buf_ioerror(bp, EFSCORRUPTED); - } -} + if ((xfs_sb_version_hascrc(&mp->m_sb) && + !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length), + XFS_FOO_CRC_OFF)) || + !xfs_foo_verify(bp)) { + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); + xfs_buf_ioerror(bp, EFSCORRUPTED); + } + } The code ensures that the CRC is only checked if the filesystem has CRCs enabled by checking the superblock of the feature bit, and then if the CRC verifies OK @@ -239,83 +242,83 @@ by checking the superblock of the feature bit, and then if the CRC verifies OK The verifier function will take a couple of different forms, depending on whether the magic number can be used to determine the format of the block. In -the case it can't, the code is structured as follows: +the case it can't, the code is structured as follows:: -static bool -xfs_foo_verify( - struct xfs_buf *bp) -{ - struct xfs_mount *mp = bp->b_mount; - struct xfs_ondisk_hdr *hdr = bp->b_addr; + static bool + xfs_foo_verify( + struct xfs_buf *bp) + { + struct xfs_mount *mp = bp->b_mount; + struct xfs_ondisk_hdr *hdr = bp->b_addr; - if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) - return false; + if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) + return false; - if (!xfs_sb_version_hascrc(&mp->m_sb)) { - if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) - return false; - if (bp->b_bn != be64_to_cpu(hdr->blkno)) - return false; - if (hdr->owner == 0) - return false; - } + if (!xfs_sb_version_hascrc(&mp->m_sb)) { + if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) + return false; + if (bp->b_bn != be64_to_cpu(hdr->blkno)) + return false; + if (hdr->owner == 0) + return false; + } - /* object specific verification checks here */ + /* object specific verification checks here */ - return true; -} + return true; + } If there are different magic numbers for the different formats, the verifier -will look like: - -static bool -xfs_foo_verify( - struct xfs_buf *bp) -{ - struct xfs_mount *mp = bp->b_mount; - struct xfs_ondisk_hdr *hdr = bp->b_addr; - - if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) { - if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) - return false; - if (bp->b_bn != be64_to_cpu(hdr->blkno)) - return false; - if (hdr->owner == 0) - return false; - } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) - return false; - - /* object specific verification checks here */ - - return true; -} +will look like:: + + static bool + xfs_foo_verify( + struct xfs_buf *bp) + { + struct xfs_mount *mp = bp->b_mount; + struct xfs_ondisk_hdr *hdr = bp->b_addr; + + if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) { + if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) + return false; + if (bp->b_bn != be64_to_cpu(hdr->blkno)) + return false; + if (hdr->owner == 0) + return false; + } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) + return false; + + /* object specific verification checks here */ + + return true; + } Write verifiers are very similar to the read verifiers, they just do things in -the opposite order to the read verifiers. A typical write verifier: +the opposite order to the read verifiers. A typical write verifier:: -static void -xfs_foo_write_verify( - struct xfs_buf *bp) -{ - struct xfs_mount *mp = bp->b_mount; - struct xfs_buf_log_item *bip = bp->b_fspriv; + static void + xfs_foo_write_verify( + struct xfs_buf *bp) + { + struct xfs_mount *mp = bp->b_mount; + struct xfs_buf_log_item *bip = bp->b_fspriv; - if (!xfs_foo_verify(bp)) { - XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); - xfs_buf_ioerror(bp, EFSCORRUPTED); - return; - } + if (!xfs_foo_verify(bp)) { + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); + xfs_buf_ioerror(bp, EFSCORRUPTED); + return; + } - if (!xfs_sb_version_hascrc(&mp->m_sb)) - return; + if (!xfs_sb_version_hascrc(&mp->m_sb)) + return; - if (bip) { - struct xfs_ondisk_hdr *hdr = bp->b_addr; - hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn); - } - xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF); -} + if (bip) { + struct xfs_ondisk_hdr *hdr = bp->b_addr; + hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn); + } + xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF); + } This will verify the internal structure of the metadata before we go any further, detecting corruptions that have occurred as the metadata has been @@ -324,7 +327,7 @@ update the LSN field (when it was last modified) and calculate the CRC on the metadata. Once this is done, we can issue the IO. Inodes and Dquots ------------------ +================= Inodes and dquots are special snowflakes. They have per-object CRC and self-identifiers, but they are packed so that there are multiple objects per @@ -347,4 +350,3 @@ XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of the unlinked list modifications check or update CRCs, neither during unlink nor log recovery. So, it's gone unnoticed until now. This won't matter immediately - repair will probably complain about it - but it needs to be fixed. - diff --git a/Documentation/i2c/i2c.svg b/Documentation/i2c/i2c_bus.svg index 5979405ad1c3..3170de976373 100644 --- a/Documentation/i2c/i2c.svg +++ b/Documentation/i2c/i2c_bus.svg @@ -9,7 +9,7 @@ xmlns="http://www.w3.org/2000/svg" xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd" xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" - sodipodi:docname="i2c.svg" + sodipodi:docname="i2c_bus.svg" inkscape:version="0.92.3 (2405546, 2018-03-11)" version="1.1" id="svg2" diff --git a/Documentation/i2c/summary.rst b/Documentation/i2c/summary.rst index ce7230025b33..136c4e333be7 100644 --- a/Documentation/i2c/summary.rst +++ b/Documentation/i2c/summary.rst @@ -34,7 +34,7 @@ Terminology Using the terminology from the official documentation, the I2C bus connects one or more *master* chips and one or more *slave* chips. -.. kernel-figure:: i2c.svg +.. kernel-figure:: i2c_bus.svg :alt: Simple I2C bus with one master and 3 slaves Simple I2C bus diff --git a/Documentation/ia64/irq-redir.rst b/Documentation/ia64/irq-redir.rst index 39bf94484a15..6bbbbe4f73ef 100644 --- a/Documentation/ia64/irq-redir.rst +++ b/Documentation/ia64/irq-redir.rst @@ -7,7 +7,7 @@ IRQ affinity on IA64 platforms By writing to /proc/irq/IRQ#/smp_affinity the interrupt routing can be controlled. The behavior on IA64 platforms is slightly different from -that described in Documentation/IRQ-affinity.txt for i386 systems. +that described in Documentation/core-api/irq/irq-affinity.rst for i386 systems. Because of the usage of SAPIC mode and physical destination mode the IRQ target is one particular CPU and cannot be a mask of several diff --git a/Documentation/iio/iio_configfs.rst b/Documentation/iio/iio_configfs.rst index ecbfdb3afef7..6e38cbbd2981 100644 --- a/Documentation/iio/iio_configfs.rst +++ b/Documentation/iio/iio_configfs.rst @@ -9,7 +9,7 @@ Configfs is a filesystem-based manager of kernel objects. IIO uses some objects that could be easily configured using configfs (e.g.: devices, triggers). -See Documentation/filesystems/configfs/configfs.txt for more information +See Documentation/filesystems/configfs.rst for more information about how configfs works. 2. Usage diff --git a/Documentation/futex-requeue-pi.txt b/Documentation/locking/futex-requeue-pi.rst index 14ab5787b9a7..14ab5787b9a7 100644 --- a/Documentation/futex-requeue-pi.txt +++ b/Documentation/locking/futex-requeue-pi.rst diff --git a/Documentation/hwspinlock.txt b/Documentation/locking/hwspinlock.rst index 6f03713b7003..6f03713b7003 100644 --- a/Documentation/hwspinlock.txt +++ b/Documentation/locking/hwspinlock.rst diff --git a/Documentation/locking/index.rst b/Documentation/locking/index.rst index 5d6800a723dc..d785878cad65 100644 --- a/Documentation/locking/index.rst +++ b/Documentation/locking/index.rst @@ -16,6 +16,13 @@ locking rt-mutex spinlocks ww-mutex-design + preempt-locking + pi-futex + futex-requeue-pi + hwspinlock + percpu-rw-semaphore + robust-futexes + robust-futex-ABI .. only:: subproject and html diff --git a/Documentation/locking/locktorture.rst b/Documentation/locking/locktorture.rst index 5bcb99ba7bd9..8012a74555e7 100644 --- a/Documentation/locking/locktorture.rst +++ b/Documentation/locking/locktorture.rst @@ -110,7 +110,7 @@ stutter same period of time. Defaults to "stutter=5", so as to run and pause for (roughly) five-second intervals. Specifying "stutter=0" causes the test to run continuously - without pausing, which is the old default behavior. + without pausing. shuffle_interval The number of seconds to keep the test threads affinitied diff --git a/Documentation/locking/locktypes.rst b/Documentation/locking/locktypes.rst index 09f45ce38d26..1b577a8bf982 100644 --- a/Documentation/locking/locktypes.rst +++ b/Documentation/locking/locktypes.rst @@ -13,6 +13,7 @@ The kernel provides a variety of locking primitives which can be divided into two categories: - Sleeping locks + - CPU local locks - Spinning locks This document conceptually describes these lock types and provides rules @@ -44,9 +45,23 @@ Sleeping lock types: On PREEMPT_RT kernels, these lock types are converted to sleeping locks: + - local_lock - spinlock_t - rwlock_t + +CPU local locks +--------------- + + - local_lock + +On non-PREEMPT_RT kernels, local_lock functions are wrappers around +preemption and interrupt disabling primitives. Contrary to other locking +mechanisms, disabling preemption or interrupts are pure CPU local +concurrency control mechanisms and not suited for inter-CPU concurrency +control. + + Spinning locks -------------- @@ -67,6 +82,7 @@ can have suffixes which apply further protections: _irqsave/restore() Save and disable / restore interrupt disabled state =================== ==================================================== + Owner semantics =============== @@ -139,6 +155,56 @@ implementation, thus changing the fairness: writer from starving readers. +local_lock +========== + +local_lock provides a named scope to critical sections which are protected +by disabling preemption or interrupts. + +On non-PREEMPT_RT kernels local_lock operations map to the preemption and +interrupt disabling and enabling primitives: + + =========================== ====================== + local_lock(&llock) preempt_disable() + local_unlock(&llock) preempt_enable() + local_lock_irq(&llock) local_irq_disable() + local_unlock_irq(&llock) local_irq_enable() + local_lock_save(&llock) local_irq_save() + local_lock_restore(&llock) local_irq_save() + =========================== ====================== + +The named scope of local_lock has two advantages over the regular +primitives: + + - The lock name allows static analysis and is also a clear documentation + of the protection scope while the regular primitives are scopeless and + opaque. + + - If lockdep is enabled the local_lock gains a lockmap which allows to + validate the correctness of the protection. This can detect cases where + e.g. a function using preempt_disable() as protection mechanism is + invoked from interrupt or soft-interrupt context. Aside of that + lockdep_assert_held(&llock) works as with any other locking primitive. + +local_lock and PREEMPT_RT +------------------------- + +PREEMPT_RT kernels map local_lock to a per-CPU spinlock_t, thus changing +semantics: + + - All spinlock_t changes also apply to local_lock. + +local_lock usage +---------------- + +local_lock should be used in situations where disabling preemption or +interrupts is the appropriate form of concurrency control to protect +per-CPU data structures on a non PREEMPT_RT kernel. + +local_lock is not suitable to protect against preemption or interrupts on a +PREEMPT_RT kernel due to the PREEMPT_RT specific spinlock_t semantics. + + raw_spinlock_t and spinlock_t ============================= @@ -258,10 +324,82 @@ implementation, thus changing semantics: PREEMPT_RT caveats ================== +local_lock on RT +---------------- + +The mapping of local_lock to spinlock_t on PREEMPT_RT kernels has a few +implications. For example, on a non-PREEMPT_RT kernel the following code +sequence works as expected:: + + local_lock_irq(&local_lock); + raw_spin_lock(&lock); + +and is fully equivalent to:: + + raw_spin_lock_irq(&lock); + +On a PREEMPT_RT kernel this code sequence breaks because local_lock_irq() +is mapped to a per-CPU spinlock_t which neither disables interrupts nor +preemption. The following code sequence works perfectly correct on both +PREEMPT_RT and non-PREEMPT_RT kernels:: + + local_lock_irq(&local_lock); + spin_lock(&lock); + +Another caveat with local locks is that each local_lock has a specific +protection scope. So the following substitution is wrong:: + + func1() + { + local_irq_save(flags); -> local_lock_irqsave(&local_lock_1, flags); + func3(); + local_irq_restore(flags); -> local_lock_irqrestore(&local_lock_1, flags); + } + + func2() + { + local_irq_save(flags); -> local_lock_irqsave(&local_lock_2, flags); + func3(); + local_irq_restore(flags); -> local_lock_irqrestore(&local_lock_2, flags); + } + + func3() + { + lockdep_assert_irqs_disabled(); + access_protected_data(); + } + +On a non-PREEMPT_RT kernel this works correctly, but on a PREEMPT_RT kernel +local_lock_1 and local_lock_2 are distinct and cannot serialize the callers +of func3(). Also the lockdep assert will trigger on a PREEMPT_RT kernel +because local_lock_irqsave() does not disable interrupts due to the +PREEMPT_RT-specific semantics of spinlock_t. The correct substitution is:: + + func1() + { + local_irq_save(flags); -> local_lock_irqsave(&local_lock, flags); + func3(); + local_irq_restore(flags); -> local_lock_irqrestore(&local_lock, flags); + } + + func2() + { + local_irq_save(flags); -> local_lock_irqsave(&local_lock, flags); + func3(); + local_irq_restore(flags); -> local_lock_irqrestore(&local_lock, flags); + } + + func3() + { + lockdep_assert_held(&local_lock); + access_protected_data(); + } + + spinlock_t and rwlock_t ----------------------- -These changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels +The changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels have a few implications. For example, on a non-PREEMPT_RT kernel the following code sequence works as expected:: @@ -282,9 +420,61 @@ local_lock mechanism. Acquiring the local_lock pins the task to a CPU, allowing things like per-CPU interrupt disabled locks to be acquired. However, this approach should be used only where absolutely necessary. +A typical scenario is protection of per-CPU variables in thread context:: -raw_spinlock_t --------------- + struct foo *p = get_cpu_ptr(&var1); + + spin_lock(&p->lock); + p->count += this_cpu_read(var2); + +This is correct code on a non-PREEMPT_RT kernel, but on a PREEMPT_RT kernel +this breaks. The PREEMPT_RT-specific change of spinlock_t semantics does +not allow to acquire p->lock because get_cpu_ptr() implicitly disables +preemption. The following substitution works on both kernels:: + + struct foo *p; + + migrate_disable(); + p = this_cpu_ptr(&var1); + spin_lock(&p->lock); + p->count += this_cpu_read(var2); + +On a non-PREEMPT_RT kernel migrate_disable() maps to preempt_disable() +which makes the above code fully equivalent. On a PREEMPT_RT kernel +migrate_disable() ensures that the task is pinned on the current CPU which +in turn guarantees that the per-CPU access to var1 and var2 are staying on +the same CPU. + +The migrate_disable() substitution is not valid for the following +scenario:: + + func() + { + struct foo *p; + + migrate_disable(); + p = this_cpu_ptr(&var1); + p->val = func2(); + +While correct on a non-PREEMPT_RT kernel, this breaks on PREEMPT_RT because +here migrate_disable() does not protect against reentrancy from a +preempting task. A correct substitution for this case is:: + + func() + { + struct foo *p; + + local_lock(&foo_lock); + p = this_cpu_ptr(&var1); + p->val = func2(); + +On a non-PREEMPT_RT kernel this protects against reentrancy by disabling +preemption. On a PREEMPT_RT kernel this is achieved by acquiring the +underlying per-CPU spinlock. + + +raw_spinlock_t on RT +-------------------- Acquiring a raw_spinlock_t disables preemption and possibly also interrupts, so the critical section must avoid acquiring a regular @@ -325,22 +515,25 @@ Lock type nesting rules The most basic rules are: - - Lock types of the same lock category (sleeping, spinning) can nest - arbitrarily as long as they respect the general lock ordering rules to - prevent deadlocks. + - Lock types of the same lock category (sleeping, CPU local, spinning) + can nest arbitrarily as long as they respect the general lock ordering + rules to prevent deadlocks. + + - Sleeping lock types cannot nest inside CPU local and spinning lock types. - - Sleeping lock types cannot nest inside spinning lock types. + - CPU local and spinning lock types can nest inside sleeping lock types. - - Spinning lock types can nest inside sleeping lock types. + - Spinning lock types can nest inside all lock types These constraints apply both in PREEMPT_RT and otherwise. The fact that PREEMPT_RT changes the lock category of spinlock_t and -rwlock_t from spinning to sleeping means that they cannot be acquired while -holding a raw spinlock. This results in the following nesting ordering: +rwlock_t from spinning to sleeping and substitutes local_lock with a +per-CPU spinlock_t means that they cannot be acquired while holding a raw +spinlock. This results in the following nesting ordering: 1) Sleeping locks - 2) spinlock_t and rwlock_t + 2) spinlock_t, rwlock_t, local_lock 3) raw_spinlock_t and bit spinlocks Lockdep will complain if these constraints are violated, both in diff --git a/Documentation/percpu-rw-semaphore.txt b/Documentation/locking/percpu-rw-semaphore.rst index 247de6410855..247de6410855 100644 --- a/Documentation/percpu-rw-semaphore.txt +++ b/Documentation/locking/percpu-rw-semaphore.rst diff --git a/Documentation/pi-futex.txt b/Documentation/locking/pi-futex.rst index c33ba2befbf8..c33ba2befbf8 100644 --- a/Documentation/pi-futex.txt +++ b/Documentation/locking/pi-futex.rst diff --git a/Documentation/preempt-locking.txt b/Documentation/locking/preempt-locking.rst index dce336134e54..dce336134e54 100644 --- a/Documentation/preempt-locking.txt +++ b/Documentation/locking/preempt-locking.rst diff --git a/Documentation/robust-futex-ABI.txt b/Documentation/locking/robust-futex-ABI.rst index f24904f1c16f..f24904f1c16f 100644 --- a/Documentation/robust-futex-ABI.txt +++ b/Documentation/locking/robust-futex-ABI.rst diff --git a/Documentation/robust-futexes.txt b/Documentation/locking/robust-futexes.rst index 6361fb01c9c1..6361fb01c9c1 100644 --- a/Documentation/robust-futexes.txt +++ b/Documentation/locking/robust-futexes.rst diff --git a/Documentation/locking/rt-mutex.rst b/Documentation/locking/rt-mutex.rst index c365dc302081..3b5097a380e6 100644 --- a/Documentation/locking/rt-mutex.rst +++ b/Documentation/locking/rt-mutex.rst @@ -4,7 +4,7 @@ RT-mutex subsystem with PI support RT-mutexes with priority inheritance are used to support PI-futexes, which enable pthread_mutex_t priority inheritance attributes -(PTHREAD_PRIO_INHERIT). [See Documentation/pi-futex.txt for more details +(PTHREAD_PRIO_INHERIT). [See Documentation/locking/pi-futex.rst for more details about PI-futexes.] This technology was developed in the -rt tree and streamlined for diff --git a/Documentation/maintainer/maintainer-entry-profile.rst b/Documentation/maintainer/maintainer-entry-profile.rst index 11ebe3682771..77e43c8b24b4 100644 --- a/Documentation/maintainer/maintainer-entry-profile.rst +++ b/Documentation/maintainer/maintainer-entry-profile.rst @@ -7,7 +7,7 @@ The Maintainer Entry Profile supplements the top-level process documents (submitting-patches, submitting drivers...) with subsystem/device-driver-local customs as well as details about the patch submission life-cycle. A contributor uses this document to level set -their expectations and avoid common mistakes, maintainers may use these +their expectations and avoid common mistakes; maintainers may use these profiles to look across subsystems for opportunities to converge on common practices. @@ -26,7 +26,7 @@ Example questions to consider: - Does the subsystem have a patchwork instance? Are patchwork state changes notified? - Any bots or CI infrastructure that watches the list, or automated - testing feedback that the subsystem gates acceptance? + testing feedback that the subsystem uses to gate acceptance? - Git branches that are pulled into -next? - What branch should contributors submit against? - Links to any other Maintainer Entry Profiles? For example a @@ -54,8 +54,8 @@ One of the common misunderstandings of submitters is that patches can be sent at any time before the merge window closes and can still be considered for the next -rc1. The reality is that most patches need to be settled in soaking in linux-next in advance of the merge window -opening. Clarify for the submitter the key dates (in terms rc release -week) that patches might considered for merging and when patches need to +opening. Clarify for the submitter the key dates (in terms of -rc release +week) that patches might be considered for merging and when patches need to wait for the next -rc. At a minimum: - Last -rc for new feature submissions: @@ -70,8 +70,8 @@ wait for the next -rc. At a minimum: - Last -rc to merge features: Deadline for merge decisions Indicate to contributors the point at which an as yet un-applied patch set will need to wait for the NEXT+1 merge window. Of course there is no - obligation to ever except any given patchset, but if the review has not - concluded by this point the expectation the contributor should wait and + obligation to ever accept any given patchset, but if the review has not + concluded by this point the expectation is the contributor should wait and resubmit for the following merge window. Optional: diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index e1c355e84edd..eaabc3134294 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -620,7 +620,7 @@ because the CPUs that the Linux kernel supports don't do writes until they are certain (1) that the write will actually happen, (2) of the location of the write, and (3) of the value to be written. But please carefully read the "CONTROL DEPENDENCIES" section and the -Documentation/RCU/rcu_dereference.txt file: The compiler can and does +Documentation/RCU/rcu_dereference.rst file: The compiler can and does break dependencies in a great many highly creative ways. CPU 1 CPU 2 diff --git a/Documentation/misc-devices/index.rst b/Documentation/misc-devices/index.rst index c1dcd2628911..1ecc05fbe6f4 100644 --- a/Documentation/misc-devices/index.rst +++ b/Documentation/misc-devices/index.rst @@ -21,4 +21,5 @@ fit into other categories. lis3lv02d max6875 mic/index + uacce xilinx_sdfec diff --git a/Documentation/networking/scaling.rst b/Documentation/networking/scaling.rst index f78d7bf27ff5..8f0347b9fb3d 100644 --- a/Documentation/networking/scaling.rst +++ b/Documentation/networking/scaling.rst @@ -81,7 +81,7 @@ of queues to IRQs can be determined from /proc/interrupts. By default, an IRQ may be handled on any CPU. Because a non-negligible part of packet processing takes place in receive interrupt handling, it is advantageous to spread receive interrupts between CPUs. To manually adjust the IRQ -affinity of each interrupt see Documentation/IRQ-affinity.txt. Some systems +affinity of each interrupt see Documentation/core-api/irq/irq-affinity.rst. Some systems will be running irqbalance, a daemon that dynamically optimizes IRQ assignments and as a result may override any manual settings. @@ -160,7 +160,7 @@ can be configured for each receive queue using a sysfs file entry:: This file implements a bitmap of CPUs. RPS is disabled when it is zero (the default), in which case packets are processed on the interrupting -CPU. Documentation/IRQ-affinity.txt explains how CPUs are assigned to +CPU. Documentation/core-api/irq/irq-affinity.rst explains how CPUs are assigned to the bitmap. diff --git a/Documentation/nvdimm/maintainer-entry-profile.rst b/Documentation/nvdimm/maintainer-entry-profile.rst index efe37adadcea..9da748e42623 100644 --- a/Documentation/nvdimm/maintainer-entry-profile.rst +++ b/Documentation/nvdimm/maintainer-entry-profile.rst @@ -4,15 +4,15 @@ LIBNVDIMM Maintainer Entry Profile Overview -------- The libnvdimm subsystem manages persistent memory across multiple -architectures. The mailing list, is tracked by patchwork here: +architectures. The mailing list is tracked by patchwork here: https://patchwork.kernel.org/project/linux-nvdimm/list/ ...and that instance is configured to give feedback to submitters on patch acceptance and upstream merge. Patches are merged to either the -'libnvdimm-fixes', or 'libnvdimm-for-next' branch. Those branches are +'libnvdimm-fixes' or 'libnvdimm-for-next' branch. Those branches are available here: https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git/ -In general patches can be submitted against the latest -rc, however if +In general patches can be submitted against the latest -rc; however, if the incoming code change is dependent on other pending changes then the patch should be based on the libnvdimm-for-next branch. However, since persistent memory sits at the intersection of storage and memory there @@ -35,12 +35,12 @@ getting the test environment set up. ACPI Device Specific Methods (_DSM) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Before patches enabling for a new _DSM family will be considered it must +Before patches enabling a new _DSM family will be considered, it must be assigned a format-interface-code from the NVDIMM Sub-team of the ACPI Specification Working Group. In general, the stance of the subsystem is -to push back on the proliferation of NVDIMM command sets, do strongly +to push back on the proliferation of NVDIMM command sets, so do strongly consider implementing support for an existing command set. See -drivers/acpi/nfit/nfit.h for the set of support command sets. +drivers/acpi/nfit/nfit.h for the set of supported command sets. Key Cycle Dates @@ -48,7 +48,7 @@ Key Cycle Dates New submissions can be sent at any time, but if they intend to hit the next merge window they should be sent before -rc4, and ideally stabilized in the libnvdimm-for-next branch by -rc6. Of course if a -patch set requires more than 2 weeks of review -rc4 is already too late +patch set requires more than 2 weeks of review, -rc4 is already too late and some patches may require multiple development cycles to review. diff --git a/Documentation/power/suspend-and-cpuhotplug.rst b/Documentation/power/suspend-and-cpuhotplug.rst index 572d968c5375..ebedb6c75db9 100644 --- a/Documentation/power/suspend-and-cpuhotplug.rst +++ b/Documentation/power/suspend-and-cpuhotplug.rst @@ -48,7 +48,7 @@ More details follow:: | | v - disable_nonboot_cpus() + freeze_secondary_cpus() /* start */ | v @@ -83,7 +83,7 @@ More details follow:: Release cpu_add_remove_lock | v - /* disable_nonboot_cpus() complete */ + /* freeze_secondary_cpus() complete */ | v Do suspend @@ -93,7 +93,7 @@ More details follow:: Resuming back is likewise, with the counterparts being (in the order of execution during resume): -* enable_nonboot_cpus() which involves:: +* thaw_secondary_cpus() which involves:: | Acquire cpu_add_remove_lock | Decrease cpu_hotplug_disabled, thereby enabling regular cpu hotplug diff --git a/Documentation/powerpc/cxl.rst b/Documentation/powerpc/cxl.rst index 920546d81326..d2d77057610e 100644 --- a/Documentation/powerpc/cxl.rst +++ b/Documentation/powerpc/cxl.rst @@ -133,6 +133,7 @@ User API ======== 1. AFU character devices +^^^^^^^^^^^^^^^^^^^^^^^^ For AFUs operating in AFU directed mode, two character device files will be created. /dev/cxl/afu0.0m will correspond to a @@ -395,6 +396,7 @@ read 2. Card character device (powerVM guest only) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In a powerVM guest, an extra character device is created for the card. The device is only used to write (flash) a new image on the diff --git a/Documentation/powerpc/firmware-assisted-dump.rst b/Documentation/powerpc/firmware-assisted-dump.rst index b3f3ee135dbe..20ea8cdee0aa 100644 --- a/Documentation/powerpc/firmware-assisted-dump.rst +++ b/Documentation/powerpc/firmware-assisted-dump.rst @@ -344,7 +344,7 @@ Here is the list of files under powerpc debugfs: NOTE: - Please refer to Documentation/filesystems/debugfs.txt on + Please refer to Documentation/filesystems/debugfs.rst on how to mount the debugfs filesystem. diff --git a/Documentation/process/adding-syscalls.rst b/Documentation/process/adding-syscalls.rst index 1c3a840d06b9..a6b4a3a5bf3f 100644 --- a/Documentation/process/adding-syscalls.rst +++ b/Documentation/process/adding-syscalls.rst @@ -33,7 +33,7 @@ interface. to a somewhat opaque API. - If you're just exposing runtime system information, a new node in sysfs - (see ``Documentation/filesystems/sysfs.txt``) or the ``/proc`` filesystem may + (see ``Documentation/filesystems/sysfs.rst``) or the ``/proc`` filesystem may be more appropriate. However, access to these mechanisms requires that the relevant filesystem is mounted, which might not always be the case (e.g. in a namespaced/sandboxed/chrooted environment). Avoid adding any API to diff --git a/Documentation/process/index.rst b/Documentation/process/index.rst index 6399d92f0b21..f07c9250c3ac 100644 --- a/Documentation/process/index.rst +++ b/Documentation/process/index.rst @@ -61,6 +61,7 @@ lack of a better place. botching-up-ioctls clang-format ../riscv/patch-acceptance + unaligned-memory-access .. only:: subproject and html diff --git a/Documentation/process/submit-checklist.rst b/Documentation/process/submit-checklist.rst index 8e56337d422d..3f8e9d5d95c2 100644 --- a/Documentation/process/submit-checklist.rst +++ b/Documentation/process/submit-checklist.rst @@ -107,7 +107,7 @@ and elsewhere regarding submitting Linux kernel patches. and why. 26) If any ioctl's are added by the patch, then also update - ``Documentation/ioctl/ioctl-number.rst``. + ``Documentation/userspace-api/ioctl/ioctl-number.rst``. 27) If your modified source code depends on or uses any of the kernel APIs or features that are related to the following ``Kconfig`` symbols, diff --git a/Documentation/unaligned-memory-access.txt b/Documentation/process/unaligned-memory-access.rst index 1ee82419d8aa..1ee82419d8aa 100644 --- a/Documentation/unaligned-memory-access.txt +++ b/Documentation/process/unaligned-memory-access.rst diff --git a/Documentation/s390/vfio-ap.rst b/Documentation/s390/vfio-ap.rst index b5c51f7c748d..367e27ec3c50 100644 --- a/Documentation/s390/vfio-ap.rst +++ b/Documentation/s390/vfio-ap.rst @@ -484,7 +484,7 @@ CARD.DOMAIN TYPE MODE 05.00ff CEX5A Accelerator =========== ===== ============ -Guest2 +Guest3 ------ =========== ===== ============ CARD.DOMAIN TYPE MODE diff --git a/Documentation/scheduler/sched-domains.rst b/Documentation/scheduler/sched-domains.rst index f7504226f445..5c4b7f4f0062 100644 --- a/Documentation/scheduler/sched-domains.rst +++ b/Documentation/scheduler/sched-domains.rst @@ -19,10 +19,12 @@ CPUs". Each scheduling domain must have one or more CPU groups (struct sched_group) which are organised as a circular one way linked list from the ->groups pointer. The union of cpumasks of these groups MUST be the same as the -domain's span. The intersection of cpumasks from any two of these groups -MUST be the empty set. The group pointed to by the ->groups pointer MUST -contain the CPU to which the domain belongs. Groups may be shared among -CPUs as they contain read only data after they have been set up. +domain's span. The group pointed to by the ->groups pointer MUST contain the CPU +to which the domain belongs. Groups may be shared among CPUs as they contain +read only data after they have been set up. The intersection of cpumasks from +any two of these groups may be non empty. If this is the case the SD_OVERLAP +flag is set on the corresponding scheduling domain and its groups may not be +shared between CPUs. Balancing within a sched domain occurs between groups. That is, each group is treated as one entity. The load of a group is defined as the sum of the diff --git a/Documentation/digsig.txt b/Documentation/security/digsig.rst index f6a8902d3ef7..f6a8902d3ef7 100644 --- a/Documentation/digsig.txt +++ b/Documentation/security/digsig.rst diff --git a/Documentation/security/index.rst b/Documentation/security/index.rst index fc503dd689a7..8129405eb2cc 100644 --- a/Documentation/security/index.rst +++ b/Documentation/security/index.rst @@ -15,3 +15,4 @@ Security Documentation self-protection siphash tpm/index + digsig diff --git a/Documentation/security/lsm.rst b/Documentation/security/lsm.rst index aadf47c808c0..6a2a2e973080 100644 --- a/Documentation/security/lsm.rst +++ b/Documentation/security/lsm.rst @@ -35,47 +35,50 @@ desired model of security. Linus also suggested the possibility of migrating the Linux capabilities code into such a module. The Linux Security Modules (LSM) project was started by WireX to develop -such a framework. LSM is a joint development effort by several security +such a framework. LSM was a joint development effort by several security projects, including Immunix, SELinux, SGI and Janus, and several individuals, including Greg Kroah-Hartman and James Morris, to develop a -Linux kernel patch that implements this framework. The patch is -currently tracking the 2.4 series and is targeted for integration into -the 2.5 development series. This technical report provides an overview -of the framework and the example capabilities security module provided -by the LSM kernel patch. +Linux kernel patch that implements this framework. The work was +incorporated in the mainstream in December of 2003. This technical +report provides an overview of the framework and the capabilities +security module. LSM Framework ============= -The LSM kernel patch provides a general kernel framework to support +The LSM framework provides a general kernel framework to support security modules. In particular, the LSM framework is primarily focused on supporting access control modules, although future development is -likely to address other security needs such as auditing. By itself, the +likely to address other security needs such as sandboxing. By itself, the framework does not provide any additional security; it merely provides -the infrastructure to support security modules. The LSM kernel patch -also moves most of the capabilities logic into an optional security -module, with the system defaulting to the traditional superuser logic. +the infrastructure to support security modules. The LSM framework is +optional, requiring `CONFIG_SECURITY` to be enabled. The capabilities +logic is implemented as a security module. This capabilities module is discussed further in `LSM Capabilities Module`_. -The LSM kernel patch adds security fields to kernel data structures and -inserts calls to hook functions at critical points in the kernel code to -manage the security fields and to perform access control. It also adds -functions for registering and unregistering security modules, and adds a -general :c:func:`security()` system call to support new system calls -for security-aware applications. - -The LSM security fields are simply ``void*`` pointers. For process and -program execution security information, security fields were added to +The LSM framework includes security fields in kernel data structures and +calls to hook functions at critical points in the kernel code to +manage the security fields and to perform access control. +It also adds functions for registering security modules. +An interface `/sys/kernel/security/lsm` reports a comma separated list +of security modules that are active on the system. + +The LSM security fields are simply ``void*`` pointers. +The data is referred to as a blob, which may be managed by +the framework or by the individual security modules that use it. +Security blobs that are used by more than one security module are +typically managed by the framework. +For process and +program execution security information, security fields are included in :c:type:`struct task_struct <task_struct>` and -:c:type:`struct linux_binprm <linux_binprm>`. For filesystem -security information, a security field was added to :c:type:`struct +:c:type:`struct cred <cred>`. +For filesystem +security information, a security field is included in :c:type:`struct super_block <super_block>`. For pipe, file, and socket security -information, security fields were added to :c:type:`struct inode -<inode>` and :c:type:`struct file <file>`. For packet and -network device security information, security fields were added to -:c:type:`struct sk_buff <sk_buff>` and :c:type:`struct -net_device <net_device>`. For System V IPC security information, +information, security fields are included in :c:type:`struct inode +<inode>` and :c:type:`struct file <file>`. +For System V IPC security information, security fields were added to :c:type:`struct kern_ipc_perm <kern_ipc_perm>` and :c:type:`struct msg_msg <msg_msg>`; additionally, the definitions for :c:type:`struct @@ -84,118 +87,45 @@ were moved to header files (``include/linux/msg.h`` and ``include/linux/shm.h`` as appropriate) to allow the security modules to use these definitions. -Each LSM hook is a function pointer in a global table, security_ops. -This table is a :c:type:`struct security_operations -<security_operations>` structure as defined by -``include/linux/security.h``. Detailed documentation for each hook is -included in this header file. At present, this structure consists of a -collection of substructures that group related hooks based on the kernel -object (e.g. task, inode, file, sk_buff, etc) as well as some top-level -hook function pointers for system operations. This structure is likely -to be flattened in the future for performance. The placement of the hook -calls in the kernel code is described by the "called:" lines in the -per-hook documentation in the header file. The hook calls can also be -easily found in the kernel code by looking for the string -"security_ops->". - -Linus mentioned per-process security hooks in his original remarks as a -possible alternative to global security hooks. However, if LSM were to -start from the perspective of per-process hooks, then the base framework -would have to deal with how to handle operations that involve multiple -processes (e.g. kill), since each process might have its own hook for -controlling the operation. This would require a general mechanism for -composing hooks in the base framework. Additionally, LSM would still -need global hooks for operations that have no process context (e.g. -network input operations). Consequently, LSM provides global security -hooks, but a security module is free to implement per-process hooks -(where that makes sense) by storing a security_ops table in each -process' security field and then invoking these per-process hooks from -the global hooks. The problem of composition is thus deferred to the -module. - -The global security_ops table is initialized to a set of hook functions -provided by a dummy security module that provides traditional superuser -logic. A :c:func:`register_security()` function (in -``security/security.c``) is provided to allow a security module to set -security_ops to refer to its own hook functions, and an -:c:func:`unregister_security()` function is provided to revert -security_ops to the dummy module hooks. This mechanism is used to set -the primary security module, which is responsible for making the final -decision for each hook. - -LSM also provides a simple mechanism for stacking additional security -modules with the primary security module. It defines -:c:func:`register_security()` and -:c:func:`unregister_security()` hooks in the :c:type:`struct -security_operations <security_operations>` structure and -provides :c:func:`mod_reg_security()` and -:c:func:`mod_unreg_security()` functions that invoke these hooks -after performing some sanity checking. A security module can call these -functions in order to stack with other modules. However, the actual -details of how this stacking is handled are deferred to the module, -which can implement these hooks in any way it wishes (including always -returning an error if it does not wish to support stacking). In this -manner, LSM again defers the problem of composition to the module. - -Although the LSM hooks are organized into substructures based on kernel -object, all of the hooks can be viewed as falling into two major +For packet and +network device security information, security fields were added to +:c:type:`struct sk_buff <sk_buff>` and +:c:type:`struct scm_cookie <scm_cookie>`. +Unlike the other security module data, the data used here is a +32-bit integer. The security modules are required to map or otherwise +associate these values with real security attributes. + +LSM hooks are maintained in lists. A list is maintained for each +hook, and the hooks are called in the order specified by CONFIG_LSM. +Detailed documentation for each hook is +included in the `include/linux/lsm_hooks.h` header file. + +The LSM framework provides for a close approximation of +general security module stacking. It defines +security_add_hooks() to which each security module passes a +:c:type:`struct security_hooks_list <security_hooks_list>`, +which are added to the lists. +The LSM framework does not provide a mechanism for removing hooks that +have been registered. The SELinux security module has implemented +a way to remove itself, however the feature has been deprecated. + +The hooks can be viewed as falling into two major categories: hooks that are used to manage the security fields and hooks that are used to perform access control. Examples of the first category -of hooks include the :c:func:`alloc_security()` and -:c:func:`free_security()` hooks defined for each kernel data -structure that has a security field. These hooks are used to allocate -and free security structures for kernel objects. The first category of -hooks also includes hooks that set information in the security field -after allocation, such as the :c:func:`post_lookup()` hook in -:c:type:`struct inode_security_ops <inode_security_ops>`. -This hook is used to set security information for inodes after -successful lookup operations. An example of the second category of hooks -is the :c:func:`permission()` hook in :c:type:`struct -inode_security_ops <inode_security_ops>`. This hook checks -permission when accessing an inode. +of hooks include the security_inode_alloc() and security_inode_free() +These hooks are used to allocate +and free security structures for inode objects. +An example of the second category of hooks +is the security_inode_permission() hook. +This hook checks permission when accessing an inode. LSM Capabilities Module ======================= -The LSM kernel patch moves most of the existing POSIX.1e capabilities -logic into an optional security module stored in the file -``security/capability.c``. This change allows users who do not want to -use capabilities to omit this code entirely from their kernel, instead -using the dummy module for traditional superuser logic or any other -module that they desire. This change also allows the developers of the -capabilities logic to maintain and enhance their code more freely, -without needing to integrate patches back into the base kernel. - -In addition to moving the capabilities logic, the LSM kernel patch could -move the capability-related fields from the kernel data structures into -the new security fields managed by the security modules. However, at -present, the LSM kernel patch leaves the capability fields in the kernel -data structures. In his original remarks, Linus suggested that this -might be preferable so that other security modules can be easily stacked -with the capabilities module without needing to chain multiple security -structures on the security field. It also avoids imposing extra overhead -on the capabilities module to manage the security fields. However, the -LSM framework could certainly support such a move if it is determined to -be desirable, with only a few additional changes described below. - -At present, the capabilities logic for computing process capabilities on -:c:func:`execve()` and :c:func:`set\*uid()`, checking -capabilities for a particular process, saving and checking capabilities -for netlink messages, and handling the :c:func:`capget()` and -:c:func:`capset()` system calls have been moved into the -capabilities module. There are still a few locations in the base kernel -where capability-related fields are directly examined or modified, but -the current version of the LSM patch does allow a security module to -completely replace the assignment and testing of capabilities. These few -locations would need to be changed if the capability-related fields were -moved into the security field. The following is a list of known -locations that still perform such direct examination or modification of -capability-related fields: - -- ``fs/open.c``::c:func:`sys_access()` - -- ``fs/lockd/host.c``::c:func:`nlm_bind_host()` - -- ``fs/nfsd/auth.c``::c:func:`nfsd_setuser()` - -- ``fs/proc/array.c``::c:func:`task_cap()` +The POSIX.1e capabilities logic is maintained as a security module +stored in the file ``security/commoncap.c``. The capabilities +module uses the order field of the :c:type:`lsm_info` description +to identify it as the first security module to be registered. +The capabilities security module does not use the general security +blobs, unlike other modules. The reasons are historical and are +based on overhead, complexity and performance concerns. diff --git a/Documentation/sphinx/requirements.txt b/Documentation/sphinx/requirements.txt index 14e29a0ae480..489f6626de67 100644 --- a/Documentation/sphinx/requirements.txt +++ b/Documentation/sphinx/requirements.txt @@ -1,3 +1,3 @@ docutils -Sphinx==1.7.9 +Sphinx==2.4.4 sphinx_rtd_theme diff --git a/Documentation/trace/coresight/coresight-ect.rst b/Documentation/trace/coresight/coresight-ect.rst index ecc1e57012ef..a93e52abcf46 100644 --- a/Documentation/trace/coresight/coresight-ect.rst +++ b/Documentation/trace/coresight/coresight-ect.rst @@ -1,4 +1,5 @@ .. SPDX-License-Identifier: GPL-2.0 + ============================================= CoreSight Embedded Cross Trigger (CTI & CTM). ============================================= diff --git a/Documentation/trace/events.rst b/Documentation/trace/events.rst index 4a2ebe0bd19b..f792b1959a33 100644 --- a/Documentation/trace/events.rst +++ b/Documentation/trace/events.rst @@ -527,8 +527,8 @@ The following commands are supported: See Documentation/trace/histogram.rst for details and examples. -6.3 In-kernel trace event API ------------------------------ +7. In-kernel trace event API +============================ In most cases, the command-line interface to trace events is more than sufficient. Sometimes, however, applications might find the need for @@ -560,8 +560,8 @@ following: - tracing synthetic events from in-kernel code - the low-level "dynevent_cmd" API -6.3.1 Dyamically creating synthetic event definitions ------------------------------------------------------ +7.1 Dyamically creating synthetic event definitions +--------------------------------------------------- There are a couple ways to create a new synthetic event from a kernel module or other kernel code. @@ -666,8 +666,8 @@ registered by calling the synth_event_gen_cmd_end() function:: At this point, the event object is ready to be used for tracing new events. -6.3.3 Tracing synthetic events from in-kernel code --------------------------------------------------- +7.2 Tracing synthetic events from in-kernel code +------------------------------------------------ To trace a synthetic event, there are several options. The first option is to trace the event in one call, using synth_event_trace() @@ -678,8 +678,8 @@ synth_event_trace_start() and synth_event_trace_end() along with synth_event_add_next_val() or synth_event_add_val() to add the values piecewise. -6.3.3.1 Tracing a synthetic event all at once ---------------------------------------------- +7.2.1 Tracing a synthetic event all at once +------------------------------------------- To trace a synthetic event all at once, the synth_event_trace() or synth_event_trace_array() functions can be used. @@ -780,8 +780,8 @@ remove the event:: ret = synth_event_delete("schedtest"); -6.3.3.1 Tracing a synthetic event piecewise -------------------------------------------- +7.2.2 Tracing a synthetic event piecewise +----------------------------------------- To trace a synthetic using the piecewise method described above, the synth_event_trace_start() function is used to 'open' the synthetic @@ -864,8 +864,8 @@ Note that synth_event_trace_end() must be called at the end regardless of whether any of the add calls failed (say due to a bad field name being passed in). -6.3.4 Dyamically creating kprobe and kretprobe event definitions ----------------------------------------------------------------- +7.3 Dyamically creating kprobe and kretprobe event definitions +-------------------------------------------------------------- To create a kprobe or kretprobe trace event from kernel code, the kprobe_event_gen_cmd_start() or kretprobe_event_gen_cmd_start() @@ -941,8 +941,8 @@ used to give the kprobe event file back and delete the event:: ret = kprobe_event_delete("gen_kprobe_test"); -6.3.4 The "dynevent_cmd" low-level API --------------------------------------- +7.4 The "dynevent_cmd" low-level API +------------------------------------ Both the in-kernel synthetic event and kprobe interfaces are built on top of a lower-level "dynevent_cmd" interface. This interface is diff --git a/Documentation/trace/ftrace-design.rst b/Documentation/trace/ftrace-design.rst index a8e22e0db63c..6893399157f0 100644 --- a/Documentation/trace/ftrace-design.rst +++ b/Documentation/trace/ftrace-design.rst @@ -229,14 +229,6 @@ Adding support for it is easy: just define the macro in asm/ftrace.h and pass the return address pointer as the 'retp' argument to ftrace_push_return_trace(). -HAVE_FTRACE_NMI_ENTER ---------------------- - -If you can't trace NMI functions, then skip this option. - -<details to be filled> - - HAVE_SYSCALL_TRACEPOINTS ------------------------ diff --git a/Documentation/translations/it_IT/doc-guide/kernel-doc.rst b/Documentation/translations/it_IT/doc-guide/kernel-doc.rst index a4ecd8f27631..524ad86cadbb 100644 --- a/Documentation/translations/it_IT/doc-guide/kernel-doc.rst +++ b/Documentation/translations/it_IT/doc-guide/kernel-doc.rst @@ -515,6 +515,22 @@ internal: *[source-pattern ...]* .. kernel-doc:: drivers/gpu/drm/i915/intel_audio.c :internal: +identifiers: *[ function/type ...]* + Include la documentazione per ogni *function* e *type* in *source*. + Se non vengono esplicitamente specificate le funzioni da includere, allora + verranno incluse tutte quelle disponibili in *source*. + + Esempi:: + + .. kernel-doc:: lib/bitmap.c + :identifiers: bitmap_parselist bitmap_parselist_user + + .. kernel-doc:: lib/idr.c + :identifiers: + +functions: *[ function ...]* + Questo è uno pseudonimo, deprecato, per la direttiva 'identifiers'. + doc: *title* Include la documentazione del paragrafo ``DOC:`` identificato dal titolo (*title*) all'interno del file sorgente (*source*). Gli spazi in *title* sono @@ -528,15 +544,6 @@ doc: *title* .. kernel-doc:: drivers/gpu/drm/i915/intel_audio.c :doc: High Definition Audio over HDMI and Display Port -functions: *function* *[...]* - Dal file sorgente (*source*) include la documentazione per le funzioni - elencate (*function*). - - Esempio:: - - .. kernel-doc:: lib/bitmap.c - :functions: bitmap_parselist bitmap_parselist_user - Senza alcuna opzione, la direttiva kernel-doc include tutti i commenti di documentazione presenti nel file sorgente (*source*). diff --git a/Documentation/translations/it_IT/kernel-hacking/hacking.rst b/Documentation/translations/it_IT/kernel-hacking/hacking.rst index 24c592852bf1..6aab27a8d323 100644 --- a/Documentation/translations/it_IT/kernel-hacking/hacking.rst +++ b/Documentation/translations/it_IT/kernel-hacking/hacking.rst @@ -627,6 +627,24 @@ Alcuni manutentori e sviluppatori potrebbero comunque richiedere :c:func:`EXPORT_SYMBOL_GPL()` quando si aggiungono nuove funzionalità o interfacce. +:c:func:`EXPORT_SYMBOL_NS()` +---------------------------- + +Definita in ``include/linux/export.h`` + +Questa è una variate di `EXPORT_SYMBOL()` che permette di specificare uno +spazio dei nomi. Lo spazio dei nomi è documentato in +:doc:`../core-api/symbol-namespaces` + +:c:func:`EXPORT_SYMBOL_NS_GPL()` +-------------------------------- + +Definita in ``include/linux/export.h`` + +Questa è una variate di `EXPORT_SYMBOL_GPL()` che permette di specificare uno +spazio dei nomi. Lo spazio dei nomi è documentato in +:doc:`../core-api/symbol-namespaces` + Procedure e convenzioni ======================= diff --git a/Documentation/translations/it_IT/kernel-hacking/locking.rst b/Documentation/translations/it_IT/kernel-hacking/locking.rst index b9a6be4b8499..4615df5723fb 100644 --- a/Documentation/translations/it_IT/kernel-hacking/locking.rst +++ b/Documentation/translations/it_IT/kernel-hacking/locking.rst @@ -159,17 +159,17 @@ Sincronizzazione in contesto utente Se avete una struttura dati che verrà utilizzata solo dal contesto utente, allora, per proteggerla, potete utilizzare un semplice mutex (``include/linux/mutex.h``). Questo è il caso più semplice: inizializzate il -mutex; invocate :c:func:`mutex_lock_interruptible()` per trattenerlo e -:c:func:`mutex_unlock()` per rilasciarlo. C'è anche :c:func:`mutex_lock()` +mutex; invocate mutex_lock_interruptible() per trattenerlo e +mutex_unlock() per rilasciarlo. C'è anche mutex_lock() ma questa dovrebbe essere evitata perché non ritorna in caso di segnali. Per esempio: ``net/netfilter/nf_sockopt.c`` permette la registrazione -di nuove chiamate per :c:func:`setsockopt()` e :c:func:`getsockopt()` -usando la funzione :c:func:`nf_register_sockopt()`. La registrazione e +di nuove chiamate per setsockopt() e getsockopt() +usando la funzione nf_register_sockopt(). La registrazione e la rimozione vengono eseguite solamente quando il modulo viene caricato o scaricato (e durante l'avvio del sistema, qui non abbiamo concorrenza), e la lista delle funzioni registrate viene consultata solamente quando -:c:func:`setsockopt()` o :c:func:`getsockopt()` sono sconosciute al sistema. +setsockopt() o getsockopt() sono sconosciute al sistema. In questo caso ``nf_sockopt_mutex`` è perfetto allo scopo, in particolar modo visto che setsockopt e getsockopt potrebbero dormire. @@ -179,19 +179,19 @@ Sincronizzazione fra il contesto utente e i softirq Se un softirq condivide dati col contesto utente, avete due problemi. Primo, il contesto utente corrente potrebbe essere interroto da un softirq, e secondo, la sezione critica potrebbe essere eseguita da un altro -processore. Questo è quando :c:func:`spin_lock_bh()` +processore. Questo è quando spin_lock_bh() (``include/linux/spinlock.h``) viene utilizzato. Questo disabilita i softirq -sul processore e trattiene il *lock*. Invece, :c:func:`spin_unlock_bh()` fa +sul processore e trattiene il *lock*. Invece, spin_unlock_bh() fa l'opposto. (Il suffisso '_bh' è un residuo storico che fa riferimento al "Bottom Halves", il vecchio nome delle interruzioni software. In un mondo perfetto questa funzione si chiamerebbe 'spin_lock_softirq()'). -Da notare che in questo caso potete utilizzare anche :c:func:`spin_lock_irq()` -o :c:func:`spin_lock_irqsave()`, queste fermano anche le interruzioni hardware: +Da notare che in questo caso potete utilizzare anche spin_lock_irq() +o spin_lock_irqsave(), queste fermano anche le interruzioni hardware: vedere :ref:`Contesto di interruzione hardware <it_hardirq-context>`. Questo funziona alla perfezione anche sui sistemi monoprocessore: gli spinlock -svaniscono e questa macro diventa semplicemente :c:func:`local_bh_disable()` +svaniscono e questa macro diventa semplicemente local_bh_disable() (``include/linux/interrupt.h``), la quale impedisce ai softirq d'essere eseguiti. @@ -224,8 +224,8 @@ Differenti tasklet/timer ~~~~~~~~~~~~~~~~~~~~~~~~ Se un altro tasklet/timer vuole condividere dati col vostro tasklet o timer, -allora avrete bisogno entrambe di :c:func:`spin_lock()` e -:c:func:`spin_unlock()`. Qui :c:func:`spin_lock_bh()` è inutile, siete già +allora avrete bisogno entrambe di spin_lock() e +spin_unlock(). Qui spin_lock_bh() è inutile, siete già in un tasklet ed avete la garanzia che nessun altro verrà eseguito sullo stesso processore. @@ -243,13 +243,13 @@ processore (vedere :ref:`Dati per processore <it_per-cpu>`). Se siete arrivati fino a questo punto nell'uso dei softirq, probabilmente tenete alla scalabilità delle prestazioni abbastanza da giustificarne la complessità aggiuntiva. -Dovete utilizzare :c:func:`spin_lock()` e :c:func:`spin_unlock()` per +Dovete utilizzare spin_lock() e spin_unlock() per proteggere i dati condivisi. Diversi Softirqs ~~~~~~~~~~~~~~~~ -Dovete utilizzare :c:func:`spin_lock()` e :c:func:`spin_unlock()` per +Dovete utilizzare spin_lock() e spin_unlock() per proteggere i dati condivisi, che siano timer, tasklet, diversi softirq o lo stesso o altri softirq: uno qualsiasi di essi potrebbe essere in esecuzione su un diverso processore. @@ -270,40 +270,40 @@ Se un gestore di interruzioni hardware condivide dati con un softirq, allora avrete due preoccupazioni. Primo, il softirq può essere interrotto da un'interruzione hardware, e secondo, la sezione critica potrebbe essere eseguita da un'interruzione hardware su un processore diverso. Questo è il caso -dove :c:func:`spin_lock_irq()` viene utilizzato. Disabilita le interruzioni -sul processore che l'esegue, poi trattiene il lock. :c:func:`spin_unlock_irq()` +dove spin_lock_irq() viene utilizzato. Disabilita le interruzioni +sul processore che l'esegue, poi trattiene il lock. spin_unlock_irq() fa l'opposto. -Il gestore d'interruzione hardware non usa :c:func:`spin_lock_irq()` perché -i softirq non possono essere eseguiti quando il gestore d'interruzione hardware -è in esecuzione: per questo si può usare :c:func:`spin_lock()`, che è un po' +Il gestore d'interruzione hardware non ha bisogno di usare spin_lock_irq() +perché i softirq non possono essere eseguiti quando il gestore d'interruzione +hardware è in esecuzione: per questo si può usare spin_lock(), che è un po' più veloce. L'unica eccezione è quando un altro gestore d'interruzioni -hardware utilizza lo stesso *lock*: :c:func:`spin_lock_irq()` impedirà a questo +hardware utilizza lo stesso *lock*: spin_lock_irq() impedirà a questo secondo gestore di interrompere quello in esecuzione. Questo funziona alla perfezione anche sui sistemi monoprocessore: gli spinlock -svaniscono e questa macro diventa semplicemente :c:func:`local_irq_disable()` +svaniscono e questa macro diventa semplicemente local_irq_disable() (``include/asm/smp.h``), la quale impedisce a softirq/tasklet/BH d'essere eseguiti. -:c:func:`spin_lock_irqsave()` (``include/linux/spinlock.h``) è una variante che +spin_lock_irqsave() (``include/linux/spinlock.h``) è una variante che salva lo stato delle interruzioni in una variabile, questa verrà poi passata -a :c:func:`spin_unlock_irqrestore()`. Questo significa che lo stesso codice +a spin_unlock_irqrestore(). Questo significa che lo stesso codice potrà essere utilizzato in un'interruzione hardware (dove le interruzioni sono già disabilitate) e in un softirq (dove la disabilitazione delle interruzioni è richiesta). Da notare che i softirq (e quindi tasklet e timer) sono eseguiti al ritorno -da un'interruzione hardware, quindi :c:func:`spin_lock_irq()` interrompe +da un'interruzione hardware, quindi spin_lock_irq() interrompe anche questi. Tenuto conto di questo si può dire che -:c:func:`spin_lock_irqsave()` è la funzione di sincronizzazione più generica +spin_lock_irqsave() è la funzione di sincronizzazione più generica e potente. Sincronizzazione fra due gestori d'interruzioni hardware -------------------------------------------------------- Condividere dati fra due gestori di interruzione hardware è molto raro, ma se -succede, dovreste usare :c:func:`spin_lock_irqsave()`: è una specificità +succede, dovreste usare spin_lock_irqsave(): è una specificità dell'architettura il fatto che tutte le interruzioni vengano interrotte quando si eseguono di gestori di interruzioni. @@ -317,11 +317,11 @@ Pete Zaitcev ci offre il seguente riassunto: il mutex e dormire (``copy_from_user*(`` o ``kmalloc(x,GFP_KERNEL)``). - Altrimenti (== i dati possono essere manipolati da un'interruzione) usate - :c:func:`spin_lock_irqsave()` e :c:func:`spin_unlock_irqrestore()`. + spin_lock_irqsave() e spin_unlock_irqrestore(). - Evitate di trattenere uno spinlock per più di 5 righe di codice incluse le chiamate a funzione (ad eccezione di quell per l'accesso come - :c:func:`readb()`). + readb()). Tabella dei requisiti minimi ---------------------------- @@ -334,7 +334,7 @@ processore alla volta, ma se deve condividere dati con un altro thread, allora la sincronizzazione è necessaria). Ricordatevi il suggerimento qui sopra: potete sempre usare -:c:func:`spin_lock_irqsave()`, che è un sovrainsieme di tutte le altre funzioni +spin_lock_irqsave(), che è un sovrainsieme di tutte le altre funzioni per spinlock. ============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ============== @@ -378,13 +378,13 @@ protetti dal *lock* quando qualche altro thread lo sta già facendo trattenendo il *lock*. Potrete acquisire il *lock* più tardi se vi serve accedere ai dati protetti da questo *lock*. -La funzione :c:func:`spin_trylock()` non ritenta di acquisire il *lock*, +La funzione spin_trylock() non ritenta di acquisire il *lock*, se ci riesce al primo colpo ritorna un valore diverso da zero, altrimenti se fallisce ritorna 0. Questa funzione può essere utilizzata in un qualunque -contesto, ma come :c:func:`spin_lock()`: dovete disabilitare i contesti che +contesto, ma come spin_lock(): dovete disabilitare i contesti che potrebbero interrompervi e quindi trattenere lo spinlock. -La funzione :c:func:`mutex_trylock()` invece di sospendere il vostro processo +La funzione mutex_trylock() invece di sospendere il vostro processo ritorna un valore diverso da zero se è possibile trattenere il lock al primo colpo, altrimenti se fallisce ritorna 0. Nonostante non dorma, questa funzione non può essere usata in modo sicuro in contesti di interruzione hardware o @@ -506,7 +506,7 @@ della memoria che il suo contenuto sono protetti dal *lock*. Questo caso è semplice dato che copiamo i dati dall'utente e non permettiamo mai loro di accedere direttamente agli oggetti. -C'è una piccola ottimizzazione qui: nella funzione :c:func:`cache_add()` +C'è una piccola ottimizzazione qui: nella funzione cache_add() impostiamo i campi dell'oggetto prima di acquisire il *lock*. Questo è sicuro perché nessun altro potrà accedervi finché non lo inseriremo nella memoria. @@ -514,7 +514,7 @@ nella memoria. Accesso dal contesto utente --------------------------- -Ora consideriamo il caso in cui :c:func:`cache_find()` può essere invocata +Ora consideriamo il caso in cui cache_find() può essere invocata dal contesto d'interruzione: sia hardware che software. Un esempio potrebbe essere un timer che elimina oggetti dalla memoria. @@ -583,15 +583,15 @@ sono quelle rimosse, mentre quelle ``+`` sono quelle aggiunte. return ret; } -Da notare che :c:func:`spin_lock_irqsave()` disabiliterà le interruzioni +Da notare che spin_lock_irqsave() disabiliterà le interruzioni se erano attive, altrimenti non farà niente (quando siamo già in un contesto d'interruzione); dunque queste funzioni possono essere chiamante in sicurezza da qualsiasi contesto. -Sfortunatamente, :c:func:`cache_add()` invoca :c:func:`kmalloc()` con +Sfortunatamente, cache_add() invoca kmalloc() con l'opzione ``GFP_KERNEL`` che è permessa solo in contesto utente. Ho supposto -che :c:func:`cache_add()` venga chiamata dal contesto utente, altrimenti -questa opzione deve diventare un parametro di :c:func:`cache_add()`. +che cache_add() venga chiamata dal contesto utente, altrimenti +questa opzione deve diventare un parametro di cache_add(). Esporre gli oggetti al di fuori del file ---------------------------------------- @@ -610,7 +610,7 @@ Il secondo problema è il problema del ciclo di vita: se un'altra struttura mantiene un puntatore ad un oggetto, presumibilmente si aspetta che questo puntatore rimanga valido. Sfortunatamente, questo è garantito solo mentre si trattiene il *lock*, altrimenti qualcuno potrebbe chiamare -:c:func:`cache_delete()` o peggio, aggiungere un oggetto che riutilizza lo +cache_delete() o peggio, aggiungere un oggetto che riutilizza lo stesso indirizzo. Dato che c'è un solo *lock*, non potete trattenerlo a vita: altrimenti @@ -710,9 +710,9 @@ Ecco il codice:: } Abbiamo incapsulato il contatore di riferimenti nelle tipiche funzioni -di 'get' e 'put'. Ora possiamo ritornare l'oggetto da :c:func:`cache_find()` +di 'get' e 'put'. Ora possiamo ritornare l'oggetto da cache_find() col vantaggio che l'utente può dormire trattenendo l'oggetto (per esempio, -:c:func:`copy_to_user()` per copiare il nome verso lo spazio utente). +copy_to_user() per copiare il nome verso lo spazio utente). Un altro punto da notare è che ho detto che il contatore dovrebbe incrementarsi per ogni puntatore ad un oggetto: quindi il contatore di riferimenti è 1 @@ -727,8 +727,8 @@ Ci sono un certo numbero di operazioni atomiche definite in ``include/asm/atomic.h``: queste sono garantite come atomiche su qualsiasi processore del sistema, quindi non sono necessari i *lock*. In questo caso è più semplice rispetto all'uso degli spinlock, benché l'uso degli spinlock -sia più elegante per casi non banali. Le funzioni :c:func:`atomic_inc()` e -:c:func:`atomic_dec_and_test()` vengono usate al posto dei tipici operatori di +sia più elegante per casi non banali. Le funzioni atomic_inc() e +atomic_dec_and_test() vengono usate al posto dei tipici operatori di incremento e decremento, e i *lock* non sono più necessari per proteggere il contatore stesso. @@ -820,7 +820,7 @@ al nome di cambiare abbiamo tre possibilità: - Si può togliere static da ``cache_lock`` e dire agli utenti che devono trattenere il *lock* prima di modificare il nome di un oggetto. -- Si può fornire una funzione :c:func:`cache_obj_rename()` che prende il +- Si può fornire una funzione cache_obj_rename() che prende il *lock* e cambia il nome per conto del chiamante; si dirà poi agli utenti di usare questa funzione. @@ -878,11 +878,11 @@ Da notare che ho deciso che il contatore di popolarità dovesse essere protetto da ``cache_lock`` piuttosto che dal *lock* dell'oggetto; questo perché è logicamente parte dell'infrastruttura (come :c:type:`struct list_head <list_head>` nell'oggetto). In questo modo, -in :c:func:`__cache_add()`, non ho bisogno di trattenere il *lock* di ogni +in __cache_add(), non ho bisogno di trattenere il *lock* di ogni oggetto mentre si cerca il meno popolare. Ho anche deciso che il campo id è immutabile, quindi non ho bisogno di -trattenere il lock dell'oggetto quando si usa :c:func:`__cache_find()` +trattenere il lock dell'oggetto quando si usa __cache_find() per leggere questo campo; il *lock* dell'oggetto è usato solo dal chiamante che vuole leggere o scrivere il campo name. @@ -907,7 +907,7 @@ Questo è facile da diagnosticare: non è uno di quei problemi che ti tengono sveglio 5 notti a parlare da solo. Un caso un pochino più complesso; immaginate d'avere una spazio condiviso -fra un softirq ed il contesto utente. Se usate :c:func:`spin_lock()` per +fra un softirq ed il contesto utente. Se usate spin_lock() per proteggerlo, il contesto utente potrebbe essere interrotto da un softirq mentre trattiene il lock, da qui il softirq rimarrà in attesa attiva provando ad acquisire il *lock* già trattenuto nel contesto utente. @@ -1006,12 +1006,12 @@ potreste fare come segue:: spin_unlock_bh(&list_lock); Primo o poi, questo esploderà su un sistema multiprocessore perché un -temporizzatore potrebbe essere già partiro prima di :c:func:`spin_lock_bh()`, -e prenderà il *lock* solo dopo :c:func:`spin_unlock_bh()`, e cercherà +temporizzatore potrebbe essere già partiro prima di spin_lock_bh(), +e prenderà il *lock* solo dopo spin_unlock_bh(), e cercherà di eliminare il suo oggetto (che però è già stato eliminato). Questo può essere evitato controllando il valore di ritorno di -:c:func:`del_timer()`: se ritorna 1, il temporizzatore è stato già +del_timer(): se ritorna 1, il temporizzatore è stato già rimosso. Se 0, significa (in questo caso) che il temporizzatore è in esecuzione, quindi possiamo fare come segue:: @@ -1032,9 +1032,9 @@ esecuzione, quindi possiamo fare come segue:: spin_unlock_bh(&list_lock); Un altro problema è l'eliminazione dei temporizzatori che si riavviano -da soli (chiamando :c:func:`add_timer()` alla fine della loro esecuzione). +da soli (chiamando add_timer() alla fine della loro esecuzione). Dato che questo è un problema abbastanza comune con una propensione -alle corse critiche, dovreste usare :c:func:`del_timer_sync()` +alle corse critiche, dovreste usare del_timer_sync() (``include/linux/timer.h``) per gestire questo caso. Questa ritorna il numero di volte che il temporizzatore è stato interrotto prima che fosse in grado di fermarlo senza che si riavviasse. @@ -1116,7 +1116,7 @@ chiamata ``list``:: wmb(); list->next = new; -La funzione :c:func:`wmb()` è una barriera di sincronizzazione delle +La funzione wmb() è una barriera di sincronizzazione delle scritture. Questa garantisce che la prima operazione (impostare l'elemento ``next`` del nuovo elemento) venga completata e vista da tutti i processori prima che venga eseguita la seconda operazione (che sarebbe quella di mettere @@ -1127,7 +1127,7 @@ completamente il nuovo elemento; oppure che lo vedano correttamente e quindi il puntatore ``next`` deve puntare al resto della lista. Fortunatamente, c'è una funzione che fa questa operazione sulle liste -:c:type:`struct list_head <list_head>`: :c:func:`list_add_rcu()` +:c:type:`struct list_head <list_head>`: list_add_rcu() (``include/linux/list.h``). Rimuovere un elemento dalla lista è anche più facile: sostituiamo il puntatore @@ -1138,7 +1138,7 @@ l'elemento o lo salteranno. list->next = old->next; -La funzione :c:func:`list_del_rcu()` (``include/linux/list.h``) fa esattamente +La funzione list_del_rcu() (``include/linux/list.h``) fa esattamente questo (la versione normale corrompe il vecchio oggetto, e non vogliamo che accada). @@ -1146,9 +1146,9 @@ Anche i lettori devono stare attenti: alcuni processori potrebbero leggere attraverso il puntatore ``next`` il contenuto dell'elemento successivo troppo presto, ma non accorgersi che il contenuto caricato è sbagliato quando il puntatore ``next`` viene modificato alla loro spalle. Ancora una volta -c'è una funzione che viene in vostro aiuto :c:func:`list_for_each_entry_rcu()` +c'è una funzione che viene in vostro aiuto list_for_each_entry_rcu() (``include/linux/list.h``). Ovviamente, gli scrittori possono usare -:c:func:`list_for_each_entry()` dato che non ci possono essere due scrittori +list_for_each_entry() dato che non ci possono essere due scrittori in contemporanea. Il nostro ultimo dilemma è il seguente: quando possiamo realmente distruggere @@ -1156,15 +1156,15 @@ l'elemento rimosso? Ricordate, un lettore potrebbe aver avuto accesso a questo elemento proprio ora: se eliminiamo questo elemento ed il puntatore ``next`` cambia, il lettore salterà direttamente nella spazzatura e scoppierà. Dobbiamo aspettare finché tutti i lettori che stanno attraversando la lista abbiano -finito. Utilizziamo :c:func:`call_rcu()` per registrare una funzione di +finito. Utilizziamo call_rcu() per registrare una funzione di richiamo che distrugga l'oggetto quando tutti i lettori correnti hanno terminato. In alternative, potrebbe essere usata la funzione -:c:func:`synchronize_rcu()` che blocca l'esecuzione finché tutti i lettori +synchronize_rcu() che blocca l'esecuzione finché tutti i lettori non terminano di ispezionare la lista. Ma come fa l'RCU a sapere quando i lettori sono finiti? Il meccanismo è il seguente: innanzi tutto i lettori accedono alla lista solo fra la coppia -:c:func:`rcu_read_lock()`/:c:func:`rcu_read_unlock()` che disabilita la +rcu_read_lock()/rcu_read_unlock() che disabilita la prelazione così che i lettori non vengano sospesi mentre stanno leggendo la lista. @@ -1253,12 +1253,12 @@ codice RCU è un po' più ottimizzato di così, ma questa è l'idea di fondo. } Da notare che i lettori modificano il campo popularity nella funzione -:c:func:`__cache_find()`, e ora non trattiene alcun *lock*. Una soluzione +__cache_find(), e ora non trattiene alcun *lock*. Una soluzione potrebbe essere quella di rendere la variabile ``atomic_t``, ma per l'uso che ne abbiamo fatto qui, non ci interessano queste corse critiche perché un risultato approssimativo è comunque accettabile, quindi non l'ho cambiato. -Il risultato è che la funzione :c:func:`cache_find()` non ha bisogno di alcuna +Il risultato è che la funzione cache_find() non ha bisogno di alcuna sincronizzazione con le altre funzioni, quindi è veloce su un sistema multi-processore tanto quanto lo sarebbe su un sistema mono-processore. @@ -1271,9 +1271,9 @@ riferimenti. Ora, dato che il '*lock* di lettura' di un RCU non fa altro che disabilitare la prelazione, un chiamante che ha sempre la prelazione disabilitata fra le -chiamate :c:func:`cache_find()` e :c:func:`object_put()` non necessita +chiamate cache_find() e object_put() non necessita di incrementare e decrementare il contatore di riferimenti. Potremmo -esporre la funzione :c:func:`__cache_find()` dichiarandola non-static, +esporre la funzione __cache_find() dichiarandola non-static, e quel chiamante potrebbe usare direttamente questa funzione. Il beneficio qui sta nel fatto che il contatore di riferimenti no @@ -1293,10 +1293,10 @@ singolo contatore. Facile e pulito. Se questo dovesse essere troppo lento (solitamente non lo è, ma se avete dimostrato che lo è devvero), potreste usare un contatore per ogni processore e quindi non sarebbe più necessaria la mutua esclusione. Vedere -:c:func:`DEFINE_PER_CPU()`, :c:func:`get_cpu_var()` e :c:func:`put_cpu_var()` +DEFINE_PER_CPU(), get_cpu_var() e put_cpu_var() (``include/linux/percpu.h``). -Il tipo di dato ``local_t``, la funzione :c:func:`cpu_local_inc()` e tutte +Il tipo di dato ``local_t``, la funzione cpu_local_inc() e tutte le altre funzioni associate, sono di particolare utilità per semplici contatori per-processore; su alcune architetture sono anche più efficienti (``include/asm/local.h``). @@ -1324,11 +1324,11 @@ da un'interruzione software. Il gestore d'interruzione non utilizza alcun enable_irq(irq); spin_unlock(&lock); -La funzione :c:func:`disable_irq()` impedisce al gestore d'interruzioni +La funzione disable_irq() impedisce al gestore d'interruzioni d'essere eseguito (e aspetta che finisca nel caso fosse in esecuzione su un altro processore). Lo spinlock, invece, previene accessi simultanei. Naturalmente, questo è più lento della semplice chiamata -:c:func:`spin_lock_irq()`, quindi ha senso solo se questo genere di accesso +spin_lock_irq(), quindi ha senso solo se questo genere di accesso è estremamente raro. .. _`it_sleeping-things`: @@ -1336,7 +1336,7 @@ Naturalmente, questo è più lento della semplice chiamata Quali funzioni possono essere chiamate in modo sicuro dalle interruzioni? ========================================================================= -Molte funzioni del kernel dormono (in sostanza, chiamano ``schedule()``) +Molte funzioni del kernel dormono (in sostanza, chiamano schedule()) direttamente od indirettamente: non potete chiamarle se trattenere uno spinlock o avete la prelazione disabilitata, mai. Questo significa che dovete necessariamente essere nel contesto utente: chiamarle da un @@ -1354,23 +1354,23 @@ dormire. - Accessi allo spazio utente: - - :c:func:`copy_from_user()` + - copy_from_user() - - :c:func:`copy_to_user()` + - copy_to_user() - - :c:func:`get_user()` + - get_user() - - :c:func:`put_user()` + - put_user() -- :c:func:`kmalloc(GFP_KERNEL) <kmalloc>` +- kmalloc(GFP_KERNEL) <kmalloc>` -- :c:func:`mutex_lock_interruptible()` and - :c:func:`mutex_lock()` +- mutex_lock_interruptible() and + mutex_lock() - C'è anche :c:func:`mutex_trylock()` che però non dorme. + C'è anche mutex_trylock() che però non dorme. Comunque, non deve essere usata in un contesto d'interruzione dato che la sua implementazione non è sicura in quel contesto. - Anche :c:func:`mutex_unlock()` non dorme mai. Non può comunque essere + Anche mutex_unlock() non dorme mai. Non può comunque essere usata in un contesto d'interruzione perché un mutex deve essere rilasciato dallo stesso processo che l'ha acquisito. @@ -1380,11 +1380,11 @@ Alcune funzioni che non dormono Alcune funzioni possono essere chiamate tranquillamente da qualsiasi contesto, o trattenendo un qualsiasi *lock*. -- :c:func:`printk()` +- printk() -- :c:func:`kfree()` +- kfree() -- :c:func:`add_timer()` e :c:func:`del_timer()` +- add_timer() e del_timer() Riferimento per l'API dei Mutex =============================== @@ -1444,14 +1444,14 @@ prelazione bh Bottom Half: per ragioni storiche, le funzioni che contengono '_bh' nel loro nome ora si riferiscono a qualsiasi interruzione software; per esempio, - :c:func:`spin_lock_bh()` blocca qualsiasi interuzione software sul processore + spin_lock_bh() blocca qualsiasi interuzione software sul processore corrente. I *Bottom Halves* sono deprecati, e probabilmente verranno sostituiti dai tasklet. In un dato momento potrà esserci solo un *bottom half* in esecuzione. contesto d'interruzione Non è il contesto utente: qui si processano le interruzioni hardware e - software. La macro :c:func:`in_interrupt()` ritorna vero. + software. La macro in_interrupt() ritorna vero. contesto utente Il kernel che esegue qualcosa per conto di un particolare processo (per @@ -1461,12 +1461,12 @@ contesto utente che hardware. interruzione hardware - Richiesta di interruzione hardware. :c:func:`in_irq()` ritorna vero in un + Richiesta di interruzione hardware. in_irq() ritorna vero in un gestore d'interruzioni hardware. interruzione software / softirq - Gestore di interruzioni software: :c:func:`in_irq()` ritorna falso; - :c:func:`in_softirq()` ritorna vero. I tasklet e le softirq sono entrambi + Gestore di interruzioni software: in_irq() ritorna falso; + in_softirq() ritorna vero. I tasklet e le softirq sono entrambi considerati 'interruzioni software'. In soldoni, un softirq è uno delle 32 interruzioni software che possono diff --git a/Documentation/translations/it_IT/process/2.Process.rst b/Documentation/translations/it_IT/process/2.Process.rst index 9af4d01617c4..30dc172f06b0 100644 --- a/Documentation/translations/it_IT/process/2.Process.rst +++ b/Documentation/translations/it_IT/process/2.Process.rst @@ -23,18 +23,18 @@ ogni due o tre mesi viene effettuata un rilascio importante del kernel. I rilasci più recenti sono stati: ====== ================= - 4.11 Aprile 30, 2017 - 4.12 Luglio 2, 2017 - 4.13 Settembre 3, 2017 - 4.14 Novembre 12, 2017 - 4.15 Gennaio 28, 2018 - 4.16 Aprile 1, 2018 + 5.0 3 marzo, 2019 + 5.1 5 maggio, 2019 + 5.2 7 luglio, 2019 + 5.3 15 settembre, 2019 + 5.4 24 novembre, 2019 + 5.5 6 gennaio, 2020 ====== ================= -Ciascun rilascio 4.x è un importante rilascio del kernel con nuove +Ciascun rilascio 5.x è un importante rilascio del kernel con nuove funzionalità, modifiche interne dell'API, e molto altro. Un tipico -rilascio 4.x contiene quasi 13,000 gruppi di modifiche con ulteriori -modifiche a parecchie migliaia di linee di codice. La 4.x. è pertanto la +rilascio contiene quasi 13,000 gruppi di modifiche con ulteriori +modifiche a parecchie migliaia di linee di codice. La 5.x. è pertanto la linea di confine nello sviluppo del kernel Linux; il kernel utilizza un sistema di sviluppo continuo che integra costantemente nuove importanti modifiche. @@ -55,8 +55,8 @@ verrà descritto dettagliatamente più avanti). La finestra di inclusione resta attiva approssimativamente per due settimane. Al termine di questo periodo, Linus Torvald dichiarerà che la finestra è chiusa e rilascerà il primo degli "rc" del kernel. -Per il kernel che è destinato ad essere 2.6.40, per esempio, il rilascio -che emerge al termine della finestra d'inclusione si chiamerà 2.6.40-rc1. +Per il kernel che è destinato ad essere 5.6, per esempio, il rilascio +che emerge al termine della finestra d'inclusione si chiamerà 5.6-rc1. Questo rilascio indica che il momento di aggiungere nuovi componenti è passato, e che è iniziato il periodo di stabilizzazione del prossimo kernel. @@ -76,22 +76,23 @@ Mentre le correzioni si aprono la loro strada all'interno del ramo principale, il ritmo delle modifiche rallenta col tempo. Linus rilascia un nuovo kernel -rc circa una volta alla settimana; e ne usciranno circa 6 o 9 prima che il kernel venga considerato sufficientemente stabile e che il rilascio -finale 2.6.x venga fatto. A quel punto tutto il processo ricomincerà. +finale venga fatto. A quel punto tutto il processo ricomincerà. -Esempio: ecco com'è andato il ciclo di sviluppo della versione 4.16 +Esempio: ecco com'è andato il ciclo di sviluppo della versione 5.4 (tutte le date si collocano nel 2018) ============== ======================================= - Gennaio 28 4.15 rilascio stabile - Febbraio 11 4.16-rc1, finestra di inclusione chiusa - Febbraio 18 4.16-rc2 - Febbraio 25 4.16-rc3 - Marzo 4 4.16-rc4 - Marzo 11 4.16-rc5 - Marzo 18 4.16-rc6 - Marzo 25 4.16-rc7 - Aprile 1 4.17 rilascio stabile + 15 settembre 5.3 rilascio stabile + 30 settembre 5.4-rc1, finestra di inclusione chiusa + 6 ottobre 5.4-rc2 + 13 ottobre 5.4-rc3 + 20 ottobre 5.4-rc4 + 27 ottobre 5.4-rc5 + 3 novembre 5.4-rc6 + 10 novembre 5.4-rc7 + 17 novembre 5.4-rc8 + 24 novembre 5.4 rilascio stabile ============== ======================================= In che modo gli sviluppatori decidono quando chiudere il ciclo di sviluppo e @@ -108,43 +109,44 @@ tipo di perfezione difficilmente viene raggiunta; esistono troppe variabili in un progetto di questa portata. Arriva un punto dove ritardare il rilascio finale peggiora la situazione; la quantità di modifiche in attesa della prossima finestra di inclusione crescerà enormemente, creando ancor più -regressioni al giro successivo. Quindi molti kernel 4.x escono con una +regressioni al giro successivo. Quindi molti kernel 5.x escono con una manciata di regressioni delle quali, si spera, nessuna è grave. Una volta che un rilascio stabile è fatto, il suo costante mantenimento è affidato al "squadra stabilità", attualmente composta da Greg Kroah-Hartman. Questa squadra rilascia occasionalmente degli aggiornamenti relativi al -rilascio stabile usando la numerazione 4.x.y. Per essere presa in +rilascio stabile usando la numerazione 5.x.y. Per essere presa in considerazione per un rilascio d'aggiornamento, una modifica deve: (1) correggere un baco importante (2) essere già inserita nel ramo principale per il prossimo sviluppo del kernel. Solitamente, passato il loro rilascio iniziale, i kernel ricevono aggiornamenti per più di un ciclo di sviluppo. -Quindi, per esempio, la storia del kernel 4.13 appare così: +Quindi, per esempio, la storia del kernel 5.2 appare così (anno 2019): ============== =============================== - Settembre 3 4.13 rilascio stabile - Settembre 13 4.13.1 - Settembre 20 4.13.2 - Settembre 27 4.13.3 - Ottobre 5 4.13.4 - Ottobre 12 4.13.5 + 15 settembre 5.2 rilascio stabile FIXME settembre è sbagliato + 14 luglio 5.2.1 + 21 luglio 5.2.2 + 26 luglio 5.2.3 + 28 luglio 5.2.4 + 31 luglio 5.2.5 ... ... - Novembre 24 4.13.16 + 11 ottobre 5.2.21 ============== =============================== -La 4.13.16 fu l'aggiornamento finale per la versione 4.13. +La 5.2.21 fu l'aggiornamento finale per la versione 5.2. Alcuni kernel sono destinati ad essere kernel a "lungo termine"; questi riceveranno assistenza per un lungo periodo di tempo. Al momento in cui scriviamo, i manutentori dei kernel stabili a lungo termine sono: - ====== ====================== ========================================== - 3.16 Ben Hutchings (kernel stabile molto più a lungo termine) - 4.1 Sasha Levin - 4.4 Greg Kroah-Hartman (kernel stabile molto più a lungo termine) - 4.9 Greg Kroah-Hartman - 4.14 Greg Kroah-Hartman - ====== ====================== ========================================== + ====== ================================ ========================================== + 3.16 Ben Hutchings (kernel stabile molto più a lungo termine) + 4.4 Greg Kroah-Hartman e Sasha Levin (kernel stabile molto più a lungo termine) + 4.9 Greg Kroah-Hartman e Sasha Levin + 4.14 Greg Kroah-Hartman e Sasha Levin + 4.19 Greg Kroah-Hartman e Sasha Levin + 5.4i Greg Kroah-Hartman e Sasha Levin + ====== ================================ ========================================== Questa selezione di kernel di lungo periodo sono puramente dovuti ai loro @@ -229,12 +231,13 @@ Come le modifiche finiscono nel Kernel -------------------------------------- Esiste una sola persona che può inserire le patch nel repositorio principale -del kernel: Linus Torvalds. Ma, di tutte le 9500 patch che entrarono nella -versione 2.6.38 del kernel, solo 112 (circa l'1,3%) furono scelte direttamente -da Linus in persona. Il progetto del kernel è cresciuto fino a raggiungere -una dimensione tale per cui un singolo sviluppatore non può controllare e -selezionare indipendentemente ogni modifica senza essere supportato. -La via scelta dagli sviluppatori per indirizzare tale crescita è stata quella +del kernel: Linus Torvalds. Ma, per esempio, di tutte le 9500 patch +che entrarono nella versione 2.6.38 del kernel, solo 112 (circa +l'1,3%) furono scelte direttamente da Linus in persona. Il progetto +del kernel è cresciuto fino a raggiungere una dimensione tale per cui +un singolo sviluppatore non può controllare e selezionare +indipendentemente ogni modifica senza essere supportato. La via +scelta dagli sviluppatori per indirizzare tale crescita è stata quella di utilizzare un sistema di "sottotenenti" basato sulla fiducia. Il codice base del kernel è spezzato in una serie si sottosistemi: rete, diff --git a/Documentation/translations/it_IT/process/adding-syscalls.rst b/Documentation/translations/it_IT/process/adding-syscalls.rst index c3a3439595a6..bff0a82bf127 100644 --- a/Documentation/translations/it_IT/process/adding-syscalls.rst +++ b/Documentation/translations/it_IT/process/adding-syscalls.rst @@ -39,7 +39,7 @@ vostra interfaccia. un qualche modo opaca. - Se dovete esporre solo delle informazioni sul sistema, un nuovo nodo in - sysfs (vedere ``Documentation/filesystems/sysfs.txt``) o + sysfs (vedere ``Documentation/filesystems/sysfs.rst``) o in procfs potrebbe essere sufficiente. Tuttavia, l'accesso a questi meccanismi richiede che il filesystem sia montato, il che potrebbe non essere sempre vero (per esempio, in ambienti come namespace/sandbox/chroot). diff --git a/Documentation/translations/it_IT/process/coding-style.rst b/Documentation/translations/it_IT/process/coding-style.rst index 8725f2b9e960..6f4f85832dee 100644 --- a/Documentation/translations/it_IT/process/coding-style.rst +++ b/Documentation/translations/it_IT/process/coding-style.rst @@ -313,7 +313,7 @@ che conta gli utenti attivi, dovreste chiamarla ``count_active_users()`` o qualcosa di simile, **non** dovreste chiamarla ``cntusr()``. Codificare il tipo di funzione nel suo nome (quella cosa chiamata notazione -ungherese) fa male al cervello - il compilatore conosce comunque il tipo e +ungherese) è stupido - il compilatore conosce comunque il tipo e può verificarli, e inoltre confonde i programmatori. Non c'è da sorprendersi che MicroSoft faccia programmi bacati. @@ -825,8 +825,8 @@ linguaggio assembler. Agli sviluppatori del kernel piace essere visti come dotti. Tenete un occhio di riguardo per l'ortografia e farete una belle figura. In inglese, evitate -l'uso di parole mozzate come ``dont``: usate ``do not`` oppure ``don't``. -Scrivete messaggi concisi, chiari, e inequivocabili. +l'uso incorretto di abbreviazioni come ``dont``: usate ``do not`` oppure +``don't``. Scrivete messaggi concisi, chiari, e inequivocabili. I messaggi del kernel non devono terminare con un punto fermo. diff --git a/Documentation/translations/it_IT/process/deprecated.rst b/Documentation/translations/it_IT/process/deprecated.rst index 776f26732a94..e108eaf82cf6 100644 --- a/Documentation/translations/it_IT/process/deprecated.rst +++ b/Documentation/translations/it_IT/process/deprecated.rst @@ -34,6 +34,33 @@ interfaccia come 'vecchia', questa non è una soluzione completa. L'interfaccia deve essere rimossa dal kernel, o aggiunta a questo documento per scoraggiarne l'uso. +BUG() e BUG_ON() +---------------- +Al loro posto usate WARN() e WARN_ON() per gestire le +condizioni "impossibili" e gestitele come se fosse possibile farlo. +Nonostante le funzioni della famiglia BUG() siano state progettate +per asserire "situazioni impossibili" e interrompere in sicurezza un +thread del kernel, queste si sono rivelate essere troppo rischiose +(per esempio, in quale ordine rilasciare i *lock*? Ci sono stati che +sono stati ripristinati?). Molto spesso l'uso di BUG() +destabilizza il sistema o lo corrompe del tutto, il che rende +impossibile un'attività di debug o anche solo leggere un rapporto +circa l'errore. Linus ha un'opinione molto critica al riguardo: +`email 1 +<https://lore.kernel.org/lkml/CA+55aFy6jNLsywVYdGp83AMrXBo_P-pkjkphPGrO=82SPKCpLQ@mail.gmail.com/>`_, +`email 2 +<https://lore.kernel.org/lkml/CAHk-=whDHsbK3HTOpTF=ue_o04onRwTEaK_ZoJp_fjbqq4+=Jw@mail.gmail.com/>`_ + +Tenete presente che la famiglia di funzioni WARN() dovrebbe essere +usato solo per situazioni che si suppone siano "impossibili". Se +volete avvisare gli utenti riguardo a qualcosa di possibile anche se +indesiderato, usare le funzioni della famiglia pr_warn(). Chi +amministra il sistema potrebbe aver attivato l'opzione sysctl +*panic_on_warn* per essere sicuri che il sistema smetta di funzionare +in caso si verifichino delle condizioni "inaspettate". (per esempio, +date un'occhiata al questo `commit +<https://git.kernel.org/linus/d4689846881d160a4d12a514e991a740bcb5d65a>`_) + Calcoli codificati negli argomenti di un allocatore ---------------------------------------------------- Il calcolo dinamico delle dimensioni (specialmente le moltiplicazioni) non @@ -68,52 +95,81 @@ Invece, usate la seguente funzione:: header = kzalloc(struct_size(header, item, count), GFP_KERNEL); -Per maggiori dettagli fate riferimento a :c:func:`array_size`, -:c:func:`array3_size`, e :c:func:`struct_size`, così come la famiglia di -funzioni :c:func:`check_add_overflow` e :c:func:`check_mul_overflow`. +Per maggiori dettagli fate riferimento a array_size(), +array3_size(), e struct_size(), così come la famiglia di +funzioni check_add_overflow() e check_mul_overflow(). simple_strtol(), simple_strtoll(), simple_strtoul(), simple_strtoull() ---------------------------------------------------------------------- -Le funzioni :c:func:`simple_strtol`, :c:func:`simple_strtoll`, -:c:func:`simple_strtoul`, e :c:func:`simple_strtoull` ignorano volutamente +Le funzioni simple_strtol(), simple_strtoll(), +simple_strtoul(), e simple_strtoull() ignorano volutamente i possibili overflow, e questo può portare il chiamante a generare risultati -inaspettati. Le rispettive funzioni :c:func:`kstrtol`, :c:func:`kstrtoll`, -:c:func:`kstrtoul`, e :c:func:`kstrtoull` sono da considerarsi le corrette +inaspettati. Le rispettive funzioni kstrtol(), kstrtoll(), +kstrtoul(), e kstrtoull() sono da considerarsi le corrette sostitute; tuttavia va notato che queste richiedono che la stringa sia terminata con il carattere NUL o quello di nuova riga. strcpy() -------- -La funzione :c:func:`strcpy` non fa controlli agli estremi del buffer +La funzione strcpy() non fa controlli agli estremi del buffer di destinazione. Questo può portare ad un overflow oltre i limiti del buffer e generare svariati tipi di malfunzionamenti. Nonostante l'opzione `CONFIG_FORTIFY_SOURCE=y` e svariate opzioni del compilatore aiutano a ridurne il rischio, non c'è alcuna buona ragione per continuare ad usare -questa funzione. La versione sicura da usare è :c:func:`strscpy`. +questa funzione. La versione sicura da usare è strscpy(). strncpy() su stringe terminate con NUL -------------------------------------- -L'utilizzo di :c:func:`strncpy` non fornisce alcuna garanzia sul fatto che +L'utilizzo di strncpy() non fornisce alcuna garanzia sul fatto che il buffer di destinazione verrà terminato con il carattere NUL. Questo potrebbe portare a diversi overflow di lettura o altri malfunzionamenti causati, appunto, dalla mancanza del terminatore. Questa estende la terminazione nel buffer di destinazione quando la stringa d'origine è più corta; questo potrebbe portare ad una penalizzazione delle prestazioni per chi usa solo stringe terminate. La versione sicura da usare è -:c:func:`strscpy`. (chi usa :c:func:`strscpy` e necessita di estendere la -terminazione con NUL deve aggiungere una chiamata a :c:func:`memset`) +strscpy(). (chi usa strscpy() e necessita di estendere la +terminazione con NUL deve aggiungere una chiamata a memset()) -Se il chiamate no usa stringhe terminate con NUL, allore :c:func:`strncpy()` +Se il chiamate no usa stringhe terminate con NUL, allore strncpy()() può continuare ad essere usata, ma i buffer di destinazione devono essere marchiati con l'attributo `__nonstring <https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html>`_ per evitare avvisi durante la compilazione. strlcpy() --------- -La funzione :c:func:`strlcpy`, per prima cosa, legge interamente il buffer di +La funzione strlcpy(), per prima cosa, legge interamente il buffer di origine, magari leggendo più di quanto verrà effettivamente copiato. Questo è inefficiente e può portare a overflow di lettura quando la stringa non è -terminata con NUL. La versione sicura da usare è :c:func:`strscpy`. +terminata con NUL. La versione sicura da usare è strscpy(). + +Segnaposto %p nella stringa di formato +-------------------------------------- + +Tradizionalmente, l'uso del segnaposto "%p" nella stringa di formato +esponne un indirizzo di memoria in dmesg, proc, sysfs, eccetera. Per +evitare che questi indirizzi vengano sfruttati da malintenzionati, +tutto gli usi di "%p" nel kernel rappresentano l'hash dell'indirizzo, +rendendolo di fatto inutilizzabile. Nuovi usi di "%p" non dovrebbero +essere aggiunti al kernel. Per una rappresentazione testuale di un +indirizzo usate "%pS", l'output è migliore perché mostrerà il nome del +simbolo. Per tutto il resto, semplicemente non usate "%p". + +Parafrasando la `guida +<https://lore.kernel.org/lkml/CA+55aFwQEd_d40g4mUCSsVRZzrFPUJt74vc6PPpb675hYNXcKw@mail.gmail.com/>`_ +di Linus: + +- Se il valore hash di "%p" è inutile, chiediti se il puntatore stesso + è importante. Forse dovrebbe essere rimosso del tutto? +- Se credi davvero che il vero valore del puntatore sia importante, + perché alcuni stati del sistema o i livelli di privilegi di un + utente sono considerati "special"? Se pensi di poterlo giustificare + (in un commento e nel messaggio del commit) abbastanza bene da + affrontare il giudizio di Linus, allora forse potrai usare "%px", + assicurandosi anche di averne il permesso. + +Infine, sappi che un cambio in favore di "%p" con hash `non verrà +accettato +<https://lore.kernel.org/lkml/CA+55aFwieC1-nAs+NFq9RTwaR8ef9hWa4MjNBWL41F-8wM49eA@mail.gmail.com/>`_. Vettori a dimensione variabile (VLA) ------------------------------------ @@ -127,3 +183,47 @@ Questo può portare a dei malfunzionamenti, potrebbe sovrascrivere dati importanti alla fine dello stack (quando il kernel è compilato senza `CONFIG_THREAD_INFO_IN_TASK=y`), o sovrascrivere un pezzo di memoria adiacente allo stack (quando il kernel è compilato senza `CONFIG_VMAP_STACK=y`). + +Salto implicito nell'istruzione switch-case +------------------------------------------- + +Il linguaggio C permette ai casi di un'istruzione `switch` di saltare al +prossimo caso quando l'istruzione "break" viene omessa alla fine del caso +corrente. Tuttavia questo rende il codice ambiguo perché non è sempre ovvio se +l'istruzione "break" viene omessa intenzionalmente o è un baco. Per esempio, +osservando il seguente pezzo di codice non è chiaro se lo stato +`STATE_ONE` è stato progettato apposta per eseguire anche `STATE_TWO`:: + + switch (value) { + case STATE_ONE: + do_something(); + case STATE_TWO: + do_other(); + break; + default: + WARN("unknown state"); + } + +Dato che c'è stata una lunga lista di problemi `dovuti alla mancanza dell'istruzione +"break" <https://cwe.mitre.org/data/definitions/484.html>`_, oggigiorno non +permettiamo più che vi sia un "salto implicito" (*fall-through*). Per +identificare un salto implicito intenzionale abbiamo adottato la pseudo +parola chiave 'fallthrough' che viene espansa nell'estensione di gcc +`__attribute__((fallthrough))` `Statement Attributes +<https://gcc.gnu.org/onlinedocs/gcc/Statement-Attributes.html>`_. +(Quando la sintassi C17/C18 `[[fallthrough]]` sarà più comunemente +supportata dai compilatori C, analizzatori statici, e dagli IDE, +allora potremo usare quella sintassi per la pseudo parola chiave) + +Quando la sintassi [[fallthrough]] sarà più comunemente supportata dai +compilatori, analizzatori statici, e ambienti di sviluppo IDE, +allora potremo usarla anche noi. + +Ne consegue che tutti i blocchi switch/case devono finire in uno dei seguenti +modi: + +* ``break;`` +* `fallthrough;`` +* ``continue;`` +* ``goto <label>;`` +* ``return [expression];`` diff --git a/Documentation/translations/it_IT/process/email-clients.rst b/Documentation/translations/it_IT/process/email-clients.rst index 224ab031ffd3..89abf6d325f2 100644 --- a/Documentation/translations/it_IT/process/email-clients.rst +++ b/Documentation/translations/it_IT/process/email-clients.rst @@ -1,12 +1,334 @@ .. include:: ../disclaimer-ita.rst -:Original: :ref:`Documentation/process/email-clients.rst <email_clients>` - -.. _it_email_clients: +:Original: :doc:`../../../process/email-clients` +:Translator: Alessia Mantegazza <amantegazza@vaga.pv.it> Informazioni sui programmi di posta elettronica per Linux ========================================================= -.. warning:: +Git +--- + +Oggigiorno, la maggior parte degli sviluppatori utilizza ``git send-email`` +al posto dei classici programmi di posta elettronica. Le pagine man sono +abbastanza buone. Dal lato del ricevente, i manutentori utilizzano ``git am`` +per applicare le patch. + +Se siete dei novelli utilizzatori di ``git`` allora inviate la patch a voi +stessi. Salvatela come testo includendo tutte le intestazioni. Poi eseguite +il comando ``git am messaggio-formato-testo.txt`` e revisionatene il risultato +con ``git log``. Quando tutto funziona correttamente, allora potete inviare +la patch alla lista di discussione più appropriata. + +Panoramica delle opzioni +------------------------ + +Le patch per il kernel vengono inviate per posta elettronica, preferibilmente +come testo integrante del messaggio. Alcuni manutentori accettano gli +allegati, ma in questo caso gli allegati devono avere il *content-type* +impostato come ``text/plain``. Tuttavia, generalmente gli allegati non sono +ben apprezzati perché rende più difficile citare porzioni di patch durante il +processo di revisione. + +I programmi di posta elettronica che vengono usati per inviare le patch per il +kernel Linux dovrebbero inviarle senza alterazioni. Per esempio, non +dovrebbero modificare o rimuovere tabulazioni o spazi, nemmeno all'inizio o +alla fine delle righe. + +Non inviate patch con ``format=flowed``. Questo potrebbe introdurre +interruzioni di riga inaspettate e indesiderate. + +Non lasciate che il vostro programma di posta vada a capo automaticamente. +Questo può corrompere le patch. + +I programmi di posta non dovrebbero modificare la codifica dei caratteri nel +testo. Le patch inviate per posta elettronica dovrebbero essere codificate in +ASCII o UTF-8. +Se configurate il vostro programma per inviare messaggi codificati con UTF-8 +eviterete possibili problemi di codifica. + +I programmi di posta dovrebbero generare e mantenere le intestazioni +"References" o "In-Reply-To:" cosicché la discussione non venga interrotta. + +Di solito, il copia-e-incolla (o taglia-e-incolla) non funziona con le patch +perché le tabulazioni vengono convertite in spazi. Usando xclipboard, xclip +e/o xcutsel potrebbe funzionare, ma è meglio che lo verifichiate o meglio +ancora: non usate il copia-e-incolla. + +Non usate firme PGP/GPG nei messaggi che contengono delle patch. Questo +impedisce il corretto funzionamento di alcuni script per leggere o applicare +patch (questo si dovrebbe poter correggere). + +Prima di inviare le patch sulle liste di discussione Linux, può essere una +buona idea quella di inviare la patch a voi stessi, salvare il messaggio +ricevuto, e applicarlo ai sorgenti con successo. + + +Alcuni suggerimenti per i programmi di posta elettronica (MUA) +-------------------------------------------------------------- + +Qui troverete alcuni suggerimenti per configurare i vostri MUA allo scopo +di modificare ed inviare patch per il kernel Linux. Tuttavia, questi +suggerimenti non sono da considerarsi come un riassunto di una configurazione +completa. + +Legenda: + +- TUI = interfaccia utente testuale (*text-based user interface*) +- GUI = interfaccia utente grafica (*graphical user interface*) + +Alpine (TUI) +************ + +Opzioni per la configurazione: + +Nella sezione :menuselection:`Sending Preferences`: + +- :menuselection:`Do Not Send Flowed Text` deve essere ``enabled`` +- :menuselection:`Strip Whitespace Before Sending` deve essere ``disabled`` + +Quando state scrivendo un messaggio, il cursore dev'essere posizionato +dove volete che la patch inizi, poi premendo :kbd:`CTRL-R` vi verrà chiesto +di selezionare il file patch da inserire nel messaggio. + +Claws Mail (GUI) +**************** + +Funziona. Alcune persone riescono ad usarlo con successo per inviare le patch. + +Per inserire una patch usate :menuselection:`Messaggio-->Inserisci file` +(:kbd:`CTRL-I`) oppure un editor esterno. + +Se la patch che avete inserito dev'essere modificata usato la finestra di +scrittura di Claws, allora assicuratevi che l'"auto-interruzione" sia +disabilitata :menuselection:`Configurazione-->Preferenze-->Composizione-->Interruzione riga`. + +Evolution (GUI) +*************** + +Alcune persone riescono ad usarlo con successo per inviare le patch. + +Quando state scrivendo una lettera selezionate: Preformattato + da :menuselection:`Formato-->Stile del paragrafo-->Preformattato` + (:kbd:`CTRL-7`) o dalla barra degli strumenti + +Poi per inserire la patch usate: +:menuselection:`Inserisci--> File di testo...` (:kbd:`ALT-N x`) + +Potete anche eseguire ``diff -Nru old.c new.c | xclip``, selezionare +:menuselection:`Preformattato`, e poi usare il tasto centrale del mouse. + +Kmail (GUI) +*********** + +Alcune persone riescono ad usarlo con successo per inviare le patch. + +La configurazione base che disabilita la composizione di messaggi HTML è +corretta; non abilitatela. + +Quando state scrivendo un messaggio, nel menu opzioni, togliete la selezione a +"A capo automatico". L'unico svantaggio sarà che qualsiasi altra cosa scriviate +nel messaggio non verrà mandata a capo in automatico ma dovrete farlo voi. +Il modo più semplice per ovviare a questo problema è quello di scrivere il +messaggio con l'opzione abilitata e poi di salvarlo nelle bozze. Riaprendo ora +il messaggio dalle bozze le andate a capo saranno parte integrante del +messaggio, per cui togliendo l'opzione "A capo automatico" non perderete nulla. + +Alla fine del vostro messaggio, appena prima di inserire la vostra patch, +aggiungete il delimitatore di patch: tre trattini (``---``). + +Ora, dal menu :menuselection:`Messaggio`, selezionate :menuselection:`Inserisci file di testo...` +quindi scegliete la vostra patch. +Come soluzione aggiuntiva potreste personalizzare la vostra barra degli +strumenti aggiungendo un'icona per :menuselection:`Inserisci file di testo...`. + +Allargate la finestra di scrittura abbastanza da evitare andate a capo. +Questo perché in Kmail 1.13.5 (KDE 4.5.4), Kmail aggiunge andate a capo +automaticamente al momento dell'invio per tutte quelle righe che graficamente, +nella vostra finestra di composizione, si sono estete su una riga successiva. +Disabilitare l'andata a capo automatica non è sufficiente. Dunque, se la vostra +patch contiene delle righe molto lunghe, allora dovrete allargare la finestra +di composizione per evitare che quelle righe vadano a capo. Vedere: +https://bugs.kde.org/show_bug.cgi?id=174034 + +Potete firmare gli allegati con GPG, ma per le patch si preferisce aggiungerle +al testo del messaggio per cui non usate la firma GPG. Firmare le patch +inserite come testo del messaggio le rende più difficili da estrarre dalla loro +codifica a 7-bit. + +Se dovete assolutamente inviare delle patch come allegati invece di integrarle +nel testo del messaggio, allora premete il tasto destro sull'allegato e +selezionate :menuselection:`Proprietà`, e poi attivate +:menuselection:`Suggerisci visualizzazione automatica` per far si che +l'allegato sia più leggibile venendo visualizzato come parte del messaggio. + +Per salvare le patch inviate come parte di un messaggio, selezionate il +messaggio che la contiene, premete il tasto destro e selezionate +:menuselection:`Salva come`. Se il messaggio fu ben preparato, allora potrete +usarlo interamente senza alcuna modifica. +I messaggi vengono salvati con permessi di lettura-scrittura solo per l'utente, +nel caso in cui vogliate copiarli altrove per renderli disponibili ad altri +gruppi o al mondo, ricordatevi di usare ``chmod`` per cambiare i permessi. + +Lotus Notes (GUI) +***************** + +Scappate finché potete. + +IBM Verse (Web GUI) +******************* + +Vedi il commento per Lotus Notes. + +Mutt (TUI) +********** + +Un sacco di sviluppatori Linux usano ``mutt``, per cui deve funzionare +abbastanza bene. + +Mutt non ha un proprio editor, quindi qualunque sia il vostro editor dovrete +configurarlo per non aggiungere automaticamente le andate a capo. Molti +editor hanno un'opzione :menuselection:`Inserisci file` che inserisce il +contenuto di un file senza alterarlo. + +Per usare ``vim`` come editor per mutt:: + + set editor="vi" + +Se per inserire la patch nel messaggio usate xclip, scrivete il comando:: + + :set paste + +prima di premere il tasto centrale o shift-insert. Oppure usate il +comando:: + + :r filename + +(a)llega funziona bene senza ``set paste`` + +Potete generare le patch con ``git format-patch`` e usare Mutt per inviarle:: + + $ mutt -H 0001-some-bug-fix.patch + +Opzioni per la configurazione: + +Tutto dovrebbe funzionare già nella configurazione base. +Tuttavia, è una buona idea quella di impostare ``send_charset``:: + + set send_charset="us-ascii:utf-8" + +Mutt è molto personalizzabile. Qui di seguito trovate la configurazione minima +per iniziare ad usare Mutt per inviare patch usando Gmail:: + + # .muttrc + # ================ IMAP ==================== + set imap_user = 'yourusername@gmail.com' + set imap_pass = 'yourpassword' + set spoolfile = imaps://imap.gmail.com/INBOX + set folder = imaps://imap.gmail.com/ + set record="imaps://imap.gmail.com/[Gmail]/Sent Mail" + set postponed="imaps://imap.gmail.com/[Gmail]/Drafts" + set mbox="imaps://imap.gmail.com/[Gmail]/All Mail" + + # ================ SMTP ==================== + set smtp_url = "smtp://username@smtp.gmail.com:587/" + set smtp_pass = $imap_pass + set ssl_force_tls = yes # Require encrypted connection + + # ================ Composition ==================== + set editor = `echo \$EDITOR` + set edit_headers = yes # See the headers when editing + set charset = UTF-8 # value of $LANG; also fallback for send_charset + # Sender, email address, and sign-off line must match + unset use_domain # because joe@localhost is just embarrassing + set realname = "YOUR NAME" + set from = "username@gmail.com" + set use_from = yes + +La documentazione di Mutt contiene molte più informazioni: + + https://gitlab.com/muttmua/mutt/-/wikis/UseCases/Gmail + + http://www.mutt.org/doc/manual/ + +Pine (TUI) +********** + +Pine aveva alcuni problemi con gli spazi vuoti, ma questi dovrebbero essere +stati risolti. + +Se potete usate alpine (il successore di pine). + +Opzioni di configurazione: + +- Nelle versioni più recenti è necessario avere ``quell-flowed-text`` +- l'opzione ``no-strip-whitespace-before-send`` è necessaria + +Sylpheed (GUI) +************** + +- funziona bene per aggiungere testo in linea (o usando allegati) +- permette di utilizzare editor esterni +- è lento su cartelle grandi +- non farà l'autenticazione TSL SMTP su una connessione non SSL +- ha un utile righello nella finestra di scrittura +- la rubrica non comprende correttamente il nome da visualizzare e + l'indirizzo associato + +Thunderbird (GUI) +***************** + +Thunderbird è un clone di Outlook a cui piace maciullare il testo, ma esistono +modi per impedirglielo. + +- permettere l'uso di editor esterni: + La cosa più semplice da fare con Thunderbird e le patch è quello di usare + l'estensione "external editor" e di usare il vostro ``$EDITOR`` preferito per + leggere/includere patch nel vostro messaggio. Per farlo, scaricate ed + installate l'estensione e aggiungete un bottone per chiamarla rapidamente + usando :menuselection:`Visualizza-->Barra degli strumenti-->Personalizza...`; + una volta fatto potrete richiamarlo premendo sul bottone mentre siete nella + finestra :menuselection:`Scrivi` + + Tenete presente che "external editor" richiede che il vostro editor non + faccia alcun fork, in altre parole, l'editor non deve ritornare prima di + essere stato chiuso. Potreste dover passare dei parametri aggiuntivi al + vostro editor oppure cambiargli la configurazione. Per esempio, usando + gvim dovrete aggiungere l'opzione -f ``/usr/bin/gvim -f`` (Se il binario + si trova in ``/usr/bin``) nell'apposito campo nell'interfaccia di + configurazione di :menuselection:`external editor`. Se usate altri editor + consultate il loro manuale per sapere come configurarli. + +Per rendere l'editor interno un po' più sensato, fate così: + +- Modificate le impostazioni di Thunderbird per far si che non usi + ``format=flowed``. Andate in :menuselection:`Modifica-->Preferenze-->Avanzate-->Editor di configurazione` + per invocare il registro delle impostazioni. + +- impostate ``mailnews.send_plaintext_flowed`` a ``false`` + +- impostate ``mailnews.wraplength`` da ``72`` a ``0`` + +- :menuselection:`Visualizza-->Corpo del messaggio come-->Testo semplice` + +- :menuselection:`Visualizza-->Codifica del testo-->Unicode` + + +TkRat (GUI) +*********** + +Funziona. Usare "Inserisci file..." o un editor esterno. + +Gmail (Web GUI) +*************** + +Non funziona per inviare le patch. + +Il programma web Gmail converte automaticamente i tab in spazi. + +Allo stesso tempo aggiunge andata a capo ogni 78 caratteri. Comunque +il problema della conversione fra spazi e tab può essere risolto usando +un editor esterno. - TODO ancora da tradurre +Un altro problema è che Gmail usa la codifica base64 per tutti quei messaggi +che contengono caratteri non ASCII. Questo include cose tipo i nomi europei. diff --git a/Documentation/translations/it_IT/process/index.rst b/Documentation/translations/it_IT/process/index.rst index 012de0f3154a..c4c867132c88 100644 --- a/Documentation/translations/it_IT/process/index.rst +++ b/Documentation/translations/it_IT/process/index.rst @@ -59,6 +59,7 @@ perché non si è trovato un posto migliore. magic-number volatile-considered-harmful clang-format + ../riscv/patch-acceptance .. only:: subproject and html diff --git a/Documentation/translations/it_IT/process/management-style.rst b/Documentation/translations/it_IT/process/management-style.rst index 07e68bfb8402..c709285138a7 100644 --- a/Documentation/translations/it_IT/process/management-style.rst +++ b/Documentation/translations/it_IT/process/management-style.rst @@ -1,12 +1,293 @@ .. include:: ../disclaimer-ita.rst -:Original: :ref:`Documentation/process/management-style.rst <managementstyle>` +:Original: :doc:`../../../process/management-style` +:Translator: Alessia Mantegazza <amantegazza@vaga.pv.it> -.. _it_managementstyle: +Il modello di gestione del kernel Linux +======================================= -Tipo di gestione del kernel Linux -================================= +Questo breve documento descrive il modello di gestione del kernel Linux. +Per certi versi, esso rispecchia il documento +:ref:`translations/it_IT/process/coding-style.rst <it_codingstyle>`, +ed è principalmente scritto per evitare di rispondere [#f1]_ in continuazione +alle stesse identiche (o quasi) domande. -.. warning:: +Il modello di gestione è qualcosa di molto personale e molto più difficile da +qualificare rispetto a delle semplici regole di codifica, quindi questo +documento potrebbe avere più o meno a che fare con la realtà. È cominciato +come un gioco, ma ciò non significa che non possa essere vero. +Lo dovrete decidere voi stessi. - TODO ancora da tradurre +In ogni caso, quando si parla del "dirigente del kernel", ci si riferisce +sempre alla persona che dirige tecnicamente, e non a coloro che +tradizionalmente hanno un ruolo direttivo all'interno delle aziende. Se vi +occupate di convalidare acquisti o avete una qualche idea sul budget del vostro +gruppo, probabilmente non siete un dirigente del kernel. Quindi i suggerimenti +qui indicati potrebbero fare al caso vostro, oppure no. + +Prima di tutto, suggerirei di acquistare "Le sette regole per avere successo", +e di non leggerlo. Bruciatelo, è un grande gesto simbolico. + +.. [#f1] Questo documento non fa molto per risponde alla domanda, ma rende + così dannatamente ovvio a chi la pone che non abbiamo la minima idea + di come rispondere. + +Comunque, partiamo: + +.. _it_decisions: + +1) Le decisioni +--------------- + +Tutti pensano che i dirigenti decidano, e che questo prendere decisioni +sia importante. Più grande e dolorosa è la decisione, più importante deve +essere il dirigente che la prende. Questo è molto profondo ed ovvio, ma non è +del tutto vero. + +Il gioco consiste nell'"evitare" di dover prendere decisioni. In particolare +se qualcuno vi chiede di "Decidere" tra (a) o (b), e vi dice che ha +davvero bisogno di voi per questo, come dirigenti siete nei guai. +Le persone che gestite devono conoscere i dettagli più di quanto li conosciate +voi, quindi se vengono da voi per una decisione tecnica, siete fottuti. +Non sarete chiaramente competente per prendere quella decisione per loro. + +(Corollario: se le persone che gestite non conoscono i dettagli meglio di voi, +anche in questo caso sarete fregati, tuttavia per altre ragioni. Ossia state +facendo il lavoro sbagliato, e che invece dovrebbero essere "loro" a gestirvi) + +Quindi il gioco si chiama "evitare" decisioni, almeno le più grandi e +difficili. Prendere decisioni piccoli e senza conseguenze va bene, e vi fa +sembrare competenti in quello che state facendo, quindi quello che un dirigente +del kernel ha bisogno di fare è trasformare le decisioni grandi e difficili +in minuzie delle quali nessuno importa. + +Ciò aiuta a capire che la differenza chiave tra una grande decisione ed una +piccola sta nella possibilità di modificare tale decisione in seguito. +Qualsiasi decisione importante può essere ridotta in decisioni meno importanti, +ma dovete assicurarvi che possano essere reversibili in caso di errori +(presenti o futuri). Improvvisamente, dovrete essere doppiamente dirigenti +per **due** decisioni non sequenziali - quella sbagliata **e** quella giusta. + +E le persone vedranno tutto ciò come prova di vera capacità di comando +(*cough* cavolata *cough*) + +Così la chiave per evitare le decisioni difficili diviene l'evitare +di fare cose che non possono essere disfatte. Non infilatevi in un angolo +dal quale non potrete sfuggire. Un topo messo all'angolo può rivelarsi +pericoloso - un dirigente messo all'angolo è solo pietoso. + +**In ogni caso** dato che nessuno è stupido al punto da lasciare veramente ad +un dirigente del kernel un enorme responsabilità, solitamente è facile fare +marcia indietro. Annullare una decisione è molto facile: semplicemente dite a +tutti che siete stati degli scemi incompetenti, dite che siete dispiaciuti, ed +annullate tutto l'inutile lavoro sul quale gli altri hanno lavorato nell'ultimo +anno. Improvvisamente la decisione che avevate preso un anno fa non era poi +così grossa, dato che può essere facilmente annullata. + +È emerso che alcune persone hanno dei problemi con questo tipo di approccio, +questo per due ragioni: + + - ammettere di essere degli idioti è più difficile di quanto sembri. A tutti + noi piace mantenere le apparenze, ed uscire allo scoperto in pubblico per + ammettere che ci si è sbagliati è qualcosa di davvero impegnativo. + - avere qualcuno che ti dice che ciò su cui hai lavorato nell'ultimo anno + non era del tutto valido, può rivelarsi difficile anche per un povero ed + umile ingegnere, e mentre il **lavoro** vero era abbastanza facile da + cancellare, dall'altro canto potreste aver irrimediabilmente perso la + fiducia di quell'ingegnere. E ricordate che l'"irrevocabile" era quello + che avevamo cercato di evitare fin dall'inizio, e la vostra decisione + ha finito per esserlo. + +Fortunatamente, entrambe queste ragioni posso essere mitigate semplicemente +ammettendo fin dal principio che non avete una cavolo di idea, dicendo +agli altri in anticipo che la vostra decisione è puramente ipotetica, e che +potrebbe essere sbagliata. Dovreste sempre riservarvi il diritto di cambiare +la vostra opinione, e rendere gli altri ben **consapevoli** di ciò. +Ed è molto più facile ammettere di essere stupidi quando non avete **ancora** +fatto quella cosa stupida. + +Poi, quando è realmente emersa la vostra stupidità, le persone semplicemente +roteeranno gli occhi e diranno "Uffa, no, ancora". + +Questa ammissione preventiva di incompetenza potrebbe anche portare le persone +che stanno facendo il vero lavoro, a pensarci due volte. Dopo tutto, se +**loro** non sono certi se sia una buona idea, voi, sicuro come la morte, +non dovreste incoraggiarli promettendogli che ciò su cui stanno lavorando +verrà incluso. Fate si che ci pensino due volte prima che si imbarchino in un +grosso lavoro. + +Ricordate: loro devono sapere più cose sui dettagli rispetto a voi, e +solitamente pensano di avere già la risposta a tutto. La miglior cosa che +potete fare in qualità di dirigente è di non instillare troppa fiducia, ma +invece fornire una salutare dose di pensiero critico su quanto stanno facendo. + +Comunque, un altro modo di evitare una decisione è quello di lamentarsi +malinconicamente dicendo : "non possiamo farli entrambi e basta?" e con uno +sguardo pietoso. Fidatevi, funziona. Se non è chiaro quale sia il miglior +approccio, lo scopriranno. La risposta potrebbe essere data dal fatto che +entrambe i gruppi di lavoro diventano frustati al punto di rinunciarvi. + +Questo può suonare come un fallimento, ma di solito questo è un segno che +c'era qualcosa che non andava in entrambe i progetti, e il motivo per +il quale le persone coinvolte non abbiano potuto decidere era che entrambe +sbagliavano. Voi ne uscirete freschi come una rosa, e avrete evitato un'altra +decisione con la quale avreste potuto fregarvi. + + +2) Le persone +------------- + +Ci sono molte persone stupide, ed essere un dirigente significa che dovrete +scendere a patti con questo, e molto più importate, che **loro** devono avere +a che fare con **voi**. + +Ne emerge che mentre è facile annullare degli errori tecnici, non è invece +così facile rimuovere i disordini della personalità. Dovrete semplicemente +convivere con i loro, ed i vostri, problemi. + +Comunque, al fine di preparavi in qualità di dirigenti del kernel, è meglio +ricordare di non abbattere alcun ponte, bombardare alcun paesano innocente, +o escludere troppi sviluppatori kernel. Ne emerge che escludere le persone +è piuttosto facile, mentre includerle nuovamente è difficile. Così +"l'esclusione" immediatamente cade sotto il titolo di "non reversibile", e +diviene un no-no secondo la sezione :ref:`it_decisions`. + +Esistono alcune semplici regole qui: + + (1) non chiamate le persone teste di c*** (al meno, non in pubblico) + (2) imparate a scusarvi quando dimenticate la regola (1) + +Il problema del punto numero 1 è che è molto facile da rispettare, dato che +è possibile dire "sei una testa di c***" in milioni di modi differenti [#f2]_, +a volte senza nemmeno pensarci, e praticamente sempre con la calda convinzione +di essere nel giusto. + +E più convinti sarete che avete ragione (e diciamolo, potete chiamare +praticamente **tutti** testa di c**, e spesso **sarete** nel giusto), più +difficile sarà scusarvi successivamente. + +Per risolvere questo problema, avete due possibilità: + + - diventare davvero bravi nello scusarsi + - essere amabili così che nessuno finirà col sentirsi preso di mira. Siate + creativi abbastanza, e potrebbero esserne divertiti. + +L'opzione dell'essere immancabilmente educati non esiste proprio. Nessuno +si fiderà di qualcuno che chiaramente sta nascondendo il suo vero carattere. + +.. [#f2] Paul Simon cantava: "50 modi per lasciare il vostro amante", perché, + molto francamente, "Un milione di modi per dire ad uno sviluppatore + Testa di c***" non avrebbe funzionato. Ma sono sicuro che ci abbia + pensato. + + +3) Le persone II - quelle buone +------------------------------- + +Mentre emerge che la maggior parte delle persone sono stupide, il corollario +a questo è il triste fatto che anche voi siete fra queste, e che mentre +possiamo tutti crogiolarci nella sicurezza di essere migliori della media +delle persone (diciamocelo, nessuno crede di essere nelle media o sotto di +essa), dovremmo anche ammettere che non siamo il "coltello più affilato" del +circondario, e che ci saranno altre persone che sono meno stupide di quanto +lo siete voi. + +Molti reagiscono male davanti alle persone intelligenti. Altri le usano a +proprio vantaggio. + +Assicuratevi che voi, in qualità di manutentori del kernel, siate nel secondo +gruppo. Inchinatevi dinanzi a loro perché saranno le persone che vi renderanno +il lavoro più facile. In particolare, prenderanno le decisioni per voi, che è +l'oggetto di questo gioco. + +Quindi quando trovate qualcuno più sveglio di voi, prendetevela comoda. +Le vostre responsabilità dirigenziali si ridurranno in gran parte nel dire +"Sembra una buona idea - Vai", oppure "Sembra buono, ma invece circa questo e +quello?". La seconda versione in particolare è una gran modo per imparare +qualcosa di nuovo circa "questo e quello" o di sembrare **extra** dirigenziali +sottolineando qualcosa alla quale i più svegli non avevano pensato. In +entrambe i casi, vincete. + +Una cosa alla quale dovete fare attenzione è che l'essere grandi in qualcosa +non si traduce automaticamente nell'essere grandi anche in altre cose. Quindi +dovreste dare una spintarella alle persone in una specifica direzione, ma +diciamocelo, potrebbero essere bravi in ciò che fanno e far schifo in tutto +il resto. La buona notizia è che le persone tendono a gravitare attorno a ciò +in cui sono bravi, quindi non state facendo nulla di irreversibile quando li +spingete verso una certa direzione, solo non spingete troppo. + + +4) Addossare le colpe +--------------------- + +Le cose andranno male, e le persone vogliono qualcuno da incolpare. Sarete voi. + +Non è poi così difficile accettare la colpa, specialmente se le persone +riescono a capire che non era **tutta** colpa vostra. Il che ci porta +sulla miglior strada per assumersi la colpa: fatelo per qualcun'altro. +Vi sentirete bene nel assumervi la responsabilità, e loro si sentiranno +bene nel non essere incolpati, e coloro che hanno perso i loro 36GB di +pornografia a causa della vostra incompetenza ammetteranno a malincuore che +almeno non avete cercato di fare il furbetto. + +Successivamente fate in modo che gli sviluppatori che in realtà hanno fallito +(se riuscite a trovarli) sappiano **in privato** che sono "fottuti". +Questo non per fargli sapere che la prossima volta possono evitarselo ma per +fargli capire che sono in debito. E, forse cosa più importante, sono loro che +devono sistemare la cosa. Perché, ammettiamolo, è sicuro non sarete voi a +farlo. + +Assumersi la colpa è anche ciò che vi rendere dirigenti in prima battuta. +È parte di ciò che spinge gli altri a fidarsi di voi, e vi garantisce +la gloria potenziale, perché siete gli unici a dire "Ho fatto una cavolata". +E se avete seguito le regole precedenti, sarete decisamente bravi nel dirlo. + + +5) Le cose da evitare +--------------------- + +Esiste una cosa che le persone odiano più che essere chiamate "teste di c****", +ed è essere chiamate "teste di c****" con fare da bigotto. Se per il primo +caso potrete comunque scusarvi, per il secondo non ve ne verrà data nemmeno +l'opportunità. Probabilmente smetteranno di ascoltarvi anche se tutto sommato +state svolgendo un buon lavoro. + +Tutti crediamo di essere migliori degli altri, il che significa che quando +qualcuno inizia a darsi delle arie, ci da **davvero** fastidio. Potreste anche +essere moralmente ed intellettualmente superiore a tutti quelli attorno a voi, +ma non cercate di renderlo ovvio per gli altri a meno che non **vogliate** +veramente far arrabbiare qualcuno [#f3]_. + +Allo stesso modo evitate di essere troppo gentili e pacati. Le buone maniere +facilmente finiscono per strabordare e nascondere i problemi, e come si usa +dire, "su internet nessuno può sentire la vostra pacatezza". Usate argomenti +diretti per farvi capire, non potete sperare che la gente capisca in altro +modo. + +Un po' di umorismo può aiutare a smorzare sia la franchezza che la moralità. +Andare oltre i limiti al punto d'essere ridicolo può portare dei punti a casa +senza renderlo spiacevole per i riceventi, i quali penseranno che stavate +facendo gli scemi. Può anche aiutare a lasciare andare quei blocchi mentali +che abbiamo nei confronti delle critiche. + +.. [#f3] Suggerimento: i forum di discussione su internet, che non sono + collegati col vostro lavoro, sono ottimi modi per sfogare la frustrazione + verso altre persone. Di tanto in tanto scrivete messaggi offensivi col ghigno + in faccia per infiammare qualche discussione: vi sentirete purificati. Solo + cercate di non cagare troppo vicino a casa. + +6) Perché io? +------------- + +Dato che la vostra responsabilità principale è quella di prendervi le colpe +d'altri, e rendere dolorosamente ovvio a tutti che siete degli incompetenti, +la domanda naturale che ne segue sarà : perché dovrei fare tutto ciò? + +Innanzitutto, potreste diventare o no popolari al punto da avere la fila di +ragazzine (o ragazzini, evitiamo pregiudizi o sessismo) che gridano e bussano +alla porta del vostro camerino, ma comunque **proverete** un immenso senso di +realizzazione personale dall'essere "in carica". Dimenticate il fatto che voi +state discutendo con tutti e che cercate di inseguirli il più velocemente che +potete. Tutti continueranno a pensare che voi siete la persona in carica. + +È un bel lavoro se riuscite ad adattarlo a voi. diff --git a/Documentation/translations/it_IT/process/submit-checklist.rst b/Documentation/translations/it_IT/process/submit-checklist.rst index 995ee69fab11..3e575502690f 100644 --- a/Documentation/translations/it_IT/process/submit-checklist.rst +++ b/Documentation/translations/it_IT/process/submit-checklist.rst @@ -117,7 +117,7 @@ sottomissione delle patch, in particolare sorgenti che ne spieghi la logica: cosa fanno e perché. 25) Se la patch aggiunge nuove chiamate ioctl, allora aggiornate - ``Documentation/ioctl/ioctl-number.rst``. + ``Documentation/userspace-api/ioctl/ioctl-number.rst``. 26) Se il codice che avete modificato dipende o usa una qualsiasi interfaccia o funzionalità del kernel che è associata a uno dei seguenti simboli diff --git a/Documentation/translations/it_IT/riscv/patch-acceptance.rst b/Documentation/translations/it_IT/riscv/patch-acceptance.rst new file mode 100644 index 000000000000..edf67252b3fb --- /dev/null +++ b/Documentation/translations/it_IT/riscv/patch-acceptance.rst @@ -0,0 +1,40 @@ +.. include:: ../disclaimer-ita.rst + +:Original: :doc:`../../../riscv/patch-acceptance` +:Translator: Federico Vaga <federico.vaga@vaga.pv.it> + +arch/riscv linee guida alla manutenzione per gli sviluppatori +============================================================= + +Introduzione +------------ + +L'insieme di istruzioni RISC-V sono sviluppate in modo aperto: le +bozze in fase di sviluppo sono disponibili a tutti per essere +revisionate e per essere sperimentare nelle implementazioni. Le bozze +dei nuovi moduli o estensioni possono cambiare in fase di sviluppo - a +volte in modo incompatibile rispetto a bozze precedenti. Questa +flessibilità può portare a dei problemi di manutenzioni per il +supporto RISC-V nel kernel Linux. I manutentori Linux non amano +l'abbandono del codice, e il processo di sviluppo del kernel +preferisce codice ben revisionato e testato rispetto a quello +sperimentale. Desideriamo estendere questi stessi principi al codice +relativo all'architettura RISC-V che verrà accettato per l'inclusione +nel kernel. + +In aggiunta alla lista delle verifiche da fare prima di inviare una patch +------------------------------------------------------------------------- + +Accetteremo le patch per un nuovo modulo o estensione se la fondazione +RISC-V li classifica come "Frozen" o "Retified". (Ovviamente, gli +sviluppatori sono liberi di mantenere una copia del kernel Linux +contenente il codice per una bozza di estensione). + +In aggiunta, la specifica RISC-V permette agli implementatori di +creare le proprie estensioni. Queste estensioni non passano +attraverso il processo di revisione della fondazione RISC-V. Per +questo motivo, al fine di evitare complicazioni o problemi di +prestazioni, accetteremo patch solo per quelle estensioni che sono +state ufficialmente accettate dalla fondazione RISC-V. (Ovviamente, +gli implementatori sono liberi di mantenere una copia del kernel Linux +contenente il codice per queste specifiche estensioni). diff --git a/Documentation/translations/ko_KR/memory-barriers.txt b/Documentation/translations/ko_KR/memory-barriers.txt index 2e831ece6e26..e50fe6541335 100644 --- a/Documentation/translations/ko_KR/memory-barriers.txt +++ b/Documentation/translations/ko_KR/memory-barriers.txt @@ -641,7 +641,7 @@ P 는 짝수 번호 캐시 라인에 저장되어 있고, 변수 B 는 홀수 리눅스 커널이 지원하는 CPU 들은 (1) 쓰기가 정말로 일어날지, (2) 쓰기가 어디에 이루어질지, 그리고 (3) 쓰여질 값을 확실히 알기 전까지는 쓰기를 수행하지 않기 때문입니다. 하지만 "컨트롤 의존성" 섹션과 -Documentation/RCU/rcu_dereference.txt 파일을 주의 깊게 읽어 주시기 바랍니다: +Documentation/RCU/rcu_dereference.rst 파일을 주의 깊게 읽어 주시기 바랍니다: 컴파일러는 매우 창의적인 많은 방법으로 종속성을 깰 수 있습니다. CPU 1 CPU 2 diff --git a/Documentation/translations/zh_CN/IRQ.txt b/Documentation/translations/zh_CN/IRQ.txt index 956026d5cf82..9aec8dca4fcf 100644 --- a/Documentation/translations/zh_CN/IRQ.txt +++ b/Documentation/translations/zh_CN/IRQ.txt @@ -1,4 +1,4 @@ -Chinese translated version of Documentation/IRQ.txt +Chinese translated version of Documentation/core-api/irq/index.rst If you have any comment or update to the content, please contact the original document maintainer directly. However, if you have a problem @@ -9,7 +9,7 @@ or if there is a problem with the translation. Maintainer: Eric W. Biederman <ebiederman@xmission.com> Chinese maintainer: Fu Wei <tekkamanninja@gmail.com> --------------------------------------------------------------------- -Documentation/IRQ.txt 的中文翻译 +Documentation/core-api/irq/index.rst 的中文翻译 如果想评论或更新本文的内容,请直接联系原文档的维护者。如果你使用英文 交流有困难的话,也可以向中文版维护者求助。如果本翻译更新不及时或者翻 diff --git a/Documentation/translations/zh_CN/filesystems/debugfs.rst b/Documentation/translations/zh_CN/filesystems/debugfs.rst new file mode 100644 index 000000000000..f8a28793c277 --- /dev/null +++ b/Documentation/translations/zh_CN/filesystems/debugfs.rst @@ -0,0 +1,221 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. include:: ../disclaimer-zh_CN.rst + +:Original: :ref:`Documentation/filesystems/debugfs.txt <debugfs_index>` + +======= +Debugfs +======= + +译者 +:: + + 中文版维护者: 罗楚成 Chucheng Luo <luochucheng@vivo.com> + 中文版翻译者: 罗楚成 Chucheng Luo <luochucheng@vivo.com> + 中文版校译者: 罗楚成 Chucheng Luo <luochucheng@vivo.com> + + + +版权所有2020 罗楚成 <luochucheng@vivo.com> + + +Debugfs是内核开发人员在用户空间获取信息的简单方法。与/proc不同,proc只提供进程 +信息。也不像sysfs,具有严格的“每个文件一个值“的规则。debugfs根本没有规则,开发 +人员可以在这里放置他们想要的任何信息。debugfs文件系统也不能用作稳定的ABI接口。 +从理论上讲,debugfs导出文件的时候没有任何约束。但是[1]实际情况并不总是那么 +简单。即使是debugfs接口,也最好根据需要进行设计,并尽量保持接口不变。 + + +Debugfs通常使用以下命令安装:: + + mount -t debugfs none /sys/kernel/debug + +(或等效的/etc/fstab行)。 +debugfs根目录默认仅可由root用户访问。要更改对文件树的访问,请使用“ uid”,“ gid” +和“ mode”挂载选项。请注意,debugfs API仅按照GPL协议导出到模块。 + +使用debugfs的代码应包含<linux/debugfs.h>。然后,首先是创建至少一个目录来保存 +一组debugfs文件:: + + struct dentry *debugfs_create_dir(const char *name, struct dentry *parent); + +如果成功,此调用将在指定的父目录下创建一个名为name的目录。如果parent参数为空, +则会在debugfs根目录中创建。创建目录成功时,返回值是一个指向dentry结构体的指针。 +该dentry结构体的指针可用于在目录中创建文件(以及最后将其清理干净)。ERR_PTR +(-ERROR)返回值表明出错。如果返回ERR_PTR(-ENODEV),则表明内核是在没有debugfs +支持的情况下构建的,并且下述函数都不会起作用。 + +在debugfs目录中创建文件的最通用方法是:: + + struct dentry *debugfs_create_file(const char *name, umode_t mode, + struct dentry *parent, void *data, + const struct file_operations *fops); + +在这里,name是要创建的文件的名称,mode描述了访问文件应具有的权限,parent指向 +应该保存文件的目录,data将存储在产生的inode结构体的i_private字段中,而fops是 +一组文件操作函数,这些函数中实现文件操作的具体行为。至少,read()和/或 +write()操作应提供;其他可以根据需要包括在内。同样的,返回值将是指向创建文件 +的dentry指针,错误时返回ERR_PTR(-ERROR),系统不支持debugfs时返回值为ERR_PTR +(-ENODEV)。创建一个初始大小的文件,可以使用以下函数代替:: + + struct dentry *debugfs_create_file_size(const char *name, umode_t mode, + struct dentry *parent, void *data, + const struct file_operations *fops, + loff_t file_size); + +file_size是初始文件大小。其他参数跟函数debugfs_create_file的相同。 + +在许多情况下,没必要自己去创建一组文件操作;对于一些简单的情况,debugfs代码提供 +了许多帮助函数。包含单个整数值的文件可以使用以下任何一项创建:: + + void debugfs_create_u8(const char *name, umode_t mode, + struct dentry *parent, u8 *value); + void debugfs_create_u16(const char *name, umode_t mode, + struct dentry *parent, u16 *value); + struct dentry *debugfs_create_u32(const char *name, umode_t mode, + struct dentry *parent, u32 *value); + void debugfs_create_u64(const char *name, umode_t mode, + struct dentry *parent, u64 *value); + +这些文件支持读取和写入给定值。如果某个文件不支持写入,只需根据需要设置mode +参数位。这些文件中的值以十进制表示;如果需要使用十六进制,可以使用以下函数 +替代:: + + void debugfs_create_x8(const char *name, umode_t mode, + struct dentry *parent, u8 *value); + void debugfs_create_x16(const char *name, umode_t mode, + struct dentry *parent, u16 *value); + void debugfs_create_x32(const char *name, umode_t mode, + struct dentry *parent, u32 *value); + void debugfs_create_x64(const char *name, umode_t mode, + struct dentry *parent, u64 *value); + +这些功能只有在开发人员知道导出值的大小的时候才有用。某些数据类型在不同的架构上 +有不同的宽度,这样会使情况变得有些复杂。在这种特殊情况下可以使用以下函数:: + + void debugfs_create_size_t(const char *name, umode_t mode, + struct dentry *parent, size_t *value); + +不出所料,此函数将创建一个debugfs文件来表示类型为size_t的变量。 + +同样地,也有导出无符号长整型变量的函数,分别以十进制和十六进制表示如下:: + + struct dentry *debugfs_create_ulong(const char *name, umode_t mode, + struct dentry *parent, + unsigned long *value); + void debugfs_create_xul(const char *name, umode_t mode, + struct dentry *parent, unsigned long *value); + +布尔值可以通过以下方式放置在debugfs中:: + + struct dentry *debugfs_create_bool(const char *name, umode_t mode, + struct dentry *parent, bool *value); + + +读取结果文件将产生Y(对于非零值)或N,后跟换行符写入的时候,它只接受大写或小写 +值或1或0。任何其他输入将被忽略。 + +同样,atomic_t类型的值也可以放置在debugfs中:: + + void debugfs_create_atomic_t(const char *name, umode_t mode, + struct dentry *parent, atomic_t *value) + +读取此文件将获得atomic_t值,写入此文件将设置atomic_t值。 + +另一个选择是通过以下结构体和函数导出一个任意二进制数据块:: + + struct debugfs_blob_wrapper { + void *data; + unsigned long size; + }; + + struct dentry *debugfs_create_blob(const char *name, umode_t mode, + struct dentry *parent, + struct debugfs_blob_wrapper *blob); + +读取此文件将返回由指针指向debugfs_blob_wrapper结构体的数据。一些驱动使用“blobs” +作为一种返回几行(静态)格式化文本的简单方法。这个函数可用于导出二进制信息,但 +似乎在主线中没有任何代码这样做。请注意,使用debugfs_create_blob()命令创建的 +所有文件是只读的。 + +如果您要转储一个寄存器块(在开发过程中经常会这么做,但是这样的调试代码很少上传 +到主线中。Debugfs提供两个函数:一个用于创建仅寄存器文件,另一个把一个寄存器块 +插入一个顺序文件中:: + + struct debugfs_reg32 { + char *name; + unsigned long offset; + }; + + struct debugfs_regset32 { + struct debugfs_reg32 *regs; + int nregs; + void __iomem *base; + }; + + struct dentry *debugfs_create_regset32(const char *name, umode_t mode, + struct dentry *parent, + struct debugfs_regset32 *regset); + + void debugfs_print_regs32(struct seq_file *s, struct debugfs_reg32 *regs, + int nregs, void __iomem *base, char *prefix); + +“base”参数可能为0,但您可能需要使用__stringify构建reg32数组,实际上有许多寄存器 +名称(宏)是寄存器块在基址上的字节偏移量。 + +如果要在debugfs中转储u32数组,可以使用以下函数创建文件:: + + void debugfs_create_u32_array(const char *name, umode_t mode, + struct dentry *parent, + u32 *array, u32 elements); + +“array”参数提供数据,而“elements”参数为数组中元素的数量。注意:数组创建后,数组 +大小无法更改。 + +有一个函数来创建与设备相关的seq_file:: + + struct dentry *debugfs_create_devm_seqfile(struct device *dev, + const char *name, + struct dentry *parent, + int (*read_fn)(struct seq_file *s, + void *data)); + +“dev”参数是与此debugfs文件相关的设备,并且“read_fn”是一个函数指针,这个函数在 +打印seq_file内容的时候被回调。 + +还有一些其他的面向目录的函数:: + + struct dentry *debugfs_rename(struct dentry *old_dir, + struct dentry *old_dentry, + struct dentry *new_dir, + const char *new_name); + + struct dentry *debugfs_create_symlink(const char *name, + struct dentry *parent, + const char *target); + +调用debugfs_rename()将为现有的debugfs文件重命名,可能同时切换目录。 new_name +函数调用之前不能存在;返回值为old_dentry,其中包含更新的信息。可以使用 +debugfs_create_symlink()创建符号链接。 + +所有debugfs用户必须考虑的一件事是: + +debugfs不会自动清除在其中创建的任何目录。如果一个模块在不显式删除debugfs目录的 +情况下卸载模块,结果将会遗留很多野指针,从而导致系统不稳定。因此,所有debugfs +用户-至少是那些可以作为模块构建的用户-必须做模块卸载的时候准备删除在此创建的 +所有文件和目录。一份文件可以通过以下方式删除:: + + void debugfs_remove(struct dentry *dentry); + +dentry值可以为NULL或错误值,在这种情况下,不会有任何文件被删除。 + +很久以前,内核开发者使用debugfs时需要记录他们创建的每个dentry指针,以便最后所有 +文件都可以被清理掉。但是,现在debugfs用户能调用以下函数递归清除之前创建的文件:: + + void debugfs_remove_recursive(struct dentry *dentry); + +如果将对应顶层目录的dentry传递给以上函数,则该目录下的整个层次结构将会被删除。 + +注释: +[1] http://lwn.net/Articles/309298/ diff --git a/Documentation/translations/zh_CN/filesystems/index.rst b/Documentation/translations/zh_CN/filesystems/index.rst index 14f155edaf69..186501d13bc1 100644 --- a/Documentation/translations/zh_CN/filesystems/index.rst +++ b/Documentation/translations/zh_CN/filesystems/index.rst @@ -24,4 +24,5 @@ Linux Kernel中的文件系统 :maxdepth: 2 virtiofs + debugfs diff --git a/Documentation/translations/zh_CN/filesystems/sysfs.txt b/Documentation/translations/zh_CN/filesystems/sysfs.txt index ee1f37da5b23..fcf620049d11 100644 --- a/Documentation/translations/zh_CN/filesystems/sysfs.txt +++ b/Documentation/translations/zh_CN/filesystems/sysfs.txt @@ -1,4 +1,4 @@ -Chinese translated version of Documentation/filesystems/sysfs.txt +Chinese translated version of Documentation/filesystems/sysfs.rst If you have any comment or update to the content, please contact the original document maintainer directly. However, if you have a problem @@ -10,7 +10,7 @@ Maintainer: Patrick Mochel <mochel@osdl.org> Mike Murphy <mamurph@cs.clemson.edu> Chinese maintainer: Fu Wei <tekkamanninja@gmail.com> --------------------------------------------------------------------- -Documentation/filesystems/sysfs.txt 的中文翻译 +Documentation/filesystems/sysfs.rst 的中文翻译 如果想评论或更新本文的内容,请直接联系原文档的维护者。如果你使用英文 交流有困难的话,也可以向中文版维护者求助。如果本翻译更新不及时或者翻 @@ -40,7 +40,7 @@ sysfs 是一个最初基于 ramfs 且位于内存的文件系统。它提供导 数据结构及其属性,以及它们之间的关联到用户空间的方法。 sysfs 始终与 kobject 的底层结构紧密相关。请阅读 -Documentation/kobject.txt 文档以获得更多关于 kobject 接口的 +Documentation/core-api/kobject.rst 文档以获得更多关于 kobject 接口的 信息。 @@ -281,7 +281,7 @@ drivers/ 包含了每个已为特定总线上的设备而挂载的驱动程序 假定驱动没有跨越多个总线类型)。 fs/ 包含了一个为文件系统设立的目录。现在每个想要导出属性的文件系统必须 -在 fs/ 下创建自己的层次结构(参见Documentation/filesystems/fuse.txt)。 +在 fs/ 下创建自己的层次结构(参见Documentation/filesystems/fuse.rst)。 dev/ 包含两个子目录: char/ 和 block/。在这两个子目录中,有以 <major>:<minor> 格式命名的符号链接。这些符号链接指向 sysfs 目录 diff --git a/Documentation/translations/zh_CN/process/submit-checklist.rst b/Documentation/translations/zh_CN/process/submit-checklist.rst index 8738c55e42a2..50386e0e42e7 100644 --- a/Documentation/translations/zh_CN/process/submit-checklist.rst +++ b/Documentation/translations/zh_CN/process/submit-checklist.rst @@ -97,7 +97,7 @@ Linux内核补丁提交清单 24) 所有内存屏障例如 ``barrier()``, ``rmb()``, ``wmb()`` 都需要源代码中的注 释来解释它们正在执行的操作及其原因的逻辑。 -25) 如果补丁添加了任何ioctl,那么也要更新 ``Documentation/ioctl/ioctl-number.rst`` +25) 如果补丁添加了任何ioctl,那么也要更新 ``Documentation/userspace-api/ioctl/ioctl-number.rst`` 26) 如果修改后的源代码依赖或使用与以下 ``Kconfig`` 符号相关的任何内核API或 功能,则在禁用相关 ``Kconfig`` 符号和/或 ``=m`` (如果该选项可用)的情况 diff --git a/Documentation/translations/zh_CN/video4linux/v4l2-framework.txt b/Documentation/translations/zh_CN/video4linux/v4l2-framework.txt index 9c39ee58ea50..a96abcdec777 100644 --- a/Documentation/translations/zh_CN/video4linux/v4l2-framework.txt +++ b/Documentation/translations/zh_CN/video4linux/v4l2-framework.txt @@ -488,7 +488,7 @@ struct v4l2_subdev *sd = v4l2_i2c_new_subdev(v4l2_dev, adapter, 这个函数会加载给定的模块(如果没有模块需要加载,可以为 NULL), 并用给定的 i2c 适配器结构体指针(i2c_adapter)和 器件地址(chip/address) -作为参数调用 i2c_new_device()。如果一切顺利,则就在 v4l2_device +作为参数调用 i2c_new_client_device()。如果一切顺利,则就在 v4l2_device 中注册了子设备。 你也可以利用 v4l2_i2c_new_subdev()的最后一个参数,传递一个可能的 diff --git a/Documentation/usb/gadget_configfs.rst b/Documentation/usb/gadget_configfs.rst index 54fb08baae22..158e48dab586 100644 --- a/Documentation/usb/gadget_configfs.rst +++ b/Documentation/usb/gadget_configfs.rst @@ -24,7 +24,7 @@ Linux provides a number of functions for gadgets to use. Creating a gadget means deciding what configurations there will be and which functions each configuration will provide. -Configfs (please see `Documentation/filesystems/configfs/*`) lends itself nicely +Configfs (please see `Documentation/filesystems/configfs.rst`) lends itself nicely for the purpose of telling the kernel about the above mentioned decision. This document is about how to do it. @@ -354,7 +354,7 @@ the directories in general can be named at will. A group can have a number of its default sub-groups created automatically. For more information on configfs please see -`Documentation/filesystems/configfs/*`. +`Documentation/filesystems/configfs.rst`. The concepts described above translate to USB gadgets like this: diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst index f759edafd938..52bf58417653 100644 --- a/Documentation/userspace-api/ioctl/ioctl-number.rst +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst @@ -146,6 +146,7 @@ Code Seq# Include File Comments 'H' 40-4F sound/hdspm.h conflict! 'H' 40-4F sound/hdsp.h conflict! 'H' 90 sound/usb/usx2y/usb_stream.h +'H' 00-0F uapi/misc/habanalabs.h conflict! 'H' A0 uapi/linux/usb/cdc-wdm.h 'H' C0-F0 net/bluetooth/hci.h conflict! 'H' C0-DF net/bluetooth/hidp/hidp.h conflict! diff --git a/Documentation/virt/kvm/amd-memory-encryption.rst b/Documentation/virt/kvm/amd-memory-encryption.rst index c3129b9ba5cb..57c01f531e61 100644 --- a/Documentation/virt/kvm/amd-memory-encryption.rst +++ b/Documentation/virt/kvm/amd-memory-encryption.rst @@ -74,7 +74,7 @@ should point to a file descriptor that is opened on the ``/dev/sev`` device, if needed (see individual commands). On output, ``error`` is zero on success, or an error code. Error codes -are defined in ``<linux/psp-dev.h>`. +are defined in ``<linux/psp-dev.h>``. KVM implements the following commands to support common lifecycle events of SEV guests, such as launching, running, snapshotting, migrating and decommissioning. diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index efbbe570aa9b..d2c1cbce1018 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -2572,13 +2572,15 @@ list in 4.68. :Parameters: None :Returns: 0 on success, -1 on error -This signals to the host kernel that the specified guest is being paused by -userspace. The host will set a flag in the pvclock structure that is checked -from the soft lockup watchdog. The flag is part of the pvclock structure that -is shared between guest and host, specifically the second bit of the flags +This ioctl sets a flag accessible to the guest indicating that the specified +vCPU has been paused by the host userspace. + +The host will set a flag in the pvclock structure that is checked from the +soft lockup watchdog. The flag is part of the pvclock structure that is +shared between guest and host, specifically the second bit of the flags field of the pvclock_vcpu_time_info structure. It will be set exclusively by the host and read/cleared exclusively by the guest. The guest operation of -checking and clearing the flag must an atomic operation so +checking and clearing the flag must be an atomic operation so load-link/store-conditional, or equivalent must be used. There are two cases where the guest will clear the flag: when the soft lockup watchdog timer resets itself or when a soft lockup is detected. This ioctl can be called any time diff --git a/Documentation/virt/kvm/arm/pvtime.rst b/Documentation/virt/kvm/arm/pvtime.rst index 2357dd2d8655..687b60d76ca9 100644 --- a/Documentation/virt/kvm/arm/pvtime.rst +++ b/Documentation/virt/kvm/arm/pvtime.rst @@ -76,5 +76,5 @@ It is advisable that one or more 64k pages are set aside for the purpose of these structures and not used for other purposes, this enables the guest to map the region using 64k pages and avoids conflicting attributes with other memory. -For the user space interface see Documentation/virt/kvm/devices/vcpu.txt +For the user space interface see Documentation/virt/kvm/devices/vcpu.rst section "3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL". diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst index 9963e680770a..ca374d3fe085 100644 --- a/Documentation/virt/kvm/devices/vcpu.rst +++ b/Documentation/virt/kvm/devices/vcpu.rst @@ -110,5 +110,5 @@ Returns: Specifies the base address of the stolen time structure for this VCPU. The base address must be 64 byte aligned and exist within a valid guest memory -region. See Documentation/virt/kvm/arm/pvtime.txt for more information +region. See Documentation/virt/kvm/arm/pvtime.rst for more information including the layout of the stolen time structure. diff --git a/Documentation/virt/kvm/hypercalls.rst b/Documentation/virt/kvm/hypercalls.rst index dbaf207e560d..ed4fddd364ea 100644 --- a/Documentation/virt/kvm/hypercalls.rst +++ b/Documentation/virt/kvm/hypercalls.rst @@ -22,7 +22,7 @@ S390: number in R1. For further information on the S390 diagnose call as supported by KVM, - refer to Documentation/virt/kvm/s390-diag.txt. + refer to Documentation/virt/kvm/s390-diag.rst. PowerPC: It uses R3-R10 and hypercall number in R11. R4-R11 are used as output registers. @@ -30,7 +30,7 @@ PowerPC: KVM hypercalls uses 4 byte opcode, that are patched with 'hypercall-instructions' property inside the device tree's /hypervisor node. - For more information refer to Documentation/virt/kvm/ppc-pv.txt + For more information refer to Documentation/virt/kvm/ppc-pv.rst MIPS: KVM hypercalls use the HYPCALL instruction with code 0 and the hypercall diff --git a/Documentation/virt/kvm/mmu.rst b/Documentation/virt/kvm/mmu.rst index 60981887d20b..46126ecc70f7 100644 --- a/Documentation/virt/kvm/mmu.rst +++ b/Documentation/virt/kvm/mmu.rst @@ -319,7 +319,7 @@ Handling a page fault is performed as follows: - If both P bit and R/W bit of error code are set, this could possibly be handled as a "fast page fault" (fixed without taking the MMU lock). See - the description in Documentation/virt/kvm/locking.txt. + the description in Documentation/virt/kvm/locking.rst. - if needed, walk the guest page tables to determine the guest translation (gva->gpa or ngpa->gpa) diff --git a/Documentation/virt/kvm/review-checklist.rst b/Documentation/virt/kvm/review-checklist.rst index 1f86a9d3f705..dc01aea4057b 100644 --- a/Documentation/virt/kvm/review-checklist.rst +++ b/Documentation/virt/kvm/review-checklist.rst @@ -10,7 +10,7 @@ Review checklist for kvm patches 2. Patches should be against kvm.git master branch. 3. If the patch introduces or modifies a new userspace API: - - the API must be documented in Documentation/virt/kvm/api.txt + - the API must be documented in Documentation/virt/kvm/api.rst - the API must be discoverable using KVM_CHECK_EXTENSION 4. New state must include support for save/restore. diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index e8d943b21cf9..611140ffef7e 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -31,6 +31,7 @@ descriptions of data structures and algorithms. active_mm balance cleancache + free_page_reporting frontswap highmem hmm diff --git a/Documentation/vm/page_frags.rst b/Documentation/vm/page_frags.rst index 637cc49d1b2f..7d6f9385d129 100644 --- a/Documentation/vm/page_frags.rst +++ b/Documentation/vm/page_frags.rst @@ -26,7 +26,7 @@ to be disabled when executing the fragment allocation. The network stack uses two separate caches per CPU to handle fragment allocation. The netdev_alloc_cache is used by callers making use of the -__netdev_alloc_frag and __netdev_alloc_skb calls. The napi_alloc_cache is +netdev_alloc_frag and __netdev_alloc_skb calls. The napi_alloc_cache is used by callers of the __napi_alloc_frag and __napi_alloc_skb calls. The main difference between these two calls is the context in which they may be called. The "netdev" prefixed functions are usable in any context as these diff --git a/Documentation/vm/zswap.rst b/Documentation/vm/zswap.rst index f8c6a79d7c70..d8d9fa4a1f0d 100644 --- a/Documentation/vm/zswap.rst +++ b/Documentation/vm/zswap.rst @@ -140,10 +140,10 @@ without any real benefit but with a performance drop for the system), a special parameter has been introduced to implement a sort of hysteresis to refuse taking pages into zswap pool until it has sufficient space if the limit has been hit. To set the threshold at which zswap would start accepting pages -again after it became full, use the sysfs ``accept_threhsold_percent`` +again after it became full, use the sysfs ``accept_threshold_percent`` attribute, e. g.:: - echo 80 > /sys/module/zswap/parameters/accept_threhsold_percent + echo 80 > /sys/module/zswap/parameters/accept_threshold_percent Setting this parameter to 100 will disable the hysteresis. diff --git a/Documentation/watchdog/convert_drivers_to_kernel_api.rst b/Documentation/watchdog/convert_drivers_to_kernel_api.rst index dd934cc08e40..a1c3f038ce0e 100644 --- a/Documentation/watchdog/convert_drivers_to_kernel_api.rst +++ b/Documentation/watchdog/convert_drivers_to_kernel_api.rst @@ -2,7 +2,7 @@ Converting old watchdog drivers to the watchdog framework ========================================================= -by Wolfram Sang <w.sang@pengutronix.de> +by Wolfram Sang <wsa@kernel.org> Before the watchdog framework came into the kernel, every driver had to implement the API on its own. Now, as the framework factored out the common @@ -115,7 +115,7 @@ Add the watchdog operations --------------------------- All possible callbacks are defined in 'struct watchdog_ops'. You can find it -explained in 'watchdog-kernel-api.txt' in this directory. start(), stop() and +explained in 'watchdog-kernel-api.txt' in this directory. start() and owner must be set, the rest are optional. You will easily find corresponding functions in the old driver. Note that you will now get a pointer to the watchdog_device as a parameter to these functions, so you probably have to diff --git a/Documentation/watchdog/watchdog-kernel-api.rst b/Documentation/watchdog/watchdog-kernel-api.rst index 864edbe932c1..068a55ee0d4a 100644 --- a/Documentation/watchdog/watchdog-kernel-api.rst +++ b/Documentation/watchdog/watchdog-kernel-api.rst @@ -123,8 +123,8 @@ The list of watchdog operations is defined as:: struct module *owner; /* mandatory operations */ int (*start)(struct watchdog_device *); - int (*stop)(struct watchdog_device *); /* optional operations */ + int (*stop)(struct watchdog_device *); int (*ping)(struct watchdog_device *); unsigned int (*status)(struct watchdog_device *); int (*set_timeout)(struct watchdog_device *, unsigned int); diff --git a/Documentation/x86/x86_64/uefi.rst b/Documentation/x86/x86_64/uefi.rst index 88c3ba32546f..3b894103a734 100644 --- a/Documentation/x86/x86_64/uefi.rst +++ b/Documentation/x86/x86_64/uefi.rst @@ -36,7 +36,7 @@ Mechanics elilo bootloader with x86_64 support, elilo configuration file, kernel image built in first step and corresponding - initrd. Instructions on building elilo and its dependencies + initrd. Instructions on building elilo and its dependencies can be found in the elilo sourceforge project. - Boot to EFI shell and invoke elilo choosing the kernel image built |