diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2021-06-29 13:36:06 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2021-06-29 13:36:06 -0700 |
commit | 3563f55ce65462063543dfa6a8d8c7fbfb9d7772 (patch) | |
tree | f43239057f966f6511e360930071e02b72f72297 /Documentation | |
parent | 1dfb0f47aca11350f45f8c04c3b83f0e829adfa9 (diff) | |
parent | 22b65d31ad9d10cdd726239966b6d6f67db8f251 (diff) |
Merge tag 'pm-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These add hybrid processors support to the intel_pstate driver and
make it work with more processor models when HWP is disabled, make the
intel_idle driver use special C6 idle state paremeters when package
C-states are disabled, add cooling support to the tegra30 devfreq
driver, rework the TEO (timer events oriented) cpuidle governor,
extend the OPP (operating performance points) framework to use the
required-opps DT property in more cases, fix some issues and clean up
a number of assorted pieces of code.
Specifics:
- Make intel_pstate support hybrid processors using abstract
performance units in the HWP interface (Rafael Wysocki).
- Add Icelake servers and Cometlake support in no-HWP mode to
intel_pstate (Giovanni Gherdovich).
- Make cpufreq_online() error path be consistent with the CPU device
removal path in cpufreq (Rafael Wysocki).
- Clean up 3 cpufreq drivers and the statistics code (Hailong Liu,
Randy Dunlap, Shaokun Zhang).
- Make intel_idle use special idle state parameters for C6 when
package C-states are disabled (Chen Yu).
- Rework the TEO (timer events oriented) cpuidle governor to address
some theoretical shortcomings in it (Rafael Wysocki).
- Drop unneeded semicolon from the TEO governor (Wan Jiabing).
- Modify the runtime PM framework to accept unassigned suspend and
resume callback pointers (Ulf Hansson).
- Improve pm_runtime_get_sync() documentation (Krzysztof Kozlowski).
- Improve device performance states support in the generic power
domains (genpd) framework (Ulf Hansson).
- Fix some documentation issues in genpd (Yang Yingliang).
- Make the operating performance points (OPP) framework use the
required-opps DT property in use cases that are not related to
genpd (Hsin-Yi Wang).
- Make lazy_link_required_opp_table() use list_del_init instead of
list_del/INIT_LIST_HEAD (Yang Yingliang).
- Simplify wake IRQs handling in the core system-wide sleep support
code and clean up some coding style inconsistencies in it (Tian
Tao, Zhen Lei).
- Add cooling support to the tegra30 devfreq driver and improve its
DT bindings (Dmitry Osipenko).
- Fix some assorted issues in the devfreq core and drivers (Chanwoo
Choi, Dong Aisheng, YueHaibing)"
* tag 'pm-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (39 commits)
PM / devfreq: passive: Fix get_target_freq when not using required-opp
cpufreq: Make cpufreq_online() call driver->offline() on errors
opp: Allow required-opps to be used for non genpd use cases
cpuidle: teo: remove unneeded semicolon in teo_select()
dt-bindings: devfreq: tegra30-actmon: Add cooling-cells
dt-bindings: devfreq: tegra30-actmon: Convert to schema
PM / devfreq: userspace: Use DEVICE_ATTR_RW macro
PM: runtime: Clarify documentation when callbacks are unassigned
PM: runtime: Allow unassigned ->runtime_suspend|resume callbacks
PM: runtime: Improve path in rpm_idle() when no callback
PM: hibernate: remove leading spaces before tabs
PM: sleep: remove trailing spaces and tabs
PM: domains: Drop/restore performance state votes for devices at runtime PM
PM: domains: Return early if perf state is already set for the device
PM: domains: Split code in dev_pm_genpd_set_performance_state()
cpuidle: teo: Use kerneldoc documentation in admin-guide
cpuidle: teo: Rework most recent idle duration values treatment
cpuidle: teo: Change the main idle state selection logic
cpuidle: teo: Cosmetic modification of teo_select()
cpuidle: teo: Cosmetic modifications of teo_update()
...
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/admin-guide/pm/cpuidle.rst | 77 | ||||
-rw-r--r-- | Documentation/admin-guide/pm/intel_pstate.rst | 6 | ||||
-rw-r--r-- | Documentation/devicetree/bindings/arm/tegra/nvidia,tegra30-actmon.txt | 57 | ||||
-rw-r--r-- | Documentation/devicetree/bindings/devfreq/nvidia,tegra30-actmon.yaml | 126 | ||||
-rw-r--r-- | Documentation/power/runtime_pm.rst | 15 |
5 files changed, 148 insertions, 133 deletions
diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst index 10fde58d0869..aec2cd2aaea7 100644 --- a/Documentation/admin-guide/pm/cpuidle.rst +++ b/Documentation/admin-guide/pm/cpuidle.rst @@ -347,81 +347,8 @@ for tickless systems. It follows the same basic strategy as the ``menu`` `one <menu-gov_>`_: it always tries to find the deepest idle state suitable for the given conditions. However, it applies a different approach to that problem. -First, it does not use sleep length correction factors, but instead it attempts -to correlate the observed idle duration values with the available idle states -and use that information to pick up the idle state that is most likely to -"match" the upcoming CPU idle interval. Second, it does not take the tasks -that were running on the given CPU in the past and are waiting on some I/O -operations to complete now at all (there is no guarantee that they will run on -the same CPU when they become runnable again) and the pattern detection code in -it avoids taking timer wakeups into account. It also only uses idle duration -values less than the current time till the closest timer (with the scheduler -tick excluded) for that purpose. - -Like in the ``menu`` governor `case <menu-gov_>`_, the first step is to obtain -the *sleep length*, which is the time until the closest timer event with the -assumption that the scheduler tick will be stopped (that also is the upper bound -on the time until the next CPU wakeup). That value is then used to preselect an -idle state on the basis of three metrics maintained for each idle state provided -by the ``CPUIdle`` driver: ``hits``, ``misses`` and ``early_hits``. - -The ``hits`` and ``misses`` metrics measure the likelihood that a given idle -state will "match" the observed (post-wakeup) idle duration if it "matches" the -sleep length. They both are subject to decay (after a CPU wakeup) every time -the target residency of the idle state corresponding to them is less than or -equal to the sleep length and the target residency of the next idle state is -greater than the sleep length (that is, when the idle state corresponding to -them "matches" the sleep length). The ``hits`` metric is increased if the -former condition is satisfied and the target residency of the given idle state -is less than or equal to the observed idle duration and the target residency of -the next idle state is greater than the observed idle duration at the same time -(that is, it is increased when the given idle state "matches" both the sleep -length and the observed idle duration). In turn, the ``misses`` metric is -increased when the given idle state "matches" the sleep length only and the -observed idle duration is too short for its target residency. - -The ``early_hits`` metric measures the likelihood that a given idle state will -"match" the observed (post-wakeup) idle duration if it does not "match" the -sleep length. It is subject to decay on every CPU wakeup and it is increased -when the idle state corresponding to it "matches" the observed (post-wakeup) -idle duration and the target residency of the next idle state is less than or -equal to the sleep length (i.e. the idle state "matching" the sleep length is -deeper than the given one). - -The governor walks the list of idle states provided by the ``CPUIdle`` driver -and finds the last (deepest) one with the target residency less than or equal -to the sleep length. Then, the ``hits`` and ``misses`` metrics of that idle -state are compared with each other and it is preselected if the ``hits`` one is -greater (which means that that idle state is likely to "match" the observed idle -duration after CPU wakeup). If the ``misses`` one is greater, the governor -preselects the shallower idle state with the maximum ``early_hits`` metric -(or if there are multiple shallower idle states with equal ``early_hits`` -metric which also is the maximum, the shallowest of them will be preselected). -[If there is a wakeup latency constraint coming from the `PM QoS framework -<cpu-pm-qos_>`_ which is hit before reaching the deepest idle state with the -target residency within the sleep length, the deepest idle state with the exit -latency within the constraint is preselected without consulting the ``hits``, -``misses`` and ``early_hits`` metrics.] - -Next, the governor takes several idle duration values observed most recently -into consideration and if at least a half of them are greater than or equal to -the target residency of the preselected idle state, that idle state becomes the -final candidate to ask for. Otherwise, the average of the most recent idle -duration values below the target residency of the preselected idle state is -computed and the governor walks the idle states shallower than the preselected -one and finds the deepest of them with the target residency within that average. -That idle state is then taken as the final candidate to ask for. - -Still, at this point the governor may need to refine the idle state selection if -it has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_. That -generally happens if the target residency of the idle state selected so far is -less than the tick period and the tick has not been stopped already (in a -previous iteration of the idle loop). Then, like in the ``menu`` governor -`case <menu-gov_>`_, the sleep length used in the previous computations may not -reflect the real time until the closest timer event and if it really is greater -than that time, a shallower state with a suitable target residency may need to -be selected. - +.. kernel-doc:: drivers/cpuidle/governors/teo.c + :doc: teo-description .. _idle-states-representation: diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst index 7a7d4b041eac..d5043cd8d2f5 100644 --- a/Documentation/admin-guide/pm/intel_pstate.rst +++ b/Documentation/admin-guide/pm/intel_pstate.rst @@ -365,6 +365,9 @@ argument is passed to the kernel in the command line. inclusive) including both turbo and non-turbo P-states (see `Turbo P-states Support`_). + This attribute is present only if the value exposed by it is the same + for all of the CPUs in the system. + The value of this attribute is not affected by the ``no_turbo`` setting described `below <no_turbo_attr_>`_. @@ -374,6 +377,9 @@ argument is passed to the kernel in the command line. Ratio of the `turbo range <turbo_>`_ size to the size of the entire range of supported P-states, in percent. + This attribute is present only if the value exposed by it is the same + for all of the CPUs in the system. + This attribute is read-only. .. _no_turbo_attr: diff --git a/Documentation/devicetree/bindings/arm/tegra/nvidia,tegra30-actmon.txt b/Documentation/devicetree/bindings/arm/tegra/nvidia,tegra30-actmon.txt deleted file mode 100644 index 897eedfa2bc8..000000000000 --- a/Documentation/devicetree/bindings/arm/tegra/nvidia,tegra30-actmon.txt +++ /dev/null @@ -1,57 +0,0 @@ -NVIDIA Tegra Activity Monitor - -The activity monitor block collects statistics about the behaviour of other -components in the system. This information can be used to derive the rate at -which the external memory needs to be clocked in order to serve all requests -from the monitored clients. - -Required properties: -- compatible: should be "nvidia,tegra<chip>-actmon" -- reg: offset and length of the register set for the device -- interrupts: standard interrupt property -- clocks: Must contain a phandle and clock specifier pair for each entry in -clock-names. See ../../clock/clock-bindings.txt for details. -- clock-names: Must include the following entries: - - actmon - - emc -- resets: Must contain an entry for each entry in reset-names. See -../../reset/reset.txt for details. -- reset-names: Must include the following entries: - - actmon -- operating-points-v2: See ../bindings/opp/opp.txt for details. -- interconnects: Should contain entries for memory clients sitting on - MC->EMC memory interconnect path. -- interconnect-names: Should include name of the interconnect path for each - interconnect entry. Consult TRM documentation for - information about available memory clients, see MEMORY - CONTROLLER section. - -For each opp entry in 'operating-points-v2' table: -- opp-supported-hw: bitfield indicating SoC speedo ID mask -- opp-peak-kBps: peak bandwidth of the memory channel - -Example: - dfs_opp_table: opp-table { - compatible = "operating-points-v2"; - - opp@12750000 { - opp-hz = /bits/ 64 <12750000>; - opp-supported-hw = <0x000F>; - opp-peak-kBps = <51000>; - }; - ... - }; - - actmon@6000c800 { - compatible = "nvidia,tegra124-actmon"; - reg = <0x0 0x6000c800 0x0 0x400>; - interrupts = <GIC_SPI 45 IRQ_TYPE_LEVEL_HIGH>; - clocks = <&tegra_car TEGRA124_CLK_ACTMON>, - <&tegra_car TEGRA124_CLK_EMC>; - clock-names = "actmon", "emc"; - resets = <&tegra_car 119>; - reset-names = "actmon"; - operating-points-v2 = <&dfs_opp_table>; - interconnects = <&mc TEGRA124_MC_MPCORER &emc>; - interconnect-names = "cpu"; - }; diff --git a/Documentation/devicetree/bindings/devfreq/nvidia,tegra30-actmon.yaml b/Documentation/devicetree/bindings/devfreq/nvidia,tegra30-actmon.yaml new file mode 100644 index 000000000000..e3379d106728 --- /dev/null +++ b/Documentation/devicetree/bindings/devfreq/nvidia,tegra30-actmon.yaml @@ -0,0 +1,126 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/devfreq/nvidia,tegra30-actmon.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: NVIDIA Tegra30 Activity Monitor + +maintainers: + - Dmitry Osipenko <digetx@gmail.com> + - Jon Hunter <jonathanh@nvidia.com> + - Thierry Reding <thierry.reding@gmail.com> + +description: | + The activity monitor block collects statistics about the behaviour of other + components in the system. This information can be used to derive the rate at + which the external memory needs to be clocked in order to serve all requests + from the monitored clients. + +properties: + compatible: + enum: + - nvidia,tegra30-actmon + - nvidia,tegra114-actmon + - nvidia,tegra124-actmon + - nvidia,tegra210-actmon + + reg: + maxItems: 1 + + clocks: + maxItems: 2 + + clock-names: + items: + - const: actmon + - const: emc + + resets: + maxItems: 1 + + reset-names: + items: + - const: actmon + + interrupts: + maxItems: 1 + + interconnects: + minItems: 1 + maxItems: 12 + + interconnect-names: + minItems: 1 + maxItems: 12 + description: + Should include name of the interconnect path for each interconnect + entry. Consult TRM documentation for information about available + memory clients, see MEMORY CONTROLLER and ACTIVITY MONITOR sections. + + operating-points-v2: + description: + Should contain freqs and voltages and opp-supported-hw property, which + is a bitfield indicating SoC speedo ID mask. + + "#cooling-cells": + const: 2 + +required: + - compatible + - reg + - clocks + - clock-names + - resets + - reset-names + - interrupts + - interconnects + - interconnect-names + - operating-points-v2 + - "#cooling-cells" + +additionalProperties: false + +examples: + - | + #include <dt-bindings/memory/tegra30-mc.h> + + mc: memory-controller@7000f000 { + compatible = "nvidia,tegra30-mc"; + reg = <0x7000f000 0x400>; + clocks = <&clk 32>; + clock-names = "mc"; + + interrupts = <0 77 4>; + + #iommu-cells = <1>; + #reset-cells = <1>; + #interconnect-cells = <1>; + }; + + emc: external-memory-controller@7000f400 { + compatible = "nvidia,tegra30-emc"; + reg = <0x7000f400 0x400>; + interrupts = <0 78 4>; + clocks = <&clk 57>; + + nvidia,memory-controller = <&mc>; + operating-points-v2 = <&dvfs_opp_table>; + power-domains = <&domain>; + + #interconnect-cells = <0>; + }; + + actmon@6000c800 { + compatible = "nvidia,tegra30-actmon"; + reg = <0x6000c800 0x400>; + interrupts = <0 45 4>; + clocks = <&clk 119>, <&clk 57>; + clock-names = "actmon", "emc"; + resets = <&rst 119>; + reset-names = "actmon"; + operating-points-v2 = <&dvfs_opp_table>; + interconnects = <&mc TEGRA30_MC_MPCORER &emc>; + interconnect-names = "cpu-read"; + #cooling-cells = <2>; + }; diff --git a/Documentation/power/runtime_pm.rst b/Documentation/power/runtime_pm.rst index 18ae21bf7f92..d6bf84f061f4 100644 --- a/Documentation/power/runtime_pm.rst +++ b/Documentation/power/runtime_pm.rst @@ -378,7 +378,11 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h: `int pm_runtime_get_sync(struct device *dev);` - increment the device's usage counter, run pm_runtime_resume(dev) and - return its result + return its result; + note that it does not drop the device's usage counter on errors, so + consider using pm_runtime_resume_and_get() instead of it, especially + if its return value is checked by the caller, as this is likely to + result in cleaner code. `int pm_runtime_get_if_in_use(struct device *dev);` - return -EINVAL if 'power.disable_depth' is nonzero; otherwise, if the @@ -827,6 +831,15 @@ or driver about runtime power changes. Instead, the driver for the device's parent must take responsibility for telling the device's driver when the parent's power state changes. +Note that, in some cases it may not be desirable for subsystems/drivers to call +pm_runtime_no_callbacks() for their devices. This could be because a subset of +the runtime PM callbacks needs to be implemented, a platform dependent PM +domain could get attached to the device or that the device is power managed +through a supplier device link. For these reasons and to avoid boilerplate code +in subsystems/drivers, the PM core allows runtime PM callbacks to be +unassigned. More precisely, if a callback pointer is NULL, the PM core will act +as though there was a callback and it returned 0. + 9. Autosuspend, or automatically-delayed suspends ================================================= |