summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
AgeCommit message (Collapse)Author
2020-12-24drm/i915/gt: Refactor heartbeat request construction and submissionChris Wilson
Pull the individual strands of creating a custom heartbeat requests into a pair of common functions. This will reduce the number of changes we will need to make in future. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20201224160213.29521-1-chris@chris-wilson.co.uk
2020-10-07drm/i915/gt: Track the most recent pulse for the heartbeatChris Wilson
Since we track the idle_pulse for flushing the barriers and avoid re-emitting the pulse upon idling if no futher action is required, this also impacts the heartbeat. Before emitting a fresh heartbeat, we look at the engine idle status and assume that if the pulse was the last request emitted along the heartbeat, the engine is idling and a heartbeat pulse not required. This assumption fails, but we can reuse the idle pulse as the heartbeat if we are yet to emit one, and so track the status of that pulse for our engine health check. This impacts tgl/rcs0 as we rely on the heartbeat for our healthcheck for the normal preemption detection mechanism is disabled by default. Testcase: igt/gem_exec_schedule/preempt-hang/rcs0 #tgl Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20201006094653.7558-1-chris@chris-wilson.co.uk
2020-09-29drm/i915/gt: Always send a pulse down the engine after disabling heartbeatChris Wilson
Currently, we check we can send a pulse prior to disabling the heartbeat to verify that we can change the heartbeat, but since we may re-evaluate execution upon changing the heartbeat interval we need another pulse afterwards to refresh execution. v2: Tvrtko asked if we could reduce the double pulse to a single, which opened up a discussion of how we should handle the pulse-error after attempting to change the property, and the desire to serialise adjustment of the property with its validating pulse, and unwind upon failure. Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: <stable@vger.kernel.org> # v5.7+ Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Acked-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20200928221510.26044-2-chris@chris-wilson.co.uk
2020-07-02drm/i915/gt: Move the heartbeat into the high priority system wqChris Wilson
As we ensure that the heartbeat is reasonably fast (and should not block), move the heartbeat work into the system_highpri_wq to avoid having this essential task be blocked behind other slow work, such as our own retire_work_handler. References: https://gitlab.freedesktop.org/drm/intel/-/issues/2119 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20200702095219.963-2-chris@chris-wilson.co.uk
2020-07-02drm/i915/gt: Harden the heartbeat against a stuck driverChris Wilson
If the driver gets stuck holding the kernel timeline, we cannot issue a heartbeat and so fail to discover that the driver is indeed stuck and do not issue a GPU reset (which would hopefully unstick the driver!). Switch to using a trylock so that we can query if the heartbeat's timeline mutex is locked elsewhere, and then use the timer to probe if it remains stuck at the same spot for consecutive heartbeats, indicating that the mutex has not been released and the engine has not progressed. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20200702095219.963-1-chris@chris-wilson.co.uk
2020-06-22drm/i915/params: switch to device specific parametersJani Nikula
Start using device specific parameters instead of module parameters for most things. The module parameters become the immutable initial values for i915 parameters. The device specific parameters in i915->params start life as a copy of i915_modparams. Any later changes are only reflected in the debugfs. The stragglers are: * i915.force_probe and i915.modeset. Needed before dev_priv is available. This is fine because the parameters are read-only and never modified. * i915.verbose_state_checks. Passing dev_priv to I915_STATE_WARN and I915_STATE_WARN_ON would result in massive and ugly churn. This is handled by not exposing the parameter via debugfs, and leaving the parameter writable in sysfs. This may be fixed up in follow-up work. * i915.inject_probe_failure. Only makes sense in terms of the module, not the device. This is handled by not exposing the parameter via debugfs. v2: Fix uc i915 lookup code (Michał Winiarski) Cc: Juha-Pekka Heikkilä <juha-pekka.heikkila@intel.com> Cc: Venkata Sandeep Dhanalakota <venkata.s.dhanalakota@intel.com> Cc: Michał Winiarski <michal.winiarski@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Acked-by: Michał Winiarski <michal.winiarski@intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20200618150402.14022-1-jani.nikula@intel.com
2020-06-15drm/i915/gt: Add a safety submission flush in the heartbeatChris Wilson
Just in case everything fails (like for example "missed interrupt syndrome" on Sandybridge), always flush the submission tasklet from the heartbeat. This papers over such issues, but will still appear as a second long glitch, and prevents us from detecting it unless we happen to be performing a timed test. v2: We rely on flush_submission() synchronizing with the tasklet on another CPU. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20200615165013.22973-3-chris@chris-wilson.co.uk
2020-05-28drm/i915/gt: Don't declare hangs if engine is stalledChris Wilson
If the ring submission is stalled on an external request, nothing can be submitted, not even the heartbeat in the kernel context. Since nothing is running, resetting the engine/device does not unblock the system and is pointless. We can see if the heartbeat is supposed to be running before declaring foul. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20200528074109.28235-2-chris@chris-wilson.co.uk
2020-03-17drm/i915/gt: Always reschedule the new heartbeatChris Wilson
In order to better respond to new heartbeat intervals given via sysfs, always reprogramme an active heartbeat upon change (i.e. use mod_delayed_work to reschedule rather than queue_delayed_work which ignores an already active work.) Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20200317163208.30010-1-chris@chris-wilson.co.uk
2020-02-18drm/i915/gt: Fix up missing error propagation for heartbeat pulsesChris Wilson
Just missed setting err along an interruptible error path for the intel_engine_pulse(). Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20200218162150.1300405-4-chris@chris-wilson.co.uk
2020-01-06drm/i915: Merge i915_request.flags with i915_request.fence.flagsChris Wilson
As we already have a flags field buried within i915_request, reuse it! Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20200106114234.2529613-3-chris@chris-wilson.co.uk
2019-11-25drm/i915: Serialise with engine-pm around requests on the kernel_contextChris Wilson
As the engine->kernel_context is used within the engine-pm barrier, we have to be careful when emitting requests outside of the barrier, as the strict timeline locking rules do not apply. Instead, we must ensure the engine_park() cannot be entered as we build the request, which is simplest by taking an explicit engine-pm wakeref around the request construction. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20191125105858.1718307-1-chris@chris-wilson.co.uk
2019-11-07drm/i915/gt: Cleanup heartbeat systole firstChris Wilson
Before we grab the engine wakeref, tidy up the previous heartbeat request. If we then abort because the engine powerwell is off, we ensure the request is freed as we know we will not have freed it when cancelling the work (as the work is running!). Fixes: 841e86728615 ("drm/i915/gt: Only drop heartbeat.systole if the sole owner") References: 058179e72e09 ("drm/i915/gt: Replace hangcheck by heartbeats") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20191106223410.30334-1-chris@chris-wilson.co.uk
2019-11-06drm/i915/gt: Only drop heartbeat.systole if the sole ownerChris Wilson
Mika spotted that only using cancel_delayed_work() could mean that we attempted to clear the heartbeat.systole while the worker was still running. Rectify the situation by only touching the systole from outside the worker if we suceeded in cancelling the worker before it could run. The worker is expected to clean up by itself upon idling. Reported-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Fixes: 058179e72e09 ("drm/i915/gt: Replace hangcheck by heartbeats") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20191106133129.17732-1-chris@chris-wilson.co.uk
2019-10-26drm/i915: Encapsulate kconfig constant values inside boolean predicatesChris Wilson
Avoid angering clang and smatch by using a constant value in a '&&' test, by forcing that constant value into a boolean. E.g., drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c:159:13: warning: use of logical '&&' with constant operand [-Wconstant-logical-operand] if (!delay && CONFIG_DRM_I915_PREEMPT_TIMEOUT) { ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/20191025135943.12524-1-chris@chris-wilson.co.uk
2019-10-23drm/i915/gt: Replace hangcheck by heartbeatsChris Wilson
Replace sampling the engine state every so often with a periodic heartbeat request to measure the health of an engine. This is coupled with the forced-preemption to allow long running requests to survive so long as they do not block other users. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Jon Bloomfield <jon.bloomfield@intel.com> Reviewed-by: Jon Bloomfield <jon.bloomfield@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20191023133108.21401-5-chris@chris-wilson.co.uk
2019-10-21drm/i915/gt: Introduce barrier pulses along enginesChris Wilson
To flush idle barriers, and even inflight requests, we want to send a preemptive 'pulse' along an engine. We use a no-op request along the pinned kernel_context at high priority so that it should run or else kick off the stuck requests. We can use this to ensure idle barriers are immediately flushed, as part of a context cancellation mechanism, or as part of a heartbeat mechanism to detect and reset a stuck GPU. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20191021174339.5389-1-chris@chris-wilson.co.uk