summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2022-01-07fscache: Implement functions add/remove a cacheDavid Howells
Implement functions to allow the cache backend to add or remove a cache: (1) Declare a cache to be live: int fscache_add_cache(struct fscache_cache *cache, const struct fscache_cache_ops *ops, void *cache_priv); Take a previously acquired cache cookie, set the operations table and private data and mark the cache open for access. (2) Withdraw a cache from service: void fscache_withdraw_cache(struct fscache_cache *cache); This marks the cache as withdrawn and thus prevents further cache-level and volume-level accesses. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819596022.215744.8799712491432238827.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906896599.143852.17049208999019262884.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967097870.1823006.3470041000971522030.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021505541.640689.1819714759326331054.stgit@warthog.procyon.org.uk/ # v4
2022-01-07fscache: Implement cookie-level access helpersDavid Howells
Add a number of helper functions to manage access to a cookie, pinning the cache object in place for the duration to prevent cache withdrawal from removing it: (1) void fscache_init_access_gate(struct fscache_cookie *cookie); This function initialises the access count when a cache binds to a cookie. An extra ref is taken on the access count to prevent wakeups while the cache is active. We're only interested in the wakeup when a cookie is being withdrawn and we're waiting for it to quiesce - at which point the counter will be decremented before the wait. The FSCACHE_COOKIE_NACC_ELEVATED flag is set on the cookie to keep track of the extra ref in order to handle a race between relinquishment and withdrawal both trying to drop the extra ref. (2) bool fscache_begin_cookie_access(struct fscache_cookie *cookie, enum fscache_access_trace why); This function attempts to begin access upon a cookie, pinning it in place if it's cached. If successful, it returns true and leaves a the access count incremented. (3) void fscache_end_cookie_access(struct fscache_cookie *cookie, enum fscache_access_trace why); This function drops the access count obtained by (2), permitting object withdrawal to take place when it reaches zero. A tracepoint is provided to track changes to the access counter on a cookie. Changes ======= ver #2: - Don't hold n_accesses elevated whilst cache is bound to a cookie, but rather add a flag that prevents the state machine from being queued when n_accesses reaches 0. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819595085.215744.1706073049250505427.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906895313.143852.10141619544149102193.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967095980.1823006.1133648159424418877.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021503063.640689.8870918985269528670.stgit@warthog.procyon.org.uk/ # v4
2022-01-07fscache: Implement volume-level access helpersDavid Howells
Add a pair of helper functions to manage access to a volume, pinning the volume in place for the duration to prevent cache withdrawal from removing it: bool fscache_begin_volume_access(struct fscache_volume *volume, enum fscache_access_trace why); void fscache_end_volume_access(struct fscache_volume *volume, enum fscache_access_trace why); The way the access gate on the volume works/will work is: (1) If the cache tests as not live (state is not FSCACHE_CACHE_IS_ACTIVE), then we return false to indicate access was not permitted. (2) If the cache tests as live, then we increment the volume's n_accesses count and then recheck the cache liveness, ending the access if it ceased to be live. (3) When we end the access, we decrement the volume's n_accesses and wake up the any waiters if it reaches 0. (4) Whilst the cache is caching, the volume's n_accesses is kept artificially incremented to prevent wakeups from happening. (5) When the cache is taken offline, the state is changed to prevent new accesses, the volume's n_accesses is decremented and we wait for it to become 0. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819594158.215744.8285859817391683254.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906894315.143852.5454793807544710479.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967095028.1823006.9173132503876627466.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021501546.640689.9631510472149608443.stgit@warthog.procyon.org.uk/ # v4
2022-01-07fscache: Implement cache-level access helpersDavid Howells
Add a pair of functions to pin/unpin a cache that we're wanting to do a high-level access to (such as creating or removing a volume): bool fscache_begin_cache_access(struct fscache_cache *cache, enum fscache_access_trace why); void fscache_end_cache_access(struct fscache_cache *cache, enum fscache_access_trace why); The way the access gate works/will work is: (1) If the cache tests as not live (state is not FSCACHE_CACHE_IS_ACTIVE), then we return false to indicate access was not permitted. (2) If the cache tests as live, then we increment the n_accesses count and then recheck the liveness, ending the access if it ceased to be live. (3) When we end the access, we decrement n_accesses and wake up the any waiters if it reaches 0. (4) Whilst the cache is caching, n_accesses is kept artificially incremented to prevent wakeups from happening. (5) When the cache is taken offline, the state is changed to prevent new accesses, n_accesses is decremented and we wait for n_accesses to become 0. Note that some of this is implemented in a later patch. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819593239.215744.7537428720603638088.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906893368.143852.14164004598465617981.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967093977.1823006.6967886507023056409.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021499995.640689.18286203753480287850.stgit@warthog.procyon.org.uk/ # v4
2022-01-07fscache: Implement cookie registrationDavid Howells
Add functions to the fscache API to allow data file cookies to be acquired and relinquished by the network filesystem. It is intended that the filesystem will create such cookies per-inode under a volume. To request a cookie, the filesystem should call: struct fscache_cookie * fscache_acquire_cookie(struct fscache_volume *volume, u8 advice, const void *index_key, size_t index_key_len, const void *aux_data, size_t aux_data_len, loff_t object_size) The filesystem must first have created a volume cookie, which is passed in here. If it passes in NULL then the function will just return a NULL cookie. A binary key should be passed in index_key and is of size index_key_len. This is saved in the cookie and is used to locate the associated data in the cache. A coherency data buffer of size aux_data_len will be allocated and initialised from the buffer pointed to by aux_data. This is used to validate cache objects when they're opened and is stored on disk with them when they're committed. The data is stored in the cookie and will be updateable by various functions in later patches. The object_size must also be given. This is also used to perform a coherency check and to size the backing storage appropriately. This function disallows a cookie from being acquired twice in parallel, though it will cause the second user to wait if the first is busy relinquishing its cookie. When a network filesystem has finished with a cookie, it should call: void fscache_relinquish_cookie(struct fscache_volume *volume, bool retire) If retire is true, any backing data will be discarded immediately. Changes ======= ver #3: - fscache_hash()'s size parameter is now in bytes. Use __le32 as the unit to round up to. - When comparing cookies, simply see if the attributes are the same rather than subtracting them to produce a strcmp-style return[1]. - Add a check to see if the cookie is still hashed at the point of freeing. ver #2: - Don't hold n_accesses elevated whilst cache is bound to a cookie, but rather add a flag that prevents the state machine from being queued when n_accesses reaches 0. - Remove the unused cookie pointer field from the fscache_acquire tracepoint. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/CAHk-=whtkzB446+hX0zdLsdcUJsJ=8_-0S1mE_R+YurThfUbLA@mail.gmail.com/ [1] Link: https://lore.kernel.org/r/163819590658.215744.14934902514281054323.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906891983.143852.6219772337558577395.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967088507.1823006.12659006350221417165.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021498432.640689.12743483856927722772.stgit@warthog.procyon.org.uk/ # v4
2022-01-07fscache: Implement volume registrationDavid Howells
Add functions to the fscache API to allow volumes to be acquired and relinquished by the network filesystem. A volume is an index of data storage cache objects. A volume is represented by a volume cookie in the API. A filesystem would typically create a volume for a superblock and then create per-inode cookies within it. To request a volume, the filesystem calls: struct fscache_volume * fscache_acquire_volume(const char *volume_key, const char *cache_name, const void *coherency_data, size_t coherency_len) The volume_key is a printable string used to match the volume in the cache. It should not contain any '/' characters. For AFS, for example, this would be "afs,<cellname>,<volume_id>", e.g. "afs,example.com,523001". The cache_name can be NULL, but if not it should be a string indicating the name of the cache to use if there's more than one available. The coherency data, if given, is an arbitrarily-sized blob that's attached to the volume and is compared when the volume is looked up. If it doesn't match, the old volume is judged to be out of date and it and everything within it is discarded. Acquiring a volume twice concurrently is disallowed, though the function will wait if an old volume cookie is being relinquishing. When a network filesystem has finished with a volume, it should return the volume cookie by calling: void fscache_relinquish_volume(struct fscache_volume *volume, const void *coherency_data, bool invalidate) If invalidate is true, the entire volume will be discarded; if false, the volume will be synced and the coherency data will be updated. Changes ======= ver #4: - Removed an extraneous param from kdoc on fscache_relinquish_volume()[3]. ver #3: - fscache_hash()'s size parameter is now in bytes. Use __le32 as the unit to round up to. - When comparing cookies, simply see if the attributes are the same rather than subtracting them to produce a strcmp-style return[2]. - Make the coherency data an arbitrary blob rather than a u64, but don't store it for the moment. ver #2: - Fix error check[1]. - Make a fscache_acquire_volume() return errors, including EBUSY if a conflicting volume cookie already exists. No error is printed now - that's left to the netfs. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/20211203095608.GC2480@kili/ [1] Link: https://lore.kernel.org/r/CAHk-=whtkzB446+hX0zdLsdcUJsJ=8_-0S1mE_R+YurThfUbLA@mail.gmail.com/ [2] Link: https://lore.kernel.org/r/20211220224646.30e8205c@canb.auug.org.au/ [3] Link: https://lore.kernel.org/r/163819588944.215744.1629085755564865996.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906890630.143852.13972180614535611154.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967086836.1823006.8191672796841981763.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021495816.640689.4403156093668590217.stgit@warthog.procyon.org.uk/ # v4
2022-01-07fscache: Implement cache registrationDavid Howells
Implement a register of caches and provide functions to manage it. Two functions are provided for the cache backend to use: (1) Acquire a cache cookie: struct fscache_cache *fscache_acquire_cache(const char *name) This gets the cache cookie for a cache of the specified name and moves it to the preparation state. If a nameless cache cookie exists, that will be given this name and used. (2) Relinquish a cache cookie: void fscache_relinquish_cache(struct fscache_cache *cache); This relinquishes a cache cookie, cleans it and makes it available if it's still referenced by a network filesystem. Note that network filesystems don't deal with cache cookies directly, but rather go straight to the volume registration. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819587157.215744.13523139317322503286.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906889665.143852.10378009165231294456.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967085081.1823006.2218944206363626210.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021494847.640689.10109692261640524343.stgit@warthog.procyon.org.uk/ # v4
2022-01-07fscache: Implement a hash functionDavid Howells
Implement a function to generate hashes. It needs to be stable over time and endianness-independent as the hashes will appear on disk in future patches. It can assume that its input is a multiple of four bytes in size and alignment. This is borrowed from the VFS and simplified. le32_to_cpu() is added to make it endianness-independent. Changes ======= ver #3: - Read the data being hashed in an endianness-independent way[1]. - Change the size parameter to be in bytes rather than words. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/CAHk-=whtkzB446+hX0zdLsdcUJsJ=8_-0S1mE_R+YurThfUbLA@mail.gmail.com [1] Link: https://lore.kernel.org/r/163819586113.215744.1699465806130102367.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906888735.143852.10944614318596881429.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967082342.1823006.8915671045444488742.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021493624.640689.9990442668811178628.stgit@warthog.procyon.org.uk/ # v4
2022-01-07fscache: Introduce new driverDavid Howells
Introduce basic skeleton of the new, rewritten fscache driver. Changes ======= ver #3: - Use remove_proc_subtree(), not remove_proc_entry() to remove a populated dir. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819584034.215744.4290533472390439030.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906887770.143852.3577888294989185666.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967080039.1823006.5702921801104057922.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021491014.640689.4292699878317589512.stgit@warthog.procyon.org.uk/ # v4
2022-01-07netfs: Pass a flag to ->prepare_write() to say if there's no alloc'd spaceDavid Howells
Pass a flag to ->prepare_write() to indicate if there's definitely no space allocated in the cache yet (for instance if we've already checked as we were asked to do a read). Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819583123.215744.12783808230464471417.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906886835.143852.6689886781122679769.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967079100.1823006.12889542712309574359.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021489334.640689.3131206613015409076.stgit@warthog.procyon.org.uk/ # v4
2022-01-07fscache: Remove the contents of the fscache driver, pending rewriteDavid Howells
Remove the code that comprises the fscache driver as it's going to be substantially rewritten, with the majority of the code being erased in the rewrite. A small piece of linux/fscache.h is left as that is #included by a bunch of network filesystems. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819578724.215744.18210619052245724238.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906884814.143852.6727245089843862889.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967077097.1823006.1377665951499979089.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021485548.640689.13876080567388696162.stgit@warthog.procyon.org.uk/ # v4
2022-01-07cachefiles: Delete the cachefiles driver pending rewriteDavid Howells
Delete the code from the cachefiles driver to make it easier to rewrite and resubmit in a logical manner. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819577641.215744.12718114397770666596.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906883770.143852.4149714614981373410.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967076066.1823006.7175712134577687753.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021483619.640689.7586546280515844702.stgit@warthog.procyon.org.uk/ # v4
2022-01-07fscache, cachefiles: Disable configurationDavid Howells
Disable fscache and cachefiles in Kconfig whilst it is rewritten. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819576672.215744.12444272479560406780.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906882835.143852.11073015983885872901.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967075113.1823006.277316290062782998.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021481179.640689.2004199594774033658.stgit@warthog.procyon.org.uk/ # v4
2022-01-06NFSv42: Fallocate and clone should also request 'blocks used'Trond Myklebust
Both fallocate and clone can end up updating the blocks used attribute. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06NFSv4: Allow writebacks to request 'blocks used'Trond Myklebust
When doing a non-pNFS write, allow the writeback code to specify that it also needs to update 'blocks used'. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06NFS: use default_groups in kobj_typeGreg Kroah-Hartman
There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the NFS code to use default_groups field which has been the preferred way since aa30f47cf666 ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field. Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <anna.schumaker@netapp.com> Cc: linux-nfs@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06NFS: Fix the verifier for case sensitive filesystem in nfs_atomic_open()Trond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06NFS: Add a helper to remove case-insensitive aliasesTrond Myklebust
When dealing with case insensitive names, the client has no idea how the server performs the mapping, so cannot collapse the dentries into a single representative. So both rename and unlink need to deal with the fact that there could be several dentries representing the file, and have to somehow force them to be revalidated. Use d_prune_aliases() as a big hammer approach. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06NFS: Invalidate negative dentries on all case insensitive directory changesTrond Myklebust
If we create a file, rename it, or hardlink it, then we need to assume that cached negative dentries need to be revalidated. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06NFSv4: Just don't cache negative dentries on case insensitive serversTrond Myklebust
If the directory contents change, we cannot rely on the negative dentry being cacheable. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06NFSv4: Add some support for case insensitive filesystemsTrond Myklebust
Add capabilities to allow the NFS client to recognise when it is dealing with case insensitive and case preserving filesystems. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06NFSv4.1: Fix uninitialised variable in devicenotifyTrond Myklebust
When decode_devicenotify_args() exits with no entries, we need to ensure that the struct cb_devicenotifyargs is initialised to { 0, NULL } in order to avoid problems in nfs4_callback_devicenotify(). Reported-by: <rtm@csail.mit.edu> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06nfs: nfs4clinet: check the return value of kstrdup()Xiaoke Wang
kstrdup() returns NULL when some internal memory errors happen, it is better to check the return value of it so to catch the memory error in time. Signed-off-by: Xiaoke Wang <xkernel.wang@foxmail.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06NFSv4 only print the label when its queriedOlga Kornievskaia
When the bitmask of the attributes doesn't include the security label, don't bother printing it. Since the label might not be null terminated, adjust the printing format accordingly. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06nfs41: pnfs: filelayout: Replace one-element array with flexible-array memberGustavo A. R. Silva
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. Refactor the code a bit according to the use of a flexible-array member in struct nfs4_file_layout_dsaddr instead of a one-element array, and use the struct_size() helper. This helps with the ongoing efforts to globally enable -Warray-bounds and get us closer to being able to tighten the FORTIFY_SOURCE routines on memcpy(). This issue was found with the help of Coccinelle and audited and fixed, manually. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/v5.10/process/deprecated.html#zero-length-and-one-element-arrays Link: https://github.com/KSPP/linux/issues/79 Link: https://github.com/KSPP/linux/issues/109 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06NFS: Ensure the server has an up to date ctime before renamingTrond Myklebust
Renaming a file is required by POSIX to update the file ctime, so ensure that the file data is synced to disk so that we don't clobber the updated ctime by writing back after creating the hard link. Fixes: f2c2c552f119 ("NFS: Move delegation recall into the NFSv4 callback for rename_setup()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06NFS: Ensure the server has an up to date ctime before hardlinkingTrond Myklebust
Creating a hard link is required by POSIX to update the file ctime, so ensure that the file data is synced to disk so that we don't clobber the updated ctime by writing back after creating the hard link. Fixes: 9f7682728728 ("NFS: Move the delegation return down into nfs4_proc_link()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06NFS: don't store 'struct cred *' in struct nfs_access_entryNeilBrown
Storing the 'struct cred *' in nfs_access_entry is problematic. An active 'cred' can keep a 'struct key *' active, and a quota is imposed on the number of such keys that a user can maintain. Cached 'nfs_access_entry' structs have indefinite lifetime, and having these keep 'struct key's alive imposes on that quota. So remove the 'struct cred *' and replace it with the fields we need: kuid_t, kgid_t, and struct group_info * This makes the 'struct nfs_access_entry' 64 bits larger. New function "access_cmp" is introduced which is identical to cred_fscmp() except that the second arg is an 'nfs_access_entry', rather than a 'cred' Fixes: b68572e07c58 ("NFS: change access cache to use 'struct cred'.") Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06NFS: pass cred explicitly for access testsNeilBrown
Storing the 'struct cred *' in nfs_access_entry is problematic. An active 'cred' can keep a 'struct key *' active, and a quota is imposed on the number of such keys that a user can maintain. Cached 'nfs_access_entry' structs have indefinite lifetime, and having these keep 'struct key's alive imposes on that quota. So a future patch will remove the ->cred ref from nfs_access_entry. To prepare, change various functions to not assume there is a 'cred' in the nfs_access_entry, but to pass the cred around explicitly. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06NFS: change nfs_access_get_cached to only report the maskNeilBrown
Currently the nfs_access_get_cached family of functions report a 'struct nfs_access_entry' as the result, with both .mask and .cred set. However the .cred is never used. This is probably good and there is no guarantee that it won't be freed before use. Change to only report the 'mask' - as this is all that is used or needed. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06xfs: warn about inodes with project id of -1Darrick J. Wong
Inodes aren't supposed to have a project id of -1U (aka 4294967295) but the kernel hasn't always validated FSSETXATTR correctly. Flag this as something for the sysadmin to check out. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-06xfs: hold quota inode ILOCK_EXCL until the end of dqallocDarrick J. Wong
Online fsck depends on callers holding ILOCK_EXCL from the time they decide to update a block mapping until after they've updated the reverse mapping records to guarantee the stability of both mapping records. Unfortunately, the quota code drops ILOCK_EXCL at the first transaction roll in the dquot allocation process, which breaks that assertion. This leads to sporadic failures in the online rmap repair code if the repair code grabs the AGF after bmapi_write maps a new block into the quota file's data fork but before it can finish the deferred rmap update. Fix this by rewriting the function to hold the ILOCK until after the transaction commit like all other bmap updates do, and get rid of the dqread wrapper that does nothing but complicate the codebase. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-06xfs: Remove redundant assignment of mpJiapeng Chong
mp is being initialized to log->l_mp but this is never read as record is overwritten later on. Remove the redundant assignment. Cleans up the following clang-analyzer warning: fs/xfs/xfs_log_recover.c:3543:20: warning: Value stored to 'mp' during its initialization is never read [clang-analyzer-deadcode.DeadStores]. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-01-06xfs: reduce kvmalloc overhead for CIL shadow buffersDave Chinner
Oh, let me count the ways that the kvmalloc API sucks dog eggs. The problem is when we are logging lots of large objects, we hit kvmalloc really damn hard with costly order allocations, and behaviour utterly sucks: - 49.73% xlog_cil_commit - 31.62% kvmalloc_node - 29.96% __kmalloc_node - 29.38% kmalloc_large_node - 29.33% __alloc_pages - 24.33% __alloc_pages_slowpath.constprop.0 - 18.35% __alloc_pages_direct_compact - 17.39% try_to_compact_pages - compact_zone_order - 15.26% compact_zone 5.29% __pageblock_pfn_to_page 3.71% PageHuge - 1.44% isolate_migratepages_block 0.71% set_pfnblock_flags_mask 1.11% get_pfnblock_flags_mask - 0.81% get_page_from_freelist - 0.59% _raw_spin_lock_irqsave - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 3.24% try_to_free_pages - 3.14% shrink_node - 2.94% shrink_slab.constprop.0 - 0.89% super_cache_count - 0.66% xfs_fs_nr_cached_objects - 0.65% xfs_reclaim_inodes_count 0.55% xfs_perag_get_tag 0.58% kfree_rcu_shrink_count - 2.09% get_page_from_freelist - 1.03% _raw_spin_lock_irqsave - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 4.88% get_page_from_freelist - 3.66% _raw_spin_lock_irqsave - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 1.63% __vmalloc_node - __vmalloc_node_range - 1.10% __alloc_pages_bulk - 0.93% __alloc_pages - 0.92% get_page_from_freelist - 0.89% rmqueue_bulk - 0.69% _raw_spin_lock - do_raw_spin_lock __pv_queued_spin_lock_slowpath 13.73% memcpy_erms - 2.22% kvfree On this workload, that's almost a dozen CPUs all trying to compact and reclaim memory inside kvmalloc_node at the same time. Yet it is regularly falling back to vmalloc despite all that compaction, page and shrinker reclaim that direct reclaim is doing. Copying all the metadata is taking far less CPU time than allocating the storage! Direct reclaim should be considered extremely harmful. This is a high frequency, high throughput, CPU usage and latency sensitive allocation. We've got memory there, and we're using kvmalloc to allow memory allocation to avoid doing lots of work to try to do contiguous allocations. Except it still does *lots of costly work* that is unnecessary. Worse: the only way to avoid the slowpath page allocation trying to do compaction on costly allocations is to turn off direct reclaim (i.e. remove __GFP_RECLAIM_DIRECT from the gfp flags). Unfortunately, the stupid kvmalloc API then says "oh, this isn't a GFP_KERNEL allocation context, so you only get kmalloc!". This cuts off the vmalloc fallback, and this leads to almost instant OOM problems which ends up in filesystems deadlocks, shutdowns and/or kernel crashes. I want some basic kvmalloc behaviour: - kmalloc for a contiguous range with fail fast semantics - no compaction direct reclaim if the allocation enters the slow path. - run normal vmalloc (i.e. GFP_KERNEL) if kmalloc fails The really, really stupid part about this is these kvmalloc() calls are run under memalloc_nofs task context, so all the allocations are always reduced to GFP_NOFS regardless of the fact that kvmalloc requires GFP_KERNEL to be passed in. IOWs, we're already telling kvmalloc to behave differently to the gfp flags we pass in, but it still won't allow vmalloc to be run with anything other than GFP_KERNEL. So, this patch open codes the kvmalloc() in the commit path to have the above described behaviour. The result is we more than halve the CPU time spend doing kvmalloc() in this path and transaction commits with 64kB objects in them more than doubles. i.e. we get ~5x reduction in CPU usage per costly-sized kvmalloc() invocation and the profile looks like this: - 37.60% xlog_cil_commit 16.01% memcpy_erms - 8.45% __kmalloc - 8.04% kmalloc_order_trace - 8.03% kmalloc_order - 7.93% alloc_pages - 7.90% __alloc_pages - 4.05% __alloc_pages_slowpath.constprop.0 - 2.18% get_page_from_freelist - 1.77% wake_all_kswapds .... - __wake_up_common_lock - 0.94% _raw_spin_lock_irqsave - 3.72% get_page_from_freelist - 2.43% _raw_spin_lock_irqsave - 5.72% vmalloc - 5.72% __vmalloc_node_range - 4.81% __get_vm_area_node.constprop.0 - 3.26% alloc_vmap_area - 2.52% _raw_spin_lock - 1.46% _raw_spin_lock 0.56% __alloc_pages_bulk - 4.66% kvfree - 3.25% vfree - __vfree - 3.23% __vunmap - 1.95% remove_vm_area - 1.06% free_vmap_area_noflush - 0.82% _raw_spin_lock - 0.68% _raw_spin_lock - 0.92% _raw_spin_lock - 1.40% kfree - 1.36% __free_pages - 1.35% __free_pages_ok - 1.02% _raw_spin_lock_irqsave It's worth noting that over 50% of the CPU time spent allocating these shadow buffers is now spent on spinlocks. So the shadow buffer allocation overhead is greatly reduced by getting rid of direct reclaim from kmalloc, and could probably be made even less costly if vmalloc() didn't use global spinlocks to protect it's structures. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-01-06xfs: sysfs: use default_groups in kobj_typeGreg Kroah-Hartman
There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the xfs sysfs code to use default_groups field which has been the preferred way since aa30f47cf666 ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field. Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: linux-xfs@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-01-06debugfs: lockdown: Allow reading debugfs files that are not world readableMichal Suchanek
When the kernel is locked down the kernel allows reading only debugfs files with mode 444. Mode 400 is also valid but is not allowed. Make the 444 into a mask. Fixes: 5496197f9b08 ("debugfs: Restrict debugfs when the kernel is locked down") Signed-off-by: Michal Suchanek <msuchanek@suse.de> Link: https://lore.kernel.org/r/20220104170505.10248-1-msuchanek@suse.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-01-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-05io_uring: remove redundant tab spaceGuoYong Zheng
When show fdinfo, SqMask follow two tab space, which is inconsistent with other parameters. Remove one, so it lines up nicely. Signed-off-by: GuoYong Zheng <zhenggy@chinatelecom.cn> Link: https://lore.kernel.org/r/1641377585-1891-1-git-send-email-zhenggy@chinatelecom.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-05io_uring: remove unused function parameterGuoYong Zheng
Parameter res2 is not used in __io_complete_rw, remove it. Fixes: 6b19b766e8f0 ("fs: get rid of the res2 iocb->ki_complete argument") Signed-off-by: GuoYong Zheng <zhenggy@chinatelecom.cn> Link: https://lore.kernel.org/r/1641377522-1851-1-git-send-email-zhenggy@chinatelecom.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-05block: move rq_list macros to blk-mq.hKeith Busch
Move the request list macros to the header file that defines that struct they operate on. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220105170518.3181469-2-kbusch@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-04f2fs: remove redunant invalidate compress pagesFengnan Chang
Compress page will invalidate in truncate block process too, so remove redunant invalidate compress pages in f2fs_evict_inode. Signed-off-by: Fengnan Chang <changfengnan@vivo.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-04f2fs: Simplify bool conversionYang Li
Fix the following coccicheck warning: ./fs/f2fs/sysfs.c:491:41-46: WARNING: conversion to bool not needed here Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-04f2fs: don't drop compressed page cache in .{invalidate,release}pageChao Yu
For compressed inode, in .{invalidate,release}page, we will call f2fs_invalidate_compress_pages() to drop all compressed page cache of current inode. But we don't need to drop compressed page cache synchronously in .invalidatepage, because, all trancation paths of compressed physical block has been covered with f2fs_invalidate_compress_page(). And also we don't need to drop compressed page cache synchronously in .releasepage, because, if there is out-of-memory, we can count on page cache reclaim on sbi->compress_inode. BTW, this patch may fix the issue reported below: https://lore.kernel.org/linux-f2fs-devel/20211202092812.197647-1-changfengnan@vivo.com/T/#u Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-04f2fs: fix to reserve space for IO align featureChao Yu
https://bugzilla.kernel.org/show_bug.cgi?id=204137 With below script, we will hit panic during new segment allocation: DISK=bingo.img MOUNT_DIR=/mnt/f2fs dd if=/dev/zero of=$DISK bs=1M count=105 mkfs.f2fe -a 1 -o 19 -t 1 -z 1 -f -q $DISK mount -t f2fs $DISK $MOUNT_DIR -o "noinline_dentry,flush_merge,noextent_cache,mode=lfs,io_bits=7,fsync_mode=strict" for (( i = 0; i < 4096; i++ )); do name=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 10` mkdir $MOUNT_DIR/$name done umount $MOUNT_DIR rm $DISK --- Core dump --- Call Trace: allocate_segment_by_default+0x9d/0x100 [f2fs] f2fs_allocate_data_block+0x3c0/0x5c0 [f2fs] do_write_page+0x62/0x110 [f2fs] f2fs_outplace_write_data+0x43/0xc0 [f2fs] f2fs_do_write_data_page+0x386/0x560 [f2fs] __write_data_page+0x706/0x850 [f2fs] f2fs_write_cache_pages+0x267/0x6a0 [f2fs] f2fs_write_data_pages+0x19c/0x2e0 [f2fs] do_writepages+0x1c/0x70 __filemap_fdatawrite_range+0xaa/0xe0 filemap_fdatawrite+0x1f/0x30 f2fs_sync_dirty_inodes+0x74/0x1f0 [f2fs] block_operations+0xdc/0x350 [f2fs] f2fs_write_checkpoint+0x104/0x1150 [f2fs] f2fs_sync_fs+0xa2/0x120 [f2fs] f2fs_balance_fs_bg+0x33c/0x390 [f2fs] f2fs_write_node_pages+0x4c/0x1f0 [f2fs] do_writepages+0x1c/0x70 __writeback_single_inode+0x45/0x320 writeback_sb_inodes+0x273/0x5c0 wb_writeback+0xff/0x2e0 wb_workfn+0xa1/0x370 process_one_work+0x138/0x350 worker_thread+0x4d/0x3d0 kthread+0x109/0x140 ret_from_fork+0x25/0x30 The root cause here is, with IO alignment feature enables, in worst case, we need F2FS_IO_SIZE() free blocks space for single one 4k write due to IO alignment feature will fill dummy pages to make IO being aligned. So we will easily run out of free segments during non-inline directory's data writeback, even in process of foreground GC. In order to fix this issue, I just propose to reserve additional free space for IO alignment feature to handle worst case of free space usage ratio during FGGC. Fixes: 0a595ebaaa6b ("f2fs: support IO alignment for DATA and NODE writes") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-04f2fs: fix to check available space of CP area correctly in update_ckpt_flags()Chao Yu
Otherwise, nat_bit area may be persisted across boundary of CP area during nat_bit rebuilding. Fixes: 94c821fb286b ("f2fs: rebuild nat_bits during umount") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-04f2fs: support fault injection to f2fs_trylock_op()Chao Yu
f2fs: support fault injection for f2fs_trylock_op() This patch supports to inject fault into f2fs_trylock_op(). Usage: a) echo 65536 > /sys/fs/f2fs/<dev>/inject_type or b) mount -o fault_type=65536 <dev> <mountpoint> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-04f2fs: clean up __find_inline_xattr() with __find_xattr()Chao Yu
Just cleanup, no logic change. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-04f2fs: fix to do sanity check on last xattr entry in __f2fs_setxattr()Chao Yu
As Wenqing Liu reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215235 - Overview page fault in f2fs_setxattr() when mount and operate on corrupted image - Reproduce tested on kernel 5.16-rc3, 5.15.X under root 1. unzip tmp7.zip 2. ./single.sh f2fs 7 Sometimes need to run the script several times - Kernel dump loop0: detected capacity change from 0 to 131072 F2FS-fs (loop0): Found nat_bits in checkpoint F2FS-fs (loop0): Mounted with checkpoint version = 7548c2ee BUG: unable to handle page fault for address: ffffe47bc7123f48 RIP: 0010:kfree+0x66/0x320 Call Trace: __f2fs_setxattr+0x2aa/0xc00 [f2fs] f2fs_setxattr+0xfa/0x480 [f2fs] __f2fs_set_acl+0x19b/0x330 [f2fs] __vfs_removexattr+0x52/0x70 __vfs_removexattr_locked+0xb1/0x140 vfs_removexattr+0x56/0x100 removexattr+0x57/0x80 path_removexattr+0xa3/0xc0 __x64_sys_removexattr+0x17/0x20 do_syscall_64+0x37/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae The root cause is in __f2fs_setxattr(), we missed to do sanity check on last xattr entry, result in out-of-bound memory access during updating inconsistent xattr data of target inode. After the fix, it can detect such xattr inconsistency as below: F2FS-fs (loop11): inode (7) has invalid last xattr entry, entry_size: 60676 F2FS-fs (loop11): inode (8) has corrupted xattr F2FS-fs (loop11): inode (8) has corrupted xattr F2FS-fs (loop11): inode (8) has invalid last xattr entry, entry_size: 47736 Cc: stable@vger.kernel.org Reported-by: Wenqing Liu <wenqingliu0120@gmail.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-04f2fs: do not bother checkpoint by f2fs_get_node_infoJaegeuk Kim
This patch tries to mitigate lock contention between f2fs_write_checkpoint and f2fs_get_node_info along with nat_tree_lock. The idea is, if checkpoint is currently running, other threads that try to grab nat_tree_lock would be better to wait for checkpoint. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-04f2fs: avoid down_write on nat_tree_lock during checkpointJaegeuk Kim
Let's cache nat entry if there's no lock contention only. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>