Michael Meffie [Fri, 12 Jun 2015 00:28:43 +0000 (20:28 -0400)]
libafs: reset all the volumes with fs flushall
Fix a logic bug in fs flushall in which only the first volume in each
hash chain is reset (invalidated). Instead, reset all the volumes in
the volume hash.
Also, when flushing a single volume with fs flushvolume, don't bother
searching all the hash chains, instead start on the hash chain
containing the volume being flushed.
Change-Id: I7be67fdb310b4845d02dc916f4400f83cc649cb8
Reviewed-on: http://gerrit.openafs.org/11892 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Mark Vitale <mvitale@sinenomine.net> Reviewed-by: Chas Williams <3chas3@gmail.com> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
Benjamin Kaduk [Mon, 9 Feb 2015 17:09:32 +0000 (12:09 -0500)]
pagsh: do not call set[ug]id()
Supposedly calling setuid(getuid()) and setgid(getgid()) would
help pick up a new group list on some systems, in the depths
of history. In the absence of reason to believe this is still
the case, drop the calls to avoid scary warnings about unchecked
return values.
Benjamin Kaduk [Mon, 9 Feb 2015 15:38:04 +0000 (10:38 -0500)]
Avoid unsafe scanf("%s")
Reading user input into a fixed-length buffer just to check the
first character is silly and an easy buffer overrun. gcc on
Ubuntu 13.03 warns about the unchecked return value for scanf(),
but scanf("%s") is guaranteed to either succeed or get EOF/EINTR/etc..
In any case, we don't need to use scanf() at all, here -- reuse an
idiom from BSD cp(1) and loop around getchar to read the user's
response, eliminating the fixed-length buffer entirely. A separate
initial loop is needed to skip leading whitespace, which is done
implicitly by scanf().
Change-Id: Ic5ed65e80146aa3d08a4b03c213f748ef088156b
Reviewed-on: http://gerrit.openafs.org/11758 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Chas Williams <3chas3@gmail.com> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Reviewed-by: Perry Ruiter <pruiter@sinenomine.net> Reviewed-by: Michael Meffie <mmeffie@sinenomine.net> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
Benjamin Kaduk [Wed, 27 May 2015 20:13:13 +0000 (16:13 -0400)]
afs: Do not supply bogus poll vnodeops for FBSD
We currently provide one which just always returns 1, but the
kernel provides a vop_nopoll which conceptually is the same thing.
That one, however, provides some feature checks and fails when
consumers ask for fancy features that are not portable.
Benjamin Kaduk [Fri, 6 Feb 2015 19:15:11 +0000 (14:15 -0500)]
Ignore return values more harder
Building on Ubuntu 14.04 with gcc 4.8.2-19ubuntu1, we encounter
fatal warnings about unchecked return values in uss, which is
now always built, as of 00a33b26d74aa067086ddc340efb82184715857f.
Change-Id: I997dcb683e33902c2765121c70bdcf21e9d5e892
Reviewed-on: http://gerrit.openafs.org/11757 Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Chas Williams <3chas3@gmail.com> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
Marc Dionne [Wed, 22 Apr 2015 18:06:12 +0000 (15:06 -0300)]
Linux: mmap: Apply recursion check only to recursion cases
The CPageWrite flag was originally added to prevent a scenario
where a thread doing "writepage" would realize that the cache
was too full and that some of its contents need to be written
back to the server. Before writing back it would ask the OS to
flush any dirty VM associated with the vcache entries that are
to be written, to make sure the data is not stale. This flush
could itself trigger writeback, leading to deadly recursion.
One such scenario is a process doing mmap writes to a file larger
than the cache.
With some kernel versions and some callers of writepage, this
can cause the mapping to be marked as being in an error state,
leading to EIO errors passed back to user space.
Make the recursion check more specific to only bail when the
calling thread is one that was originally seen writing. A list
of current writers is maintained instead of a single state flag.
This lets other threads (like the flusher thread) go on with
writeback to the same file, and limits the WRITEPAGE_ACTIVATE
return case to call sites that can deal with it.
In testing this helps avoid EIO errors when writing large
chunks of data through mmap.
Thanks to Yadav Yadavendra for extensive analysis and testing.
Marc Dionne [Mon, 20 Apr 2015 13:41:53 +0000 (10:41 -0300)]
Linux 4.1: Don't define or use ->write directly
We no longer have to define a ->write operation, and we can't
expect the underlying disk cache filesystem to have one. Use
the new __vfs_read/write helpers that will select the operation
to use based on what's available for that particular filesystem.
Reviewed-on: http://gerrit.openafs.org/11849 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Chas Williams <3chas3@gmail.com> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
(cherry picked from commit 5c1237432edf4600111845d175c92252430d5f76)
Change-Id: I21bca85637e07d0e03ef471896d0454eeef68a14
Reviewed-on: http://gerrit.openafs.org/11873 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Daria Brashear <shadow@your-file-system.com> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Reviewed-by: Stephan Wiesand <stephan.wiesand@desy.de>
Marc Dionne [Mon, 20 Apr 2015 13:37:40 +0000 (10:37 -0300)]
Linux 4.1: No need for do_sync_read
Make the test here a bit more specific. do_sync_read no longer
exists, but we don't use it for new kernels. Trying to define it
here in terms of generic_file_read is not helpful as that doesn't
exist anymore.
Reviewed-on: http://gerrit.openafs.org/11848 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Chas Williams <3chas3@gmail.com> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
(cherry picked from commit fcfa5ae2468d878db962a93d6013fcd3042e6c13)
Change-Id: I87bf0fc856d244d15bdae300f0cd6b80ecb63797
Reviewed-on: http://gerrit.openafs.org/11872 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Daria Brashear <shadow@your-file-system.com> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Reviewed-by: Stephan Wiesand <stephan.wiesand@desy.de>
Benjamin Kaduk [Wed, 20 May 2015 14:57:53 +0000 (10:57 -0400)]
afsio: switch BreakUpPath to strdup
The current version of BreakUpPath is slightly broken, since
commit 4e68282e26b0c4569d25d076d54274f0da47a691 -- it has two
output parameters but takes only one length parameter for the
size of the output buffers passed in. The callers ended up using
the shorter of the buffer lengths in question, so there is not
a risk of a buffer overrun, but long paths would not be properly
handled.
There is not really any need to pass in a length at all, since
what is going on is conceptually strdup, and there is no real
need to use strlcpy at all. Make the change from strlcpy to
str(n)dup, and adjust callers to free the outputs as appropriate.
While here, convert writeFile() to use goto and a cleanup handler
to avoid leaks.
Jeffrey Altman [Mon, 12 Nov 2012 03:00:07 +0000 (22:00 -0500)]
afsio: process windows file paths consistently
Windows file paths can use either '\' or '/' as a path
separator. libafscp on the other hand requires '/' and argv[0]
will always use '\'.
Introduce a new function ConvertAFSPath() which converts the
input path to '/' and converts \\afs to /afs. A future commit
should access the registry and make use of the NetbiosName and
MountRoot values to perform the conversion correctly.
Simon Wilkinson [Fri, 30 Mar 2012 18:35:51 +0000 (19:35 +0100)]
venus: Make clang happy with strlcpy use
clang now expects that strlcpy will always be used to prevent overflow
of the destination string, and gives a warning if the size parameter is
based solely on the length of the source string.
Modify the BreakUpPath function so that it takes the size of the
destination string as an argument, and uses this to limit the amount of
data pasted into it.
Ben Kaduk [Wed, 11 Feb 2015 22:47:10 +0000 (17:47 -0500)]
Remove spurious NULL checks
clang 3.5 is more aggressive about these checks than the previous
FreeBSD system compiler, so new warnings (which became errors)
appeared on FreeBSD 11-CURRENT.
In afs_dcache.c, checking &tdc->f for NULL-ness has no effect.
The struct fcache f member of struct dcache is an ordinary structure
element; its address will be the value of tdc plus the offset of
f within struct dcache, which will not be NULL even if tdc is NULL.
In ubik_db_if.c, udbHandle is a file-scope global and thus has
allocated storage; the address of a member variable will never
be NULL. The 0 it was compared against was spelled RX_SECIDX_NULL,
which shows the intended check, which is for the value of the
uh_scIndex member variable, not its address.
In afscp_server.c, srv->conns can never be NULL since conns is a member
variable of struct afscp_server (of array type, containing pointers
to struct rx_connection). Comparing the array member variable against
NULL is comparing the address of the array, which is never NULL since
it is not allocated separately from struct afscp_server.
In fssync-debug.c, state.vop->partName is never NULL because
common_volop_prolog always allocates for state.vop, and the
partName member variable of struct fssync_state is of array type,
and thus is not separately allocated from the containing structure.
Reviewed-on: http://gerrit.openafs.org/11739 Reviewed-by: Chas Williams - CONTRACTOR <chas@cmf.nrl.navy.mil> Reviewed-by: Perry Ruiter <pruiter@sinenomine.net> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Tested-by: BuildBot <buildbot@rampaginggeek.com>
(cherry picked from commit fb499c2406450fa5dc423a0b038266d3b8e79e33)
Change-Id: I13799a3362508672136f8c603eabdfc0f3ee072d
Reviewed-on: http://gerrit.openafs.org/11843 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Reviewed-by: Stephan Wiesand <stephan.wiesand@desy.de>
Benjamin Kaduk [Wed, 22 Apr 2015 17:43:43 +0000 (13:43 -0400)]
kauth: fix clock skew detection
Commit 5b3c1042969daec38ccb260e61d665eda0c713ea changed/removed some
uses of abs() on unsigned time values. While the previous use of abs()
was indeed incorrect, the result wasn't necessarily much better, even
though it built with recent compilers, since it only checked for skew
in one direction.
Define and use a macro to correctly evaluate the conditionals in 64-bit
precision, avoiding C's integer promotion rules which prefer unsigned types
(Date) to signed types of the same width (time_t on 32-bit systems).
Reviewed-on: http://gerrit.openafs.org/11850 Reviewed-by: Stephan Wiesand <stephan.wiesand@desy.de> Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
(cherry picked from commit 810f0ccd0354dac30af024ca7b5acf3ebabf5f4b)
Change-Id: I29337e1ecd410fcf7733408287930c50c055ff90
Reviewed-on: http://gerrit.openafs.org/11863 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Daria Brashear <shadow@your-file-system.com> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Reviewed-by: Stephan Wiesand <stephan.wiesand@desy.de>
Ben Kaduk [Fri, 13 Feb 2015 14:47:20 +0000 (09:47 -0500)]
Fix incorrect uses of abs()
abs(3) is a function of one variable of type int returning int.
labs(3) is a function of one variable of type long returning long.
labs(3) should be used when the input is of type long, as in
kaprocs.c.
Calling anything from the abs(3) family on a variable of unsigned
type is a bogus type pun, and a logical operation which is a no-op.
(Unsigned values are never negative and thus the absolute value
function is the identity over the entire range of values representable
in an unsigned type.) Just remove the use of abs() for unsigned
values, as in kaprocs.c, krb_udp.c, and vldb_check.c
While in kaprocs.c, wrap a long line that was touched for the
conversion to labs(3), spell the argument to time(3) as NULL
instead of 0, remove unneeded parentheses, and correct the spelling
of "reserved".
Reviewed-on: http://gerrit.openafs.org/11745 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Chas Williams - CONTRACTOR <chas@cmf.nrl.navy.mil> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu>
(cherry picked from commit 5b3c1042969daec38ccb260e61d665eda0c713ea)
Change-Id: I82038e41346479dad39466907b95f2d7540f6258
Reviewed-on: http://gerrit.openafs.org/11842 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Reviewed-by: Stephan Wiesand <stephan.wiesand@desy.de>
Marc Dionne [Wed, 22 Apr 2015 18:06:12 +0000 (15:06 -0300)]
Linux: mmap: Apply recursion check only to recursion cases
The CPageWrite flag was originally added to prevent a scenario
where a thread doing "writepage" would realize that the cache
was too full and that some of its contents need to be written
back to the server. Before writing back it would ask the OS to
flush any dirty VM associated with the vcache entries that are
to be written, to make sure the data is not stale. This flush
could itself trigger writeback, leading to deadly recursion.
One such scenario is a process doing mmap writes to a file larger
than the cache.
With some kernel versions and some callers of writepage, this
can cause the mapping to be marked as being in an error state,
leading to EIO errors passed back to user space.
Make the recursion check more specific to only bail when the
calling thread is one that was originally seen writing. A list
of current writers is maintained instead of a single state flag.
This lets other threads (like the flusher thread) go on with
writeback to the same file, and limits the WRITEPAGE_ACTIVATE
return case to call sites that can deal with it.
In testing this helps avoid EIO errors when writing large
chunks of data through mmap.
Thanks to Yadav Yadavendra for extensive analysis and testing.
Simon Wilkinson [Fri, 23 Mar 2012 21:26:14 +0000 (21:26 +0000)]
opr: Add new softsig implementation
Signals and pthreaded applications are a poor match. OpenAFS has had
the softsig system (currently in src/util/softsig.c) in an attempt to
alleviate some of these problems. However, that implementation itself
has a number of problems. It uses signal functions that are unsafe in
pthreaded applications, and uses pthread_kill within its signal
handlers. Over the years it has been responsible for a number of
portability bugs.
The old implementation continues to receive signals in the main thread
of the application. However, the handler code is run within a seperate
signal handler thread. When the main thread receives a signal a stub
handler is invoked, which simply pthread_kill()s the signal handler
thread.
The new implementation simplifies things by only receiving signals in
the handler thread. It uses only pthread-compatible signal functions,
and invokes no code from within async signal handlers.
Benjamin Kaduk [Wed, 20 May 2015 14:57:53 +0000 (10:57 -0400)]
afsio: switch BreakUpPath to strdup
The current version of BreakUpPath is slightly broken, since
commit 4e68282e26b0c4569d25d076d54274f0da47a691 -- it has two
output parameters but takes only one length parameter for the
size of the output buffers passed in. The callers ended up using
the shorter of the buffer lengths in question, so there is not
a risk of a buffer overrun, but long paths would not be properly
handled.
There is not really any need to pass in a length at all, since
what is going on is conceptually strdup, and there is no real
need to use strlcpy at all. Make the change from strlcpy to
str(n)dup, and adjust callers to free the outputs as appropriate.
While here, convert writeFile() to use goto and a cleanup handler
to avoid leaks.
Marc Dionne [Mon, 20 Apr 2015 13:41:53 +0000 (10:41 -0300)]
Linux 4.1: Don't define or use ->write directly
We no longer have to define a ->write operation, and we can't
expect the underlying disk cache filesystem to have one. Use
the new __vfs_read/write helpers that will select the operation
to use based on what's available for that particular filesystem.
Change-Id: Iab923235308ff57348ffc2dc6d718dd64040656b
Reviewed-on: http://gerrit.openafs.org/11849 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Chas Williams <3chas3@gmail.com> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
Marc Dionne [Mon, 20 Apr 2015 13:37:40 +0000 (10:37 -0300)]
Linux 4.1: No need for do_sync_read
Make the test here a bit more specific. do_sync_read no longer
exists, but we don't use it for new kernels. Trying to define it
here in terms of generic_file_read is not helpful as that doesn't
exist anymore.
Change-Id: Iffb059716165436c3439e66db15002cdec5dfc16
Reviewed-on: http://gerrit.openafs.org/11848 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Chas Williams <3chas3@gmail.com> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
Benjamin Kaduk [Wed, 22 Apr 2015 17:43:43 +0000 (13:43 -0400)]
kauth: fix clock skew detection
Commit 5b3c1042969daec38ccb260e61d665eda0c713ea changed/removed some
uses of abs() on unsigned time values. While the previous use of abs()
was indeed incorrect, the result wasn't necessarily much better, even
though it built with recent compilers, since it only checked for skew
in one direction.
Define and use a macro to correctly evaluate the conditionals in 64-bit
precision, avoiding C's integer promotion rules which prefer unsigned types
(Date) to signed types of the same width (time_t on 32-bit systems).
Perry Ruiter [Wed, 22 Apr 2015 16:58:48 +0000 (09:58 -0700)]
afsd: Update list of supported flags
afsd.c starts with a block comment listing the flags supported by the
afsd command. As the code has evolved this list has not been kept up
to date. Bring the list up to date. Some obsolete options no longer
have any backing code. These are marked OBSOLETE. Some obsolete
options have code that says they are now deprecated. These are
marked IGNORED.
Additionally fix a typo in backuptree's help text.
This change reworks numerous places which formerly used potentially
large on-stack buffers (of size AFSDIR_PATH_MAX) for constructing or
storing pathnames. Instead, these buffers are now allocated from the
heap, either by using asprintf() to build a pathname in a correctly
sized buffer or, where necessary, using malloc() to allocate a buffer
of size AFSDIR_PATH_MAX.
A few occurrances of AFSDIR_PATH_MAX-sized buffers are not changed;
these are generally either globals or are contained within another
data structure that is already allocated on the heap.
[kaduk@mit.edu convert to cleanup-handler memory management where
appropriate]
Change-Id: Ib1986187a1c467e867d50280aaf1d8a86d9108c8
Reviewed-on: http://gerrit.openafs.org/9985 Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Michael Meffie <mmeffie@sinenomine.net> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
Benjamin Kaduk [Wed, 18 Mar 2015 17:11:44 +0000 (13:11 -0400)]
Mark Linux 2.4 as unsupported
The Linux 2.4 series (and older) will not be supported platforms
for OpenAFS 1.8 and later. Detect these systems at configure time
and direct users of those systems to the OpenAFS 1.6 series of releases.
These systems are believed to not be in common use with OpenAFS,
and retaining support for the LinuxThreads threading implementation
they require presents a maintenance burden that the project is
not equipped to deliver. The project will be able to move forward
more quickly by desupporting these systems.
Code conditional on these old systems can be removed in subsequent
commits.
Change-Id: I679fc2390b35851f3b0457a846047c812bc03dba
Reviewed-on: http://gerrit.openafs.org/11799 Reviewed-by: Perry Ruiter <pruiter@sinenomine.net> Reviewed-by: Chas Williams <3chas3@gmail.com> Reviewed-by: Daria Brashear <shadow@your-file-system.com> Tested-by: Daria Brashear <shadow@your-file-system.com>
Jeffrey Altman [Thu, 22 Jan 2015 06:14:28 +0000 (01:14 -0500)]
ubik: SDISK_Begin no quorum, wrong db, no transaction
When processing an DISK_Begin RPC verify that there is an active quorum
and that the local database is current. Otherwise, fail the RPC with
a UNOQUORUM error.
The returned error must be UNOQUORUM instead of USYNC becase the returned
error code will be returned by the coordinator's ContactQuorum_iterate()
to the client that triggered the write transaction. Most ubik clients
will only retry if the error is UNOQUORUM.
FIXES 131997
Change-Id: Icaa30e6aca82e7e7d33e9171a4f023970aba61df
Reviewed-on: http://gerrit.openafs.org/11689 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Daria Brashear <shadow@your-file-system.com> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Reviewed-by: Jeffrey Hutzelman <jhutz@cmu.edu> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
(cherry picked from commit d47beca13236c64ed935fabeff9d1001e8a8871f)
Reviewed-on: http://gerrit.openafs.org/11773 Reviewed-by: Chas Williams <3chas3@gmail.com> Reviewed-by: Stephan Wiesand <stephan.wiesand@desy.de>
Michael Meffie [Fri, 14 Nov 2014 03:28:08 +0000 (22:28 -0500)]
libafs: remove "Please install afsd with check server daemon" warning
Apparently, ancient versions of afsd did not start the check server
daemon (AFSOP_START_CS). The afs_Daemon tries to detect when the check
server daemon is not running and issues a warning to upgrade afsd. The
afs_Daemon waits for the cache initialization to complete (AFSOP_GO)
before detecting if the cache server daemon is started.
Unfortunately, when running with memcache, the cache initialization is
fast enough to race with the start of the check server daemon, and the
"Please install afsd with check server daemon" message is sometimes
printed to the syslog.
Since all modern versions of afsd do start the check server daemon, this
error message is no longer needed, so just remove the message and the
flag used to print it on only once.
Reviewed-on: http://gerrit.openafs.org/11602 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
(cherry picked from commit 8ce37d0d4aa4e6107f79efaf5027f31ea5a17604)
Change-Id: I292052c9ba629c85ddc4b76c4b3db7d54ce1d852
Reviewed-on: http://gerrit.openafs.org/11680 Reviewed-by: Perry Ruiter <pruiter@sinenomine.net> Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Reviewed-by: Stephan Wiesand <stephan.wiesand@desy.de>
Andrew Deason [Tue, 10 Jun 2014 19:47:31 +0000 (14:47 -0500)]
doc: Document fs listquota 2TB partition limit
We have previously documented that volumes over 2TB can result in
inaccuracies, but this documentation does not say how the 'partition'
field in "fs listquota" can be inaccurate. It is confusing to see a
usage of 0% for a partition that you know is being used, so try to
briefly explain in what way this field is inaccurate.
The reason we _under_-report the partition usage is that the
fileserver actually gives back PartBlocksAvail and PartMaxBlocks (not
"blocks used" and "blocks total"). So 1TB used and 4TB total is
truncated to 2TB and given back as 2TB free and 2TB total. One we hit
3TB used we'll report it as 1TB free 2TB total (50%) when the actual
usage is 75%.
Reviewed-on: http://gerrit.openafs.org/11245 Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Tested-by: BuildBot <buildbot@rampaginggeek.com>
(cherry picked from commit cd8f24d9a1ba8563c6bef2b8d30885a753e8d30c)
Change-Id: I2bd72cca994414a88073d26d44bef49e9cac3be1
Reviewed-on: http://gerrit.openafs.org/11626 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Perry Ruiter <pruiter@sinenomine.net> Reviewed-by: Chas Williams <3chas3@gmail.com> Reviewed-by: Stephan Wiesand <stephan.wiesand@desy.de>
Stephan Wiesand [Wed, 1 Apr 2015 13:52:46 +0000 (15:52 +0200)]
Make OpenAFS 1.6.11.1
Update configure version strings for 1.6.11.1. Note that macos kext
can be of form XXXX.YY[.ZZ[(d|a|b|fc)NNN]] where d dev, a alpha,
b beta, f final candidate so we have no way to represent 1.6.11.1.
Switch to 1.6.12 dev 1 for macos.
Anders Kaseorg [Mon, 23 Feb 2015 04:43:49 +0000 (23:43 -0500)]
Treat Linux 4 (and greater) as Linux 2.6/3
In an age where Linux version numbers are determined by Google+ polls,
it’s clear that they aren’t going to be very useful for marking major
API compatibility boundaries like they were in the days of 2.2/2.4.
Reviewed-on: http://gerrit.openafs.org/11755 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Stephan Wiesand <stephan.wiesand@desy.de> Reviewed-by: Marc Dionne <marc.c.dionne@gmail.com> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
(cherry picked from commit a5b091e1ec69d4a43d6f1b1efc93134ef7ed2167)
Change-Id: I5b0da6b43e3cbf5d9a6fa883a09deccb359e53e9
Reviewed-on: http://gerrit.openafs.org/11760 Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Reviewed-by: Marc Dionne <marc.c.dionne@gmail.com> Reviewed-by: Daria Brashear <shadow@your-file-system.com> Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Stephan Wiesand <stephan.wiesand@desy.de>
Benjamin Kaduk [Wed, 25 Mar 2015 04:26:42 +0000 (00:26 -0400)]
FBSD: do not set -mno-align-long-strings
The new clang imported for FreeBSD 10.1 has stopped accepting
this argument as a no-op. Fix the kernel module build by
stopping passing it on the compiler command line.
Michael Meffie [Mon, 17 Nov 2014 16:23:38 +0000 (11:23 -0500)]
vos: remaddrs sub-command
Introduce the vos remaddrs sub-command for removing multi-homed server
entries from the vldb. The remaddrs sub-command completes the listaddrs
and setaddrs command suite and allows vos changeaddr to be deprecated
completely.
Benjamin Kaduk [Wed, 18 Mar 2015 14:35:33 +0000 (10:35 -0400)]
Do not redeclare mutexes for darwin
Partially revert commit e2e93aa8920c0b1bfc672a555a59eb4e15dbeaae,
which added local declarations for des_init_mutex, des_random_mutex,
and rxkad_random_mutex to a number of files in libadmin, apparently
to fix the build on macos 10.3. That OS is long EoL-ed, and
more recent versions of OS X include toolchains that do not
need these extra declarations. In particular, the extra declarations
can be harmful when these files start to pull in more symbols
from our libraries (e.g., libafscp), since the details of the
linking process can cause that to generate duplicate symbol errors.
There is no longer any need to have local declarations of these
symbols for OS X, so just remove them.
Mark Vitale [Mon, 9 Feb 2015 23:16:16 +0000 (18:16 -0500)]
pioctl.c: restore required result variable
Commit b9fb9c62a6779aa997259ddf2a83a90b08e04d5f refactored lpioctl()
so that LINUX would have its own implementation. This also simplified
the other lpioctl() implementations by removing superfluous variable
'rval'.
Unfortunately, 'rval' was actually required for both DARWIN and SUN511.
On both of these platforms, the address of 'errcode' is passed
to the respective ioctl_*() routine so its value may be passed back
to lpioctl(). Therefore, 'errcode' must not also be used for the
return value from these functions; doing so results in the return
value from the function overwriting the intended value of 'errcode' upon
return to lpioctl().
In the case of Solaris 11, ioctl_sun_afs_syscall() always returns zero
(as long as the ioctl device 'dev/afs' opened successfully).
So 'errcode' was always being set to zero, even if the pioctl had
actually failed. For example, without this fix, 'fs listcells'
loops forever on Solaris 11, listing an infinite number of "cells",
because it will never "see" the EDOM that informs it of the last defined
cell.
Ben Kaduk [Fri, 13 Feb 2015 14:47:20 +0000 (09:47 -0500)]
Fix incorrect uses of abs()
abs(3) is a function of one variable of type int returning int.
labs(3) is a function of one variable of type long returning long.
labs(3) should be used when the input is of type long, as in
kaprocs.c.
Calling anything from the abs(3) family on a variable of unsigned
type is a bogus type pun, and a logical operation which is a no-op.
(Unsigned values are never negative and thus the absolute value
function is the identity over the entire range of values representable
in an unsigned type.) Just remove the use of abs() for unsigned
values, as in kaprocs.c, krb_udp.c, and vldb_check.c
While in kaprocs.c, wrap a long line that was touched for the
conversion to labs(3), spell the argument to time(3) as NULL
instead of 0, remove unneeded parentheses, and correct the spelling
of "reserved".
Change-Id: I0897b250fd885a1230d1622015eec9afe3450b46
Reviewed-on: http://gerrit.openafs.org/11745 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Chas Williams - CONTRACTOR <chas@cmf.nrl.navy.mil> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu>
Ben Kaduk [Wed, 11 Feb 2015 22:47:10 +0000 (17:47 -0500)]
Remove spurious NULL checks
clang 3.5 is more aggressive about these checks than the previous
FreeBSD system compiler, so new warnings (which became errors)
appeared on FreeBSD 11-CURRENT.
In afs_dcache.c, checking &tdc->f for NULL-ness has no effect.
The struct fcache f member of struct dcache is an ordinary structure
element; its address will be the value of tdc plus the offset of
f within struct dcache, which will not be NULL even if tdc is NULL.
In ubik_db_if.c, udbHandle is a file-scope global and thus has
allocated storage; the address of a member variable will never
be NULL. The 0 it was compared against was spelled RX_SECIDX_NULL,
which shows the intended check, which is for the value of the
uh_scIndex member variable, not its address.
In afscp_server.c, srv->conns can never be NULL since conns is a member
variable of struct afscp_server (of array type, containing pointers
to struct rx_connection). Comparing the array member variable against
NULL is comparing the address of the array, which is never NULL since
it is not allocated separately from struct afscp_server.
In fssync-debug.c, state.vop->partName is never NULL because
common_volop_prolog always allocates for state.vop, and the
partName member variable of struct fssync_state is of array type,
and thus is not separately allocated from the containing structure.
Change-Id: I03e1332d8a3320f1a4d303b444985648a207116e
Reviewed-on: http://gerrit.openafs.org/11739 Reviewed-by: Chas Williams - CONTRACTOR <chas@cmf.nrl.navy.mil> Reviewed-by: Perry Ruiter <pruiter@sinenomine.net> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Tested-by: BuildBot <buildbot@rampaginggeek.com>
Michael Meffie [Tue, 16 Dec 2014 21:13:01 +0000 (16:13 -0500)]
vlserver: do not perform ChangeAddr on mh entries, except for removal
Fix a long standing bug in the ChangeAddr RPC which damages the vldb,
When vos changeaddr is run with -oldaddr and -newaddr, and the -oldaddr
is present in an multi-homed entry, instead of changing the address in
the mh entry, the server slot is "downgraded" to a single homed entry
and the mh entry is orphaned in the vldb.
Instead, if the -oldaddr is in a multi-home entry, refuse to change the
address with a VL entry not found error and log the event.
Multi-homed addresses can be changed manually using the vos setaddrs
command which calls the RegisterAddrs() RPC.
Jeffrey Altman [Thu, 22 Jan 2015 06:14:28 +0000 (01:14 -0500)]
ubik: SDISK_Begin no quorum, wrong db, no transaction
When processing an DISK_Begin RPC verify that there is an active quorum
and that the local database is current. Otherwise, fail the RPC with
a UNOQUORUM error.
The returned error must be UNOQUORUM instead of USYNC becase the returned
error code will be returned by the coordinator's ContactQuorum_iterate()
to the client that triggered the write transaction. Most ubik clients
will only retry if the error is UNOQUORUM.
Anders Kaseorg [Mon, 23 Feb 2015 04:43:49 +0000 (23:43 -0500)]
Treat Linux 4 (and greater) as Linux 2.6/3
In an age where Linux version numbers are determined by Google+ polls,
it’s clear that they aren’t going to be very useful for marking major
API compatibility boundaries like they were in the days of 2.2/2.4.
Change-Id: I56e0e88eb178573c3eb280d5a5a01d8b8a20a363
Reviewed-on: http://gerrit.openafs.org/11755 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Stephan Wiesand <stephan.wiesand@desy.de> Reviewed-by: Marc Dionne <marc.c.dionne@gmail.com> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
Benjamin Kaduk [Wed, 18 Feb 2015 20:52:30 +0000 (15:52 -0500)]
Merge branch 'master' into experimental
Pick up the spanish translation, the changelog from master, and
CellServDB updates. Some of the new patches will not apply against
this new upstream version, which will be addressed in a follow-up
commit.
Andrew Deason [Wed, 4 Feb 2015 16:25:38 +0000 (10:25 -0600)]
rx: Zero unitialized uio structs
We use some uio structures that were allocated on the stack, but we
only initialize them by initializing individual fields. On some
platforms (Solaris is one known example, but probably not the only
one), there are additional fields we do not initialize. Since we
cannot be certain of what any additional fields there may be, just
zero the whole thing.
This is basically the same change as
I0eae0b49a70aee19f3a9ec118b03cfb3a6bd03a3, but in the rx subtree.
Reviewed-on: http://gerrit.openafs.org/11711 Tested-by: Andrew Deason <adeason@sinenomine.net> Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Perry Ruiter <pruiter@sinenomine.net> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com> Reviewed-by: Daria Brashear <shadow@your-file-system.com>
(cherry picked from commit a762e6871ad6837ee126cec9e63d99388b4bf119)
Change-Id: Ie6a2cce500d6a0a7a09c305296f4b34d122d3108
Reviewed-on: http://gerrit.openafs.org/11714 Tested-by: BuildBot <buildbot@rampaginggeek.com> Tested-by: Andrew Deason <adeason@sinenomine.net> Reviewed-by: Perry Ruiter <pruiter@sinenomine.net> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Reviewed-by: Chas Williams - CONTRACTOR <chas@cmf.nrl.navy.mil> Reviewed-by: Stephan Wiesand <stephan.wiesand@desy.de>
Marc Dionne [Thu, 18 Dec 2014 13:43:22 +0000 (08:43 -0500)]
Linux: d_splice_alias may drop inode reference on error
d_splice_alias now drops the inode reference on error, so we
need to grab an extra one to make sure that the inode doesn't
go away, and release it when done if there was no error.
For kernels that may not drop the reference, provide an
additional iput() within an ifdef. This could be hooked up
to a configure option to allow building a module for a kernel
that is known not to drop the reference on error. That hook
is not provided here. Affected kernels should be the early
3.17 ones (3.17 - 3.17.2); 3.16 and older kernels should not
return errors here.
[kaduk@mit.edu add configure option to control behavior, which
is mandatory on non-buildbot linux systems]
Reviewed-on: http://gerrit.openafs.org/11643 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Michael Laß <lass@mail.uni-paderborn.de> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
(cherry picked from commit 15260c7fdc5ac8fe9fb1797c8e383c665e9e0ccd)
Andrew Deason [Fri, 30 Jan 2015 19:29:57 +0000 (13:29 -0600)]
afs: Zero uninitialized uio structs
In several places in the code, we allocate a 'struct uio' on the
stack, or allocate one from non-zeroed memory. In most of these
places, we initialize the structure by assigning individual fields to
certain values. However, this leaves any remaining fields assigned to
random garbage, if there are any additional fields in the struct uio
that we don't know about.
One such platform is Solaris, which has a field called uio_extflg,
which exists in Solaris 11, Solaris 10, and possibly further back.
One of the flags defined for this field in Solaris 11 is UIO_XUIO,
which indicates that the structure is actually an xuio_t, which is
larger than a normal uio_t and contains additional fields. So when we
allocate a uio on the stack without initializing it, it can randomly
appear to be an xuio_t, depending on what garbage was on the stack at
the time. An xuio_t is a kind of extensible structure, which is used
for things like async I/O or DMA, that kind of thing.
One of the places we make use of such a uio_t is in afs_ustrategy,
which we go through for cache reads and writes on most Unix platforms
(but not Linux). When handling a read (reading from the disk cache
into a mapped page), a copy of our stack-allocated uio eventually gets
passed to VOP_READ. So VOP_READ for the cache filesystem will randomly
interpret our uio_t as an xuio_t.
In many scenarios, this (amazingly) does not cause any problems, since
generally, Solaris code will not notice if something is flagged as an
xuio_t, unless it is specifically written to handle specific xuio_t
types. ZFS is one of the apparent few filesystem implementations that
can handle xuio_t's, and will detect and specially handle a
UIOTYPE_ZEROCOPY xuio_t differently than a regular uio_t.
If ZFS gets a UIOTYPE_ZEROCOPY xuio_t, it appears to ignore the uio
buffers passed in, and supplies its own buffers from its cache. This
means that our VOP_READ request will return success, and act like it
serviced the read just fine. However, the actual buffer that we passed
in will remain untouched, and so we will return the page to the VFS
filled with garbage data.
The way this typically manifests is that seemingly random pages will
contain random data. This seems to happen very rarely, though it may
not always be obvious what is going on when this occurs.
It is also worth noting that the above description on Solaris only
happens with Solaris 11 and newer, and only with a ZFS disk cache.
Anything older than Solaris 11 does not have the xuio_t framework
(though other uio_extflg values can cause performance degradations),
and all known non-ZFS local disk filesystems do not interpret special
xuio_t structures (networked filesystems might have xuio_t handling,
but they shouldn't be used for a cache).
Bugs similar to this may also exist on other Unix clients, but at
least this specific scenario should not occur on Linux (since we don't
use afs_ustrategy), and newer Darwin (since we get a uio allocated for
us).
To fix this, zero out the entire uio structure before we use it, for
all instances where we allocate a uio from the stack or from
non-zeroed memory. Also zero out the accompanying iovec in many
places, just to be safe. Some of these may not actually need to be
zeroed (since we do actually initialize the whole thing, or a platform
doesn't have any additional unknown uio fields), but it seems
worthwhile to err on the side of caution.
Thanks to Oracle for their assistance on this issue, and thanks to the
organization experiencing this issue for their patience and
persistence.
1.6 note: This differs noticeably from the master commit in two
places:
- src/afs/NBSD/osi_vnodeops.c: On master there is no stack-allocated
uio struct here.
- src/afs/VNOPS/afs_vnop_write.c and afs_vnop_read.c: On master,
these code paths are structured quite differently, and are handled
in afs_osi_uio.c instead.
Reviewed-on: http://gerrit.openafs.org/11705 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Perry Ruiter <pruiter@sinenomine.net> Reviewed-by: Daria Brashear <shadow@your-file-system.com>
(cherry picked from commit 5ef1de5eddccce0e7b135bb9ed4decaa62fc19ce)
Change-Id: I8dbf60637774dff81ff839ccd78f58b3b1e85c5b
Reviewed-on: http://gerrit.openafs.org/11713 Reviewed-by: Perry Ruiter <pruiter@sinenomine.net> Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com> Reviewed-by: Stephan Wiesand <stephan.wiesand@desy.de>
Andrew Deason [Fri, 30 Jan 2015 19:08:19 +0000 (13:08 -0600)]
SOLARIS: Avoid uninitialized caller_context_t
Currently we pass a caller_context_t* to some of Solaris' VFS
functions (VOP_SETATTR, VOP_READ, VOP_WRITE, VOP_RWLOCK,
VOP_RWUNLOCK), but the pointer we pass is to uninitialized memory.
This code was added in commit 51d76681, and this particular argument
is mentioned in
<https://lists.openafs.org/pipermail/openafs-info/2004-March/012657.html>,
where the author doesn't really know what the argument is for.
Over 10 years later, it's still not obvious what this argument does,
since I cannot find any documentation for it. However, browsing
publicly-available Illumos/OpenSolaris source suggests this is used
for things like non-blocking operations for network filesystems, and
is only interpreted by certain filesystems in certain codepaths.
In any case, it's clear that we're not supposed to be passing in an
uninitialized structure, since the struct has actual members that are
sometimes interpreted by lower levels. Other callers in
Illumos/OpenSolaris source seem to just pass NULL here if they don't
need any special behavior. So, just pass NULL.
I am not aware of any issues caused by passing in this uninitialized
struct, and browsing Illumos source and discussing the issue with
Oracle engineers suggest there would currently not be any issues with
the cache filesystems we would be using.
However, it's always possible that issues could arise from this in the
future, or there are issues we don't know about. Any such issues would
almost certainly appear to be non-deterministic and be a nightmare to
track down. So just pass NULL, to avoid the potential issues.
Reviewed-on: http://gerrit.openafs.org/11704 Reviewed-by: Perry Ruiter <pruiter@sinenomine.net> Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Daria Brashear <shadow@your-file-system.com>
(cherry picked from commit b9647ac1062509d6a3997ca575ab1542d04677a2)
Change-Id: I5d247cfa6ada3773d20e3938957dcc31c8664bb2
Reviewed-on: http://gerrit.openafs.org/11712 Reviewed-by: Perry Ruiter <pruiter@sinenomine.net> Reviewed-by: Andrew Deason <adeason@sinenomine.net> Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com> Reviewed-by: Stephan Wiesand <stephan.wiesand@desy.de>
Michael Meffie [Mon, 29 Sep 2014 16:14:24 +0000 (12:14 -0400)]
vos: preserve cloneId and backupId when restoring
Preserve the volume clone and backup ids in the volume header when
restoring over an existing volume, instead of always setting the clone
and backup ids to zero.
For example, before this change, restoring over a volume resets the
ROnly and Backup ids reported in the volume header section of vos
examine.
$ vos examine xyzzy
xyzzy 536871023 RW 3 K On-line
myhost /vicepa
RWrite 536871023 ROnly 536871024 Backup 536871025
...
RWrite: 536871023 ROnly: 536871024 Backup: 536871025
number of sites -> 2
server myhost partition /vicepa RW Site
server myhost partition /vicepa RO Site
$ cat /tmp/xyzzy.dump | vos restore myhost a xyzzy -overwrite incremental
Restoring volume xyzzy Id 536871023 on server myhost partition /vicepa .. done
Restored volume xyzzy on myhost /vicepa
$ vos examine xyzzy
xyzzy 536871023 RW 3 K On-line
myhost /vicepa
RWrite 536871023 ROnly 0 Backup 0
...
RWrite: 536871023 ROnly: 536871024 Backup: 536871025
number of sites -> 2
server myhost partition /vicepa RW Site
server myhost partition /vicepa RO Site
Benjamin Kaduk [Wed, 10 Dec 2014 19:07:14 +0000 (14:07 -0500)]
Handle backupDate of zero
In older versions of OpenAFS (prior to 2001), the backupDate was
never set. Try to provide somewhat more reasonable behavior in
this case, by using a different date in that case.
Andrew Deason [Thu, 30 Jan 2014 19:38:01 +0000 (13:38 -0600)]
libafscp: Remove comment with dead code
You're not supposed to write the length of the submitted data on the
split rx stream for a StoreData operation; the fileserver knows how
much data to read from the "Length" parameter of the StoreData RPC.
For a FetchData, putting the data length over the split rx stream is
required, since we can't get the "OUT" arguments before reading the
file data. But for a StoreData, this is unnecessary, since the length
is right there in the arguments.
So just get rid of this commented-out code; it's clearly wrong and
this commit explains why.
Andrew Deason [Thu, 30 Jan 2014 06:02:24 +0000 (00:02 -0600)]
rx: Set lastBusy on RX_CALL_TIMEOUT
Currently, if a server RPC hangs forever, the client call will error
out with RX_CALL_TIMEOUT (if idle/dead timeouts are configured). If we
later try to make a new call on that conn, the server will respond
with BUSY packets, and we'll have to wait until we RX_CALL_TIMEOUT
again. After that we'll set lastBusy and avoid the call channel, but
that extra delay with the BUSY packets is avoidable.
So, avoid this extra delay by setting lastBusy when we kill a call
with RX_CALL_TIMEOUT, so a future rx_NewCall will avoid the call
channel. It makes sense to set lastBusy here, since the call channel
is more likely to be busy than the other call channels.
Andrew Deason [Thu, 30 Jan 2014 06:40:57 +0000 (00:40 -0600)]
rx: Remove RX_CALL_BUSY
Commit 23d6287f7f494383891a497038e8c0e870e824bf introduced the
behavior where a client can immediately retry a call if it receives a
"busy" packet from the server (meaning, the call channel is already in
use). This happened via Rx returning the error code RX_CALL_BUSY, and
the caller was supposed to immediately retry the call, so Rx could
reissue the RPC on a different call channel.
However, this behavior makes it more likely for the server to process
an RPC that the client thinks has not been processed. Say the client
issues an RPC, the server replies with a "busy" packet, and the client
resends the original packet before it sees the "busy" packet. In
this case, the server will get the resent packet for the RPC request
and process it, but the client will think the call has failed (and
presumably will retry the call on a new channel). For calls that are
non-idempotent (e.g. MakeDir), this can result in incorrect errors
(e.g. EEXIST) as well as incorrect cache state in the client.
There may be some ways to mitigate at least some of the problems here,
but this kind of "instant" retry behavior is often not really that
helpful. Calls that take a very long time to run on the server are
very rare (and usually indicate some other problem), while the
occasional short-lived "busy" packet is relatively common (sometimes
the server just hasn't cleaned up the call by the time we issue a new
call). So just get rid of the retrying behavior to ensure we don't
continue to encounter any problems like this.
To get rid of this behavior, we remove the RX_CALL_BUSY code, and all
code dealing with processing it. This means removing the RX_CALL_BUSY
handling from the client, as well as removing
rx_SetBusyChannelError(). This effectively reverts most of 23d6287f7f494383891a497038e8c0e870e824bf, and a few other commits
related to RX_CALL_BUSY.
With this change, if all we get from the server are BUSY packets when
we try to issue an RPC, the call will eventually error out with
RX_CALL_TIMEOUT (or hang forever, if no timeouts are configured). This
can be thought of intuitively as similar to "idle dead" behavior,
since we are just waiting for the server to proceed with processing
the call. So, if "idle dead" is configured, we still timeout after the
"idle dead" timeout. And if no idle or hard dead timeout is
configured, we will hang forever; just like if the server started
processing the call but then hangs forever.
Note that not all of 23d6287f7f494383891a497038e8c0e870e824bf is
reverted. Namely, the logic to have rx_NewCall try to pick the "least
busy" channel is retained.
Thanks to Simon Wilkinson for bringing up and discussing this issue in
this thread:
<http://thread.gmane.org/gmane.comp.file-systems.openafs.devel/10931>
<https://lists.openafs.org/pipermail/openafs-devel/2013-April/019297.html>
Andrew Deason [Thu, 30 Jan 2014 06:39:39 +0000 (00:39 -0600)]
rx: Remove RX_CALL_IDLE
After change Ie0497d24f1bf4ad7d30ab59061f96c3298f47d17, RX_CALL_IDLE
is not generated by Rx anymore; "idle dead" timeouts just cause
RX_CALL_TIMEOUT errors. Any code dealing with it is thus now dead code
(this value was deliberately never sent over the wire), so remove the
dead code.
Andrew Deason [Thu, 30 Jan 2014 06:36:22 +0000 (00:36 -0600)]
rx: Remove idleDeadDetection
After change Ie0497d24f1bf4ad7d30ab59061f96c3298f47d17,
testing for idleDeadDetection is equivalent to testing if idleDeadTime
is non-zero. The idleDeadDetection field is thus redundant, so remove
it.
Andrew Deason [Mon, 27 Jan 2014 06:36:14 +0000 (00:36 -0600)]
rx: Rely on remote startWait idleness for idleDead
This commit removes the functionality introduced in c26dc0e6aaefedc55ed5c35a5744b5c01ba39ea1 (which is also modified by a
few later commits), as well as 05f3a0d1e0359f604cc6162708f3f381eabcd1d7. Instead we modify the
startWait check in rxi_CheckCall to apply to both "reading" and
"writing" to enforce "idle dead" timeouts.
Why do this? First, let's start out with the following:
If an Rx call gets permanently "stuck", what happens? What should
happen?
Here, "stuck" means that either the server or client hangs while
processing the call. The server or client is waiting for something to
complete before it issues the next rx_Read() or rx_Write() call. In
various situations over the years, this has happened because the
server or client is waiting for a lock, waiting for local disk I/O to
complete, or waiting for some other arbitrary event to occur.
Currently, what happens with such a "hanging" call is a little
complex, and has several different results in different situations.
The behavior of a call in this "stuck" situation is handled by the
"idle dead" timeout of an Rx call/connection. This timeout is enforced
in rxi_CheckCall, in two different conditionals (if an "idle dead"
timeout is configured):
The first of these handles the case where we are waiting to rx_Read()
from a call for too long (the other side of the call needs to give us
more data). The second handles the case where we are waiting to
rx_Write() for too long (the other side of the call needs to read some
of the data we sent previously).
This second case was added by commit c26dc0e6aaefedc55ed5c35a5744b5c01ba39ea1, but it has the general
problem that this check does not check if anyone is actually trying to
write to the call, and just tries to keep track of the last time we
wrote to the call. So, we may have written some data to the call
successfully, and then we went off to do something else. We can then
kill the call later for taking too long to write to, even though
nobody is trying to write to it. This results in a few problems:
(1) When the fileserver is writing to the client, it may need to wait
for various locks and it may need to wait for local disk I/O to
complete. If this takes too long for any reason, the fileserver
will kill the call (currently with VNOSERVICE), but the thread
for servicing the call will still keep running until whatever the
fileserver was waiting for finishes.
(2) lastSendData is set whenever we send any ACK besides an
RX_ACK_PING_RESPONSE (as of commit 658d2f47281306dfd46c5eddcecaeadc3e3e7fa9). If we are the server,
and we send any such ACK (in particular, RX_ACK_REQUESTED is
common), the "idle dead" timer starts. This means the server can
easily kill a call for idleness even if the server has never sent
the client anything, and even if the server is still actively
reading from the client.
(3) When a client tries to issue an RPC for the server, the "idle
dead" timeout effectively becomes a hard dead timeout, since we
will write the RPC arguments to the Rx stream, and then wait for
the server to respond with the output arguments. During this
time, our 'lastSendData' is the last time we sent our arguments
to the server, and so the call must finish before
'call->lastSendData + idleDeadTime' is in the past.
In addition to this "idle dead" processing, commit 05f3a0d1e0359f604cc6162708f3f381eabcd1d7 appears to attempt to provide
"idle dead"-like behavior by disabling Rx keepalives at certain points
(when we're waiting for disk I/O), controlled by the application
process (currently only the fileserver). The idea is that if
keepalives are disabled, the server will just appear unreachable to
the client, and so if disk I/O takes too long, the client will just
kill the call because it looks like the server is gone. However, this
also has some problems:
(A) Clients send their own keepalives, and the server will still
respond to them. So, the server will not appear to be
inaccessible anyway. But even if it did work:
(B) This approach only accounts for delays in disk I/O, and not
anywhere else (we could hang for a wide variety of reasons). It
also requires the fileserver to decide when it's okay for a call
to be killed due to "idle dead" and when it's not, which
currently seems to be decided somewhat arbitrarily.
(C) This doesn't really let the client dictate its own "idle dead"
timeout for idleness specifically; it just looks like the server
went away.
(D) The fileserver would appear to be unreachable in this situation,
but it's not actually unreachable. This can be confusing to
clients, since distinguishing between a server that is completely
down vs just taking too long is an important distinction.
(E) As noted in (1) above, the fileserver thread will still keep
waiting for whatever it has been waiting for, even though the
call has been killed and is thus useless.
So instead of all of this stuff, just modify the rxi_CheckCall "idle
dead" check to depend on the call->startWait parameter instead. This
parameter will be set whenever anyone is waiting for something to
proceed in the call, whether that is waiting to read data or write
data. This should make "idle dead" processing much simpler, as it is
reduced to effectively: if we've been waiting for longer than N
seconds, kill the call.
This involves ripping out much of the code related to lastSendData and
rx_KeepAlive*. This means removing the call->lastSendData field and
the rx_SetServerIdleDeadErr function, since those were only used for
the behavior in c26dc0e6aaefedc55ed5c35a5744b5c01ba39ea1. This also
means removing rx_KeepAliveOn and rx_KeepAliveOff, since those were
only used for the behavior in 05f3a0d1e0359f604cc6162708f3f381eabcd1d7. This commit also removes the
only known use of the VNOSERVICE error code, so add some comments
saying what this code was used for (especially since it is quite
different from other V* error codes).
Note that the behavior in (1) could actually be desirable in some
situations. In environments that have clients without "idle dead"
functionality, and those clients cannot be upgraded or reconfigured,
this commit means those clients may hang forever if the server hangs
forever. Some sites may want the fileserver to be able to kill such
hanging calls, so the client will not hang (even if it doesn't free up
the fileserver thread). However, such behavior should really be a
special case for such sites, and not be the default behavior (or only
behavior) for all sites. The fileserver should just be concerned with
maintaining its own threads and availability, and clients should
manage their own timeouts and handle hanging servers.
Thanks to Markus Koeberl, who originally brought attention to some of
the problematic behavior here, and helped investigate what was going
on in the fileserver.