Simon Wilkinson [Wed, 13 Jul 2011 13:44:39 +0000 (14:44 +0100)]
rxkad: Suppress warnings for ticket5.c
rxkad's ticket5.c includes v5gen.c, a generated file from Heimdal.
This file contains a load of set-but-unused variable warnings. As we
currently have no way of portably suppressing just these warnings,
turn off warnings-as-errors for ticket5.c
Simon Wilkinson [Wed, 13 Jul 2011 13:42:11 +0000 (14:42 +0100)]
volser: fix set-but-unused variable warning
restorevol reads the magic number from the dump, then does nothing
with it. Rather than not reading it at all, just mark the variable that
it is read into as unused to supress the compiler warning.
Throughout cm_server.c, input parameters to functions that
are protected by cm_serverLock are dereferenced by assignment
during variable initialization prior to the cm_serverLock being
obtained. As a result there is a race which can result in
either list corruption or dereferencing freed memory.
Simon Wilkinson [Wed, 13 Jul 2011 13:35:48 +0000 (14:35 +0100)]
vol: Initialise list before error exit when cloning
The inode list wasn't being initialised before the first call into the
error handler. This makes it possible that we end up trying to discard
items from an uninitialised list, with all the chaos that would cause.
Fix things so that this list is correctly set up.
Simon Wilkinson [Wed, 13 Jul 2011 13:33:57 +0000 (14:33 +0100)]
volser: Actually return errors from ListOneVolume
The return code from GetVolInfo was being thrown away, and success
returned to the caller, regardless of the success of this function.
As GetVolInfo's exit codes aren't suitable for sending over the wire,
just return ENODEV if this function returns failure.
Simon Wilkinson [Wed, 13 Jul 2011 13:31:15 +0000 (14:31 +0100)]
Mark nearInode as unused
When we're building an inode fileserver, we use the nearInode hint.
The IH_CREATE macro just throws this hint away if we're building namei,
which leads to compiler warnings about set-but-unused variables. Just
flag nearInode as being potentially unused in order to suppress these
warnings.
Simon Wilkinson [Wed, 13 Jul 2011 13:23:22 +0000 (14:23 +0100)]
Don't split int64s when we don't need to
Now that we're always using an int64, and never a hyper_t, to represent
64bit integers, we can just print them out and assign them using the
native tools, rather than having to call SplitInt64. Simplify our code
to do so, which also avoids some gcc-4.6.0 warnings.
Simon Wilkinson [Wed, 13 Jul 2011 13:02:54 +0000 (14:02 +0100)]
uss: Remove unused variables
Remove assorted unused variables, both those used to capture error
returns, and so unused (but initialised) string pointers, to make
gcc 4.6.0 happier.
Simon Wilkinson [Wed, 13 Jul 2011 13:00:30 +0000 (14:00 +0100)]
libadmin: Remove unused error codes
A number of functions in the libadmin vos implementation set up
error values, and assign them to 0, but never actually use them
for anything (either further assignment, or returning to the user)
So, just remove these unecessary variables, and make gcc 4.6.0 a
little happier.
Simon Wilkinson [Wed, 13 Jul 2011 12:59:05 +0000 (13:59 +0100)]
bozo: Remove unused error codes assignments
This removes a couple of unreported error code assignments. Firstly,
the return from 'setsid' was being assigned to 'ec' and promptly
ignored, and secondly, the response from SendNotifierData was
being ignored. As there is nothing sensible to do with these error
codes, just ignore them properly.
Simon Wilkinson [Wed, 13 Jul 2011 12:57:12 +0000 (13:57 +0100)]
afsmonitor: Fix set-but-unused variable warnings
Tidy up the afsmonitor code to remove gcc 4.6.0's set-but-unused
variable warnings. These are all assignments to error code
values which are never checked, or reported.
Simon Wilkinson [Wed, 13 Jul 2011 12:55:39 +0000 (13:55 +0100)]
libafs: Remove support for length optimisation
At one point afs_StoreAllSegments had an optimisation to speed up
stores. However, that optimistation used the chunkLength without
taking appropriate locks, and was disabled. The variable assignments
which still exist from this code cause errors with gcc 4.6.0, so just
remove them.
Simon Wilkinson [Wed, 13 Jul 2011 12:48:07 +0000 (13:48 +0100)]
libafs: Remove unused DNLC LRU code
The LRU code in osi_dnlc_lookup has never been enabled in OpenAFS,
and causes compilation errors with gcc 4.6.0 - just remove the unused
code and its associated variables.
Simon Wilkinson [Wed, 13 Jul 2011 12:45:33 +0000 (13:45 +0100)]
libafs: Remove unused NAT markeddown code
Remove unused code which used to retry once when a server was
marked down due to a bad NAT. This code has never been enabled
in OpenAFS, and causes compile errors with gcc 4.6.0
util: introduce a common interface for setting thread names
A previous change added support for setting thread names/titles to
viced; this change moves the #ifdef spaghetti to src/util in
preparation for calling it from other places where it would be
useful. Two functions are defined, one for setting an arbitrary
thread's name (as might be done by the spawning thread) and one
for setting the current thread's name; the latter is also defined as
a macro for non-pthreads compilations so that it can be called
unconditionally (the interface does not reference any
pthread-specific data types). Note that some platforms, Mac OS X
in particular, do not allow setting the name of a different thread.
The two functions are defined as no-ops for Windows as our pthreads
emulation layer for Windows does not provide the needed mechanism.
Simon Wilkinson [Wed, 13 Jul 2011 10:53:57 +0000 (11:53 +0100)]
Add make dist and make srpm targets
Add targets to generate distribution tarballs, and srpms, from a tree.
These will generate packages for whatever the current HEAD of the tree
is - if the HEAD is a release tag, then the packages will be named for
that release, if the HEAD is between releases, then git describe will
be used to create an appropriate version identifier.
The tarballs are generated from the current git repository contents,
anything not checked in will not be included.
Remove pre-existing assert macro in hcrypto header.
The config.h header for hcrypto defines an assert macro for
use by RX. OpenBSD already has an assert macro definition so
this new one causes screaming by the compiler about
re-definition. This patch adds the directives to remove any
pre-existing definition of assert, if one exists, prior to
defining the new one.
Simon Wilkinson [Tue, 12 Jul 2011 00:45:10 +0000 (01:45 +0100)]
rpms: Fix handling of x86 architectures
Once upon a time, our specfile would assume that if you were
building for i386 you were building userspace, and that i586 or i686
implied doing a kernel only build. This is no longer the case, and
now everything on modern Fedora is built for i686, so we should adapt
the spec file for this.
Windows: always open dscp in smb_ReceiveNTTranCreate
There were two code paths in smb_ReceiveNTTranCreate that included
asserts in case the directory cm_scache_t object had not been
evaluated. RT129299 contains a report that at least one of
them had been tripped in production. There is no reason to avoid
evaluating the directory scp. It must exist in the cache and
obtaining a reference in all cases simplifies the logic of this
overly complex function.
viced: If platform supports setting a thread title, do so
Some pthread libraries support setting a name or title for individual
threads (analogous to setproctitle() for processes). This can be useful
for debugging and is sometimes published for use by utilities like ps
(again like setproctitle() for processes). The two most common variants
of this have the same signature with slightly different function names.
If either one is present, use it in viced (which already assigns a thread
name when compiled for LWP but ignores it in pthreads compilations).
Create a new lock daemon thread which performs regular
cm_LockCheck() calls. If a lock is deleted check the cm_scache_t
to see if the matching file server lock should be dropped. If yes,
drop it.
This effectively caches file server locks for two seconds after
they are released to provide a chance for subsequent local lock
requests on the same file to avoid a file server RPC. It also
ensures that windows processes do not thrash the file server and
force callback breaks.
OpenBSD: Add <sys/queue.h> header for <sys/lockf.h>
On OpenBSD, the <sys/lockf.h> header requires the TAILQ_* macros
which are defined in <sys/queue.h>. The latter is not automatically
included by <sys/lockf.h> . This patch makes sure that it is
available by putting it into the OpenBSD-specific param.h files
(so as not to impact any other OS).
Make viced.c look more like other source files by indenting nested
preprocessor directives. In a few case it made more sense to
eliminate the nesting. This should otherwise be a whitespace-only
chnage.
Change-Id: I895ea2f754f90a15daa73cea24d3da9576fff9c9
Reviewed-on: http://gerrit.openafs.org/4959 Reviewed-by: Simon Wilkinson <sxw@inf.ed.ac.uk> Tested-by: Simon Wilkinson <sxw@inf.ed.ac.uk> Reviewed-by: Derrick Brashear <shadow@dementia.org>
When rx was converted to use pthreads, the code that allocates
a call to a connection channel in rxi_ReceivePacket() was not
made thread safe. The code prior to this patchset permitted a race
in the server connection case. The rx_connection channel assignment
in rxi_ReceivePacket() and the call destruction in rxi_FreeCall()
and rxi_DestroyConnectionNoLock() did not consistently protect the
rx_connection channel array using the conn_call_lock.
This race could result in rxi_ReceivePacket() operating on a
rx_call which was disconnected from the previously assigned
rx_connection.
In addition, the code in rxi_ReceivePacket() that was intended
to protect the allocation of a call using rxi_NewCall() to the
connection channel array was racy with itself.
This patchset consistently applies the conn_call_lock to protect
the allocation / deallocation of calls to the connection channel
array and in the process simplifies the logic in rxi_ReceivePacket()
as it is no longer necessary to protect against a null call pointer
since the race can no longer be lost.
Andrew Deason [Tue, 22 Sep 2009 20:45:09 +0000 (15:45 -0500)]
Add documentation for AFS::ukernel
Add some documentation explaining some of the minor quirks in
AFS::ukernel; specifically how some functions look a bit different
than in plain libuafs.
Windows: Refactor cm_Unlock*() to avoid code duplication
cm_Unlock() and cm_UnlockByKey() duplicate a significant amount
of code. Refactor it into a new static function, cm_IntUnlock()
which handles the process of downgrading or releasing a file
server lock depending upon the lock state of the cm_scache_t
object.
Windows: Do not probe new servers from cm_UpdateVolumeLocation
cm_NewServer() can result in a call to cm_UpdateVolumeLocation()
if a server probe is performed. In order to avoid recursive
calls to cm_UpdateVolumeLocation() do not probe new servers from
within cm_UpdateVolumeLocation().
Andrew Deason [Wed, 6 Jul 2011 19:21:53 +0000 (14:21 -0500)]
afs: Consolidate afs_calc_inum
Instead of having two separate afs_calc_inum functions, just have one
afs_calc_inum, and split off the md5 inode code into its own function
under a LINUX20 ifdef.
Andrew Deason [Tue, 5 Jul 2011 22:13:57 +0000 (17:13 -0500)]
uss: Suppress more warnings from lex.yy.c
Specify -Wold-style-definition when compiling lex.yy.c. This allows us
to compile when --enable-checking is specified and our lex generates
code with old-style function definitions.
Andrew Deason [Wed, 22 Jun 2011 18:44:38 +0000 (13:44 -0500)]
afs: Ensure afs_calc_inum yields nonzero ino
afs_calc_inum can currently yield an inode of 0 if MD5-based inode
numbers are turned on. Some userspace applications (and for some
platforms, maybe even the kernel) make certain assumptions about the
inode number for a file; in particular for example, 'ls' will not
display a file with inode 0 in a normal directory listing.
So, read the md5 digest until we get a non-zero result. Fall back to
the non-md5 calculation if we still somehow end up with a 0.
While this case may at first glance seem to be extremely rare, in
practice it can occur, as the current calculation for volume 538313506, vnode 26178 does actually yield a 0.
Andrew Deason [Fri, 1 Jul 2011 21:58:06 +0000 (16:58 -0500)]
vol: Don't always FDH_REALLYCLOSE on linktable ops
If we dec a linktable entry or get a free tag from the link table,
there is no reason to FDH_REALLYCLOSE the linktable fd handle.
FDH_REALLYCLOSE is the same as FDH_CLOSE, except that it tells the
ihandle package that the file handle will not be used again soon. If
we dec a linktable entry or get a free tag, there is no reason to
think that, so just FDH_CLOSE the handle instead.
Andrew Deason [Fri, 1 Jul 2011 19:25:05 +0000 (14:25 -0500)]
DAFS: Do not clear salv state on fssync salvage
When a volume is put into an error state via the FSYNC_VOL_FORCE_ERROR
command, we clear the salvage state informaton on it, since we're
forcing it offline and thus inaccessible. However, if we are forcing
it to an error state because the volume needs salvaging, we just
salvage it. In this case, do not clear the salvage state, since we
need to know if we've already requested or scheduled a salvage so we
can correctly keep track of the number of salvages performed.
.epub is generated using dbtoepub which is still considered alpha
software apparently and installed in a non-standard place. for now,
use the docbook stylesheet location to find it. .mobi is generated
using kindlegen from the .epub in order to have a real toc.
there is some preprocessing with a custom stylesheet to make
things "look right". see mobi-fixup.xsl.in
Andrew Deason [Wed, 29 Jun 2011 18:51:22 +0000 (13:51 -0500)]
SOLARIS: Granular multiPage detection
Currently, a struct vcache has a multiPage counter, indicating how
many afs_getpage requests are in-flight for that vcache that involve
retrieving multiple pages. Any dcache associated with such vcaches are
then avoided when choosing dcache entries to evict from the cache,
since we may deadlock when trying to evict a dcache entry from one of
the earlier afs_GetOnePage calls in a particular afs_getpage request.
This behavior can cause the client to become unusable if the cache
becomes full, and the only items in the cache are dcache entries in a
file that has an in-flight multi-page afs_getpage request. Since, in
that case, we cannot kick out any entries from the cache, and so we
wait forever to wait for the cache utilization to go down.
To prevent this from occurring, record exactly which ranges in the
file have in-flight multi-page afs_getpage requests, and just avoid
dcache entries in those ranges. This way afs_GetDownD can evict dcache
entries in the same file, but still avoid entries that would cause a
deadlock.
Also add some comments explaining this situation a bit more.
Andrew Deason [Fri, 24 Jun 2011 21:23:13 +0000 (16:23 -0500)]
Remove nonsensical bozon-lock defines
Currently there are two preprocessor defines related to bozon locks:
AFS_BOZONLOCK_ENV, and AFS_NOBOZO_LOCK. The former creates the pvnLock
member of a struct vcache, and controls calls to e.g. afs_BozonLock in
cross-platform code. The latter, if defined, turns calls to e.g.
afs_BozonLock into no-ops.
It doesn't make any sense to have both of these, since if
AFS_BOZONLOCK_ENV and AFS_NOBOZO_LOCK are defined, the pvnLock member
exists but is never used, since afs_BozonLock &co are no-ops. On
Solaris, the only platform where AFS_NOBOZO_LOCK is currently defined
(DUX used to define it before DUX was dropped), this is the case.
So to make things a bit more clear, get rid of the AFS_NOBOZO_LOCK
define, and just use AFS_BOZONLOCK_ENV to dictate whether we do
anything with bozon locks (ppc_darwin_80 appears to be the only
platform at this time).
Remove AFS_BOZONLOCK_ENV from Solaris param files, since it doesn't
use bozon locks. Remove all references to pvnLock in Solaris-specific
code.
Andrew Deason [Fri, 24 Jun 2011 20:25:46 +0000 (15:25 -0500)]
Remove support for Solaris pre-8
Remove support for all Solaris and SunOS platforms prior to Solaris 8,
since Solaris 7 reached end-of-life in August of 2008. Remove all
non-documentation references to sunx86_57 and earlier, sun4x_57 and
earlier, and AFS_SUN57_ENV and earlier.
References to AFS_SUN58_ENV have been changed to AFS_SUN5_ENV where
appropriate, and AFS_SUN5_ENV now implies Solaris 8.
AFS_SUN57_64BIT_ENV has been renamed to AFS_SUN5_64BIT_ENV.
Jeff Blaine [Thu, 16 Jun 2011 23:58:49 +0000 (19:58 -0400)]
Introduce TAP tests of man pages for command_subcommand
Introduces the first batch of man page testing as part of
the TAP tests. We would like to fail, for example, when
someone has added a new command to vos but not AHEM documented
it.
For now, the tests consist of checking to ensure that for
every subcommand listed in the output of "command help"
(e.g. vos help), fail the test if there is not a man page
for those (e.g. vos_delentry.1 etc).
Copy any of the -man-t tests and edit to make a new one
All tests make use of a simple new Perl library stored
in tests-lib/perl5 (a new area, not just named 'lib'
because I didn't want it to be confused with a s test
for a 'lib' in the src).
Jeff Blaine [Fri, 17 Jun 2011 21:24:54 +0000 (17:24 -0400)]
Styling tweak for generated HTML man pages
Prefer "Georgia" as a typeface over the less readable
Times New Roman, but with Times New Roman then "serif"
as fallbacks. Georgia is available everywhere.
Provide 10px top/bottom and 30px left/right margin on
the main body for readability. Margins are good.
Jeffrey Altman [Tue, 28 Jun 2011 13:35:02 +0000 (09:35 -0400)]
rx: race in rx_multi processing
multi_Init() registers an arrival procedure which is called when
the first response packet for the call arrives. If the call times
out the multi_Body loop will call rx_EndCall() and then set
multi_h->calls[multi_i] to NULL. If the first data packet of the
call arrives before rx_EndCall() is executed, then the arrival
procedure, multi_Ready(), will be executed adding the call to the
firstNotReady list. When the multi_Body loop attempts to process
the call from the firstNotReady list it attempts to dereference
the NULL multi_call. This race was introduced by be4abb4ec83a47477b254f2b3375742c4efbb063.
multi_h->calls[multi_i] is set to NULL as an indicator to
multi_Finalize() that rx_EndCall() has already been processed
for the call. When rx_EndCall() is executed the arrival
procedure is cleared.
If rx_EndCall() has already been processed, the fact that
the arrival procedure has been executed must be ignored. Add
an additional check in multi_Body for a non-NULL call pointer
to skip the startProc and rx_FlushWrite processing on the
no longer existent call.
Note that it is not safe to hold onto the call reference after
rx_EndCall() has been processed since the call slot may be
reused for a new RPC before the multi processing on all calls
is complete.
Jeff Blaine [Thu, 9 Jun 2011 00:56:46 +0000 (20:56 -0400)]
Change -n to -dryrun for backup subcommands
Change -n to -dryrun for "don't do it, show it though" operation
to be in line with agreement on -dryrun in place of -noexecute
or -n. Updated man page POD sources to reflect the changes
and updated README to remove these specific todo line items.
Jeffrey Altman [Mon, 30 May 2011 04:15:43 +0000 (00:15 -0400)]
vos: refactor ListAddrs
refactor ListAddrs to be more readable. Clarify that -uuid and
-host cannot be issued at the same time. Rename 'nentries' to
'max_index' so it is clear that ubik_VL_GetAddrs() is issued
to set an upper-bound for the number of subsequent ubik_VL_GetAddrsU()
calls that are issued when neither -host nor -uuid are specified.
Jeffrey Altman [Mon, 27 Jun 2011 13:31:54 +0000 (09:31 -0400)]
Windows: MergeStatus before SyncOpDone
cm_SyncOp/cm_SyncOpDone is used to synchronize the RPC processing
to ensure that calls which are in conflict cannot occur at the
same time but also to ensure that the ordering of operations
is consistent. cm_MergeStatus() was in many cases executed after
cm_SyncOpDone() removed the synchronization barrier which in turn
permitted status information to be applied out of order. Side
effects could have included data loss due to client side file
truncation. More commonly two StoreData RPCs would have their
status information applied out of order forcing the cache manager
to invalidate all of the cached data for the file.
Jeffrey Altman [Thu, 23 Jun 2011 21:51:22 +0000 (17:51 -0400)]
Windows: TRANS2_FIND_FIRST2 for _._AFS_IOCTL_._
smb_T2SearchDirSingle() must not fail directory search requests
for the _._AFS_IOCTL_._ file. Although this file does not actually
exist, it is successfully processed by CreateFile operations.
Therefore, an explicit search for it should return a valid answer.
Jeffrey Altman [Fri, 24 Jun 2011 03:49:32 +0000 (23:49 -0400)]
Windows: Fix SMB_COM_NEGOTIATE for MS11-043
MS11-043 adds response validation for SMB_COM_NEGOTIATE messages
received by the SMB Redirector. OpenAFS failed to properly specify
a Challenge and DomainName in the response when the security mode
is SMB_AUTH_NONE (or share with password). This patchset corrects
smb_ReceiveNegotiate() so that it adheres to the protocol specification.
Revert "Rx: When call receive is done, send ack all packet"
This reverts commit 3cd3715e608b801b4848399e42cb47464e6e3cc3,
which replaces an ack with an ackall; ackall processing does
not actually mark all packets acked when it is received, so
it is insufficient.
It would seem xsltproc -> fop -> pdf is the "modern" way to generate
pdf from docbook now. The hard part is finding the stylesheets.
This should work for fedora, sles and debian. Additionally, it brings
some consistency--xsltproc for all the conversions. You can still
override via configure options if you prefer something else.
"local" links to section heads inside the same pod page should be written
L</OPTIONS> instead of L<OPTIONS>. the other broken links are assorted
typos and capitalization changes.
Andrew Deason [Tue, 21 Jun 2011 21:25:14 +0000 (16:25 -0500)]
DAFS: Do not attach a specialStatus'd vol
If we encounter a preattached volume during GetVolume, we currently
ignore vp->specialStatus before trying to attach. However, we will
generally always fail to attach due to a conflicting vol op, but even
if we don't, GetVolume always returns an error later on if
vp->specialStatus is set. So, same some processing and attempted
attachments by bailing out sooner if vp->specialStatus is set.
Andrew Deason [Tue, 21 Jun 2011 23:08:21 +0000 (18:08 -0500)]
salvager: Clear summary in RecordHeader
Not every field in the summary header in RecordHeader is set, leaving
some used uninitialized when we copy to the given volumeSummaryp (like
'deleted'). Zero out the header before we do anything.
Andrew Deason [Tue, 21 Jun 2011 22:51:32 +0000 (17:51 -0500)]
Build a separate copy of vlib for dasalvager
Currently dasalvager links to vlib.a. But vlib.a is built without any
DAFS defines, and so the size of a struct DiskPartition64 is different
(since dasalvager is built with AFS_DEMAND_ATTACH_UTIL). Build our own
copies of the volume package files instead, with
AFS_DEMAND_ATTACH_UTIL defined.
Andrew Deason [Tue, 21 Jun 2011 23:33:16 +0000 (18:33 -0500)]
dir: Fix DRead
DRead was missing a return statement in one of the cases where we
found the buffer we were looking for, so we locked the buffer but kept
looking. Return it instead.
Andrew Deason [Tue, 21 Jun 2011 19:58:42 +0000 (14:58 -0500)]
vol: Do not overwrite specialStatus in attach2
attach2 wants to set specialStatus to VBUSY in certain conditions
(such as, it discovers a conflicting vol op where VVolOpSetVBusy_r is
true). However, specialStatus may already be set to something else,
like VMOVED if the volume is being moved off of the server. This can
happen if the volserver has checked out and FSYNC_VOL_MOVE'd a
preattached volume but hasn't deleted or checked the volume back in
yet.
So, if specialStatus is already set, don't touch it, so we don't start
reporting VBUSY errors to clients when we should be reporting VMOVED,
or some other error code previously set.
Simon Wilkinson [Sat, 18 Jun 2011 14:50:08 +0000 (15:50 +0100)]
rx: Exit fast restart on non-duplicate ACK
The current code only exits fast restart when we receive an ACK
packet that contains no missing chunks at all. On a network that is
dropping a reasonable chunk of its packets, this means that we spend
most of the call in fast recovery. (I originally found this by running
with the intentionally drop packets feature set to 10%)
TCP's fast retransmit behaviour is that we stay in fast recovery until
we receive our first non-duplicate acknowledgement. In TCP that means an
acknowledgement that moves the window. In RX, it is an acknowledgment
that ACKs a new packet.
Simon Wilkinson [Sat, 18 Jun 2011 12:17:07 +0000 (13:17 +0100)]
rx: Don't limit the # of packets sent in recovery
The RX transmit engine limits the number of packets sent whilst in
loss recovery to one per invocation of the transmit engine. As the
engine cannot be called by the application thread whilst in recovery,
this means that we end up being limited to one packet per ACK received,
which means that despite a growing congestion window we'll only send
one packet per RTT (in effect, a congenstion window of 1).
This will remain the case until we exit recovery, and all of a sudden
can send a large number of packets. If this is larger than the current
capacity of the network, we'll probably end straight back in recovery
again.
Let the congestion window do its job, by removing this arbitrary limit.
Simon Wilkinson [Sat, 18 Jun 2011 12:01:35 +0000 (13:01 +0100)]
rx: Don't wait for TQ busy when entering recovery
Two different threads can cause a call to enter recovery. The event
thread will move a call into recovery as a result of a timeout, or
the listener thread will move it there following a fast retransmit.
In both of these cases, recovery looks different. In the case of
a timeout, we enter slow start, starting as if we were begininning
transmission for the first time. Following fast retransmit, we enter
fast recovery, with different starting parameters than those coming
from slow start.
As a reslt, the current behaviour, where either call sitting in
FAST_RECOVERY_WAIT causes the other to simply return is inappropriate.
Further investigation indiciates that FAST_RECOVER_WAIT is actually
uncessary. There is no harm caused to a thread which is currently
blocked on the network in the middle of a transmit, in adjusting the
window size underneath it. As both of these states collapse the window,
that thread will simply cease sending earlier.
So, simplify the code, and remove the potential race between event and
listener by removing the FAST_RECOVER_WAIT state.
Simon Wilkinson [Sat, 18 Jun 2011 11:43:44 +0000 (12:43 +0100)]
rx: Enter loss recovery when we retransmit
Since I mistakenly wrote commit 36e2d13b, RX hasn't entered congestion
avoidance when a loss event occurs. This is bad, because on todays
networks the majority of packet losses are due to some form of
congestion.
Now that the timeout code has been restructured, the chances of entering
the retransmit routine in error are much much smaller, so this code
needs to be restored.
This change reverts 36e2d13b55085c996d38b30d003296c602ef8ee3. However,
the original RX code has the problem that it assumes that all forms of
fast recovery are the same - in particular, that the call settings that
result from entering fast recovery due to a fast retransmit are
identical to those resulting from a timeout. This is not the case, and
this will be fixed in a later change.
Simon Wilkinson [Sat, 18 Jun 2011 10:58:57 +0000 (11:58 +0100)]
rx: Add Karn-style backoffs to RX retransmits
When we retransmit a packet, we may be doing so because the RTT of the
connection has grown dramatically larger than earlier within the call.
However, RX doesn't permit all ACKs to retransmitted packets to be
counted within the RTT calculation.
So, adopt the same approach as Karn developed for TCP, and as described
in detail in RFC2988. When a retransmit event occurs, backoff the
connection RTT by doubling its value, and hold at this doubled value
until either another retransmit occurs (in which case we back off again,
up to a predetermined ceiling), or we receive an ACK packet which we
can use within the RTT calculation, in which case we drop back down to
the newly measured value.
This change replaces the per-packet backoff strategy originally
implemented in RX (which, whilst allowing resent packets more chance of
arriving, doesn't help with computing a correct RTT).
Simon Wilkinson [Sat, 18 Jun 2011 10:48:45 +0000 (11:48 +0100)]
rx: Make clock_Add correctly add to itself
With the existing clock_Add code, the following:
struct clock a = {2, 800000};
clock_Add(&a, &a);
gives a clock value of {6, 600000}, rather than the expected {5, 60000}.
This is because the ordering of instructions leads it to double count
the carry on the seconds field. Reorder the instructions so that the
carry is correctly applied.
Simon Wilkinson [Sat, 18 Jun 2011 10:35:30 +0000 (11:35 +0100)]
rx: Remove resending logic into its own function
Create a new function, rxi_Resend, which is the entry point to running
the transmit queue as a result of a resend event. This concentrates all
of the resend logic into one place, removes the need for
rxi_StartUnlocked, and means that rxi_Start's arguments don't need to
match those of an event handler.
Simon Wilkinson [Sat, 18 Jun 2011 09:46:53 +0000 (10:46 +0100)]
rx: Change the way that the RTT timer is applied
RX maintains a retryTime for every packet that it has transmitted,
which is held as the time that that packet was sent, plus the smoothed
RTT of the connection. If a packet is in the queue with a retryTime
older than the current time, then it is resent at the first opportunity.
In some circumstances, this first opportunity will be as a result of
the resend event timer expiring, in others it will happen as part of
a normal queue run.
There are a number of problems with this approach on congested networks.
Firstly, on a network with a large window size, which is in "normal"
flow, it means that we will never actually perform fast retransmit as
the timeout for this packet will have expired before we have received
any further ACKs. This is because, on a network with a relatively stable
RTT the ACK for packet n+1, n+2, or n+3 cannot arrive before the
expected time of arrival of the ACK for packet n. As we retry
immediately this expected time of arrival has passed, we never have the
opportunity of using these later ACKs to learn that packet n is lost.
Secondly, the fact that we may resend packets from a "normal" queue run,
rather than as a result of a resend event, means that there is no clear
entry point for resends. As resends should be assumed to be a result of
network congestion, and result in both the call throttling back, and the
RTT being increased, this lack of a clean entry point makes things
tricky.
As a solution, this patch changes the way in which retransmit times are
applied to use the algorithm described in RFC2988.
*) Whenever we send a new packet, we start a timer for the current call
rto value if one isn't already running.
*) Whenever we receive an ACK that acknowledges new data, and we have
packets that are sent but not yet acknowledged, we restart the
retransmit timer using the current rto value.
This alogrithm solves the first problem, as it means that if the
connection is still flowing, we will continue to receive ACKs, and we
can enter fast retransmit.
In implementation terms, we longer track a retryTime per packet, and
instead simply record if a packet has been sent or not. Packets which
have been sent may only be resent as a result of a resend timer
expiring, or of entering fast retransmit, so solving the second issue.
Simon Wilkinson [Fri, 17 Jun 2011 21:06:54 +0000 (22:06 +0100)]
rx: Compute smoothed RTT per call, not per peer.
RX uses the TCP RTT smoothing algorithm as described in RFC2988.
However, the TCP algorithm is designed to accept samples from a
single connection, accepting a new sample once per RTT.
RFC2988 suggests that "when multiple samples are taken
per RTT the [ alogrithm ] may keep an inadequate RTT history."
In RX's implementation, we use a single instance of this alogrithm
per peer, and input all of the samples from all of the active calls
and connections into this same instance. This leads to us taking
a significantly (potentially many magnitudes) larger number of samples
per RTT, and rapidly losing the RTT history. With RX's implementation,
short lived network events may easily bias the RTT, and cause large
numbers of packets to time out.
This change fixes this by moving the RTT calculation onto a per call
basis. We still update the peer with our caclulated value, so that new
calls may be created with an RTT corresponding to the current value for
the connection, rather than having to start high and converge downwards.
Simon Wilkinson [Tue, 21 Jun 2011 08:34:50 +0000 (09:34 +0100)]
rx: Make testclient build on Unix
The "testclient" utility is built as part of the build on Windows.
Fix it so that it actually builds on Unix, so we can test changes to
testclient there.
Simon Wilkinson [Sat, 14 May 2011 07:55:50 +0000 (08:55 +0100)]
rx: Reverse the consumption order of idle queue
Currently, the rx server thread idle queue is used in an LRU manner.
This means that we round robin requests between all of the threads
configured on a given system, which means that we end up thrashing
CPU caches on machines whose workload doesn't require that all of
the configured threads be used.
Change this so that we always use the most recently idle thread. This
isn't as "fair" to all of our waiting threads, but should mean that we
scale better on SMP machines, as a thread that is recently idle is
likely to have been recently scheduled.
Simon Wilkinson [Fri, 17 Jun 2011 19:35:59 +0000 (20:35 +0100)]
rx: Remove incorrect backoff code
The ACK packet handling routine contains code which causes the
RTT to backoff if the selective ACK response indicates that there is
a missing packet. The comment justifies this code as being in line
with Phil Karn's work on TCP.
However, the TCP behaviour is that we backoff when we enter resend. Both
TCP and RX have difficulty computing RTTs for resent packets due to the
ambiguous ACK problem. Whilst RX is slightly better than TCP in this
regard, we can't always tell whether an ACK refers to the original, or
resent packet, so resent packets are unable to contribute to the RTT.
This means that if the RTT ends up too low for the connection, and we
start resending every packet, the RTT will never grow to account for
this, as we never feed it any packet samples.
Karn's solution to this was to backoff (double) the RTT value when we
resend a packet, and then to not drop it back down until we receive an
ACK that we can count. This means that we will always get a new sample
for the connection, and the RTT will grow again.
The original author confirms that the current behaviour in RX is
incorrect, so simply remove it with this patchset.
Simon Wilkinson [Fri, 17 Jun 2011 18:38:29 +0000 (19:38 +0100)]
rx: Account for delayed ACKS when computing RTO
RX currently only soft ACKs every second packet, therefore a soft ACK
may be delayed by a period of time (currently 100ms, although RX did
expose this as a public variable in earlier versions).
RTT values are computed using only non-delayed ACKs, so the timeout
is a smoothed average of the exact time taken to send and directly
ACK a packet. Therefore, if the peer ends up using a delayed ACK for
the packet, using just the RTT will cause that packet to be timed out.
A while ago, this was dealt with by padding the calculated RTT with an
additional 350ms. This was then removed, and changed to a 350ms minimum
value. When this caused large numbers of spurious resends, the padding
was restored, but with a 20ms default value. As noted above, 20ms is
too low, as we may wait for up to 100ms before sending an ACK.
This patch changes minPeerTimeout so that it does what it says on
the tin - sets a minimum value below which the peer timout may not
fall. It then adds to either this value, or the calculated one, 200ms
of padding. This makes our padding identical to TCPs, and allows some
future leway as to the softAckDelay value.