Simon Wilkinson [Sat, 23 Oct 2010 13:51:56 +0000 (14:51 +0100)]
rx: More improvments to RTT calculation
Move the decision about whether a packet contributes to the peer's
rount trip time into the CalculateRoundTripTime function, and improve
the criteria used.
Previously, we only computed the RTT if we had not retransmitted. This
is bad, because it means that places where we have backed off in order
to retransmit never actually lengthen the RTT, and so the RTT is kept
artificially low, and we see a large number of retransmits. Instead,
use the serial of the ACK packet to determine which transmission is
being acknowledged, and if it is the first, or the last, transmission
use the appropriate sent time to calculate the RTT.
If we have no serial in the ACK (for a delayed ack, for example), or
if the serial doesn't match (where a single acknowledgement is soft
acking a number of packets), fall back to only using the ack if the
packet has not be retransmitted.
Also, avoid multiple counting of packets which have arrived as part
of a jumbogram by only permitting the last packet in a jumbogram to
contribute to the RTT. This avoids giving the RTT of jumbograms more
weight than those of normal packets - doing so would pull down the
RTT, as it in effect favours packets which have not be retransmitted.
Jeffrey Altman [Thu, 21 Oct 2010 18:13:03 +0000 (14:13 -0400)]
Rx: Treat rx_minPeerTimeout not as a minimum but as padding
An improved RTT and timeout calculation algorithm is being
developed but until we have it, treat rx_minPeerTimeout not as
a minimum value for the timeout but as padding to be added to
the measured RTT when computing the peer timeout value.
With this change rx does not begin to send large numbers of
resends when the RTT begins to exceed the rx_minPeerTimeout
value. Timeout triggered resends at the moment can force rx
into fast recovery mode which in turn kills performance. It
is better to avoid that problem for now.
Jeffrey Altman [Thu, 21 Oct 2010 18:23:18 +0000 (14:23 -0400)]
Rx: Fix socket() handling so errors are properly detected
socket() returns an osi_socket which on Windows is an
unsigned type (HANDLE). Therefore, tests of osi_socket < 0
will never identify when the INVALID_SOCKET value is returned.
On Windows, the OSI_NULLSOCKET is assigned to INVALID_SOCKET.
Replace all comparisons of (osi_socket < 0) with
(osi_socket == OSI_NULLSOCKET) as a means of detecting errors.
In addition, do not pass socket() the protocol value 0 when
IPPROTO_UDP is what is desired.
Finally, perror() on Windows never reports any error from Winsock.
perror() is a CRT function. To get the real socket error
WSAGetLastError() must be called and its value be written to
stderr.
maxDgramPackets is initially assigned this value after correcting
for the wire endian. This compare is harmless on little endian
since the network endian value will typically be huge and redundant
on big endian machines.
Simon Wilkinson [Mon, 11 Oct 2010 17:25:38 +0000 (13:25 -0400)]
rx: Simplify round trip time calculation
Move the logic for deciding whether to compute RTT out of PeerNetStats
and into the callers. This means that we can share decisions about
whether a packet is ACK'd or not, and avoid uneccessary multiple tests
and function calls.
This change also stops us from computing RTT times for packets outside
of the set of explicit ACKs that we have received. This means that we
no longer compute RTTs for packets that are on the transmit queue, but
not yet on the wire.
Jeffrey Altman [Sat, 16 Oct 2010 17:14:03 +0000 (13:14 -0400)]
Rx: Do not compute RTT on non-last packets of a jumbogram
A jumbogram is constructed as a series of rx packets that are
all sent at once and acknowledged at the same time. Computing the
RTT for all of the packets that makes up the jumbogram provides
the jumbogram RTT more weight than for a non-jumbogram packet.
To restore fairness, only compute the RTT for the last packet of
a jumbogram. The non-last packets with have the RX_JUMBO_PACKET flag
set in the packet header.
Simon Wilkinson [Mon, 11 Oct 2010 17:14:02 +0000 (13:14 -0400)]
Rx: Reject out of order ACK packets
Our RX implementation virtually guarantees that we will see out of
order ACK packets, even on well behaved networks, as we send acks
simultaneously from multiple threads.
Currently we only reject out-of-order ACKS which change the window
position (so a window that advances, can never go back). However,
we fail to deal with the explicit acknowledgement portion of the ACK
packet in the same way...
For example, if we have a packet A that acknowledges packets 1 and 2,
and then a packet B acknowledging 1,2,3 and 4. If B arrives before A,
then we mark 1, 2, 3, 4 as acknowledged, and then treat the arrival of
A as nAcking 3 and 4. This has the same effect as an explicitly stated
nack, triggers an early and unnecessary resend and may, in some situations,
cause the call to go into congestion avoidance.
We can solve this using the previousPacket field of the ACK. This
indicates the last packet seen by the peer. In the same way as
firstPacket, this should never go backwards, and so can be used to
detect out of order acknowledgements, and reject them.
Andrew Deason [Fri, 15 Oct 2010 21:35:32 +0000 (16:35 -0500)]
pts: Specifically check for group id 0
For consistency with the code checking user ids in createuser, check
for a specified group id of 0 specifically and give a slightly
different error message for it.
Russ Allbery [Thu, 14 Oct 2010 20:41:45 +0000 (13:41 -0700)]
Return SRV record ports in network byte order
Convert the port extracted from the SRV record return to network byte
order before assigning it to the port array.
The port in a SRV record is extracted by pulling out the high byte
and low byte and then mathematically combining them, which implicity
converts from network byte order to host byte order. However, the
callers of afsconf_LookupServer expect the port array to be returned
in network byte order since ports are assigned without modification
to the .sin_port field of a struct sockaddr_in. See also the byte
order of the default afsdbPort value.
Reported by Jan Christoph Nordholz (Debian Bug#600228).
Marc Dionne [Wed, 13 Oct 2010 23:11:25 +0000 (19:11 -0400)]
Linux: fix statfs configure test
The change to the statfs configure test that was made for 2.6.36
broke the test for older kernels. The new test is based on a call,
and that will generate a warning but not an error when the arguments
don't match the prototype.
Take another tack, and revert to the old style test, but with the
simple_statfs function instead of vfs_statfs.
Tom Keiser [Wed, 13 Oct 2010 05:10:09 +0000 (01:10 -0400)]
don't release Volume lightweight ref too early
FSYNC_com_VolOff was releasing its lightweight ref before the error handling
code for VGetVolumeByVp_r was executed; this code needs to dereference the
Volume pointer for some of its logic. This was unsafe since
VCancelReservation_r() could have resulted in the Volume object being freed.
Move VCancelReservation_r() below the error handling block. NB: the error
handling block now relies upon the goto done/deny to cancel its lightweight
ref.
Simon Wilkinson [Sun, 10 Oct 2010 12:04:41 +0000 (08:04 -0400)]
rx: Don't malloc the xmit list
Building the transmit list happens in a time critical section of
code. Using malloc to allocate the list which holds the packets to
be transmitted slows down this critical section. Instead, just
allocate the space as part of the call structure.
Locking of xmitList is somewhat tricksy, as the call->lock is
dropped over calls to sendmsg(). However, the xmitList is protected
by the TQ_BUSY call flag, which prevents multiple threads from
usign the transmit queue, and hence the xmitList, simultaneously.
Andrew Deason [Fri, 8 Oct 2010 16:51:30 +0000 (11:51 -0500)]
RX: Force sane timeout values
Currently we do not check the specified timeout values when someone
changes a connection's dead, idle, or hard dead time. However, if the
conn's dead time is larger than the other two times, a loss of network
activity will result in one of the other timeouts getting triggered
first.
To prevent this and possibly other problems from happening, force a
connection's timeouts to always obey the relationship
secondsUntilDead <= idleDeadTime <= hardDeadTime, by checking these
values whenever they are changed.
Andrew Deason [Wed, 6 Oct 2010 22:24:02 +0000 (17:24 -0500)]
RX: Adjust all timeouts for RTT
Previously only the deadTime RX network timeout was getting adjusted
for the peer's rtt and rtt_dev values. Do this for the idle and hard
timeouts as well, since a higher RTT is going to make everything
potentially take longer.
Tom Keiser [Wed, 13 Oct 2010 06:15:36 +0000 (02:15 -0400)]
update fssync-debug to handle the VOL_LOCKED flag
Allow fssync-debug to dump the VOL_LOCKED flag, rather than the
current behavior of printing absolutely nothing when this flag
is asserted. In addition, increase the flag buffer size since
it turns out we would truncate if all nine flags were asserted
at once.
There doesn't seem to be a need to limit the rx message size when
using rx_WritevAlloc. If there arent enough rx buffers to hold
the entire message at once, it will simply return less space.
Simon Wilkinson [Tue, 5 Oct 2010 20:21:38 +0000 (21:21 +0100)]
rx: Don't call gettimeofday for every packet ack
Every time we receive an ACK packet, we call gettimeofday() for
every entry in the transmit queue that's permanently ack'd by that
packet. Instead, just make a note of the time when we start
processing the packet queue, and use it for every packet in the
queue.
This shaves around 5% off rxperf's runtime with a window size of 128.
Jeffrey Altman [Sat, 9 Oct 2010 07:06:07 +0000 (03:06 -0400)]
Windows: Do not issue RXAFS change RPCs on known RO volumes
If the cm_scache_t is known to be on a RO volume, do not permit
RXAFS_xxx RPCs that would attempt to make a change to the volume
to be issued to the file server. Instead, return CM_ERROR_READONLY
immediately. This avoids triggering the abort threshold for
the current connection on the file server.
Phillip Moore [Thu, 7 Oct 2010 23:25:09 +0000 (19:25 -0400)]
Extract the .version file when building the srpm file
If you are building the source and binary rpms from a released
tarball, instead of a real git repo, the .version file is required by
build-tools/git-version. With out this, the version defaults to
UNKNOWN, and although the source rpm will build, it won't compile.
Simon Wilkinson [Sat, 2 Oct 2010 03:17:56 +0000 (23:17 -0400)]
rx: Reduce dependence on call->lock
This patch reduces our dependence on call->lock, by allowing more
of the reader thread to run lock free. Doing so requires that
call->mode only be set by the reader thread. As a result, call->mode
can only be set to RX_CALL_ERROR by rxi_CallError(). The mode is
set to RX_CALL_ERROR by the reader thread immediately after regaining
the call->lock when it has been dropped.
Phillip Moore [Tue, 5 Oct 2010 16:46:35 +0000 (12:46 -0400)]
Quick Start Guide updated for RHEL rpms, and CLI syntax
The names of the RHEL rpms have been updated to reflect what is
actually published on openafs.org, and the CLI syntax of the commands
run using -noauth have had the unnecessary -cell options removed.
Christof Hanke [Tue, 5 Oct 2010 15:01:41 +0000 (17:01 +0200)]
volserver: Do not return ENOMEM on AIX from XVolListPartitions
When calling "vos partinfo" or "vos listpart" towards a server
running AIX with no partitions attached, it would return a
ENOMEM, because unlike on linux, malloc(0) returns NULL on AIX.
Thus, just don't do any malloc, when we have no partitions anyway.
Andrew Deason [Sun, 3 Oct 2010 23:27:19 +0000 (18:27 -0500)]
vol: Log ignored dirs that look like partitions
If we see a /vicep*-like directory when we VAttachPartitions, and we
ignore it because it lacks an AlwaysAttach, log that we ignored it.
This may make things less confusing to admins that just try to create
a /vicep* directory and don't know why it doesn't work.
Andrew Deason [Fri, 10 Sep 2010 20:52:34 +0000 (15:52 -0500)]
vos release: Force full dump on RO_DONTUSE sites
When releasing a volume, currently we perform an incremental dump on
RO_DONTUSE sites if the volume exists on the remote site. Since
RO_DONTUSE implies that we do not expect a site to exist there, delete
the extant volume if we find one and force a full dump.
If we perform an incremental dump, we run the risk of incrementally
dumping to a temporary volume that the administrator is not aware of,
and doesn't have any actual data but has a last update time late
enough that it may be missing some data after the incremental dump. So
to avoid that, force a full dump every time.
sin_family isnt sent over the network and therefore doesnt need
htons(). sin_family is essential the same as domain, and no one
does socket(htons(AF_INET), ...)
Windows: Ensure that cm_NameI errors are acted upon promptly
There are many cases in the SMB server where an error from cm_NameI()
was either ignored or not acted upon until several other operations
are performed that could result in the same error being repeated.
This is a mistake which did not have negative side effects until
additional checks for callback status were added recently.
At present, if a CM_ERROR_ACCESS error is returned and ignored,
subsequent attempts to operate on the same cm_scache_t will result
in additional queries to the file server that will also end in an
abort response. This can trigger the file server to delay responses
to the client.
This patchset ensures that all cm_NameI() errors are acted upon
promptly.
Windows: Fix Parent(path) computation to permit mp and symlink creation
The parent path computation was leaving trailing slashes on the
path names which prevented the creation of mount points and symlinks
when UNC paths were used that contained mount points.
Jeffrey Altman [Sat, 2 Oct 2010 03:47:11 +0000 (23:47 -0400)]
Rx: raise rx_minPeerTimeout to 20ms
At 2ms it is possible for the packet we are sending to be resent
just about immediately as the retryTime computation occurs before
the send takes place and not afterwards. If the network send blocks,
the retryTime may have already expired.
We do not want rx_minPeerTimeout to be too large though because the
value will end up being used as the backoff time period when the
actual RTT for the connection is less than the rx_minPeerTimeout
value. Extensive testing shows 20ms to be an adequate compromise
that avoids the vast majority of unnecessary resends without
unnecessarily slowing down the connection if a packet is in fact
lost.
Rx: When call receive is done, send ack all packet
When all of the packets for a call have been received,
immediately send an ack all packet to the peer. This permits
the peer to free the contents of the transmit queue and cancel
all pending resend events.
Jeffrey Altman [Sat, 2 Oct 2010 04:49:38 +0000 (00:49 -0400)]
Windows: Pass Volume Root Fid to cm_Analyze after RXAFS_GetVolumeStatus
RXAFS_GetVolumeStatus can return VNOVOL, VMOVED, etc. In order to
process them and update volume state a fid must be passed to cm_Analyze().
Use the volume root fid.
If a volume lookup returns VL_NOENT or VL_BADNAME, cache the negative
response for five minutes. This prevents volume lookup storms caused
by the same volume lookup being performed repeated during a short
time period. This can happen if mount points to volumes that do not
exist are present in a directory that is being evaluated by Windows
Explorer or Common Control File Dialogs.
This functionality is implemented by storing the most recent update
time for the volume group as part of the cm_volume_t. A non-existing
volume group is identified with a new CM_VOLUMEFLAG_NOEXIST flag.
The presence of the lastUpdateTime value also permits volume location
information to expire at lastUpdateTime + lifetime instead of expiring
all volume information simultaneously each lifetime period.
Andrew Deason [Tue, 14 Sep 2010 14:45:10 +0000 (10:45 -0400)]
DAFS: Raise LogLevel for per-chain vol stats
Only report detailed per-chain volume statistics on shutdown/SIGXCPU
if LogLevel is 125 (or 25 for smaller per-chain stats). If a
fileserver is configured with a large -vhashsize, printing out stats
for each chain can take awhile and use up a nontrivial amount of disk
space for logging, so only print out these stats if we're asked for
them.
configure: --with-linux-kernel-packaging should default to disabled
the test for this build feature is reversed. by default, the value for
with_linux_kernel_packaging will not be defined which makes the existing
test pick MPS='SP' instead of LINUX_WHICH_MODULES. based on the configure
help messages, this would appear to be an opt-in not an opt-out.
...
Optional Packages:
...
--with-linux-kernel-packaging
use standard naming conventions to aid Linux kernel
build packaging (disables MPS, sets the kernel
module name to openafs.ko, and installs kernel
modules into the standard Linux location)
...
Simon Wilkinson [Sun, 26 Sep 2010 14:48:54 +0000 (15:48 +0100)]
RX: Tidy reader data locking
Data which is accessed only by the reader thread doesn't need to be
protected by call->lock
Remove the call->lock protection where it isn't required, which makes
certain read/write calls lock free.
Stop rx_ResetCall from manipulating reader thread data. This data will
be zero'd and cleared when the reader thread calls rx_EndCall, and
doesn't need to be reset by the Listener thread.
The change which made rx_ResetCall reset reader thread information
was originally part of 559ea99b. It caused race conditions that were
fixed by adding additional lock protection in d0cc6e, 4dadd2 and 423ab97e. This commit reverts portions of all of those changes. It
is safe to not clear the iovc in ResetCall because any NewCall must
be balanced by a corresponding EndCall in the reader thread, and
EndCall does the appropriate freeing of reader elements.
Ben Kaduk [Wed, 29 Sep 2010 00:03:25 +0000 (20:03 -0400)]
More FBSD syscall tweaking
We're now properly registered in syscalls.master for HEAD
(i.e. proto-9.0) and RELENG_8 (proto-8.2), which means that
afs3_syscall is prototyped in sys/sysproto.h . Accordingly,
don't declare it in afs_prototypes.h for those cases.
Also add FBSD82_ENV checks for the new syscall-registration code,
and cast afs3_syscall to sy_call_t* for the sysent structure.
Simon Wilkinson [Mon, 27 Sep 2010 22:50:23 +0000 (23:50 +0100)]
rx: Limit window size to max acks
The RX ack packet can only acknowledge 255 packets at once. In the
current implementation, this limits our maximum window size to 255,
as we can't acknowledge any packets we receive outside of that window
size.
Contains DKMS robustness fixes, improvements to the defaults for the
module build, and cleanup of the openafs-client init script. Updates
the build system for the new demand-attach binary naming and for the
changes to supported configure options. Fixes some issues with
afs-newcell. Forces disabling of the Linux syscall probing in kernel
module builds, since no supported Debian kernel allows this and it
causes problems. Update debhelper to V8, which allows simplification
of debian/rules and debian/module/rules.
Michael Meffie [Thu, 23 Sep 2010 14:15:57 +0000 (10:15 -0400)]
scout: display fetch and store counts as unsigned
Fetches and stores are already defined as unsigned, so format
them as unsigned values when displaying in scout. This fixes
the bug where scout shows those counts as negative values on
busy servers which have been running for a while.
Simon Wilkinson [Thu, 23 Sep 2010 16:41:47 +0000 (17:41 +0100)]
rx: Big windows make us sad
The commit which took our Window size to 128 caused rxperf to run
40 times slower than before. All of the recent rx improvements have
reduced this to being around 2x slower than before, but we're still
not ready for large window sizes.
As 1.6 is nearing release, reset back to the old, fast, window size
of 32. We can revist this as further performance improvements and
restructuring happen on master.
The previous value, 350ms, is historical. Now that networks are
so much faster, an artificially high timeout value when backed off
results in an extremely long delay before communication can resume.
Rx: Do not hold call lock across memcpy in rx_ReadProc/rx_WriteProc
1.4.x does not hold the call lock across memcpy operations in
rx_ReadProc, rx_ReadProc32, rx_WriteProc, rx_WriteProc32. The
claim is that the call curpos, curlen, and nLeft fields which
refer to the current packet being processed will not be touched
by any other thread. Therefore it is safe to drop the call lock
to permit another thread to add packets to the call while the memcpy
is performed in parallel.
This patchset continues to hold the call lock longer than the
original implementation but does drop it for the length of time
it takes to copy data from the packet buffer to the application
buffer.
Windows: Export additional RX debugging variables from afsrpc.dll
Export
rxi_nRecvFrags @2008 DATA
rxi_nSendFrags @2009 DATA
rx_initReceiveWindow @2010 DATA
rx_initSendWindow @2011 DATA
rx_intentionallyDroppedPacketsPer100 @2012 DATA
rx_intentionallyDroppedOnReadPer100 @2013 DATA
so they can be referenced from pthreaded builds of src/rx/test tools.
Exported variables must be present in both FREE and CHECKED builds.
Rx: PrintTheseStats should not be dependent on RXDEBUG
When RXDEBUG is not defined, PrintTheseStats generates an error
even though the statistics are in fact available. The global
variable rx_packetTypes was not being defined without RXDEBUG.
Make rx_packetTypes defined always and permit statistics to
always be printed.
The global dataPacketsReSent statistic should be the sum of all
peer->reSends and dataPacketsSent should not include the count of
resent packets. Prior to this patchset, dataPacketsSent included
the resent packets and dataPacketsReSent was computed as the number
of requests for Ack instead of the number of packets resent.
If a packet is missing, the peer timeout is backed off to provide
a new starting point for timeout computation. The backoff state
must be stored in the peer object to ensure that multiple failures
do not result in more than one backoff before a successfully received
packet is available for recomputation.
Rx: only compute peer bytes sent and received if rx_stats_active
Computing the bytes sent and received is an expensive operation.
If rx statistics collection has been disabled we should not collect
the peer data. The most expensive operation is the rx_FindPeer()
call that is performed during rxi_ReadPacket(). rxi_ReadPacket()
is processed by the rx listener thread which must be as fast as
possible.
rxi_ReceiveAckPacket can acquire and drop the conn_data_lock several
times and acquires and drops the peer_lock unnecessarily. This patchset
adds a variable to track whether the conn_data_lock is held in order
to avoid the need to drop it and reacquire it based upon conditional
operations. It also relocates the peer->maxPacketSize computations
in order to consolidate the work performed under the peer_lock.
rxperf made assumptions that it was built against LWP, used buffer
sizes for read/write that were too small, made use of non-portable
types, and set signal handlers that are unsupported.
Andrew Deason [Wed, 15 Sep 2010 16:19:33 +0000 (12:19 -0400)]
libafs: Fix pioctl get/putInt alignment issues
We don't know if the buffer for pioctl data is aligned to anything, so
we can't just dereference the given pointer as an int or anything
else. So, just memcpy the data in for ints and such; conveniently,
afs_pd_getBytes and afs_pd_putBytes can do this for us, so just use
that.
Marc Dionne [Fri, 10 Sep 2010 23:55:39 +0000 (19:55 -0400)]
vlserver: Set but not used variables
Remove some variables that are set but never used in the vlserver
directory:
- n1,n2,n3 and n4 in vlclient.c appear to have never been used even
in the original IBM code
- some variables in vldb_check.c that are no longer used after some
recent changes
Andrew Deason [Tue, 14 Sep 2010 16:15:22 +0000 (12:15 -0400)]
volser: Delete timed-out temporary volumes
When a transaction times out on a volume, delete the volume if it is a
temporary volume (destroyMe is set). This prevents half-created
volumes from accumulating, which can take up space and screw up
certain vol ops in some versions.