Simon Wilkinson [Wed, 13 Jul 2011 10:53:57 +0000 (11:53 +0100)]
Add make dist and make srpm targets
Add targets to generate distribution tarballs, and srpms, from a tree.
These will generate packages for whatever the current HEAD of the tree
is - if the HEAD is a release tag, then the packages will be named for
that release, if the HEAD is between releases, then git describe will
be used to create an appropriate version identifier.
The tarballs are generated from the current git repository contents,
anything not checked in will not be included.
Simon Wilkinson [Wed, 13 Jul 2011 13:35:48 +0000 (14:35 +0100)]
vol: Initialise list before error exit when cloning
The inode list wasn't being initialised before the first call into the
error handler. This makes it possible that we end up trying to discard
items from an uninitialised list, with all the chaos that would cause.
Fix things so that this list is correctly set up.
Simon Wilkinson [Wed, 13 Jul 2011 13:33:57 +0000 (14:33 +0100)]
volser: Actually return errors from ListOneVolume
The return code from GetVolInfo was being thrown away, and success
returned to the caller, regardless of the success of this function.
As GetVolInfo's exit codes aren't suitable for sending over the wire,
just return ENODEV if this function returns failure.
Throughout cm_server.c, input parameters to functions that
are protected by cm_serverLock are dereferenced by assignment
during variable initialization prior to the cm_serverLock being
obtained. As a result there is a race which can result in
either list corruption or dereferencing freed memory.
When rx was converted to use pthreads, the code that allocates
a call to a connection channel in rxi_ReceivePacket() was not
made thread safe. The code prior to this patchset permitted a race
in the server connection case. The rx_connection channel assignment
in rxi_ReceivePacket() and the call destruction in rxi_FreeCall()
and rxi_DestroyConnectionNoLock() did not consistently protect the
rx_connection channel array using the conn_call_lock.
This race could result in rxi_ReceivePacket() operating on a
rx_call which was disconnected from the previously assigned
rx_connection.
In addition, the code in rxi_ReceivePacket() that was intended
to protect the allocation of a call using rxi_NewCall() to the
connection channel array was racy with itself.
This patchset consistently applies the conn_call_lock to protect
the allocation / deallocation of calls to the connection channel
array and in the process simplifies the logic in rxi_ReceivePacket()
as it is no longer necessary to protect against a null call pointer
since the race can no longer be lost.
OpenBSD: Add <sys/queue.h> header for <sys/lockf.h>
On OpenBSD, the <sys/lockf.h> header requires the TAILQ_* macros
which are defined in <sys/queue.h>. The latter is not automatically
included by <sys/lockf.h> . This patch makes sure that it is
available by putting it into the OpenBSD-specific param.h files
(so as not to impact any other OS).
Windows: always open dscp in smb_ReceiveNTTranCreate
There were two code paths in smb_ReceiveNTTranCreate that included
asserts in case the directory cm_scache_t object had not been
evaluated. RT129299 contains a report that at least one of
them had been tripped in production. There is no reason to avoid
evaluating the directory scp. It must exist in the cache and
obtaining a reference in all cases simplifies the logic of this
overly complex function.
Windows: Refactor cm_Unlock*() to avoid code duplication
cm_Unlock() and cm_UnlockByKey() duplicate a significant amount
of code. Refactor it into a new static function, cm_IntUnlock()
which handles the process of downgrading or releasing a file
server lock depending upon the lock state of the cm_scache_t
object.
Windows: Do not probe new servers from cm_UpdateVolumeLocation
cm_NewServer() can result in a call to cm_UpdateVolumeLocation()
if a server probe is performed. In order to avoid recursive
calls to cm_UpdateVolumeLocation() do not probe new servers from
within cm_UpdateVolumeLocation().
Andrew Deason [Fri, 1 Jul 2011 21:58:06 +0000 (16:58 -0500)]
vol: Don't always FDH_REALLYCLOSE on linktable ops
If we dec a linktable entry or get a free tag from the link table,
there is no reason to FDH_REALLYCLOSE the linktable fd handle.
FDH_REALLYCLOSE is the same as FDH_CLOSE, except that it tells the
ihandle package that the file handle will not be used again soon. If
we dec a linktable entry or get a free tag, there is no reason to
think that, so just FDH_CLOSE the handle instead.
Andrew Deason [Fri, 1 Jul 2011 19:25:05 +0000 (14:25 -0500)]
DAFS: Do not clear salv state on fssync salvage
When a volume is put into an error state via the FSYNC_VOL_FORCE_ERROR
command, we clear the salvage state informaton on it, since we're
forcing it offline and thus inaccessible. However, if we are forcing
it to an error state because the volume needs salvaging, we just
salvage it. In this case, do not clear the salvage state, since we
need to know if we've already requested or scheduled a salvage so we
can correctly keep track of the number of salvages performed.
Andrew Deason [Wed, 29 Jun 2011 18:51:22 +0000 (13:51 -0500)]
SOLARIS: Granular multiPage detection
Currently, a struct vcache has a multiPage counter, indicating how
many afs_getpage requests are in-flight for that vcache that involve
retrieving multiple pages. Any dcache associated with such vcaches are
then avoided when choosing dcache entries to evict from the cache,
since we may deadlock when trying to evict a dcache entry from one of
the earlier afs_GetOnePage calls in a particular afs_getpage request.
This behavior can cause the client to become unusable if the cache
becomes full, and the only items in the cache are dcache entries in a
file that has an in-flight multi-page afs_getpage request. Since, in
that case, we cannot kick out any entries from the cache, and so we
wait forever to wait for the cache utilization to go down.
To prevent this from occurring, record exactly which ranges in the
file have in-flight multi-page afs_getpage requests, and just avoid
dcache entries in those ranges. This way afs_GetDownD can evict dcache
entries in the same file, but still avoid entries that would cause a
deadlock.
Also add some comments explaining this situation a bit more.
Revert "Rx: When call receive is done, send ack all packet"
This reverts commit 3cd3715e608b801b4848399e42cb47464e6e3cc3,
which replaces an ack with an ackall; ackall processing does
not actually mark all packets acked when it is received, so
it is insufficient.
Andrew Deason [Tue, 21 Jun 2011 21:25:14 +0000 (16:25 -0500)]
DAFS: Do not attach a specialStatus'd vol
If we encounter a preattached volume during GetVolume, we currently
ignore vp->specialStatus before trying to attach. However, we will
generally always fail to attach due to a conflicting vol op, but even
if we don't, GetVolume always returns an error later on if
vp->specialStatus is set. So, same some processing and attempted
attachments by bailing out sooner if vp->specialStatus is set.
Andrew Deason [Tue, 21 Jun 2011 23:08:21 +0000 (18:08 -0500)]
salvager: Clear summary in RecordHeader
Not every field in the summary header in RecordHeader is set, leaving
some used uninitialized when we copy to the given volumeSummaryp (like
'deleted'). Zero out the header before we do anything.
Andrew Deason [Tue, 21 Jun 2011 22:51:32 +0000 (17:51 -0500)]
Build a separate copy of vlib for dasalvager
Currently dasalvager links to vlib.a. But vlib.a is built without any
DAFS defines, and so the size of a struct DiskPartition64 is different
(since dasalvager is built with AFS_DEMAND_ATTACH_UTIL). Build our own
copies of the volume package files instead, with
AFS_DEMAND_ATTACH_UTIL defined.
Andrew Deason [Tue, 21 Jun 2011 19:58:42 +0000 (14:58 -0500)]
vol: Do not overwrite specialStatus in attach2
attach2 wants to set specialStatus to VBUSY in certain conditions
(such as, it discovers a conflicting vol op where VVolOpSetVBusy_r is
true). However, specialStatus may already be set to something else,
like VMOVED if the volume is being moved off of the server. This can
happen if the volserver has checked out and FSYNC_VOL_MOVE'd a
preattached volume but hasn't deleted or checked the volume back in
yet.
So, if specialStatus is already set, don't touch it, so we don't start
reporting VBUSY errors to clients when we should be reporting VMOVED,
or some other error code previously set.
Simon Wilkinson [Sat, 18 Jun 2011 14:50:08 +0000 (15:50 +0100)]
rx: Exit fast restart on non-duplicate ACK
The current code only exits fast restart when we receive an ACK
packet that contains no missing chunks at all. On a network that is
dropping a reasonable chunk of its packets, this means that we spend
most of the call in fast recovery. (I originally found this by running
with the intentionally drop packets feature set to 10%)
TCP's fast retransmit behaviour is that we stay in fast recovery until
we receive our first non-duplicate acknowledgement. In TCP that means an
acknowledgement that moves the window. In RX, it is an acknowledgment
that ACKs a new packet.
Simon Wilkinson [Sat, 18 Jun 2011 12:17:07 +0000 (13:17 +0100)]
rx: Don't limit the # of packets sent in recovery
The RX transmit engine limits the number of packets sent whilst in
loss recovery to one per invocation of the transmit engine. As the
engine cannot be called by the application thread whilst in recovery,
this means that we end up being limited to one packet per ACK received,
which means that despite a growing congestion window we'll only send
one packet per RTT (in effect, a congenstion window of 1).
This will remain the case until we exit recovery, and all of a sudden
can send a large number of packets. If this is larger than the current
capacity of the network, we'll probably end straight back in recovery
again.
Let the congestion window do its job, by removing this arbitrary limit.
Simon Wilkinson [Sat, 18 Jun 2011 12:01:35 +0000 (13:01 +0100)]
rx: Don't wait for TQ busy when entering recovery
Two different threads can cause a call to enter recovery. The event
thread will move a call into recovery as a result of a timeout, or
the listener thread will move it there following a fast retransmit.
In both of these cases, recovery looks different. In the case of
a timeout, we enter slow start, starting as if we were begininning
transmission for the first time. Following fast retransmit, we enter
fast recovery, with different starting parameters than those coming
from slow start.
As a reslt, the current behaviour, where either call sitting in
FAST_RECOVERY_WAIT causes the other to simply return is inappropriate.
Further investigation indiciates that FAST_RECOVER_WAIT is actually
uncessary. There is no harm caused to a thread which is currently
blocked on the network in the middle of a transmit, in adjusting the
window size underneath it. As both of these states collapse the window,
that thread will simply cease sending earlier.
So, simplify the code, and remove the potential race between event and
listener by removing the FAST_RECOVER_WAIT state.
Simon Wilkinson [Sat, 18 Jun 2011 11:43:44 +0000 (12:43 +0100)]
rx: Enter loss recovery when we retransmit
Since I mistakenly wrote commit 36e2d13b, RX hasn't entered congestion
avoidance when a loss event occurs. This is bad, because on todays
networks the majority of packet losses are due to some form of
congestion.
Now that the timeout code has been restructured, the chances of entering
the retransmit routine in error are much much smaller, so this code
needs to be restored.
This change reverts 36e2d13b55085c996d38b30d003296c602ef8ee3. However,
the original RX code has the problem that it assumes that all forms of
fast recovery are the same - in particular, that the call settings that
result from entering fast recovery due to a fast retransmit are
identical to those resulting from a timeout. This is not the case, and
this will be fixed in a later change.
Simon Wilkinson [Sat, 18 Jun 2011 10:58:57 +0000 (11:58 +0100)]
rx: Add Karn-style backoffs to RX retransmits
When we retransmit a packet, we may be doing so because the RTT of the
connection has grown dramatically larger than earlier within the call.
However, RX doesn't permit all ACKs to retransmitted packets to be
counted within the RTT calculation.
So, adopt the same approach as Karn developed for TCP, and as described
in detail in RFC2988. When a retransmit event occurs, backoff the
connection RTT by doubling its value, and hold at this doubled value
until either another retransmit occurs (in which case we back off again,
up to a predetermined ceiling), or we receive an ACK packet which we
can use within the RTT calculation, in which case we drop back down to
the newly measured value.
This change replaces the per-packet backoff strategy originally
implemented in RX (which, whilst allowing resent packets more chance of
arriving, doesn't help with computing a correct RTT).
Simon Wilkinson [Sat, 18 Jun 2011 10:48:45 +0000 (11:48 +0100)]
rx: Make clock_Add correctly add to itself
With the existing clock_Add code, the following:
struct clock a = {2, 800000};
clock_Add(&a, &a);
gives a clock value of {6, 600000}, rather than the expected {5, 60000}.
This is because the ordering of instructions leads it to double count
the carry on the seconds field. Reorder the instructions so that the
carry is correctly applied.
Simon Wilkinson [Sat, 18 Jun 2011 10:35:30 +0000 (11:35 +0100)]
rx: Remove resending logic into its own function
Create a new function, rxi_Resend, which is the entry point to running
the transmit queue as a result of a resend event. This concentrates all
of the resend logic into one place, removes the need for
rxi_StartUnlocked, and means that rxi_Start's arguments don't need to
match those of an event handler.
Simon Wilkinson [Mon, 25 Oct 2010 09:14:12 +0000 (10:14 +0100)]
rx: Don't let timeouts force fast recovery
The current RX implementation goes into fast recovery whenever a
timeout occurs. This is incredibly wasteful, particularly on fast
connections. So, remove this in favour of TCP style behaviour.
Simon Wilkinson [Mon, 25 Oct 2010 08:16:09 +0000 (09:16 +0100)]
rx: Fix resend accounting
rxi_Start flagged itself as 'resending' whenever it flushed the
transmit queue due to a resend event. However, it would flush the
entire transmit queue at this point, rather than only transmitting
packets that require a resend. When running with large window sizes
this results an a large number of packets erroneously being marked
as resent.
Instead, let SendXmitList decide whether a packet is being
retransmitted by using the presence of a serial number. This takes
advantage of the fact that a retransmitted packet must be the only
entry in a packet list - we just flag the packet list, instead of
having to maintain counters for each individual packet.
Jeffrey Altman [Tue, 12 Oct 2010 14:53:43 +0000 (10:53 -0400)]
Rx: Consolidate wait for tq busy and make its use uniform
rxi_WaitforTQBusy() is now used wherever a wait for the transmit
queue is required. It returns either when the transmit queue is
no longer busy or when the call enters an error state.
Having made this change it is clear that call->currentPacket is
not always validated when the call->lock is reacquired which may be
true when rxi_WaitforTQBusy() is called.
Simon Wilkinson [Sat, 18 Jun 2011 09:46:53 +0000 (10:46 +0100)]
rx: Change the way that the RTT timer is applied
RX maintains a retryTime for every packet that it has transmitted,
which is held as the time that that packet was sent, plus the smoothed
RTT of the connection. If a packet is in the queue with a retryTime
older than the current time, then it is resent at the first opportunity.
In some circumstances, this first opportunity will be as a result of
the resend event timer expiring, in others it will happen as part of
a normal queue run.
There are a number of problems with this approach on congested networks.
Firstly, on a network with a large window size, which is in "normal"
flow, it means that we will never actually perform fast retransmit as
the timeout for this packet will have expired before we have received
any further ACKs. This is because, on a network with a relatively stable
RTT the ACK for packet n+1, n+2, or n+3 cannot arrive before the
expected time of arrival of the ACK for packet n. As we retry
immediately this expected time of arrival has passed, we never have the
opportunity of using these later ACKs to learn that packet n is lost.
Secondly, the fact that we may resend packets from a "normal" queue run,
rather than as a result of a resend event, means that there is no clear
entry point for resends. As resends should be assumed to be a result of
network congestion, and result in both the call throttling back, and the
RTT being increased, this lack of a clean entry point makes things
tricky.
As a solution, this patch changes the way in which retransmit times are
applied to use the algorithm described in RFC2988.
*) Whenever we send a new packet, we start a timer for the current call
rto value if one isn't already running.
*) Whenever we receive an ACK that acknowledges new data, and we have
packets that are sent but not yet acknowledged, we restart the
retransmit timer using the current rto value.
This alogrithm solves the first problem, as it means that if the
connection is still flowing, we will continue to receive ACKs, and we
can enter fast retransmit.
In implementation terms, we longer track a retryTime per packet, and
instead simply record if a packet has been sent or not. Packets which
have been sent may only be resent as a result of a resend timer
expiring, or of entering fast retransmit, so solving the second issue.
Simon Wilkinson [Fri, 17 Jun 2011 21:06:54 +0000 (22:06 +0100)]
rx: Compute smoothed RTT per call, not per peer.
RX uses the TCP RTT smoothing algorithm as described in RFC2988.
However, the TCP algorithm is designed to accept samples from a
single connection, accepting a new sample once per RTT.
RFC2988 suggests that "when multiple samples are taken
per RTT the [ alogrithm ] may keep an inadequate RTT history."
In RX's implementation, we use a single instance of this alogrithm
per peer, and input all of the samples from all of the active calls
and connections into this same instance. This leads to us taking
a significantly (potentially many magnitudes) larger number of samples
per RTT, and rapidly losing the RTT history. With RX's implementation,
short lived network events may easily bias the RTT, and cause large
numbers of packets to time out.
This change fixes this by moving the RTT calculation onto a per call
basis. We still update the peer with our caclulated value, so that new
calls may be created with an RTT corresponding to the current value for
the connection, rather than having to start high and converge downwards.
Simon Wilkinson [Sun, 5 Jun 2011 10:04:12 +0000 (11:04 +0100)]
rx: Reorganise transmit queue walk
The transmit queue is stored in the order that we transmitted the
packets (by sequence number). This means that we can do all of the
ACK processing by just doing a single walk of this queue, rather
than having to walk the queue multiple times, once for each type of
ACK.
This clarifies the queue processing, and should reduce the amount of
time that we spending iterating large transmit queues.
Jeffrey Altman [Sun, 5 Jun 2011 22:41:24 +0000 (18:41 -0400)]
rx: Add RX_CALL_ACKALL_SENT flag and rxi_SendAck processing
3cd3715e608b801b4848399e42cb47464e6e3cc3 modified rxi_ReceiveDataPacket
to send an ACKALL whenever RX_CALL_RECEIVE_DONE is set on the call.
This produced the potential for a race with ACKs that set the
firstPacket value to 'rnext' when the receive queue for the call
has yet to be emptied. From the perspective of receiver the ACK
was already processed and does not require a response since the
previously received ACKALL acknowledged the delivery of all data
packets to the application. When sending ACKs after ACKALL it is
therefore required that firstPacket be set to the sequence number
after the last unprocessed packet in the receive queue.
Thanks to Simon Wilkinson for his extensive assistance in identifying
the problem and the development of this patchset.
Jeffrey Altman [Sun, 5 Jun 2011 20:02:46 +0000 (16:02 -0400)]
rx: do not rxi_AckAll for one data packet call
rxi_ReceiveDataPacket() calls rxi_AckAll() when the call reaches
the RX_CALL_RECEIVE_DONE state to permit the caller to empty the
transmit queue. That reduces the memory consumption of the caller
and avoids unnecessary retransmits which the call is in process.
If the call data consists of a single packet it is possible that
Ping ACK packets sent as part of connection establishment could
race with the ACKALL and be delivered out of order. If the Ping
ACK is delivered second, it will be ignored by the peer forcing
a two second delay in connection establishment. To avoid the race
do not send an ACKALL for a single packet call.
Simon Wilkinson [Sat, 14 May 2011 07:55:50 +0000 (08:55 +0100)]
rx: Reverse the consumption order of idle queue
Currently, the rx server thread idle queue is used in an LRU manner.
This means that we round robin requests between all of the threads
configured on a given system, which means that we end up thrashing
CPU caches on machines whose workload doesn't require that all of
the configured threads be used.
Change this so that we always use the most recently idle thread. This
isn't as "fair" to all of our waiting threads, but should mean that we
scale better on SMP machines, as a thread that is recently idle is
likely to have been recently scheduled.
Simon Wilkinson [Fri, 17 Jun 2011 19:35:59 +0000 (20:35 +0100)]
rx: Remove incorrect backoff code
The ACK packet handling routine contains code which causes the
RTT to backoff if the selective ACK response indicates that there is
a missing packet. The comment justifies this code as being in line
with Phil Karn's work on TCP.
However, the TCP behaviour is that we backoff when we enter resend. Both
TCP and RX have difficulty computing RTTs for resent packets due to the
ambiguous ACK problem. Whilst RX is slightly better than TCP in this
regard, we can't always tell whether an ACK refers to the original, or
resent packet, so resent packets are unable to contribute to the RTT.
This means that if the RTT ends up too low for the connection, and we
start resending every packet, the RTT will never grow to account for
this, as we never feed it any packet samples.
Karn's solution to this was to backoff (double) the RTT value when we
resend a packet, and then to not drop it back down until we receive an
ACK that we can count. This means that we will always get a new sample
for the connection, and the RTT will grow again.
The original author confirms that the current behaviour in RX is
incorrect, so simply remove it with this patchset.
Simon Wilkinson [Fri, 17 Jun 2011 18:38:29 +0000 (19:38 +0100)]
rx: Account for delayed ACKS when computing RTO
RX currently only soft ACKs every second packet, therefore a soft ACK
may be delayed by a period of time (currently 100ms, although RX did
expose this as a public variable in earlier versions).
RTT values are computed using only non-delayed ACKs, so the timeout
is a smoothed average of the exact time taken to send and directly
ACK a packet. Therefore, if the peer ends up using a delayed ACK for
the packet, using just the RTT will cause that packet to be timed out.
A while ago, this was dealt with by padding the calculated RTT with an
additional 350ms. This was then removed, and changed to a 350ms minimum
value. When this caused large numbers of spurious resends, the padding
was restored, but with a 20ms default value. As noted above, 20ms is
too low, as we may wait for up to 100ms before sending an ACK.
This patch changes minPeerTimeout so that it does what it says on
the tin - sets a minimum value below which the peer timout may not
fall. It then adds to either this value, or the calculated one, 200ms
of padding. This makes our padding identical to TCPs, and allows some
future leway as to the softAckDelay value.
Simon Wilkinson [Fri, 17 Jun 2011 18:12:09 +0000 (19:12 +0100)]
rx: Make rx_softAckDelay & rx_lastAckDelay private
The values of these two parameters directly affect the modifiers
that are needed in the peer's RTT calculations, and so can not
arbitrarily be changed by applications.
lastAckDelay has been 400ms since the first OpenAFS release, and
that value is used as a modifier when computing the timeout of the
last packet. It is likely that any change which made this value
longer than 400ms would have detrimental effects on deployed clients
softAckDelay has been 100ms for a similar time period. We have
chopped and changed the value of minPeerTimeout, so it is unclear
what the maximal value for this parameter is. For much of OpenAFS's
life, minPeerTimeout was a 350ms padding value, which suggests that
copying TCP, and setting the maximal value at 200ms would be a safe
option. For now, however, leave it at 100ms to avoid unexpected
side effects.
hardAckDelay is not addressed by this patch set, as all ACK packets
sent from the application thread are marked as delayed, and so
currently have no part in computing RTT times. It is likely, however,
that any changes to the hard ACK timeout should be very carefully
considered.
Jeffrey Altman [Mon, 27 Jun 2011 13:31:54 +0000 (09:31 -0400)]
Windows: MergeStatus before SyncOpDone
cm_SyncOp/cm_SyncOpDone is used to synchronize the RPC processing
to ensure that calls which are in conflict cannot occur at the
same time but also to ensure that the ordering of operations
is consistent. cm_MergeStatus() was in many cases executed after
cm_SyncOpDone() removed the synchronization barrier which in turn
permitted status information to be applied out of order. Side
effects could have included data loss due to client side file
truncation. More commonly two StoreData RPCs would have their
status information applied out of order forcing the cache manager
to invalidate all of the cached data for the file.
Jeffrey Altman [Thu, 23 Jun 2011 21:51:22 +0000 (17:51 -0400)]
Windows: TRANS2_FIND_FIRST2 for _._AFS_IOCTL_._
smb_T2SearchDirSingle() must not fail directory search requests
for the _._AFS_IOCTL_._ file. Although this file does not actually
exist, it is successfully processed by CreateFile operations.
Therefore, an explicit search for it should return a valid answer.
Jeffrey Altman [Fri, 24 Jun 2011 03:49:32 +0000 (23:49 -0400)]
Windows: Fix SMB_COM_NEGOTIATE for MS11-043
MS11-043 adds response validation for SMB_COM_NEGOTIATE messages
received by the SMB Redirector. OpenAFS failed to properly specify
a Challenge and DomainName in the response when the security mode
is SMB_AUTH_NONE (or share with password). This patchset corrects
smb_ReceiveNegotiate() so that it adheres to the protocol specification.
Jeffrey Altman [Wed, 8 Jun 2011 06:22:41 +0000 (02:22 -0400)]
Windows: shell extension is multithreaded
Since the shell extension is multithreaded and it is possible
for more than one thread to be executing in the gui2fs.cpp module
at a time, it is not safe to use a single static 'space' buffer
by more than one thread at a time. Move the buffer into the
stack of each function that uses it so that we have thread safety.
Ben Kaduk [Wed, 30 Mar 2011 02:26:50 +0000 (22:26 -0400)]
Unbreak make dest for FBSD
It turns out that we do need an afs.rc.fbsd that is set up for
transarc paths in this directory. To get it to work properly
will require the user to symlink to it from a dir that gets
checked by rcorder, but them's the breaks.
Ben Kaduk [Fri, 17 Jun 2011 06:22:34 +0000 (02:22 -0400)]
FBSD: do not FlushAllVCaches
In normal operation, any AFS vcache with associated data will have
an associated vnode, which will be on the list of vnodes associated
with the /afs mountpoint. We already call FreeBSD's vflush() in
our afs_unmount, which walks the list of vnodes associated with the
mountpoint and calls vgonel() on them, which calls VOP_CLOSE and
VOP_RECLAIM on the vnode. Our implementation of VOP_RECLAIM already
calls FlushVCache, so in normal operation, FlushAllVCaches() will
be a no-op.
However, in the presence of bugs, it is actively harmful, causing
panics. For example, if a vnode has been reclaimed but FlushVCache
failed (which we cannot report back since the VFS will panic in this
case), and we attempt to flush it again, the associated vnode has
already been cleaned up and we will panic. Likewise if our list of
vcaches becomes corrupt and has a vcache with bad or missing vnode
for some other reason, we will panic.
Since there is no gain in normal operation and abnormal operation
is more likely to panic than save data, skip the extra flush.
Ben Kaduk [Tue, 7 Jun 2011 15:30:18 +0000 (11:30 -0400)]
Also install afszcm.cat for i386_fbsd
The change gerrit/4760 enabled the use of gencat to actually build
this file, but failed to also change installation logic, so it was
sitting unused in the build tree. Fix this, and install the file.
This allows us to remove a shell case statement which had formerly
been needed to enforce this restriction.
configure should attempt to find the XML tools we need to process
the documentation. if it can't, it should provide a safe default.
still allow the user to override via command line.
Reviewed-on: http://gerrit.openafs.org/4766 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Simon Wilkinson <sxw@inf.ed.ac.uk> Reviewed-by: Derrick Brashear <shadow@dementia.org>
(cherry picked from commit cc2bc3e17ff5f7a10c515e309f8fec47a6fa14b6)
Jeff Blaine [Fri, 27 May 2011 19:49:52 +0000 (15:49 -0400)]
kvno invocation correction, language cleanup, afs/cell principal preferred
Properly show kvno command syntax, add information about preferring
'afs/cell' for the principal over 'afs', and changed "noted this down"
to "made note of"
Reviewed-on: http://gerrit.openafs.org/4740 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Simon Wilkinson <sxw@inf.ed.ac.uk> Reviewed-by: Derrick Brashear <shadow@dementia.org>
(cherry picked from commit 07f461e8e35147af605ebc86c139b31d2db0bb28)
Simon Wilkinson [Tue, 31 May 2011 07:31:55 +0000 (08:31 +0100)]
vos: print_addrs never receives multi-homed addrs
The magic address that tells the vlserver that a host is multi-homed,
and to look up the multi-homed address structure is an internal
implementation feature, which shouldn't be exposed to clients.
print_addrs is only ever called with the results of VL_GetAddrsU, which
has already converted any multi-homed pointers, so it doesn't need the
logic to handle them itself.
Michael Meffie [Fri, 24 Sep 2010 01:18:36 +0000 (21:18 -0400)]
xstat: cope with different size timeval structures
In xstat_fs_test and afsmonitor, try to display the xstat data
from the fileserver even if the fileserver has differently sized
timeval structures, or different word ordering, as the xstat
client program.
linux: rpm: Fix SELinux attributes on /afs when installing openafs-client package
Since the directory /afs isn't included in the package manifest, but
rather created in a script in the openafs-client package, it never
gets the appropriate SELinux attributes that are required to mount a
volume (mnt_t).
This change fixes the problem by running '/sbin/restorecon' (if it is
an executable that exists) on the /afs directory after the
openafs-client package is installed, right after the directory is
created.
Reviewed-on: http://gerrit.openafs.org/4763 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Simon Wilkinson <sxw@inf.ed.ac.uk> Reviewed-by: Derrick Brashear <shadow@dementia.org>
(cherry picked from commit b3232b2cb44a3df02a37efd852ecfef2f3a9e5cc)
Ben Kaduk [Tue, 31 May 2011 19:25:35 +0000 (15:25 -0400)]
Enable gencat for i386_fbsd_*
The machines certainly have a /usr/bin/gencat, and I see nothing
in history to indicate a reason for this prevention.
Allow the 32-bit machines to build afszcm.cat and make packaging
more uniform between architectures.
Reviewed-on: http://gerrit.openafs.org/4760 Reviewed-by: Simon Wilkinson <sxw@inf.ed.ac.uk> Reviewed-by: Derrick Brashear <shadow@dementia.org> Tested-by: Derrick Brashear <shadow@dementia.org>
(cherry picked from commit 55a41d00057106913ce2aba50772a56bc994a9a4)
Andrew Deason [Thu, 17 Mar 2011 21:32:00 +0000 (16:32 -0500)]
libafs: Do not osi_FlushPages for dirs
Directory contents are never mapped or stored in pages, so dealing
with page invalidation on directories is just overhead. So make
osi_FlushPages a no-op when we're given a directory, which can avoid a
lot of locks and other processing (particularly when we are called in
afs_getattr in BOZONLOCK_ENV).
Christof Hanke [Wed, 25 May 2011 20:16:59 +0000 (22:16 +0200)]
autoconf: add test for typedef'd structs
AC_CHECK_LINUX_STRUCT does not work for structs which are typedef'd.
The gcc will complain with "error: storage size of ‘_test’ isn’t known"
and fail the test.
Thus the new test-macro AC_CHECK_LINUX_TYPED_STRUCT.
Ben Kaduk [Thu, 26 May 2011 05:11:14 +0000 (01:11 -0400)]
FBSD: VIMAGE support
Starting in FreeBSD 8.0, there is support for multiple virtual
network stacks (generally to be exposed to separate jail(8) environments).
It is enabled as a kernel configuration option, so our builds against
GENERIC have not failed, but we fail to build when options VIMAGE
is present. Fix our variable references accordingly.
Submitted-by: Hiroki Sato of freebsd.org
Reviewed-on: http://gerrit.openafs.org/4721 Reviewed-by: Derrick Brashear <shadow@dementia.org> Tested-by: BuildBot <buildbot@rampaginggeek.com>
(cherry picked from commit 9703b023cc0f5088eab5135acf7417e90ebbb2cd)
Derrick Brashear [Wed, 25 May 2011 19:31:40 +0000 (15:31 -0400)]
macos: disable bulkstat
1.6 only change. there's still an issue where potentially
multiple contexts reference a vnode which needs to be finalized; the fixup
is successful but there's no hint to other threads to reref before proceeding
(no actual troublesome access while waiting for the fixup as the vnode
will not have actually been CStatd yet)
Derrick Brashear [Tue, 24 May 2011 18:36:04 +0000 (14:36 -0400)]
des: generated files should not require objects needed in libdes
1.6 only change, since DES is dead. don't require the same misc.o
in both libdes and when generating generated files to making, as
make dependencies then throw away valid input.
Replace uintptr_t type cast with uintptrsz in afs_vcache.c
A recent change (commit 80fe111f0044aa7a67215ad92210dc72cb7eb2c0)
to afs_vcache.c contains a call to afs_warn() whose second parameter
contains a "(uintptr_t)" type cast as part of a double type cast.
This presents an issue on some systems, such as OpenBSD, where this
object type is defined in a header that is not presently included.
This change modifies that type cast to instead use the AFS-internal
"(uintptrsz)" type which should provide the same effect.
Note that an earlier version of this patch attempted to remove the
"offending" type cast as redundant but it was pointed out that some
systems require this kind of cascading type cast when casting pointers
to integers to deal with possible size issues.
Andrew Deason [Tue, 10 May 2011 19:16:06 +0000 (14:16 -0500)]
libafs: Flush vcaches in afs_shutdown
Currently, a few platforms (linux, linux24, solaris, irix) flush all
vcaches during shutdown. However, they do this before calling
afs_shutdown(), resulting in afs_FlushVCache queueing VCBs and
possibly trying to give the callbacks back to the server.
Instead of this, perform the flushes in afs_shutdown itself, so we do
this after we try to give up all callbacks to all servers, and we do
this while afs_shuttingdown is set, so we don't try to queue VCBs.
This also consolidates some of the duplicated code to flush all
vcaches, and now does this for all platforms.
Derrick Brashear [Fri, 20 May 2011 18:13:01 +0000 (14:13 -0400)]
macos: bulkstat redux
simplify the logic which can require sleeps in various vcache
resolution paths. instead of the two-pass system we had before,
just guess using the even/odd hack what type a vnode will be.
if a vnode turns out to be a link and thus we are wrong, we
do a fixup later. other callers who "race" with bulkstat
(which is a supported feature, otherwise you'd have to block
callbacks) will also call through a fixup to get the correct
backing vnode type. this is necessary as the KPI doesn't
let us change the type of a vnode after it's been created.
side effect: eliminate many of the ugly cases where we had been
sleeping waiting for a vnode to be finalized even before bulkstat.
Derrick Brashear [Fri, 20 May 2011 18:10:49 +0000 (14:10 -0400)]
dynroot: mark vnode types on dynroot vnodes
when we create a vnode using a dynroot fid, we weren't bothering
to update the type from the default (typically VREG); most
dynroot vnodes are actually VDIR...
Michael Meffie [Wed, 18 May 2011 17:42:27 +0000 (13:42 -0400)]
volinfo: fix -filenames option check
Fix the logic for checking the presense of the volinfo -filenames
option. The original patch inadvertently added the -filenames
check as an if-else cause to the -orphaned flag check, which
prevents filenames from being printed when listing orphaned
vnodes.
Andrew Deason [Thu, 19 May 2011 22:02:35 +0000 (17:02 -0500)]
SOLARIS: Reset syscalls on mod_install failure
If our call to mod_install fails for any reason (for example, if the
afs entry is missing from /etc/name_to_sysnum), we may still have set
the sysent structures for setgroups and ioctl to point at libafs code.
So calls to those syscalls will cause a panic, since the code they
point to is no longer loaded.
To avoid this, just reset the sysent entries back to what they were if
we fail to load, just like we do when unloading the module.
rx: always use/protect the xdr routines in the kernel
This clears up some warnings about duplicate symbols with Solaris 11
since the Solaris kernel already has these routines. Since we never
use stock kernel version of the xdr routines perhaps we should always
use/protect our version of the symbols.
Reviewed-on: http://gerrit.openafs.org/4252 Reviewed-by: Andrew Deason <adeason@sinenomine.net> Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Jeffrey Altman <jaltman@openafs.org>
(cherry picked from commit 8336d31ac5092a16cfb206707e69c19f07f99241)
Jeffrey Altman [Wed, 18 May 2011 17:51:53 +0000 (13:51 -0400)]
auth: failback to afs3-vlserver for afs3-prserver
If the DNS SRV lookup is for afs3-prserver or afs3-kaserver,
fallback to a lookup for afs3-vlserver since those services
are traditionally hosted on the same machine as the vlserver.
Marc Dionne [Sun, 15 May 2011 00:57:12 +0000 (20:57 -0400)]
Linux: fix reading files larger than the chunk size
Commit 2571b6285d5da8ef62ab38c3a938258ddd7bac4e fixed an issue with
the use of tmpfs as a disk cache and ftruncate() on files in AFS.
But it introduced a problem reading larger files as reported in
RT ticket 129880.
What should be compared against the current cache file size is the
offset into the current chunk, not the overall offset for the whole
file.
Andrew Deason [Tue, 10 May 2011 17:54:53 +0000 (12:54 -0500)]
libafs: Do not write-lock afs_xserver on ICBS
Our RXAFSCB_InitCallBackState* handler currently write-locks
afs_xserver when it clears the SCAPS_KNOWN flag for the relevant
server. However, the afs_xserver lock is for protecting the global
list and hash table of server structures, and is not necessary to
acquire in order to modify the flags of an individual server struct.
For instance, CkSrv_GetCaps does not acquire any locks to modify the
server flags.
Taking this lock conflicts with a read lock on afs_xserver acquired by
afs_FlushVCBs when it traverses the list of server structures.
afs_FlushVCBs may contact a server that then calls InitCallBackState
on us, causing a deadlock if ICBS waits for the afs_xserver lock.
So, avoid locking afs_xserver in this case, to avoid that deadlock.
Reviewed-on: http://gerrit.openafs.org/4639 Tested-by: Andrew Deason <adeason@sinenomine.net> Reviewed-by: Derrick Brashear <shadow@dementia.org>
(cherry picked from commit ae638fa383b8270fe2461a2ad91b9101c74f3593)
Andrew Deason [Fri, 6 May 2011 18:12:17 +0000 (13:12 -0500)]
dasalvager: unlink fsstate.dat when standalone
If the DAFS salvager is running in a standalone mode, unlink the
fileserver's fsstate.dat file if any volumes change. Otherwise, volume
data could have changed and the fileserver will retain callback
promises for the data in those volumes until it tries to attach the
volume. This way, callbacks are broken via callback state
reinitialization.
A better solution is to record which volumes have changed, and the
fileserver can break callbacks for them on startup. But this at least
eliminates a regression from non-DAFS behavior.
Reviewed-on: http://gerrit.openafs.org/4638 Tested-by: Andrew Deason <adeason@sinenomine.net> Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Derrick Brashear <shadow@dementia.org>
(cherry picked from commit 38efda16a2c5c9e74b5a23b5bdd2818a3353eec2)
Marc Dionne [Sat, 14 May 2011 17:19:52 +0000 (13:19 -0400)]
Linux: fix permission op test for certain compilers
Some compilers complain that _inode is used uninitialised here.
Since this test requires -Werror, it causes the test to fail
and our permission op to be used in RCU mode, leading to lockups.
Initialise it to make the compilers happy.
Fixes a lockup seen on kernels 2.6.38+ on Gentoo and Debian.
Jeffrey Altman [Mon, 9 May 2011 14:46:46 +0000 (10:46 -0400)]
Windows: always try afs/cell@USER-REALM first
In the KFW_AFS library, always try afs/cell@USER-REALM
first, even when KFW_AFS_klog() is called with an explicit
realm mapping for the cell. An afs service principal from
the user's realm is always preferred. No cross realm and
if the realm is AD, the ability to avoid the inclusion of
a PAC.
Jeffrey Altman [Fri, 6 May 2011 13:49:52 +0000 (09:49 -0400)]
Windows: replace CYGWIN envvar with CYGWINDIR
The environment variable CYGWIN (starting with cygwin 1.7.1) is
now used by CYGWIN to set configuration parameters for the cygwin
runtime library. OpenAFS used it to indicate the location of the
Cygwin install directory. Since there is a conflict, rename CYGWIN
to CYGWINDIR.