Jeffrey Altman [Mon, 21 Nov 2011 18:14:40 +0000 (13:14 -0500)]
Windows: cm_GetSCache do not release unheld lock
if cm_GetNewSCache() fails, an attempt would be made to
release cm_scacheLock which is not held. However, it should
be noted that cm_GetNewSCache() cannot fail without itself
triggering a panic.
Jeffrey Altman [Tue, 15 Nov 2011 23:35:26 +0000 (18:35 -0500)]
Windows: buf_CleanAsyncLocked dirty range only
buf_CleanAsyncLocked() should not instruct cm_BufWrite() to
write a full chunk if the current buffer is the only one that
is dirty. cm_BufWrite() will determine if it is appropriate
to fill a full chunk when storing. Instructing it to check
a full chunk forces it to do more work than necessary.
Jeffrey Altman [Wed, 16 Nov 2011 00:00:05 +0000 (19:00 -0500)]
Windows: cm_SetupStoreBIOD use firstModOffset chunk
When cm_SetupStoreBIOD attempts to store a chunk to the file
server it should not use *inOffsetp as the start of the range.
There is no guarantee that the buffer at *inOffsetp is dirty.
Instead use firstModOffset which refers to the first known
dirty buffer in the range specified by the caller. Attempt
to fill a chunk of consecutive dirty buffers from that point.
smb_ReceiveNTCreateX() calls cm_CheckNTOpen() which now
requires the smb_fid_t allocated fid value for use in share
mode locking. Move the allocation of the smb_fid earlier
in the function and apply necessary cleanup in error paths.
Jeffrey Altman [Sat, 12 Nov 2011 18:41:30 +0000 (13:41 -0500)]
Windows: fix locking hierarchy in service
The smb username lock and the daemon global lock can be requested
while the scache dirlock is held if there are no free buffers
and the service is forced to claw back extents from the redirector.
Adjust the locking hierarchy accordingly.
Jeffrey Altman [Sun, 28 Aug 2011 16:03:53 +0000 (12:03 -0400)]
Windows: afslogon network provider debug registry value
create a new TransarcAFSDaemon\NetworkProvider "Debug" value
to be used for activating the network provider debugging.
The overlapping use of TransarcAFSDaemon\Parameters "TraceOption"
is just too confusing.
Jeffrey Altman [Fri, 26 Aug 2011 17:57:15 +0000 (13:57 -0400)]
Windows: afslogon.dll is not a file system interface
Do not return a file system network type that corresponds
to a real file system inter since afslogon is in fact not
associated with a file system interface. We can't return
WNNC_NET_NONE (0) because that prevents NPLogonNotify()
from being executed. However, if we return an in use
file system value that can confuse the system when the
actual file system's network provider is also installed.
Jeffrey Altman [Fri, 26 Aug 2011 13:36:04 +0000 (09:36 -0400)]
Windows: torture error reporting
When LeaveThread() is called and GetLastError() has already
been called, pass the last error value to LeaveThread(). Otherwise,
the GetLastError() call in LeaveThread() may return an inaccurrate
result.
Jeffrey Altman [Tue, 23 Aug 2011 20:02:28 +0000 (16:02 -0400)]
Windows: change buf_Find*() signature to accept cm_fid_t
The buf_Find*() functions require a cm_fid_t to match with the
cm_buf_t objects not a cm_scache_t. Change the signature so
that the cm_scache_t is not required. It should be possible to
search for a buffer even if the cm_scache_t is not present in
the cache.
Jeffrey Altman [Fri, 19 Aug 2011 01:57:12 +0000 (21:57 -0400)]
Windows: be explicit when mapping sharing violation
Only one lock acquistion failure should be mapping to
CM_ERROR_SHARING_VIOLATION. That is CM_ERROR_LOCK_NOT_GRANTED.
Make it clear that is what we are doing.
Jeffrey Altman [Tue, 9 Aug 2011 18:26:33 +0000 (14:26 -0400)]
Windows: avoid duplicate volume update queries
If multiple volume update queries have stacked up in
cm_UpdateVolumeLocation() and the active query failed,
do not re-issued the blocked queries. Instead, prevent new
queries for 60 seconds and fail those that blocked during
the active query.
Andrew Deason [Fri, 24 Feb 2012 00:28:21 +0000 (18:28 -0600)]
Rewrite make_h_tree.pl in shell script
The current usage of make_h_tree.pl adds a build requirement of
/usr/bin/perl that we did not have prior to commit 1d6593e952ce82c778b1cd6e40c6e22ec756daf1. Do the same thing in a
bourne shell script instead, so we don't need perl.
Note that this is not as generalized as make_h_tree.pl, but it doesn't
need to be. Specifically, this does not strip a leading ../ from found
include directives (nothing in the tree that includes h/* files uses
this), and header filenames containing whitespace almost certainly do
not work correctly.
The h => sys mapping is also much more hardcoded, but that's all we
were using this for anyway.
Jeffrey Altman [Thu, 23 Feb 2012 14:31:31 +0000 (06:31 -0800)]
Windows: do not bugcheck in AFSExAllocatePoolWithTag
If the Bug Check flag is set, the call to AFSBreakPoint() in
AFSExAllocatePoolWithTag() will trigger. There is no need for
an explicit bug check test in AFSExAllocatePoolWithTag().
If AFSExAllocatePoolWithTag() returns NULL there is no need
to ASSERT() the return value since AFSBreakPoint() would already
have been called to signal a debugger.
Turning on BugCheck by default was a good idea because we needed
to track down the cause of exceptions that were otherwise being
thrown resulting in resource leaks. However, it is a bad idea
because it results in out of memory conditions throwing bug checks
that result in a BSOD.
Andrew Deason [Fri, 24 Feb 2012 00:28:21 +0000 (18:28 -0600)]
Rewrite make_h_tree.pl in shell script
The current usage of make_h_tree.pl adds a build requirement of
/usr/bin/perl that we did not have prior to commit 1d6593e952ce82c778b1cd6e40c6e22ec756daf1. Do the same thing in a
bourne shell script instead, so we don't need perl.
Note that this is not as generalized as make_h_tree.pl, but it doesn't
need to be. Specifically, this does not strip a leading ../ from found
include directives (nothing in the tree that includes h/* files uses
this), and header filenames containing whitespace almost certainly do
not work correctly.
The h => sys mapping is also much more hardcoded, but that's all we
were using this for anyway.
Andrew Deason [Thu, 23 Feb 2012 19:02:13 +0000 (13:02 -0600)]
salvager: Do not fork for single VG salvage
Currently we always fork a child in the salvager in order to salvage a
volume group. I believe this is in order to protect SEGV, exit(), etc
in one salvage operation from preventing salvaging anything else. When
salvaging a single volume group, though, there appears to be little
benefit.
In addition, we need to keep the VG salvaging code in the same process
as the cleanup code for single-volume salvages, so we can know which
volumes were deleted by SalvageVolumeGroup, so we know which volumes
to bring back online. So, do not fork for the singleVolumeNumber case.
Note that for DAFS, we already never fork for the entire salvage
operation when salvaging an individual volume group. So, this is
effectively a non-DAFS-only change.
Andrew Deason [Wed, 22 Feb 2012 00:05:32 +0000 (18:05 -0600)]
salvager: Remove VolumeSummary->fileName
The 'fileName' field in VolumeSummary serves two apparent purposes:
- Storing the filename of the volume header file (V0XXX.vol).
- Indicating whether or not a given VolumeSummary object is
referenced by any inodes on disk. fileName is set by
AskVolumeSummary/GetVolumeSummary, and is cleared in
SalvageFileSys1 when a matching inodeSummary entry is found.
This is very confusing. The first purpose is completely unnecessary;
we can always calculate the filename from the volume id for the
volume, and we already enforce the filename to be of that specific
format. The second purpose is very unclear in the current code, and
overloads the meaning of the field.
So instead, remove fileName entirely. Code that was using it to locate
the header file are changed to use VolumeExternalName_r. Code that was
using the field to determine if the volume is "unused" is changed to
use a field just called "unused", set to 0 or 1.
Andrew Deason [Tue, 21 Feb 2012 23:46:41 +0000 (17:46 -0600)]
salvager: Do not require MaybeZapVolume fileName
In MaybeZapVolume, currently we do not remove the volume header if the
given isp->volSummary->fileName is not set. This effectively means
that we only actually "zap" volumes for which we have just created the
header, or which are not referenced by any inodes.
For readonly volumes that have errors, we want to delete the volumes
instead of salvaging. Readonly volumes with valid headers will have
fileName as NULL, though (set back in SalvageFileSys1), so
MaybeZapVolume will refuse to remove them. What ends up happening is
that the headers will stay around, but since we do not finish checking
the volume, all of the inodes for the data in the volume will be
dec'd. This results in a volume whose header exists, but none of its
inodes (including special inodes) exist, so the volume will need to be
salvaged again, and during that salvage will be deleted (because there
are no inodes for the volume).
Avoid all this, and just delete volume headers for volumes that lack a
valid fileName. Instead try to avoid deleting headers with
volSummary->deleted set, just so we don't try to delete the same
headers twice.
Andrew Deason [Tue, 21 Feb 2012 23:40:46 +0000 (17:40 -0600)]
salvager: Do not set fileName on header fixup
Currently, SalvageVolumeHeaderFile will set isp->volSummary->fileName
to a new string whenever the volume header needs to be created or
re-written. When control reaches back to SalvageFileSys1, this can
cause DeleteExtraVolumeHeaderFile to delete the header, since
vsp->fileName is used as a sort of indicator to see whether or not a
volume has been referenced by the inode summary.
When we create a new header, we avoid this because we allocate a new
VolumeSummary struct, which is not caught by the last
DeleteExtraVolumeHeaderFile for loop in SalvageFileSys1. However, we
do delete the header when we simply re-write a header, since we use
the existing VolumeSummary struct. Set fileName in neither, for
consistency.
Andrew Deason [Wed, 22 Feb 2012 21:40:20 +0000 (15:40 -0600)]
LINUX: Use afs_convert_code in afs_notify_change
afs_notify_change currently just returns "-code". This can cause a
panic if the error code is negative, since we will return a positive
error code, which may get interpreted as a valid pointer value in
higher levels.
Specifically, if we hit afs_notify_change via something like this code
path:
Andrew Deason [Wed, 22 Feb 2012 21:36:37 +0000 (15:36 -0600)]
LINUX: move afs_notify_change to osi_vnodeops.c
afs_notify_change is almost always used solely in inode_operations
structs, and is more similar to the other per-vnode functions. Put it
with the other per-vnode functions for better organization, and so
they can use the same static functions.
Move the helper functions iattr2vattr and vattr2inode along with it.
Derrick Brashear [Wed, 22 Feb 2012 20:57:46 +0000 (15:57 -0500)]
libafs: retry retriable RPCs instead of abandoning
if we get e.g. an idle dead error we should retry
retriable actions, namely data stores. in order
for writing files to work correctly given how
the writeback code is structured it's important that
this not interfere with analyze's shouldRetry decision
on those RPCs
Andrew Deason [Fri, 17 Feb 2012 23:12:46 +0000 (17:12 -0600)]
viced: Relax "h_TossStuff_r failed" warnings
Currently, h_TossStuff_r bails out and logs a message if we detect
that somebody grabbed a reference or locked the host while we tried to
h_NBLock_r. The reasoning for this is that it is not legal for anyone
to h_Hold_r a host that has HOSTDELETED set (but the error is
detectable and recoverable); callers are supposed to check for
HOSTDELETED and not hold a host in that case.
However, HOSTDELETED may not be set when h_TossStuff_r is called,
since we call it if either HOSTDELETED _or_ CLIENTDELETED are set. If
CLIENTDELETED is set and HOSTDELETED is not, it's perfectly fine (and
necessary) for callers to grab a reference to the host. So, if that's
what is going on, don't log a message, since that's normal behavior.
Check for HOSTDELETED before we h_NBLock_r, since it is technically
possible (and legal) for someone to grab a reference to the host and
somehow set HOSTDELETED while we wait for h_NBLock_r to return. Also
log the flags when we see this message.
Andrew Deason [Fri, 17 Feb 2012 22:24:16 +0000 (16:24 -0600)]
viced: Remove extraneous h_AHTAHT_r in h_GetHost_r
We added this address to the host with an addInterfaceAddr_r call just
a few lines before, which adds the host to the address hash table.
Another call to h_AddHostToAddrHashTable_r is pure overhead and
confusing.
Andrew Deason [Fri, 17 Feb 2012 21:46:50 +0000 (15:46 -0600)]
viced: Set h_GetHost_r probefail if MPAA_r fails
Currently, in h_GetHost_r, if we get a connection whose address does
not match an extant host, but the reported uuid does, we ProbeUuid the
old host. If it fails, we call MultiProbeAlternateAddress_r and set
'probefail'. Later on, if 'probefail' is set, we always add the
connection address to the host, and remove the host->host,host->port
address from the host.
However, this is not always correct. Consider the following situation.
We have an existing host that has primary address 1.1.1.1, and also
has addresses 1.1.1.2 and 1.1.1.3 on the interface list but not on the
hash table. Say that host A stops responding on 1.1.1.1, and a
connection comes in from 1.1.1.2. We ProbeUuid 1.1.1.1 and get a
failure, so we call MultiProbeAlternateAddress_r.
MultiProbeAlternateAddress_r probes via rx_Multi the addresses 1.1.1.2
and 1.1.1.3. Say that 1.1.1.3 responds first, and responds
successfully, so MultiProbeAlternateAddress_r sets 1.1.1.3 to be the
primary address for the host.
After MultiProbeAlternateAddress_r returns, 'probefail' is set. A few
lines down, we see that oldHost->host does not match haddr, and
'probefail' is set, so we add 1.1.1.2 to the interface list, and
remove 1.1.1.3, and set 1.1.1.2 to be the primary address, even though
1.1.1.3 is the address we most recently 'know' is correct.
To fix this, only set 'probefail' if MultiProbeAlternateAddress_r also
fails after the failed ProbeUuid call. Conceptually this makes sense,
since if MultiProbeAlternateAddress_r succeeds, it found an address
that responds successfully to ProbeUuid, and it sets that address to
be the primary address. Therefore, after MultiProbeAlternateAddress_r
returns success, the situation is the same as if the 'good' address
was already the primary address, and the ProbeUuid call succeeded, so
'probefail' should be cleared.
Andrew Deason [Fri, 17 Feb 2012 19:14:31 +0000 (13:14 -0600)]
viced: Correctly update addrs on alt addr probe
The functions MultiBreakCallBackAlternateAddress_r and
MultiProbeAlternateAddress_r try to find a valid address in a host's
interface list of addrs. If they find one, they update host->host and
host->port. However, they do so just by changing those fields directly
and by calling h_DeleteHostFromAddrHashTable_r and
h_AddHostToAddrHashTable_r. This leaves the old host->host, host->port
on the interface list, and leaves it marked as 'valid'. Similarly, the
new host and port may still be marked as not 'valid'.
This can result in the host being on the addr hash table via an
address that is not on the host's interface list. After the above
situation occurs, we may call
and then update host->host and host->port, which happens in a variety
of places. Since host->host, host->port is not marked as valid in the
interface list, it is not removed from the addr hash table, but it is
removed from the interface list. Eventually, this can cause the host
to be referenced from the addr hash table even after it has been
freed.
Since this can result in hash table entries pointing to the 'wrong'
host, this can result in FileLog messages such as:
Sun Feb 5 03:16:35 2012 Removing address that does not belong to host 0xdeadbeefdead (1.2.3.4:7001).
To fix this, make MultiBreakCallBackAlternateAddress_r and
MultiProbeAlternateAddress_r update the address list the same way as
all of the code in host.c does; by adding the new address with
addInterfaceAddr_r, removing it with removeInterfaceAddr_r, and
updating host->host and host->port.
Andrew Deason [Thu, 16 Feb 2012 22:20:16 +0000 (16:20 -0600)]
viced: Delete dup host before probing old host
Currently, when the fileserver gets a new connection from an address
not on the addr hash table, we allocate a new host structure and add
that host to the addr hash table. If we then find that that host's
uuid matches the uuid of an extant host, we do the following:
- probe the old host with the uuid, and MultiProbeAlternateAddress_r
if the probe fails
- mark the duplicate host as HOSTDELETED
- manipulate the interface lists
Consider, for example, that we have an extant host ('oldHost') with
address 1.2.3.4:7001, but with 5.6.7.8:7001 on its alternate interface
list. At some point, the 1.2.3.4:7001 interface goes away or becomes
unreachable. A new connection comes in from that same host on
5.6.7.8:7001.
What will happen is we create a new host for address 5.6.7.8:7001, and
then detect the uuid collision. When we try to probe the old address
of 1.2.3.4:7001, it will fail, and we will try to
MultiProbeAlternateAddress_r. MultiProbeAlternateAddress_r will
determine that the alternate address 5.6.7.8:7001 responds
successfully to the probe, and it tries to set 5.6.7.8:7001 to be the
primary address of 'oldHost', and add 'oldHost' to the addr hash table
under 5.6.7.8:7001.
But the "new" host from the incoming connection is already hashed on
the address hash table under 5.6.7.8:7001, so the
h_AddHostToAddrHashTable_r call in MultiProbeAlternateAddress_r fails.
Since we later delete the new duplicate host, this results in
5.6.7.8:7001 being the primary address for the host, but that address
is not anywhere in the address hash table.
This behavior can be seen by the following pair of FileLog messages:
Wed Feb 1 11:02:38 2012 CB: ProbeUuid for 0xdeadbeefdead (1.2.3.4:7001) failed -01
Wed Feb 1 11:02:38 2012 h_AddHostToAddrHashTable_r: refusing to hash host beefdead, baadcafe (5.6.7.8:7001) already hashed
While those message do not necessarily indicate this problem, this
problem will result in those messages.
To fix this, mark the duplicate host as HOSTDELETED before we do any
probing on 'oldHost'. This way, if MultiProbeAlternateAddress_r tries
to add 'oldHost' to the addr hash table under 5.6.7.8:7001, it will be
able to do so successfully, since the old duplicate host is deleted.
Andrew Deason [Mon, 13 Feb 2012 20:11:36 +0000 (14:11 -0600)]
Rx: Avoid lastBusy/PEER_BUSY discrepancy
If an rx call has the RX_CALL_PEER_BUSY flag set, but the call's
conn->lastBusy is not set, we can easily cause an rx caller to loop
infinitely. rx_NewCall will see that lastBusy for a call channel is
not set, and will use that call channel, but rxi_CheckBusy will note
that the call appears busy and that there are non-busy call channels
on the same conn, and so will return RX_CALL_BUSY.
This can currently happen in rxi_ResetCall, since we set
RX_CALL_PEER_BUSY on the call again if the call had that flag set when
rxi_ResetCall was called. If we are calling rxi_ResetCall with
'newcall' set, the passed in call is unrelated to the new call, since
it was obtained from the free list. Thus, the busy-ness of the call
should be ignored. Fix this by only paying attention to the incoming
RX_CALL_PEER_BUSY flag if 'newcall' is not set.
Also prevent this from happening by clearing RX_CALL_PEER_BUSY in
rx_NewCall when we select a call and clear lastBusy for that call.
Derrick Brashear [Tue, 13 Dec 2011 16:24:16 +0000 (11:24 -0500)]
volser: allow clonevol purge id to be new id
effectively the same functionality that reclone already uses, but
for some reason we artificially limit it out of clone despite
the interface being there for it. it used to be there. put it back.
Andrew Deason [Wed, 8 Feb 2012 22:03:29 +0000 (16:03 -0600)]
RedHat: Fail openafs-client 'stop' on rmmod error
Currently, the openafs-client RPM init script ignores any error
reported by rmmod. If 'umount /afs' succeeds but rmmod does not, the
client may panic the machine if the client is started again (from e.g.
running the 'restart' init script method), since afsd will try to
initialize AFS with a libafs that has been shut down.
So, do not ignore errors from 'rmmod', and instead fail the 'stop'
method from the init script if we get an error.
Andrew Deason [Tue, 20 Dec 2011 22:44:42 +0000 (17:44 -0500)]
viced: Keep H_LOCK while locking host in h_Alloc_r
Currently in h_Alloc_r, we h_Lock_r the host, so we have it locked on
return. However, h_Lock_r drops the host glock, which is bad in this
situation since we have already added the host to the global hash
table, so other threads may see it. This can mean that by the time
h_Alloc_r returns, the returned host may have HOSTDELETED set, and/or
the addresses associated with the host may be completely different.
h_Alloc_r's caller, h_GetHost_r, seems to assume that the host is
still associated with the address of the passed-in connection. When
this is not true, this can result in the host structure getting into a
strange state, such as the primary addr/port may not be hashed. The
host may also have HOSTDELETED set, in which case we're not supposed
to be dealing with it at all.
To avoid these problems, lock host->lock directly in h_Alloc_r,
without going through h_Lock_r and dropping H_LOCK. Also do it as one
of the first things we do to initialize the host, just to make sure
that if anybody else happens to see the host, it is locked by us when
they do.
Tom Keiser [Wed, 1 Feb 2012 08:31:23 +0000 (03:31 -0500)]
com_err: correctly deal with lack of libintl
On machines lacking a libintl, _intlize() currently fails to initialize
the output error string--leading to tools (e.g., translate_et) returning
a null string; make afs_com_err fall back to returning the en/US canonical
error text when we don't have any i18n support...
Christof Hanke [Sun, 29 Jan 2012 17:08:57 +0000 (18:08 +0100)]
linux: fix probing for noop_fsync
Commit 267934d0e6910c8d8166a6e78f93c1bab40857b8 introduced
probing code to deal with the renameing of simple_fsync
inside the linux-kernel.
This test does not take different parameter-lists
for noop_fsync or simple_fsync resp. into account.
Fix this.
Reviewed-on: http://gerrit.openafs.org/6628 Reviewed-by: Marc Dionne <marc.c.dionne@gmail.com> Reviewed-by: Derrick Brashear <shadow@dementix.org> Tested-by: Derrick Brashear <shadow@dementix.org>
(cherry picked from commit 20e82cecd9008f9b3467c9a323c5c3abf27f3021)
Derrick Brashear [Wed, 22 Feb 2012 20:57:46 +0000 (15:57 -0500)]
libafs: retry retriable RPCs instead of abandoning
if we get e.g. an idle dead error we should retry
retriable actions, namely data stores. in order
for writing files to work correctly given how
the writeback code is structured it's important that
this not interfere with analyze's shouldRetry decision
on those RPCs
Jeffrey Altman [Tue, 21 Feb 2012 01:50:53 +0000 (20:50 -0500)]
Windows: AFSPerformObjectInvalidate hold ExtentsResource shared
The AFSPerformObjectInvalidate() was obtaining exclusive
access to the Fcb ExtentsResource even though it was not
tearing down the extents list. The ExtentsResource could
be held shared instead. Doing so will avoid the following
deadlock:
Thread 1 is attempting to perform a cache purge which cannot complete
until Thread 2 is finished but Thread 2 requires the ExtentsResource
which is held by Thread 1.
Jeffrey Altman [Mon, 20 Feb 2012 06:48:20 +0000 (01:48 -0500)]
Windows: fsLockCount not accurate
Prior to 1.6.2 the file server does not report an accurate value
for the lock state. In addition, callbacks are not broken when
locks are freed due to lease expiration.
Andrew Deason [Fri, 17 Feb 2012 23:12:46 +0000 (17:12 -0600)]
viced: Relax "h_TossStuff_r failed" warnings
Currently, h_TossStuff_r bails out and logs a message if we detect
that somebody grabbed a reference or locked the host while we tried to
h_NBLock_r. The reasoning for this is that it is not legal for anyone
to h_Hold_r a host that has HOSTDELETED set (but the error is
detectable and recoverable); callers are supposed to check for
HOSTDELETED and not hold a host in that case.
However, HOSTDELETED may not be set when h_TossStuff_r is called,
since we call it if either HOSTDELETED _or_ CLIENTDELETED are set. If
CLIENTDELETED is set and HOSTDELETED is not, it's perfectly fine (and
necessary) for callers to grab a reference to the host. So, if that's
what is going on, don't log a message, since that's normal behavior.
Check for HOSTDELETED before we h_NBLock_r, since it is technically
possible (and legal) for someone to grab a reference to the host and
somehow set HOSTDELETED while we wait for h_NBLock_r to return. Also
log the flags when we see this message.
Andrew Deason [Fri, 17 Feb 2012 22:24:16 +0000 (16:24 -0600)]
viced: Remove extraneous h_AHTAHT_r in h_GetHost_r
We added this address to the host with an addInterfaceAddr_r call just
a few lines before, which adds the host to the address hash table.
Another call to h_AddHostToAddrHashTable_r is pure overhead and
confusing.
Andrew Deason [Fri, 17 Feb 2012 21:46:50 +0000 (15:46 -0600)]
viced: Set h_GetHost_r probefail if MPAA_r fails
Currently, in h_GetHost_r, if we get a connection whose address does
not match an extant host, but the reported uuid does, we ProbeUuid the
old host. If it fails, we call MultiProbeAlternateAddress_r and set
'probefail'. Later on, if 'probefail' is set, we always add the
connection address to the host, and remove the host->host,host->port
address from the host.
However, this is not always correct. Consider the following situation.
We have an existing host that has primary address 1.1.1.1, and also
has addresses 1.1.1.2 and 1.1.1.3 on the interface list but not on the
hash table. Say that host A stops responding on 1.1.1.1, and a
connection comes in from 1.1.1.2. We ProbeUuid 1.1.1.1 and get a
failure, so we call MultiProbeAlternateAddress_r.
MultiProbeAlternateAddress_r probes via rx_Multi the addresses 1.1.1.2
and 1.1.1.3. Say that 1.1.1.3 responds first, and responds
successfully, so MultiProbeAlternateAddress_r sets 1.1.1.3 to be the
primary address for the host.
After MultiProbeAlternateAddress_r returns, 'probefail' is set. A few
lines down, we see that oldHost->host does not match haddr, and
'probefail' is set, so we add 1.1.1.2 to the interface list, and
remove 1.1.1.3, and set 1.1.1.2 to be the primary address, even though
1.1.1.3 is the address we most recently 'know' is correct.
To fix this, only set 'probefail' if MultiProbeAlternateAddress_r also
fails after the failed ProbeUuid call. Conceptually this makes sense,
since if MultiProbeAlternateAddress_r succeeds, it found an address
that responds successfully to ProbeUuid, and it sets that address to
be the primary address. Therefore, after MultiProbeAlternateAddress_r
returns success, the situation is the same as if the 'good' address
was already the primary address, and the ProbeUuid call succeeded, so
'probefail' should be cleared.
Andrew Deason [Fri, 17 Feb 2012 19:14:31 +0000 (13:14 -0600)]
viced: Correctly update addrs on alt addr probe
The functions MultiBreakCallBackAlternateAddress_r and
MultiProbeAlternateAddress_r try to find a valid address in a host's
interface list of addrs. If they find one, they update host->host and
host->port. However, they do so just by changing those fields directly
and by calling h_DeleteHostFromAddrHashTable_r and
h_AddHostToAddrHashTable_r. This leaves the old host->host, host->port
on the interface list, and leaves it marked as 'valid'. Similarly, the
new host and port may still be marked as not 'valid'.
This can result in the host being on the addr hash table via an
address that is not on the host's interface list. After the above
situation occurs, we may call
and then update host->host and host->port, which happens in a variety
of places. Since host->host, host->port is not marked as valid in the
interface list, it is not removed from the addr hash table, but it is
removed from the interface list. Eventually, this can cause the host
to be referenced from the addr hash table even after it has been
freed.
Since this can result in hash table entries pointing to the 'wrong'
host, this can result in FileLog messages such as:
Sun Feb 5 03:16:35 2012 Removing address that does not belong to host 0xdeadbeefdead (1.2.3.4:7001).
To fix this, make MultiBreakCallBackAlternateAddress_r and
MultiProbeAlternateAddress_r update the address list the same way as
all of the code in host.c does; by adding the new address with
addInterfaceAddr_r, removing it with removeInterfaceAddr_r, and
updating host->host and host->port.
Andrew Deason [Thu, 16 Feb 2012 22:20:16 +0000 (16:20 -0600)]
viced: Delete dup host before probing old host
Currently, when the fileserver gets a new connection from an address
not on the addr hash table, we allocate a new host structure and add
that host to the addr hash table. If we then find that that host's
uuid matches the uuid of an extant host, we do the following:
- probe the old host with the uuid, and MultiProbeAlternateAddress_r
if the probe fails
- mark the duplicate host as HOSTDELETED
- manipulate the interface lists
Consider, for example, that we have an extant host ('oldHost') with
address 1.2.3.4:7001, but with 5.6.7.8:7001 on its alternate interface
list. At some point, the 1.2.3.4:7001 interface goes away or becomes
unreachable. A new connection comes in from that same host on
5.6.7.8:7001.
What will happen is we create a new host for address 5.6.7.8:7001, and
then detect the uuid collision. When we try to probe the old address
of 1.2.3.4:7001, it will fail, and we will try to
MultiProbeAlternateAddress_r. MultiProbeAlternateAddress_r will
determine that the alternate address 5.6.7.8:7001 responds
successfully to the probe, and it tries to set 5.6.7.8:7001 to be the
primary address of 'oldHost', and add 'oldHost' to the addr hash table
under 5.6.7.8:7001.
But the "new" host from the incoming connection is already hashed on
the address hash table under 5.6.7.8:7001, so the
h_AddHostToAddrHashTable_r call in MultiProbeAlternateAddress_r fails.
Since we later delete the new duplicate host, this results in
5.6.7.8:7001 being the primary address for the host, but that address
is not anywhere in the address hash table.
This behavior can be seen by the following pair of FileLog messages:
Wed Feb 1 11:02:38 2012 CB: ProbeUuid for 0xdeadbeefdead (1.2.3.4:7001) failed -01
Wed Feb 1 11:02:38 2012 h_AddHostToAddrHashTable_r: refusing to hash host beefdead, baadcafe (5.6.7.8:7001) already hashed
While those message do not necessarily indicate this problem, this
problem will result in those messages.
To fix this, mark the duplicate host as HOSTDELETED before we do any
probing on 'oldHost'. This way, if MultiProbeAlternateAddress_r tries
to add 'oldHost' to the addr hash table under 5.6.7.8:7001, it will be
able to do so successfully, since the old duplicate host is deleted.
Andrew Deason [Mon, 13 Feb 2012 20:11:36 +0000 (14:11 -0600)]
Rx: Avoid lastBusy/PEER_BUSY discrepancy
If an rx call has the RX_CALL_PEER_BUSY flag set, but the call's
conn->lastBusy is not set, we can easily cause an rx caller to loop
infinitely. rx_NewCall will see that lastBusy for a call channel is
not set, and will use that call channel, but rxi_CheckBusy will note
that the call appears busy and that there are non-busy call channels
on the same conn, and so will return RX_CALL_BUSY.
This can currently happen in rxi_ResetCall, since we set
RX_CALL_PEER_BUSY on the call again if the call had that flag set when
rxi_ResetCall was called. If we are calling rxi_ResetCall with
'newcall' set, the passed in call is unrelated to the new call, since
it was obtained from the free list. Thus, the busy-ness of the call
should be ignored. Fix this by only paying attention to the incoming
RX_CALL_PEER_BUSY flag if 'newcall' is not set.
Also prevent this from happening by clearing RX_CALL_PEER_BUSY in
rx_NewCall when we select a call and clear lastBusy for that call.
Derrick Brashear [Tue, 13 Dec 2011 16:24:16 +0000 (11:24 -0500)]
volser: allow clonevol purge id to be new id
effectively the same functionality that reclone already uses, but
for some reason we artificially limit it out of clone despite
the interface being there for it. it used to be there. put it back.
Jeffrey Altman [Sun, 19 Feb 2012 00:57:25 +0000 (19:57 -0500)]
Windows: Dereg Lanman and Lsa reg values for afsredir
If the machine has been upgraded from an AFS SMB Server to the
AFS Redirector, the registry will have leftover configuration
for the "AFS" netbios name in the Lsa BackConnectionHostNames
value and the LanmanWorkstation ReconnectableServers and
ServersWithExtendedSessTimeout values. These values are not
useful with the AFS Redirector since \\AFS is owned by afsredir.sys
and not the SMB redirector. Remove the "AFS" netbios name from
these values when afsd_service.exe has started in redirector mode.
Ken Dreyer [Sat, 11 Feb 2012 16:43:30 +0000 (09:43 -0700)]
doc: replace hostnames with IETF example hostnames
There were several different real and made-up hostnames and company names used
throughout our documentation examples.
The IETF has reserved "example.com" and other "example" TLDs for use in
examples (RFC 2606). Replace almost all references to ABC Corporation, DEF
Corporation, and State University, as well as "abc.com", "bigcell.com",
"def.com", "def.gov", "ghi.com", "ghi.gov", "jkl.com", "mit.edu",
"stanford.edu", "state.edu", "stateu.edu", "uncc.edu", and "xyz.com".
Standardize on "Example Corporation", "Example Network", "Example
Organization" (example.com, example.net, and example.org).
The Scout documentation in the Admin Guide contains PNG images that contain
the old cell names, so I left those references until the images can be
replaced.
AFSPrimaryVolumeWorkerThread held the VolumeCB->ObjectInfoTree.TreeLock
exclusively across calls to AFSCleanupFcb() which in turn triggers
a file extent release to the service which can in turn result in
an object invalidation. Processing the invalidation requires shared
access to VolumeCB->ObjectInfoTree.TreeLock which results in a deadlock.
This patch alters the processing of AFSPrimaryVolumeWorkerThread
so that the VolumeCB->ObjectInfoTree.TreeLock is not held across
the AFSCleanupFcb() calls.
FIXES 130431
Change-Id: I3726df02ab47d2dcc83a32c75957a5dafcfbf20e
Reviewed-on: http://gerrit.openafs.org/6724 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Peter Scott <pscott@kerneldrivers.com> Reviewed-by: Jeffrey Altman <jaltman@secure-endpoints.com> Tested-by: Jeffrey Altman <jaltman@secure-endpoints.com>
Jeffrey Altman [Wed, 15 Feb 2012 05:06:47 +0000 (00:06 -0500)]
Windows: disable afsdhook.dll reload by daemon
The daemon thread's loading and unloading of afsdhook.dll every
second prevents the disk drive from sleeping and forces a search
of the PATH. Make the periodic reloading configurable and
disable it by default.
Andrew Deason [Wed, 8 Feb 2012 22:03:29 +0000 (16:03 -0600)]
RedHat: Fail openafs-client 'stop' on rmmod error
Currently, the openafs-client RPM init script ignores any error
reported by rmmod. If 'umount /afs' succeeds but rmmod does not, the
client may panic the machine if the client is started again (from e.g.
running the 'restart' init script method), since afsd will try to
initialize AFS with a libafs that has been shut down.
So, do not ignore errors from 'rmmod', and instead fail the 'stop'
method from the init script if we get an error.
Jeffrey Altman [Sat, 11 Feb 2012 22:31:00 +0000 (17:31 -0500)]
Windows: default cell grand.central.org
Change the default cell from openafs.org to grand.central.org
since there is no openafs.org cell. All openafs software is
distributed from the grand.central.org cell.
Jeffrey Altman [Sat, 11 Feb 2012 22:29:51 +0000 (17:29 -0500)]
Windows: reset version to 0.0.0 on master
Master does not track a particular version number.
For Windows builds on master, reset the version to
0.0.0 so that the builds are not confused with the actual
1.5.7600.
Jeffrey Altman [Fri, 10 Feb 2012 13:56:12 +0000 (08:56 -0500)]
Windows: Perform rename to self check earlier
In AFSSetRenameInfo(), the rename to itself check was performed
after the name collision check. Move the check earlier in the
routine to ensure that we catch the no-op before any real work
is done.