From d1c4dbf28ae28bbfac3d8bc96d0fa5ae3d422bfd Mon Sep 17 00:00:00 2001 From: Benjamin Kaduk Date: Sun, 25 Jun 2017 13:56:04 -0500 Subject: [PATCH] Put jhutz's ubik analysis in doc/txt A file in the source tree is much easier to locate than an old mailing list post; it's quite handy to have this at hand as a reference. Change-Id: I5267a2f86b36e92b05249364085bdd33aeb28d1b Reviewed-on: https://gerrit.openafs.org/12642 Tested-by: BuildBot Reviewed-by: Michael Meffie Reviewed-by: Stephan Wiesand Reviewed-by: Benjamin Kaduk --- doc/txt/ubik.txt | 1434 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1434 insertions(+) create mode 100644 doc/txt/ubik.txt diff --git a/doc/txt/ubik.txt b/doc/txt/ubik.txt new file mode 100644 index 000000000..e508e673a --- /dev/null +++ b/doc/txt/ubik.txt @@ -0,0 +1,1434 @@ +This file contains a threading analysis of Ubik prepared by +Jeffrey Hutzelman and sent ot the openafs-devel@openafs.org +mailing list in February 2011, archived at: +https://lists.openafs.org/pipermail/openafs-devel/2011-February/018329.html + +INTRODUCTION + + This document describes the structure of Ubik, with an eye toward + thread-safety and concurrency. It begins by discussing the elements, + usage, and locking considerations of major shared data structures. This + is followed by a discussion of each major subsystem. The emphasis is on + deficiencies in thread-safety and the changes needed to correct them. + + Most of this document focuses on the user-mode Ubik server code, which + comprises the bulk of the Ubik library code and also of the complexity. + A separate section near the end is dedicated to Ubik client code, which + itself is intended primarily for use in user-mode applications. The + OpenAFS cache manager includes its own mechanisms for accessing + replicated services, including the Ubik-managed VLDB, and for tracking + the servers that provide them. + + Occasionally, issues related to concurrency, performance, and modularity + (particularly as regards support for supporting multiple independent + databases in a single server process) are also discussed; however, these + discussions are peripheral and are included only to record data obtained + while examining the question of thread-safety and as a basis for future + work. Similarly, throughout this document are detailed discussions of + some parts of the code and certain data structures which, while not + always directly relevant to thread-safety, were produced in the course + of analyzing the Ubik library code. These are included to provide + additional background and as a basis for future work in documentation of + Ubik library internals and interfaces. + + +MAJOR DATA STRUCTURES + + This section calls out the major shared data structures used in the Ubik + server code and details what they are used for and how shared access to + them is protected. Much of this text is probably suitable as a starting + point for expanding on the documentation found in comments in ubik.p.h. + + struct ubik_dbase - database + + This structure contains nearly everything Ubik knows about an open + database (for an explanation of "nearly", see the section below on + globals), including database-wide parameters and state and a list of + active transactions. + + The following structure elements are set only when the database + structure is being initialized, after which they are never modified: + + + pathName - The base path used to construct database filenames + + physical later method pointers (read, write, truncate, sync, + stat, open, setlabel, getlabel, getnfiles), used to manipulate + the physical database files on disk. These currently always + point to the methods defined in phys.c, and in fact other + modules contain knowledge of the internal implementation of that + module. + + The primary synchronization mechanism used to protect this structure + is the database lock (versionLock), which is manipulated via the + DBHOLD() and DBRELE() macros. This lock protects the following + structure elements, as well as a number of globals as described + elsewhere in this document: + + + activeTrans - list of active transactions + + version - database version (*) + + tidCounter - transaction ID of latest transaction (*) + + writeTidCounter - transaction ID of latest write transaction (*) + + flags (*) + + readers + + (*) These fields are currently protected by the database lock, except + that the beacon thread does not hold that lock when examining them, + or the global 'ubik_epochTime', which holds the corresponding epoch. + The recommendation is for these fields, along with ubik_epochTime, to + be additionally protected by a second lock, allowing the beacon + thread to examine them without holding the database lock; see the + section on the BEACON package for details. + + The condition variable 'version_cond' is used to signal to that the + database version may have changed; it is broadcast in udisk_commit(), + in SDISK_SendFile(), and from the recovery thread; it is monitored by + ubik_WaitVersion(), which can be called by an application to wait for + a database version change (this is not currently used in OpenAFS). + This CV is associated with the database lock. When LWP is used, this + condition is signalled on &ubik_dbase->version. + + The condition variable 'flags_cond' is used by udisk_end() to signal + that DBWRITING flag has been cleared. This wakes threads waiting in + ubik.c:BeginTrans() to begin a new transaction. This CV is + associated with the database lock. When LWP is used, this condition + is signalled on &ubik_dbase->flags. + + When LWP is used, signals are also sometimes sent on &urecovery_state + and on &ubik_amSyncSite. However, nothing ever waits on these, and + no corresponding condition variables exist when pthreads is used. + + The 'cachedVersion' field holds the database version represented by + the application's cache; this is compared against the current + database version to discover if the cache is up to date. Both + cachedVersion and the application cache itself are protected by + cache_lock, an AFS reader/writer lock. + + When the application wishes to use the cache, it calls + ubik_CacheUpdate(), which verifies that the cache is current, + updating it if necessary (under a write lock), and then returns with + the application cache read-locked. The application can then rely on + the cache's contents not to change until it ends the transaction by + calling ubik_EndTrans() or ubik_AbortTrans(). When one of these + functions is called, the read lock is dropped, and a write lock may + be acquired temporarily to update both cachedVersion and the cache + (in the case of committing a write transaction) or to clear + cachedVersion, indicating the cache is invalid (in the case of + aborting a transaction). + + In the case of ubik_EndTrans() on a write transaction, the cache is + held write-locked until udisk_commit() has completed; this is also + done in SDISK_Commit(), which insures that the contents of the + on-disk database and page cache cannot change while an active + transaction holds the cache lock. Note that both the recovery thread + and SDISK_SendFile() may replace the database contents wholesale; + however, these operations first abort all outstanding transactions. + + Prior to the introduction of ubik_BeginTransReadAnyWrite(), an + application could choose to manage its own cache, updating it during + write transactions and when, at the beginning of a read transaction, + it was discovered to be out of date (due to a write done by another + server). Under this model, the application uses Ubik transaction + locks to protect not only the database but also its cache. However, + the use of ubik_BeginTransReadAnyWrite() allows some transactions to + proceed and read the database without acquiring any locks, which + makes it unsafe to depend on Ubik transaction locks to protect the + application cache. Applications which make use of this mode must set + ubik_SyncWriterCacheProc to point to a procedure to be called to + update the cache during ubik_EndTrans(), after udisk_commit() has + succeeded and before the cache_lock is released. These applications + must then refrain from ever updating the cache other than during that + procedure or during the callback passed to ubik_CheckCache(). + + + struct ubik_trans - active transaction + + This structure represents the state of an active transaction, + including transaction metadata and state and a list of all changes + made as part of the transaction. + + Transactions are organized into a linked list referenced by the + corresponding database structure, protected by the database lock. + Each transaction contains an upward pointer to the database, + a transaction type, and a transaction ID, all of which are immutable + for as long as the transaction exists. + + A transaction created via the REMOTE interface exists until ended by + a call to SDISK_Abort() or SDISK_ReleaseLocks() or aborted in + urecovery_CheckTid() due to a transaction ID mismatch or creation of + a new transaction (note that in this last case, cleanup will be + deferred if SDISK_Lock() is blocked on this transaction). In these + cases, ubik_currentTrans is cleared at the same time, under the + database lock, and further REMOTE calls will fail gracefully until + a new transaction is started. Thus, REMOTE transactions may be + considered valid only for as long as the database lock is held. + + Transactions created via ubik_BeginTrans(), ubik_BeginTransReadAny(), + or ubik_BeginTransReadAnyWrite() exist until ended by a call to + ubik_AbortTrans() or ubik_EndTrans(). They are never destroyed from + within the UBIK package. Code which discovers a transaction by + traversing the database's active transaction list may access the + transaction only as long as the database lock is held, since once it + is released, an application thread may end the transaction. + + Each transaction contains the following fields, which are protected + by the database lock: + + + locktype - type of lock held by this transaction + + minCommitTime - unused + + seekFile - file number of database file pointer + + seekPos - position of database file pointer + + flags + + File writes are represented by 'iovec_info', an XDR vector structure + (type iovec_wrt, a vector of struct ubik_iovec) containing the file, + position, and length of each write, and 'iovec_data', an XDR opaque + containing the corresponding data. These vectors and their contents + are protected by the database lock. + + File truncations are represented by a linked list of struct + ubik_trunc, each of which contains a file number and the new size of + that file. This list, and the records contained in it, are protected + by the database lock. + + While transactions are currently protected by the database lock, it + may be desirable in the future to introduce a per-transaction lock, + in order to facilitate increased concurrency. + + + struct ubik_server - server status + + This structure is used to track the set of peer servers, including + state related to elections and recovery. The global 'ubik_servers' + points to a linked list of these structures, which is constructed + during initialization. Once the list has been constructed, no + entries are ever added or removed; in addition, the following fields + are not changed after startup: + + + addr[0] - server's primary address + + magic - true if this is the magic server + + isClone - true if this server is a clone + + The following fields are used by the BEACON package to track beacons + to this server and their results. Currently, they are sometimes + accessed under the database lock, and sometimes without benefit of + any lock. They should probably be protected by the same lock as + globals private to the BEACON package. + + + lastVoteTime - last time this server voted yes for us + + lastBeaconSent - last time a beacon was sent to this server + + lastVote - true if its late vote response was yes + + up - true if this server is believed to be up + + beaconSinceDown - true if a beacon has successfully been sent to + this server since the last time it was down + + The following fields are used by the RECOVERY package to track the + state of each server. They are also examined and modified by the + various ContactQuorum_* functions. Currently, they are sometimes + accessed under the database lock, and sometimes without benefit of + any lock. They should be fully protected by the database lock. + + + version - server's database version + + currentDB - true if server's database is up-to-date + + The addr[] array is used to track all known addresses for the remote + server; this list is used when attempting to contact a down server + via an alternate address, and to identify the source of incoming + server-to-server RPCs. The first element of this array is always the + primary address and does not change after initialization; other + elements are updated during initialization and when a remote server + calls DISK_UpdateInterfaceAddr(). + + The system normally maintains an active Rx connection to the VOTE and + DISK services on each server; pointers to these are kept in the + 'vote_rxcid' and 'disk_rxcid' fields. + + Currently, the addr[] array, 'vote_rxcid', and 'disk_rxcid' are not + protected by any lock. Because these are shared among a number of + subsystems and it is usually desirable to avoid blocking on the + database lock when making RPCs, it is recommended to introduce a new + lock (the "server address lock") to protect these fields and the + globals 'ubikSecClass' and 'ubikSecIndex', which contain the security + classes used for setting up such connections. The new lock would + need to be held in the following cases: + + Before beginning any RPC using 'vote_rxcid' or 'disk_rxcid', for + the time it takes to call rx_GetConnection() or rx_NewCall() + + For the duration of ubikGetPrimaryInterfaceAddr() + + In ubeacon_Interact(), when building the set of connections to be + used for multi_VOTE_Beacon(). + + In updateUbikNetworkAddress(), first while constructing the set of + connections to be used for multi_DISK_UpdateInterfaceAddr(), and + then again each time a server's list of addresses is updated. + However, the lock must NOT be held while waiting for the RPC to + complete, to avoid deadlock when multiple servers are starting at + the same time. + + For the duration of SDISK_UpdateInterfaceAddr() + + For the duration of ubeacon_ReinitServer(), with the optional + exception of the call to afsconf_UpToDate(). + + In src/ubik/recovery.c:DoProbe(), first while constructing the set + of connections to be used for multi_DISK_Probe() and, if the probe + is successful, again while destroying the old connections for the + server under test and replacing them with new ones. See the + section on the RECOVERY package for additional notes. + + +GLOBAL VARIABLES + + The library has an unfortunate number of global variables which are not + part of any data structure. Nonetheless, some of these are + database-specific; see MULTI-DATABASE SUPPORT, below. + + Globals private to a particular subsystem are discussed in the + description of that subsystem. This section discusses global variables + which are shared and/or part of external interfaces. + + The following globals are used to convey configuration parameters from + the application to the Ubik library, and are intended to be set before + the first call to ubik_ServerInit(). They are not modified by Ubik, and + are not protected by locks: + + + ubik_debugFlag - print debug messages (unused) + + ubikPrimaryAddrOnly - use only primary interface + + ubik_nBuffers - number of DISK package page cache buffers + + ubik_SRXSecurityProc - callback to create server security class + + ubik_SRXSecurityRock - argument for ubik_SRXSecurityProc + + ubik_CRXSecurityProc - callback to create client security class + + ubik_CRXSecurityRock - argument for ubik_CRXSecurityProc + + ubik_SyncWriterCacheProc - callback to update application cache + + ubik_CheckRXSecurityProc - callback to check if caller is OK + + ubik_CheckRXSecurityRock - argument for ubik_CheckRXSecurityProc + + The following globals contain values which are computed during Ubik + initialization and not modified thereafter. They are not protected by + any lock, which is safe so long as suitable memory barriers exist across + creation of threads and Rx services (see the notes on initialization in + the UBIK package section). Note that some of these values are + database-specific and properly belong in the ubik_dbase structure: + + + ubik_servers - linked list of servers (see above) + + amIClone - true if this server is a clone + + ubik_callPortal - UDP port used for server-to-server RPCs + + ubik_quorum - number of servers to make a quorum + + ubik_dbase - pointer to the (only) database (see above) + + ubik_host[] - array of local addresses + + ubik_sc[] - security classes for Ubik RPC services + + The following globals are currently protected by the database lock, + though in most cases there are existing references which do not hold the + lock. Later sections of this document contain recommendations in each + case to establish new locks to protect these or to correct cases where + they are accessed without the correct lock. + + + ubik_currentTrans - pointer to REMOTE's current transaction + + ubik_epochTime - epoch for transaction IDs + + urecovery_state - recovery state bits + + ubik_dbVersion - sync site's database version + + ubik_dbTid - sync site's transaction ID + + ubikSecIndex - security class index for making Ubik RPCs + + ubikSecClass - security class for making Ubik RPCs + + The global 'ubik_amSyncSite' belongs to the BEACON package, and should + be protected by the same lock as that package's private globals. In + fact, it should probably be made private; the only references outside + BEACON are in SVOTE_Debug() and SVOTE_DebugOld() and can easily be moved + into ubeacon_Debug(). + + +MAJOR SUBSYSTEMS + + This section describes each of the major Ubik server subsystems. For + each subsystem, a brief overview is provided, along with a description + of that subsystem's handling of synchronization (or lack thereof). + Subsystems are described beginning with the lowest layer and working + toward the highest. + + PHYS - Physical I/O + + This package is responsible for handling I/O to database files. It + maintains a private, global array of MAXFDCACHE(4) fdcache + structures, each representing an open file descriptor on a database + file, either presently in use or idle and available for use. Entries + are indexed by database file ID, and each contain a file descriptor + number and a refcount. There is also a global 'inited' flag, and + a buffer used to construct pathnames when opening database files. + + The static function uphys_open() takes a database pointer and fileid + and attempts to open the corresponding file. On success, it returns + a file descriptor; on failure, it returns the same as open(2) (with + errno set). This prefers to reuse an already-open fd whose cache + refcount is 0; failing that, it opens a new fd and attempts to enter + it in the cache. It prefers first to use an unused cache entry, then + to reclaim an idle one whose refcount is 0; if neither is possible, + the newly-opened fd is returned without caching. Note that the + refcount on an active entry is always exactly 1; the same entry + cannot be referenced more than once. + + The function uphys_close() releases a file descriptor opened by + uphys_open(). If the file descriptor was cached, the cache entry is + dereferenced but the file is left open for future use; 0 is returned. + If the file descriptor is not found in the cache, then the return + value is that of an immediate call to close(2). This function also + handles cleanup when a cached fd is closed after the file to which it + refers is invalidated by uphys_invalidate(). This function is not + static, but may as well be -- it is used only by other functions in + the PHYS package. + + Functions in this package are called via method pointers in the + database structure, which are set during initialization and not + changed thereafter. These functions must be called with the database + lock held; this protects the file descriptor cache and other globals + mentioned above, and prevents conflicting access to database files. + + The requirement that functions in this package be called with the + database lock held means that concurrent access to database files is + not possible. This situation could be improved by the addition of + a new global lock, which would be held for most of the duration of + uphys_open(), uphys_close(), and uphys_invalidate(), but need _not_ + be held during actual file accesses. This would eliminate the + requirement that calls to this package be made with the database lock + held, allowing multiple calls to be active concurrently. However, it + would still be necessary for higher layers to employ appropriate + synchronization as needed to maintain database consistency. + + + DISK - Logical Database I/O and Transaction Management + + This package is responsible for logical database file I/O, logging, + and transaction management. It maintains a private cache of database + file page buffers. This entire cache, including buffer space, is + allocated and initialized the first time a transaction is started; + the global initd flag keeps track of whether or not this has been + done. The size of this cache defaults to 20 pages, but may be + overridden by setting the global 'ubik_nBuffers' (all OpenAFS Ubik + services do this); this global is not protected by any lock, but is + read exactly once, when the cache is initialized. + + Except for udisk_Debug() (see below), all udisk_* interfaces require + a pointer either to the database or to an active transaction, and + must be called with the database lock held. This protects the DISK + package's accesses of the active transaction list and associated + state variables as well as its calls into the physical access layer, + including insuring that operations requiring multiple calls into that + layer are performed atomically. + + In addition, the database lock protects the page cache hash table and + LRU queue, the buffer control structures, buffer contents, the + truncation record free list (freeTruncList), and global statistics. + However, it may be desirable in the future to introduce a separate + lock to protect these structures, in order to facilitate multiple + database support. + + udisk_LogOpcode, udisk_LogEnd, udisk_LogTruncate, udisk_LogWriteData + + These are internal interfaces for writing to the transaction log. + They each construct a log record and then directly call the + physical layer to append it to the log; the log file is _not_ + cached by the page buffer cache, and these functions do not touch + the cache. Each of these must be atomic with respect to other + writes to the log file, which is accomplished by insuring they are + called only with the database locked. + + DInit, DRead, DTrunc, DRelease, DFlush, DAbort, DSync, DNew + + These are internal interfaces for accessing the page buffer cache. + They must be called with the database lock held. + + DRead() and DNew() return a pointer to the page data buffer, not + to the buffer control structure; this pointer is later passed to + DRelease() to release the buffer. This mechanism depends on the + fact that buffer control structures are allocated in the same + order as the actual page data buffers. + + DRead() guarantees that a read transaction will always see a clean + copy of the requested page, without any modifications made by an + in-progress write transaction. However, this is true only if the + next layer insures that once DRead() is called with write intent + on a page, that no further calls to DRead() are made for that page + until the buffer is written and DRelease() is called with the + dirty flag. Presently, this is the case because udisk_read() and + udisk_write(), which are the only callers of DRead(), are both + called with the database lock held. + + udisk_read, udisk_truncate, udisk_write, udisk_begin, udisk_commit, + udisk_abort, udisk_end, udisk_Invalidate + + These are external interfaces provided by the DISK package to + higher layers. Presently, the locking requirements of these + functions and the lower-layer functions they call are satisfied by + the simple rule that these must always be called with the database + lock. In the future, support for multiple databases and/or + a greater level of concurrency may require relaxing this rule + and/or acquiring additional locks within these functions. + + In udisk_commit(), when committing the first write transaction + after becoming sync site, the database is relabelled. In this + case, the UBIK_RECLABELDB recovery state bit should be set before + propagating the label change to other servers, rather than after. + This is because because ContactQuorum_DISK_Setversion() will + release the database lock, at which point the recovery state may + be cleared by urecovery_ResetState() running in another thread. + It is important that a relabelling which occurs before recovery + state is cleared not result in the UBIK_RECLABELDB recovery state + bit being set after; otherwise, the server may fail to correctly + relabel the database the next time it becomes sync site. + + udisk_Debug + + The udisk_Debug() interface is called as part of preparing the + response to Ubik debugging requests. Unlike other udisk_* APIs, + this interface is not called on a specific database and does not + require the database lock to be held. It reports the version data + of the single database identified by the global 'ubik_dbase' and + traverses the buffer cache collecting statistics, both without + benefit of holding any locks. This may produce inconsistent data + but does not normally risk damaging data structures or accessing + invalid or uninitialized memory (but see below), and avoiding + locking may be of greater benefit than guaranteeing the return of + consistent data. + + The global 'ubik_dbase' pointer is set when the first (only) + database is initialized, and the buffer cache data structures are + allocated the first time a transaction is started, either locally + or via the REMOTE package. Once set, the pointers used by + udisk_Debug() are never changed, and the data structures they + point to remain valid and are never freed. The buffer cache is + traversed by iterating over the array of buffer cache elements + (this is done in other places as well), not by following a linked + list or other data structure that may reordered, and so should not + be dangerous. + + At a minimum, then, it should be arranged that udisk_Debug() + cannot be called before the database structure and page cache have + been initialized. This means that both of these actions must be + taken before any incoming RPCs are accepted. Practically, this + means that DInit() should be renamed udisk_init(), and called from + ubik_ServerInitCommon() during initialization. + + Full correctness demands that udisk_Debug actually hold the + database lock while copying the database version and examining the + page cache. However, it may be desirable to sacrifice this + correctness in order to reduce interaction between the debugging + interfaces and the rest of the system. + + Helper functions: + + unthread - remove transaction from activeTrans + + Dlru, Dmru - move page to front/end of LRUQ + + MatchBuffer - check if a buffer's dbase/file/page match + + FixupBucket - move page to the right bucket + + newslot - find and allocate an unused buffer + + DedupBuffer - throw away stale copies after page is flushed + + GetTrunc, PutTrunc - trunc record allocation + + FindTrunc - find active trunc record for file + + DoTruncs - apply truncations + + + LOCK - Database Transaction Locking + + This package is responsible for managing database locks held as part + of a transaction. It supports a single read/write lock, which is + represented by the global variable 'rwlock', a struct Lock. The + global 'rwlockinit' records whether Lock_Init() has been called on + this lock, and is protected by the database lock. Properly, this + lock should be treated as belonging to the database. + + ulock_getLock() expects to be called with the database lock held; it + depends on this to protect access to fields within the transaction + structure as well as to the global 'rwlockinit'. However, note that + the database lock must be temporarily released while obtaining the + read/write lock, and so ulock_getLock() should not be called in the + middle of a critical section. The 'await' parameter may in theory be + set to 0 to effect a non-blocking lock; however, this depends on + (incorrect) knowledge of the internals of struct Lock. Fortunately, + existing Ubik code always sets await=1. This feature should be fixed + or removed. + + ulock_relLock() releases the read/write lock. It expects to be + called with the database lock held. Unlock ulock_getLock(), this + function does not release the database lock. + + ulock_Debug() reports the state of the read/write lock, for debugging + purposes. It depends on knowledge of the internals of struct Lock, + and accesses both the Lock structure and the global rwlockinit + without benefit of holding any lock. + + It would probably be best to eliminate 'rwlockinit' in favor of + always initializing the global 'rwlock' during startup; ulock_Debug() + would then have no need to examine the flag. Further, ulock_Debug() + should at a minimum use the appropriate locks when looking inside + struct Lock; ideally, knowledge of that structure's internals should + be abstracted into appropriate interfaces in src/lwp/lock.[ch]. + + + VOTE - Voting Logic and VOTE_* RPCs + + This package is responsible for keeping track of who is currently + sync site and deciding how to vote and whether this server should try + to become sync site. It provides the VOTE RPC package, which is + primarily responsible for processing vote requests from peer servers + (VOTE_Beacon), but also provides debugging interfaces and an + interface (VOTE_GetSyncSite) to allow clients to discover the current + sync site. + + Voting decisions are made using state tracked in a set of global + variables, listed below. Currently there is no locking protecting + these variables; LWP-based Ubik depends on the cooperative nature of + LWP and on the fact that routines which examine and manipulate them + do not block. Introduction of a new "vote lock" is recommended to + protect the following globals: + + + ubik_lastYesTime - last time VOTE_Beacon voted "yes" + + lastYesHost - host voted for at ubik_lastYesTime + + lastYesClaim - time lastYesHost started its voting cycle + + lastYesState - whether lastYesHost claimed to be sync site + + lowestHost - lowest host seen + + lowestTime - last time lowestHost was updated/heard from + + syncHost - sync host, or 0 if none known + + syncTime - last time syncHost was updated + + ubik_dbVersion - sync site's database version + + ubik_dbTid - sync site's transaction ID + + The vote lock should be initialized in uvote_Init() and held for the + remainder of that function and also during of uvote_ShouldIRun(), + uvote_GetSyncSite(). It should also be held for the bulk of + SVOTE_Beacon() -- specifically, it should be acquired just before + beginning the lowest-host calculation, and released just before + calling urecovery_CheckTid(), in the event of a yes vote or else just + before returning. + + In addition, the vote lock would protect the globals 'ubik_dbVersion' + and 'ubik_dbTid', which represent this server's knowledge of the + state of the database on the sync site. This variable is set in two + places in the REMOTE package, which change the local and sync site + database versions at the same time. It is also examined in one place + in the REMOTE package, and compared to the database version in the + RECOVERY package. It is imperative that when the local and remote + versions are changed at the same time, these changes become visible + to the RECOVERY package at the same time. Thus, both the changes and + the comparison must be done under both locks. For this purpose, + three new functions should be provided by the VOTE package: + + + A function to safely set 'ubik_dbVersion' + + A function to compare 'ubik_dbVersion' to another value + + A function to check whether this is sync site and compare + 'ubik_dbVersion' to another value, without releasing the vote lock + in between (to be used in recovery). + + Existing accesses to 'ubik_dbVersion' in REMOTE and RECOVERY, all of + which are already made under the database lock, would be replaced + with calls to the new accessors. + + The result of these changes will require in several places that the + vote lock be acquired while the database lock is already held. + Therefore, the vote lock MUST NOT be held while acquiring the + database lock. + + SVOTE_Beacon() currently calls urecovery_CheckTid() without benefit + of the database lock, which must be held when calling that function. + This is fairly simple to fix; however, the vote lock must be released + first. One simple approach would be to move the call to + urecovery_CheckTid() to after the point where the vote lock will be + released, wrapping it in DBHOLD/DBRELE, and conditionalizing the + whole thing on 'vote' being nonzero. + + Separately, it may be worth examining whether it is appropriate to + call urecovery_CheckTid() only when voting yes, and not whenever + a beacon is received from a server claiming to be sync site. + + As with the debug functions in other packages, SVOTE_Debug() and + SVOTE_DebugOld() access global variables belong to this and other + modules without benefit of the appropriate locks. There is also very + little abstraction here. It is probably desirable to introduce Debug + interfaces for the VOTE and RECOVERY packages, to be called from + these functions. Full correctness requires... + + + Holding the vote lock while accessing the VOTE package globals + described above. + + Holding the database lock while accessing the database flags and + tidCounter, ubik_epochTime, ubik_currentTrans, and urecovery_state + + + BEACON - elections, tracking whether this is sync site + + This package is responsible for running elections, collecting votes, + and deciding whether and for how long this is the sync site. It also + contains routines responsible for setting up the list of servers and + acquiring alternate addresses for other servers during startup. + + For this purpose, it maintains a number of global variables and + ubik_server structure fields. Currently, these variables and fields + are sometimes accessed under the database lock, and sometimes without + benefit of any lock. In order to allow this package to operate with + a minimum of disruption due to database accesses, it is recommended + that a new lock be established to protect these. It is particularly + desirable to prevent loss of sync due to the potentially long-lived + RPCs the RECOVERY package must make under the database lock when + transferring full database copies between servers; for this reason, + it is important that both the BEACON thread and SVOTE_Beacon() be + able to operate without acquiring the database lock. The global + variables and ubik_server structure fields to be protected by the new + lock are the following: + + + ubik_amSyncSite - true if this is the sync site + + syncSiteUntil - time until which this is the sync site + + ts->lastVoteTime - last time this server voted yes for us + + ts->lastBeaconSent - last time a beacon was sent to this server + + ts->lastVote - true if its late vote response was yes + + ts->up - true if this server is believed to be up + + ts->beaconSinceDown - true if a beacon has successfully been sent + to this server since the last time it was down + + The new lock would need to be held in the following cases: + + In ubeacon_AmSyncSite(), from the first time 'ubik_amSyncSite' is + examined until after the potential call to urecovery_ResetState(). + + In ubeacon_Interact(), when building the list of servers to which + to send beacons (to protect access to the 'up' flag). Note that + the server address lock must also be held during this loop, and + must be acquired _after_ the beacon lock. + + In ubeacon_Interact(), when updating status associated with + a server after receiving its beacon response. + + In ubeacon_Interact(), when updating globals after an election. + However, it must be released before calling urecovery_ResetState(). + + In updateUbikNetworkAddress(), when marking a server down. + + In the various ContactQuorum_* functions, when checking whether + a server is up before calling it, and again when marking a server + down after a call fails. + + In ubik_EndTrans, when examining a server's 'beaconSinceDown' flag + and 'lastBeaconSent' timestamp to determine whether it has recently + gone down, requiring a brief hold until it either comes back up or + times out. + + The following global variables are set during initialization and do + not change thereafter: + + + nServers - number of voting servers + + amIMagic - true if this server gets the extra half vote + + ubik_singleServer - true if this is the only server + + Currently, all initialization of this module is done _after_ the Rx + services have been created and an Rx server thread started. This is + done to avoid a situation in which all servers are blocked in + updateUbikNetworkAddress(), which is responsible for exchanging + addresses with peer servers, while none of them is prepared to + handled the DISK_UpdateInterfaceAddr call made by that function. + + When using LWP, this early initialization of Rx is not a problem, + because the various initialization routines don't actually block until + updateUbikNetworkAddress() waits for RPCs to complete. However, when + using pthreads, it is a problem, because it means that incoming RPCs + may be processed when initialization is not complete. + + To address this problem and insure orderly initialization, it is + recommended to change the initialization sequence so that most work, + including the bulk of BEACON package initialization, happens before + Ubik's Rx services have been created. The only work that should be + deferred until after that point is updateUbikNetworkAddress(), + followed by starting the long-running threads belonging to the BEACON + and RECOVERY packages. This requires updateUbikNetworkAddress() to + be exported (and perhaps to undergo a name change). + + The major part of the work of this package is done in the function + ubeacon_Interact(), which is the body of a long-running thread known + as the beacon thread. This thread is responsible for periodically + sending "beacons" to other servers to request votes (when the sending + server is the sync site, these also include how long it will remain + so, the database version, and the active transaction ID). Beacons + are sent only from servers that are already sync site or believe they + are the best candidate, as determined by uvote_ShouldIRun(). + + On the sync site, the beacon thread is responsible for insuring that + new votes are collected frequently enough to avoid loss of quorum. + This means it must not block for an extended time; therefore, it must + not acquire the database lock, which in other threads may sometimes + be held during long-running operations (most notably, database file + transfers done under this lock by the recovery thread). Instead, + a new lock is proposed (the "version lock"), which is used to allow + the beacon thread to examine (but not modify) certain fields without + holding the database lock. + + The version lock should be acquired _in addition to_ the database + lock when modifying 'ubik_epochTime' or the database 'version', + 'flags', 'tidCounter', and 'writeTidCounter' fields; such + modifications occur at the following locations: + + in ubik.c:BeginTrans(), when beginning a new transaction + + in udisk_begin(), when setting DBWRITING + + in udisk_commit(), when updating the database version at the end of + a write transaction (note this includes the case of relabelling the + database on the first write transaction after becoming sync site) + + in udisk_end(), when clearing DBWRITING + + in recovery.c:InitializeDB(), when setting the initial version of + an empty database + + in urecovery_Interact(), when labelling a new database the first + time quorum is established + + in urecovery_Interact(), after fetching the latest database from + another server (whether successful or not; there are two cases) + + in SDISK_SendFile(), when updating the database version both before + and after receiving a new database from the sync site. Note the + lock must _not_ be held during the transfer. + + in SDISK_SetVersion() + + The version lock should be acquired by the beacon thread while + examining these fields, shortly before calling multi_VOTE_Beacon(). + Since it must release the lock before making that call, the database + version will need to be explicitly copied into a local variable, as + is already done with the current transaction ID. + + Note that if UBIK_PAUSE is defined, the DBVOTING flag poses an + additional problem, as it must be modified by the beacon thread. If + it is desirable to continue to support this variant, then it becomes + necessary for all accesses to the database flags to be protected by + the version lock. Other UBIK_PAUSE code may not have been reviewed + in depth and may pose additional problems. + + The function ubeacon_AmSyncSite() is called both from the beacon + thread and from various other points to determine whether they are + running on the sync site, and thus whether various operations are + safe and appropriate. Calls to this function from outside the beacon + thread (often, via urecovery_AllBetter()) must be made with the + database lock held, and for operations modifying the database or + relying on its contents to remain constant, this lock must not then + be released until the operation is complete. The does not guarantee + that this will remain sync site for the duration of the operation, + but does insure that any operations done after or as a result of + losing sync will in fact occur after the operation holding the lock. + + Of course, ubeacon_Interact() itself cannot call ubeacon_AmSyncSite() + with the database lock held, since it must not block on that lock + when running on the sync site (otherwise sync could be lost, as + described above). So, that portion of ubeacon_AmSyncSite() other + than the call to urecovery_ResetState() should be abstracted out into + a new function, which presumably returns separate flags indicating + both whether it is in fact running on the sync site (which becomes + the return value of ubeacon_AmSyncSite()) and whether a call to + urecovery_ResetState() is indicated. When the latter is set, the + beacon thread can acquire the database lock (since, if this is not + the sync site, blocking on this lock is not a problem) and call + urecovery_ResetState(). + + This difference in behavior is acceptable because the required + invariant is that when 'ubik_amSyncSite' is cleared, it cannot become + true again until urecovery_state has also cleared. This insures that + once a server discovers it is not sync site, urecovery_AllBetter() + cannot succeed on the basis of being sync site until sync is regained + and recovery has run, insuring the database is up to date. Since + only the beacon thread ever sets 'ubik_amSyncSite' true, that thread + can meet the invariant by calling urecovery_ResetState() before doing + anything else, while other threads must continue to hold the beacon + lock to prevent the beacon thread from setting 'ubik_amSyncSite'. + + ubeacon_Debug() reports the value of syncSiteUntil, without benefit + of obtaining any locks. It should also report the value of + 'ubik_amSyncSite', allowing that global to be made private to the + BEACON package. Full correctness requires that the beacon lock be + held while accessing these globals. + + + RECOVERY - Consistency, Down Server Recovery, Propagation, Log Replay + + This package is responsible for tracking whether the local database + is current and usable and for discovering when a server comes back up + after having been marked down. On the sync site, it is also + responsible for tracking the database versions on remote servers and + for recovering from inconsistencies. + + To carry out these tasks, the RECOVERY package maintains a single + global, 'urecovery_state'. This is not currently effectively + protected by any lock; however, it should be protected by the + database lock. This is particularly important because high-level + routines that manipulate the database depend on a check against this + field to determine that the database is up-to-date and safe to + modify, and that it will remain so until the operation is complete. + + The main work of this package is done by urecovery_Interact(), which + is the body of a long-running thread known as the recovery thread. + This thread wakes up periodically to discover if any down servers + have come back up and, on the sync site, to locate the latest + available database, fetch it if necessary, and make sure it has been + distributed to all servers. This work is done in several steps, + taken on in sequence every few seconds: + + + The first part of urecovery_Interact() is to identify servers which + are believed to be down, and determine whether any of them have + come back up. This is done every 30 seconds, even when this is not + the sync site. Currently, the database lock is held for the + duration of this loop, released only in DoProbe() while actually + waiting for the DISK_Probe() RPC to complete. This is mostly + unnecessary; the server probe loop needs this lock only to examine + the database 'currentDB' flag and to clear the UBIK_RECFOUNDDB bit, + and DoProbe() doesn't need it at all. However, the beacon lock is + needed to examine and update the database 'up' flag, and DoProbe() + needs to hold the server address lock when examining server + addresses and, if the server comes back up, to replace connections. + See the section on struct ubik_server for more details on this + lock. + + + The next step is to determine whether this is the sync site, and + update the UBIK_RECSYNCSITE bit appropriately. The database lock + should be held for the duration of this step, as + ubeacon_AmSyncSite() may end up calling urecovery_ResetState(), + which requires that lock be held. If this is not the sync site, + the cycle ends here. + + + Step three is to determine the location of the latest database. + This is done by making DISK_GetVersion() calls to all active + servers, keeping track of the newest version found and the server + which reported it. Once this is done, the best version is compared + to the local version, update the UBIK_RECHAVEDB bit (so that it is + true if and only if the local version is the latest) set the + UBIK_RECFOUNDDB bit to indicate the latest database has been + located, and clear the UBIK_RECSENTDB bit to force any servers with + out-of-date databases to be updated. + + In this section, it is not necessary to hold any locks while + collecting versions on remote servers, though as before, it is + necessary to hold the beacon lock while examining the server 'up' + flag, and the address lock to obtain a reference on each connection + as it is used (but this lock should not be held during the RPC!). + Once all remote versions have been fetched, the database lock is + needed while examining the local version and updating + 'urecovery_state'. Note that write transactions may happen while + versions are collected, resulting in some servers having newer + versions than were detected. However, if this happens, the result + is always consistent and ends with the sync site having the latest, + canonical version: + + + If the actual latest version X.N before the write was in an epoch + created by the current sync site, then that version must + necessarily already be present on the sync site, since writes are + always committed first on the sync site. Thus, when the sync + site updates the version counter, it produces a new version X.N+1 + which no other server can believe it already has; thus, there is + no possibility of inconsistency. + + + If the actual latest version X.N before the write was in an epoch + not created by the sync site, then it is possible that after + quorum was established but before any writes happened, a new + server came online which had a newer version X.N+1 (that was + committed to less than a quorum of sites before a crash). If + a write transaction begins before recovery detects this fact and + fetches the version X.N+1 database, then the new write will start + with the version X.N database, essentially creating a version + fork. However, when this happens, the resulting database will be + version Y.1 (with Y>X), because the first write on a new sync + site always results in a new database. Thus, the version X.N+1 + database will no longer be latest (because the local version, + which is examined last, is Y.1), and will end up being discarded. + This is no worse than if the server with the version X.N+1 + database had not come up until completely after the transaction + resulting in version Y.1. + + + Step four is to fetch the latest database, if the sync site does + not already have it. Since this step replaces the database + wholesale, there can be no active transactions while it runs, and + all other database activity must be locked out for the duration of + the transfer. This means the database lock must be held for this + entire step, starting from the check that the UBIK_RECSYNCSITE + and/or UBIK_RECFOUNDDB bits are still set. (Note that the existing + comment which claims that it is impossible for UBIK_RECFOUNDDB not + to be set at this point is true only if the database is not + released since checking or updating that bit in step three). In + addition, the address lock must be held while setting up the call + to be used for DISK_GetFile(). + + + Step five, again after verifying that this is still the sync site + and that the UBIK_RECHAVEDB bit is set, is to distribute the new + database to any servers which do not already have it. Before this + can be done, if the database was newly-initialized (with label + epoch 1), it is relabeled with version 2.1 before it is distributed + to other sites. This allows readany transactions to run on such + a database (though probably not successfully, since the database is + empty); it is deferred until this point in recovery to insure that + if any server in the quorum contains a valid database, that version + is used rather than the empty one. + + For this step, the database lock should be held starting at the + point at which the UBIK_RECSYNCSITE and/or UBIK_RECHAVEDB bits are + tested. In the event the DBWRITING flag is set, the lock must be + released while waiting for the active write transaction to end. In + this case, once the wait is completed, it is necessary to again + check urecovery_state to determine that this is still the sync site + and have an up-to-date database before proceeding. + + It is probably simplest for urecovery_Interact() to acquire the + database lock at the start of step 2 described above, release it for + the duration of the version-collection part of step 3 and while + waiting for write transactions to terminate in step 5, and otherwise + hold it until the end of the cycle. If no work is needed, this means + the lock will be held relatively briefly; if it is necessary to fetch + and or send full database versions, these operations themselves must + be done with the lock held, and there is little point in releasing it + between them. However, it is acceptable to release the lock between + any steps described above, provided that each time the database lock + is released and reacquired, 'urecovery_state' is checked to verify + that no regression has occurred. Note that outside of the recovery + thread, the only changes that may occur to 'urecovery_state' are that + it may be completely cleared by urecovery_ResetState(), and that the + UBIK_RECLABELDB bit may be set by udisk_commit(), on the sync site. + + The function urecovery_AllBetter() is called under the database lock + by routines which need to know whether there is a usable database. + As noted in the discussion of ubeacon_AmSyncSite() above, callers + must then continue to hold the database lock until any database + operation is complete. Further, callers which intend to write to the + database currently follow a call to urecovery_AllBetter() with a call + to ubeacon_AmSyncSite(), to determine whether a write is permissible. + Unfortunately, this is not safe, because it is possible (though + admittedly unlikely) for urecovery_AllBetter() to succeed on the + basis of being a non-sync-site with an up-to-date database, and then + for a subsequent call to ubeacon_AmSyncSite() to return a true result + (because the latter may block on acquiring the beacon lock, during + which time this may become the sync site). To insure that no write + operation is performed without an up-to-date, writeable database, + urecovery_AllBetter() should be modified to indicate whether a write + operation is permitted, which is the case only when UBIK_RECHAVEDB is + tested and found to be set. + + The functions urecovery_AbortAll(), urecovery_CheckTid(), and + urecovery_ResetState() all expect to be called with the database lock + held. + + The function urecovery_LostServer() need not be called with any + locks; it is merely a wrapper for ubeacon_ReinitServer(). + + The RECOVERY package is also responsible at startup for replaying the + transaction log and initializing an empty database. These tasks are + handled by urecovery_Initialize() and its helpers, ReplayLog() and + InitializeDB(). Because the helpers use PHYS package functions and + manipulate database structure fields, urecovery_Initialize() should + acquire the database lock before calling them. + + + REMOTE - remote database I/O (DISK) RPCs + + This package provides the DISK RPC package, which mediates remote + access to the local database copy, when another server is sync site. + It also includes DISK_UpdateInterfaceAddr(), which is used during + initialization to verify consistency of configuration across servers, + and DISK_Probe(), which is used by the RECOVERY module to determine + whether the server is up. + + As the functions in this package are RPC implementations, they may be + called at any time and are entirely responsible for their own + locking. There can be at most one active remote write transaction at + a time; this rule is enforced by this package, which maintains the + global 'ubik_currentTrans' as a pointer to the active transaction. + + For the most part, the RPCs in this module use a common prologue, + which includes acquiring the database lock, and do not release that + lock until just before returning. However, the common prologue + examines both 'ubik_currentTrans' and the transaction flags without + benefit of the database lock. This can be corrected fairly easily by + acquiring the database lock sooner; however, this requires referring + to the database via the ubik_dbase global rather than via the 'dbase' + pointer in 'ubik_currentTrans'. + + Note that SDISK_GetVersion() and SDISK_SetVersion() contain a sanity + check to ensure they are not running on the sync site, since these + are called only from that portion of the recovery loop which runs + only on the sync site. These calls to ubeacon_AmSyncSite() must be + made with the database lock held, for the results to be valid. The + same holds true for a commented-out check in SDISK_GetFile() (which + really should remain that way -- there's no reason not to allow any + authorized caller to do this at any time). + + Additionally, note that SDISK_Commit() acquires a write lock on the + application cache, to insure that no transactions are running which + depend on the application cache, page cache, or underlying database + to change. The locking order requires that this lock be acquired + before taking the database lock; thus, both locks must be acquired + earlier. + + SDISK_SendFile() is really quite bad in its handling of the database + lock under error conditions. Presently, an error condition may cause + it to release the lock too early nor not at all. + + SDISK_UpdateInterfaceAddr() is called by peer servers as part of + their startup process, to trade lists of additional addresses and + insure sufficient agreement on configuration to allow the new server + to start. It will need to acquire the lock protecting the addr[] + array attached to each host. + + + UBIK (high-level API) + + This package contains the high-level APIs exported by Ubik to + applications, including initialization, the APIs for creating and + using transactions, and various utilities. It also contains a number + of internal utilities. + + Initialization + + Initialization of Ubik is handled in ubik_ServerInitCommon(), + which is called by two API functions, ubik_ServerInit() and + ubik_ServerInitByInfo(), which provide different interfaces for + passing configuration information. This code handles + initialization of the database, locks, and various globals, calls + the initialization routines for each subsystem, makes sure Rx has + been initialized, and creates Ubik's Rx services and threads. + + Most of this is done before any Ubik-specific threads are created + and before Ubik's RPCs can be called, and so takes place without + holding any lock. However, since Rx may have been initialized and + started before Ubik, it is possible that some Rx worker threads + may have already been created. In practice, locks internal to Rx + should insure that any memory writes made before an Rx service is + created become visible to any worker running on behalf of that + service; nonetheless, this should not be depended on, and proper + locking should be used during all phases of initialization. + + Unfortunately, the current code actually starts Rx services before + initialization is complete, resulting in a number of possible + races. The general THREAD SAFETY section below contains + recommendations for reordering the startup process to eliminate + this problem, while still avoiding potential deadlock if multiple + servers start at once. + + ContactQuorum_* + + Nearly a third of ubik.c consists of various ContactQuorum_* + functions, which are used to make a particular RPC to every server + in the quorum. These used to be a single ContactQuorum() + function, but have been split into multiple functions to regain + type-safety. These retain a common prologue and epilogue, some + small part of which has been abstracted out, but most of which + remains duplicated in each of these functions. Indeed, the only + part unique to each function is the actual RPC call and, in the + case of ContactQuorum_DISK_WriteV(), code to fall back to multiple + calls to DISK_Write() on a per-server basis. + + The code common to all ContactQuorum_* functions will need to + acquire the beacon lock when checking whether a server is up (no + calls are made to servers which are not marked up), and again when + marking a server down. Note that when a server is marked down, + the 'up' and 'beaconSinceDown' flags must be cleared at the same + time, without releasing the beacon lock. In addition, the helper + function Quorum_StartIO() must obtain the server address lock + while it obtains a reference on the Rx connection that will be + used. + + All of the ContactQuorum_* functions must be called with the + database lock held; this is necessary to protect their access to + each server's 'version' and 'currentDB' fields and to the local + database version. However, these functions release the lock while + actually making RPCs. This behavior is new -- ContactQuorum() in + OpenAFS 1.4 did not release the database lock during RPCs -- but + it is safe, because no caller of ContactQuorum_*() depends on + state read under the database lock before the call to remain valid + after. + + Transactions + + The main Ubik transaction API consists of ubik_BeginTrans(), + ubik_BeginTransReadAny(), ubik_BeginTransReadAnyWrite(), + ubik_AbortTrans(), ubik_EndTrans(), ubik_Read(), ubik_Write(), + ubik_Flush(), ubik_Seek(), ubik_Tell(), ubik_Truncate(), and + ubik_SetLock(). The ubik_Begin* functions require a database + pointer; all others require an active transaction pointer. These + functions are called directly by the application without any + locks, and are responsible for their own locking. Note that many + of these examine the transaction type before acquiring any locks; + this is permissible because the transaction type is immutable once + the transaction is created (but it requires that the call be made + from the thread that created the transaction, or that the + application have used some mechanism to insure an appropriate + memory barrier exists; the same mechanism used to protect the + transaction pointer should suffice). + + BeginTrans() contains the common code responsible for starting + a new transaction; this is made available to applications as + ubik_BeginTrans(), ubik_BeginTransReadAny(), and + ubik_BeginTransReadAnyWrite(). If a write transaction is in + progress, this function waits for it to complete. However, since + doing so releases the database lock, it is necessary to call + urecovery_AllBetter() again once the lock has been reacquired. + Further, in the UBIK_PAUSE case, the DBVOTING flag may have been + set, so this also must be retested (see the BEACON package section + for more on the difficulties of UBIK_PAUSE and DBVOTING). + + Both ubik_Flush() and ubik_Write() examine and manipulate the + transaction I/O vector without benefit of any lock. These fields + ('iovec_info' and 'iovec_data') should be protected by the + database lock. In the case of ubik_Write(), this may require + releasing the lock before calling ubik_Flush(), or the + introduction of a private ubik_Flush_r() which assumes the lock is + already held (but note that in either case, the lock is released, + because ubik_Flush() calls ContactQuorum_* functions). + + Utilities + + The functions ubik_GetVersion() and ubik_WaitVersion() provide the + application with a way to discover the current database version + and to wait for it to change. These interfaces are not currently + used in OpenAFS. ubik_GetVersion() needs to acquire the database + lock while copying the database version. + + The internal function ubikGetPrimaryInterfaceAddr() is used by + Ubik RPCs to determine a peer server's primary address, given the + IP address from which a call arrived. This needs to hold the + server address lock, as described in the section on struct + ubik_server. + + Application Cache + + The section on struct ubik_dbase describes the operation of the + application cache. The function ubik_CheckCache() is provided to + allow an application to insure the cache is up to date and obtain + a read lock on the cache which lasts for the duration of the + transaction. + + Note that ubik_AbortTrans() currently always invalidates the + application cache by clearing ubik_dbase->cachedVersion, for which + it requires a write lock on the cache_lock. This means that + aborted transactions block waiting for completion of any + transactions which hold the cache_lock, which may well include all + outstanding transactions. In the case of read transactions, which + cannot have modified the database, this is unnecessary and may + significantly reduce concurrency. + + +THREAD SAFETY + + This section discusses changes needed to insure thread safety. + + The initialization sequence in ubik_ServerInitCommon() should be cleaned + up, to ensure that every subsystem is fully initialized before starting + any background threads or accepting RPCs. The correct initialization + sequence should look something like this: + + + Initialize error tables + + Allocate and initialize the database structure + + Initialize Rx + + Initialize the various subsystems: + + udisk_Init() (formerly disk.c:DInit) + + ulock_Init() (new, pulled out of ulock_getLock) + + uvote_Init() + + urecovery_Initialize() + + ubeacon_InitServerList* + + Create Rx services + + Start an Rx server thread + + updateUbikNetworkAddress() (rename and export from beacon.c) + + Start the beacon and recovery threads + + Initialization functions may need to be added for some subsystems which + do not currently have them: + + + src/ubik/disk.c:DInit() should be exported and renamed + udisk_Init(); it can then be called from ubik_ServerInitCommon() + instead of from udisk_begin(). + + + The lock initialization currently done in ulock_getLock() should + instead be done in a separate ulock_Init(), which should be called + from ubik_ServerInitCommon() + + A new lock is needed to protect state belonging to the VOTE package, + including the globals 'ubik_dbVersion' and 'ubik_dbTid', which represent + the sync site's database state. See the VOTE package section for + details on the use of the vote lock. + + A new lock is needed to protect state belonging to the BEACON package, + including the globals 'ubik_amSyncSite' and 'syncSiteUntil' and several + fields in the ubik_server structure. This lock also protects the + per-server 'up' flag, which is shared among several modules. In + addition, some beacon-specific fields are accessed directly by other + modules. Thus, this lock must be visible outside the BEACON module, or + else accessors must be provided for these items. See the BEACON package + section for details on the use of the beacon lock. + + A new lock is needed to provide additional protection for the database + version, transaction counter, and flags, so that these can safely be + examined (but not updated) by the beacon thread without the need to + acquire the database lock. See the BEACON package section for details + on the use of the version lock. + + A new lock is needed to protect per-server lists of non-primary + addresses, connections to the VOTE and DISK services on each server, and + the client-mode security class objects used to set up these connections. + See the section on struct ubik_server for details on the use of the + server address lock. + + The required locking order is described by the following list. Any of + these locks may be acquired singly; when acquiring multiple locks, they + should be acquired in the listed order: + + + application cache lock (dbase->cache_lock) + + database lock (DBHOLD/DBRELE) + + beacon lock + + vote lock + + version lock + + server address lock + + Some cleanup is required in various places where existing locking + conventions are not properly obeyed: + + + SVOTE_Beacon() needs to hold the database lock when calling + urecovery_CheckTid() + + + Management of the database lock in SDISK_SendFile() needs to be + cleaned up so that it is always released at the proper time. + + + urecovery_Interact() is not consistent about holding the database + lock when it examines and updates 'urecovery_state'. It also + examines the local database version at least once without benefit + of this lock. See the RECOVERY package section for details on + which locks are needed when by urecovery_Interact(). + + + ubik_Flush() and ubik_Write() need to acquire the database lock + before accessing the 'iovec_info' and 'iovec_data' fields. + + + ubik_GetVersion() needs to acquire the database lock before copying + the database version. + + The various debugging RPCs in the VOTE packages, and the data collection + routines they call in other packages, generally operate without benefit + of any locks. This is probably desirable, as it avoids any possibility + of debugging operations blocking on or interfering with the rest of the + system, and allows debugging data to be collected even in the event of + some unforeseen deadlock. It is also "safe", in that it does not risk + damage to data structures or access to uninitialized memory, provided + that data structure initialization is complete before these functions + are called. However, it does mean that these RPCs may return data that + is not entirely self-consistent. + + + Correctness requires various locks be held during parts of + SVOTE_Debug() and SVOTE_DebugOld(); see the VOTE package section. + + + SVOTE_Debug() also examines 'ubik_currentTrans' without benefit of + the database lock. This usage is not "safe", as that transaction + may be freed by another thread while SVOTE_Debug() is examining it. + + + SVOTE_XSDebug() should hold the server address lock while copying + out server addresses, and other locks while copying per-server + state. + + + udisk_Debug() should hold the database lock while examining the + database version and collecting page cache statistics. + + + ulock_Debug should use LOCK_LOCK/LOCK_UNLOCK when examining the + internals of struct Lock. Ideally, knowledge of these internals + should be abstracted to src/lwp/lock.h. + + + ubeacon_Debug() should hold the beacon lock while accessing the + value of 'syncSiteUntil' and, if the access is moved here from + SVOTE_Debug/SVOTE_DebugOld, 'ubik_amSyncSite'. + +CONCURRENCY THOUGHTS + + This section collections information discussed previously about changes + that may be required or desirable in order to improve concurrency. Note + that none of these changes are needed to insure thread safety, provided + the changes discussed in the previous section are made. + + + PHYS could be made more concurrent by maintaining a separate lock for + the fdcache, avoiding the need to DBHOLD while calling these. + However, higher-layer consistency probably requires that certain calls + or groups of calls to this package be atomic and isolated. + + + Could DISK be made more concurrent by maintaining a separate lock for + the page cache and or free truncation record list? + + + Could concurrency be improved by maintaining separate per-transaction + locks? If so, what is the locking hierarchy with respect to database + locks and the DISK and PHYS package locks? + + + Could concurrency be improved by maintaining more reader/writer state + and/or separate write locks to insure consistency of database file + access without requiring so much DBHOLD? + + + Alternately, could concurrency be improved by requiring that an active + transaction be accessed only by the thread that created it? + + +MULTI-DATABASE SUPPORT + + There isn't any. Some of the Ubik interfaces and even the organization + of some internal data structures makes it appear as though multiple + databases can be supported. However, this is not the case, for several + reasons: + + + Only a single set of peer servers is tracked, via a global linked list + of server structures. Each server structure includes state such as + the server's database version and whether it is up-to-date, so it is + not possible to track multiple databases even for a common set of + servers. + + The physical I/O package (phys.c) tracks cached open file descriptors + without regard to what database a file is part of; thus, it cannot + handle accessing multiple databases at the same time. + + There are a variety of globals containing state which needs to be + tracked on a per-database basis. + + There are a variety of globals which are effectively protected by + a per-database lock, or by no lock at all. + + The various RPCs used to communicate between servers assume a single + database and do not include a database identifier. Thus, in the + presence of multiple databases, it would be impossible to determine, + for example, for which database a DISK RPC was intended. + + The upshot of this is that considerable work would be needed to get Ubik + into a condition which would allow a single process to maintain multiple + independent Ubik databases at the same time. Note that this is + orthogonal to the question of whether a single multi-file database is + possible. + + +OTHER DATA STRUCTURES + + Some other data structures are mentioned in ubik.p.h or elsewhere, but + are of relatively little interest from the point of view of threading. + They are described here primarily as an alternative to deleting + descriptive text that was already written. + + struct ubik_hdr - on-disk database file header + + This structure represents the on-disk format of the Ubik database + file header (label). It is used only for local variables in + uphys_getlabel() and uphys_setlabel(), which read and write this + header, respectively. + + struct ubik_stat - database file status + + This structure is used to convey the size and modification timestamp + of a database file. It is used as the type of an output parameter of + uphys_stat(), and for local variables in callers of that function. + + struct ubik_stats - global statistics + + This structure is used to contain "miscellaneous" global statistics. + Currently that consists of a single counter which is incremented in + one place (an exceptional condition in ubik_EndTrans) and is never + read. The one increment is performed only when holding the database + lock; thus, this structure may be treated the same as other globals + protected by non-global locks. + +OTHER NOTES + + There appears to be nothing to insure that calls to disk.c:DRead() + during a write transaction won't find and use a "clean" cached page when + there is already a dirty page in the cache resulting from an earlier + write as part of the same transaction. This means that if a transaction + reads a page after writing it, it may read back the original data + instead of the new, and if a transaction writes a page more than once, + the later writes may corrupt the database. It seems likely that we are + merely lucky in this regard, and this problem should be fixed. + + src/ubik/beacon.c:ubeacon_ReinitServer() assumes 'ubik_CRXSecurityRock' + is in fact an AFS configuration directory pointer, suitable as an + argument to afsconf_UpToDate(). This is inappropriate; + 'ubik_CRXSecurityRock' is an opaque argument to + (*ubik_CRXSecurityProc)(), Ubik must not make assumptions about what it + is. + + It may be worth examining whether it is appropriate for SVOTE_Beacon() + to call urecovery_CheckTid() only when voting yes, and not whenever + a beacon is received from a server claiming to be sync site. + + Recently, code has been introduced which attempts to insure that the + recovery process does not truncate an old database file before a new one + has been fully transferred, since this could, depending on the timing of + a server failure, deprive the new quorum of a valid database. This + so-called "new recovery" code violates the abstraction provided by the + PHYS package, encoding specific knowledge of the underlying file store + into the REMOTE and RECOVERY packages. This means that the backend + provided by phys.c cannot be replaced with an alternative, since in that + case recovery would do the wrong thing with the transferred database. + + The DBWRITING loop in urecovery_Interact() is not portable, as it + depends the contents of the timeout after a call to select(2). This was + fine for IOMGR_Select(), which has defined behavior, but not for + select(2). + -- 2.39.5