From: Michael Meffie Date: Wed, 2 Aug 2017 00:10:32 +0000 (-0400) Subject: doc: relocate notes from arch to txt X-Git-Tag: upstream/1.8.0_pre2^3~11 X-Git-Url: https://git.michaelhowe.org/gitweb/?a=commitdiff_plain;h=c6f5ebc4cf95b0f1d3acc7a0a8678ba0d4378243;p=packages%2Fo%2Fopenafs.git doc: relocate notes from arch to txt The doc/txt directory has become the de facto home for text-based technical notes. Relocate the contents of the doc/arch directory to doc/txt. Relocate doc/examples to doc/txt/examples. Update the doc/README file to be more current and remove old work in progress comments. Change-Id: Iaa53e77eb1f7019d22af8380fa147305ac79d055 Reviewed-on: https://gerrit.openafs.org/12675 Tested-by: BuildBot Reviewed-by: Benjamin Kaduk --- diff --git a/doc/README b/doc/README index c9fd64c39..4988494d5 100644 --- a/doc/README +++ b/doc/README @@ -1,58 +1,22 @@ What's in the "doc" subdirectory -** doc/html -original IBM html doc, no longer used - ** doc/man-pages pod sources for man pages (converted from original IBM html source). ** doc/xml -xml sources for manuals (converted from original IBM html source). there is -some generated pdf/html content as well for the curious. - -Note that doc/xml/AdminReference uses doc/xml/AdminReference/pod2refentry to -convert the pod man pages to xml for printing. pod goes directly to html -just fine. - -The reference guide is now built by converting the existing pod documentation -to xml. however, the indexing information was lost during the initial pod -conversion. Someone we will need to try to get that back. +xml sources for manuals (converted from original IBM html source). +Note: The doc/xml/AdminRef uses doc/xml/AdminRef/pod2refentry to convert the +pod man pages to xml for printing. pod goes directly to html just fine. ** doc/pdf -old Transarc (and possibly pre-Transarc) protocol and API documentation for -which we have no other source +Old Transarc (and possibly pre-Transarc) protocol and API documentation for +which we have no other source. ** doc/txt -doc/examples -a few other miscellaneous files. - - -From: Russ Allbery - -The Administrative Reference has been converted into separate POD man pages -for each command, since that's basically what it already was (just in HTML). -Considerable work remains to update that POD documentation to reflect the -current behavior of OpenAFS (for example, there's no documentation of -dynroot, no mention of Kerberos v5, many fileserver options are -undocumented, the afsd switch documentation is out of date, and so forth). -I've collected as many of those deficiencies as I know of in -doc/man-pages/README. Any contributions to correct any of those deficiencies -are very welcome. This is one easy place to start. - -The other reference manuals (the Administrator's Guide, the Quick Start -Guide, and the User's Guide) are more manual-like in their structure. After -some on-list discussion, we picked DocBook as the format to use going -forward and the existing HTML files have been converted to DocBook with a -script. This means that the markup could use a lot of cleaning up and the -content is even less updated than the man pages. +Technical notes, Windows notes, and examples. -I did some *very* initial work on the Quick Start Guide, just to get the -makefile working and to try some simple modifications. Simon Wilkinson is -currently working on making more extensive modifications. If you want to -work on the Quick Start Guide, please coordinate with him to avoid duplicate -work. +** doc/doxygen +Configuration files for the doxygen tool to generate documentation from +the annotated sources. See the 'dox' Makefile target in the top level +Makefile. -The Administrator's Guide and User's Guide have not yet been touched. Of -those, the latter is probably in the best shape, in that the user commands -and behavior haven't changed as much. If you'd like to start working on one -of those, that would also be great. diff --git a/doc/arch/README b/doc/arch/README deleted file mode 100644 index 4c4690f9f..000000000 --- a/doc/arch/README +++ /dev/null @@ -1,13 +0,0 @@ - -- dafs-fsa.dot is a description of the finite-state machine for volume -states in the Demand Attach fileserver -- dafs-vnode-fsa.dot is a description of the finite-state machine -for vnodes in the Demand Attach fileserver. - -Both diagrams are in Dot (http://www.graphviz.org) format, -and can be converted to graphics formats via an -an invocation like: - - dot -Tsvg dafs-fsa.dot > dafs-fsa.svg - - diff --git a/doc/arch/arch-overview.h b/doc/arch/arch-overview.h deleted file mode 100644 index 64c6cb834..000000000 --- a/doc/arch/arch-overview.h +++ /dev/null @@ -1,1224 +0,0 @@ -/*! - \addtogroup arch-overview Architectural Overview - \page title AFS-3 Programmer's Reference: Architectural Overview - -\author Edward R. Zayas -Transarc Corporation -\version 1.0 -\date 2 September 1991 22:53 .cCopyright 1991 Transarc Corporation All Rights -Reserved FS-00-D160 - - - \page chap1 Chapter 1: Introduction - - \section sec1-1 Section 1.1: Goals and Background - -\par -This paper provides an architectural overview of Transarc's wide-area -distributed file system, AFS. Specifically, it covers the current level of -available software, the third-generation AFS-3 system. This document will -explore the technological climate in which AFS was developed, the nature of -problem(s) it addresses, and how its design attacks these problems in order to -realize the inherent benefits in such a file system. It also examines a set of -additional features for AFS, some of which are actively being considered. -\par -This document is a member of a reference suite providing programming -specifications as to the operation of and interfaces offered by the various AFS -system components. It is intended to serve as a high-level treatment of -distributed file systems in general and of AFS in particular. This document -should ideally be read before any of the others in the suite, as it provides -the organizational and philosophical framework in which they may best be -interpreted. - - \section sec1-2 Section 1.2: Document Layout - -\par -Chapter 2 provides a discussion of the technological background and -developments that created the environment in which AFS and related systems were -inspired. Chapter 3 examines the specific set of goals that AFS was designed to -meet, given the possibilities created by personal computing and advances in -communication technology. Chapter 4 presents the core AFS architecture and how -it addresses these goals. Finally, Chapter 5 considers how AFS functionality -may be be improved by certain design changes. - - \section sec1-3 Section 1.3: Related Documents - -\par -The names of the other documents in the collection, along with brief summaries -of their contents, are listed below. -\li AFS-3 Programmer?s Reference: File Server/Cache Manager Interface: This -document describes the File Server and Cache Manager agents, which provide the -backbone ?le managment services for AFS. The collection of File Servers for a -cell supplies centralized ?le storage for that site, and allows clients running -the Cache Manager component to access those ?les in a high-performance, secure -fashion. -\li AFS-3 Programmer?s Reference:Volume Server/Volume Location Server -Interface: This document describes the services through which ?containers? of -related user data are located and managed. -\li AFS-3 Programmer?s Reference: Protection Server Interface: This paper -describes the server responsible for mapping printable user names to and from -their internal AFS identi?ers. The Protection Server also allows users to -create, destroy, and manipulate ?groups? of users, which are suitable for -placement on Access Control Lists (ACLs). -\li AFS-3 Programmer?s Reference: BOS Server Interface: This paper covers the -?nanny? service which assists in the administrability of the AFS environment. -\li AFS-3 Programmer?s Reference: Speci?cation for the Rx Remote Procedure Call -Facility: This document speci?es the design and operation of the remote -procedure call and lightweight process packages used by AFS. - - \page chap2 Chapter 2: Technological Background - -\par -Certain changes in technology over the past two decades greatly in?uenced the -nature of computational resources, and the manner in which they were used. -These developments created the conditions under which the notion of a -distributed ?le systems (DFS) was born. This chapter describes these -technological changes, and explores how a distributed ?le system attempts to -capitalize on the new computing environment?s strengths and minimize its -disadvantages. - - \section sec2-1 Section 2.1: Shift in Computational Idioms - -\par -By the beginning of the 1980s, new classes of computing engines and new methods -by which they may be interconnected were becoming firmly established. At this -time, a shift was occurring away from the conventional mainframe-based, -timeshared computing environment to one in which both workstation-class -machines and the smaller personal computers (PCs) were a strong presence. -\par -The new environment offered many benefits to its users when compared with -timesharing. These smaller, self-sufficient machines moved dedicated computing -power and cycles directly onto people's desks. Personal machines were powerful -enough to support a wide variety of applications, and allowed for a richer, -more intuitive, more graphically-based interface for them. Learning curves were -greatly reduced, cutting training costs and increasing new-employee -productivity. In addition, these machines provided a constant level of service -throughout the day. Since a personal machine was typically only executing -programs for a single human user, it did not suffer from timesharing's -load-based response time degradation. Expanding the computing services for an -organization was often accomplished by simply purchasing more of the relatively -cheap machines. Even small organizations could now afford their own computing -resources, over which they exercised full control. This provided more freedom -to tailor computing services to the specific needs of particular groups. -\par -However, many of the benefits offered by the timesharing systems were lost when -the computing idiom first shifted to include personal-style machines. One of -the prime casualties of this shift was the loss of the notion of a single name -space for all files. Instead, workstation-and PC-based environments each had -independent and completely disconnected file systems. The standardized -mechanisms through which files could be transferred between machines (e.g., -FTP) were largely designed at a time when there were relatively few large -machines that were connected over slow links. Although the newer multi-megabit -per second communication pathways allowed for faster transfers, the problem of -resource location in this environment was still not addressed. There was no -longer a system-wide file system, or even a file location service, so -individual users were more isolated from the organization's collective data. -Overall, disk requirements ballooned, since lack of a shared file system was -often resolved by replicating all programs and data to each machine that needed -it. This proliferation of independent copies further complicated the problem of -version control and management in this distributed world. Since computers were -often no longer behind locked doors at a computer center, user authentication -and authorization tasks became more complex. Also, since organizational -managers were now in direct control of their computing facilities, they had to -also actively manage the hardware and software upon which they depended. -\par -Overall, many of the benefits of the proliferation of independent, -personal-style machines were partially offset by the communication and -organizational penalties they imposed. Collaborative work and dissemination of -information became more difficult now that the previously unified file system -was fragmented among hundreds of autonomous machines. - - \section sec2-2 Section 2.2: Distributed File Systems - -\par -As a response to the situation outlined above, the notion of a distributed file -system (DFS) was developed. Basically, a DFS provides a framework in which -access to files is permitted regardless of their locations. Specifically, a -distributed file system offers a single, common set of file system operations -through which those accesses are performed. -\par -There are two major variations on the core DFS concept, classified according to -the way in which file storage is managed. These high-level models are defined -below. -\li Peer-to-peer: In this symmetrical model, each participating machine -provides storage for specific set of files on its own attached disk(s), and -allows others to access them remotely. Thus, each node in the DFS is capable of -both importing files (making reference to files resident on foreign machines) -and exporting files (allowing other machines to reference files located -locally). -\li Server-client: In this model, a set of machines designated as servers -provide the storage for all of the files in the DFS. All other machines, known -as clients, must direct their file references to these machines. Thus, servers -are the sole exporters of files in the DFS, and clients are the sole importers. - -\par -The notion of a DFS, whether organized using the peer-to-peer or server-client -discipline, may be used as a conceptual base upon which the advantages of -personal computing resources can be combined with the single-system benefits of -classical timeshared operation. -\par -Many distributed file systems have been designed and deployed, operating on the -fast local area networks available to connect machines within a single site. -These systems include DOMAIN [9], DS [15], RFS [16], and Sprite [10]. Perhaps -the most widespread of distributed file systems to date is a product from Sun -Microsystems, NFS [13] [14], extending the popular unix file system so that it -operates over local networks. - - \section sec2-3 Section 2.3: Wide-Area Distributed File Systems - -\par -Improvements in long-haul network technology are allowing for faster -interconnection bandwidths and smaller latencies between distant sites. -Backbone services have been set up across the country, and T1 (1.5 -megabit/second) links are increasingly available to a larger number of -locations. Long-distance channels are still at best approximately an order of -magnitude slower than the typical local area network, and often two orders of -magnitude slower. The narrowed difference between local-area and wide-area data -paths opens the window for the notion of a wide-area distributed file system -(WADFS). In a WADFS, the transparency of file access offered by a local-area -DFS is extended to cover machines across much larger distances. Wide-area file -system functionality facilitates collaborative work and dissemination of -information in this larger theater of operation. - - \page chap3 Chapter 3: AFS-3 Design Goals - - \section sec3-1 Section 3.1: Introduction - -\par -This chapter describes the goals for the AFS-3 system, the first commercial -WADFS in existence. -\par -The original AFS goals have been extended over the history of the project. The -initial AFS concept was intended to provide a single distributed file system -facility capable of supporting the computing needs of Carnegie Mellon -University, a community of roughly 10,000 people. It was expected that most CMU -users either had their own workstation-class machine on which to work, or had -access to such machines located in public clusters. After being successfully -implemented, deployed, and tuned in this capacity, it was recognized that the -basic design could be augmented to link autonomous AFS installations located -within the greater CMU campus. As described in Section 2.3, the long-haul -networking environment developed to a point where it was feasible to further -extend AFS so that it provided wide-area file service. The underlying AFS -communication component was adapted to better handle the widely-varying channel -characteristics encountered by intra-site and inter-site operations. -\par -A more detailed history of AFS evolution may be found in [3] and [18]. - - \section sec3-2 Section 3.2: System Goals - -\par -At a high level, the AFS designers chose to extend the single-machine unix -computing environment into a WADFS service. The unix system, in all of its -numerous incarnations, is an important computing standard, and is in very wide -use. Since AFS was originally intended to service the heavily unix-oriented CMU -campus, this decision served an important tactical purpose along with its -strategic ramifications. -\par -In addition, the server-client discipline described in Section 2.2 was chosen -as the organizational base for AFS. This provides the notion of a central file -store serving as the primary residence for files within a given organization. -These centrally-stored files are maintained by server machines and are made -accessible to computers running the AFS client software. -\par -Listed in the following sections are the primary goals for the AFS system. -Chapter 4 examines how the AFS design decisions, concepts, and implementation -meet this list of goals. - - \subsection sec3-2-1 Section 3.2.1: Scale - -\par -AFS differs from other existing DFSs in that it has the specific goal of -supporting a very large user community with a small number of server machines. -Unlike the rule-of-thumb ratio of approximately 20 client machines for every -server machine (20:1) used by Sun Microsystem's widespread NFS distributed file -system, the AFS architecture aims at smoothly supporting client/server ratios -more along the lines of 200:1 within a single installation. In addition to -providing a DFS covering a single organization with tens of thousands of users, -AFS also aims at allowing thousands of independent, autonomous organizations to -join in the single, shared name space (see Section 3.2.2 below) without a -centralized control or coordination point. Thus, AFS envisions supporting the -file system needs of tens of millions of users at interconnected yet autonomous -sites. - - \subsection sec3-2-2 Section 3.2.2: Name Space - -\par -One of the primary strengths of the timesharing computing environment is the -fact that it implements a single name space for all files in the system. Users -can walk up to any terminal connected to a timesharing service and refer to its -files by the identical name. This greatly encourages collaborative work and -dissemination of information, as everyone has a common frame of reference. One -of the major AFS goals is the extension of this concept to a WADFS. Users -should be able to walk up to any machine acting as an AFS client, anywhere in -the world, and use the identical file name to refer to a given object. -\par -In addition to the common name space, it was also an explicit goal for AFS to -provide complete access transparency and location transparency for its files. -Access transparency is defined as the system's ability to use a single -mechanism to operate on a file, regardless of its location, local or remote. -Location transparency is defined as the inability to determine a file's -location from its name. A system offering location transparency may also -provide transparent file mobility, relocating files between server machines -without visible effect to the naming system. - - \subsection sec3-2-3 Section 3.2.3: Performance - -\par -Good system performance is a critical AFS goal, especially given the scale, -client-server ratio, and connectivity specifications described above. The AFS -architecture aims at providing file access characteristics which, on average, -are similar to those of local disk performance. - - \subsection sec3-2-4 Section 3.2.4: Security - -\par -A production WADFS, especially one which allows and encourages transparent file -access between different administrative domains, must be extremely conscious of -security issues. AFS assumes that server machines are "trusted" within their -own administrative domain, being kept behind locked doors and only directly -manipulated by reliable administrative personnel. On the other hand, AFS client -machines are assumed to exist in inherently insecure environments, such as -offices and dorm rooms. These client machines are recognized to be -unsupervisable, and fully accessible to their users. This situation makes AFS -servers open to attacks mounted by possibly modified client hardware, firmware, -operating systems, and application software. In addition, while an organization -may actively enforce the physical security of its own file servers to its -satisfaction, other organizations may be lax in comparison. It is important to -partition the system's security mechanism so that a security breach in one -administrative domain does not allow unauthorized access to the facilities of -other autonomous domains. -\par -The AFS system is targeted to provide confidence in the ability to protect -system data from unauthorized access in the above environment, where untrusted -client hardware and software may attempt to perform direct remote file -operations from anywhere in the world, and where levels of physical security at -remote sites may not meet the standards of other sites. - - \subsection sec3-2-5 Section 3.2.5: Access Control - -\par -The standard unix access control mechanism associates mode bits with every file -and directory, applying them based on the user's numerical identifier and the -user's membership in various groups. This mechanism was considered too -coarse-grained by the AFS designers. It was seen as insufficient for specifying -the exact set of individuals and groups which may properly access any given -file, as well as the operations these principals may perform. The unix group -mechanism was also considered too coarse and inflexible. AFS was designed to -provide more flexible and finer-grained control of file access, improving the -ability to define the set of parties which may operate on files, and what their -specific access rights are. - - \subsection sec3-2-6 Section 3.2.6: Reliability - -\par -The crash of a server machine in any distributed file system causes the -information it hosts to become unavailable to the user community. The same -effect is observed when server and client machines are isolated across a -network partition. Given the potential size of the AFS user community, a single -server crash could potentially deny service to a very large number of people. -The AFS design reflects a desire to minimize the visibility and impact of these -inevitable server crashes. - - \subsection sec3-2-7 Section 3.2.7: Administrability - -\par -Driven once again by the projected scale of AFS operation, one of the system's -goals is to offer easy administrability. With the large projected user -population, the amount of file data expected to be resident in the shared file -store, and the number of machines in the environment, a WADFS could easily -become impossible to administer unless its design allowed for easy monitoring -and manipulation of system resources. It is also imperative to be able to apply -security and access control mechanisms to the administrative interface. - - \subsection sec3-2-8 Section 3.2.8: Interoperability/Coexistence - -\par -Many organizations currently employ other distributed file systems, most -notably Sun Microsystem's NFS, which is also an extension of the basic -single-machine unix system. It is unlikely that AFS will receive significant -use if it cannot operate concurrently with other DFSs without mutual -interference. Thus, coexistence with other DFSs is an explicit AFS goal. -\par -A related goal is to provide a way for other DFSs to interoperate with AFS to -various degrees, allowing AFS file operations to be executed from these -competing systems. This is advantageous, since it may extend the set of -machines which are capable of interacting with the AFS community. Hardware -platforms and/or operating systems to which AFS is not ported may thus be able -to use their native DFS system to perform AFS file references. -\par -These two goals serve to extend AFS coverage, and to provide a migration path -by which potential clients may sample AFS capabilities, and gain experience -with AFS. This may result in data migration into native AFS systems, or the -impetus to acquire a native AFS implementation. - - \subsection sec3-2-9 Section 3.2.9: Heterogeneity/Portability - -\par -It is important for AFS to operate on a large number of hardware platforms and -operating systems, since a large community of unrelated organizations will most -likely utilize a wide variety of computing environments. The size of the -potential AFS user community will be unduly restricted if AFS executes on a -small number of platforms. Not only must AFS support a largely heterogeneous -computing base, it must also be designed to be easily portable to new hardware -and software releases in order to maintain this coverage over time. - - \page chap4 Chapter 4: AFS High-Level Design - - \section sec4-1 Section 4.1: Introduction - -\par -This chapter presents an overview of the system architecture for the AFS-3 -WADFS. Different treatments of the AFS system may be found in several -documents, including [3], [4], [5], and [2]. Certain system features discussed -here are examined in more detail in the set of accompanying AFS programmer -specification documents. -\par -After the archtectural overview, the system goals enumerated in Chapter 3 are -revisited, and the contribution of the various AFS design decisions and -resulting features is noted. - - \section sec4-2 Section 4.2: The AFS System Architecture - - \subsection sec4-2-1 Section 4.2.1: Basic Organization - -\par -As stated in Section 3.2, a server-client organization was chosen for the AFS -system. A group of trusted server machines provides the primary disk space for -the central store managed by the organization controlling the servers. File -system operation requests for specific files and directories arrive at server -machines from machines running the AFS client software. If the client is -authorized to perform the operation, then the server proceeds to execute it. -\par -In addition to this basic file access functionality, AFS server machines also -provide related system services. These include authentication service, mapping -between printable and numerical user identifiers, file location service, time -service, and such administrative operations as disk management, system -reconfiguration, and tape backup. - - \subsection sec4-2-2 Section 4.2.2: Volumes - - \subsubsection sec4-2-2-1 Section 4.2.2.1: Definition - -\par -Disk partitions used for AFS storage do not directly host individual user files -and directories. Rather, connected subtrees of the system's directory structure -are placed into containers called volumes. Volumes vary in size dynamically as -the objects it houses are inserted, overwritten, and deleted. Each volume has -an associated quota, or maximum permissible storage. A single unix disk -partition may thus host one or more volumes, and in fact may host as many -volumes as physically fit in the storage space. However, the practical maximum -is currently 3,500 volumes per disk partition. This limitation is imposed by -the salvager program, which examines and repairs file system metadata -structures. -\par -There are two ways to identify an AFS volume. The first option is a 32-bit -numerical value called the volume ID. The second is a human-readable character -string called the volume name. -\par -Internally, a volume is organized as an array of mutable objects, representing -individual files and directories. The file system object associated with each -index in this internal array is assigned a uniquifier and a data version -number. A subset of these values are used to compose an AFS file identifier, or -FID. FIDs are not normally visible to user applications, but rather are used -internally by AFS. They consist of ordered triplets, whose components are the -volume ID, the index within the volume, and the uniquifier for the index. -\par -To understand AFS FIDs, let us consider the case where index i in volume v -refers to a file named example.txt. This file's uniquifier is currently set to -one (1), and its data version number is currently set to zero (0). The AFS -client software may then refer to this file with the following FID: (v, i, 1). -The next time a client overwrites the object identified with the (v, i, 1) FID, -the data version number for example.txt will be promoted to one (1). Thus, the -data version number serves to distinguish between different versions of the -same file. A higher data version number indicates a newer version of the file. -\par -Consider the result of deleting file (v, i, 1). This causes the body of -example.txt to be discarded, and marks index i in volume v as unused. Should -another program create a file, say a.out, within this volume, index i may be -reused. If it is, the creation operation will bump the index's uniquifier to -two (2), and the data version number is reset to zero (0). Any client caching a -FID for the deleted example.txt file thus cannot affect the completely -unrelated a.out file, since the uniquifiers differ. - - \subsubsection sec4-2-2-2 Section 4.2.2.2: Attachment - -\par -The connected subtrees contained within individual volumes are attached to -their proper places in the file space defined by a site, forming a single, -apparently seamless unix tree. These attachment points are called mount points. -These mount points are persistent file system objects, implemented as symbolic -links whose contents obey a stylized format. Thus, AFS mount points differ from -NFS-style mounts. In the NFS environment, the user dynamically mounts entire -remote disk partitions using any desired name. These mounts do not survive -client restarts, and do not insure a uniform namespace between different -machines. -\par -A single volume is chosen as the root of the AFS file space for a given -organization. By convention, this volume is named root.afs. Each client machine -belonging to this organization peforms a unix mount() of this root volume (not -to be confused with an AFS mount point) on its empty /afs directory, thus -attaching the entire AFS name space at this point. - - \subsubsection sec4-2-2-3 Section 4.2.2.3: Administrative Uses - -\par -Volumes serve as the administrative unit for AFS ?le system data, providing as -the basis for replication, relocation, and backup operations. - - \subsubsection sec4-2-2-4 Section 4.2.2.4: Replication - -Read-only snapshots of AFS volumes may be created by administrative personnel. -These clones may be deployed on up to eight disk partitions, on the same server -machine or across di?erent servers. Each clone has the identical volume ID, -which must di?er from its read-write parent. Thus, at most one clone of any -given volume v may reside on a given disk partition. File references to this -read-only clone volume may be serviced by any of the servers which host a copy. - - \subsubsection sec4-2-2-4 Section 4.2.2.5: Backup - -\par -Volumes serve as the unit of tape backup and restore operations. Backups are -accomplished by first creating an on-line backup volume for each volume to be -archived. This backup volume is organized as a copy-on-write shadow of the -original volume, capturing the volume's state at the instant that the backup -took place. Thus, the backup volume may be envisioned as being composed of a -set of object pointers back to the original image. The first update operation -on the file located in index i of the original volume triggers the -copy-on-write association. This causes the file's contents at the time of the -snapshot to be physically written to the backup volume before the newer version -of the file is stored in the parent volume. -\par -Thus, AFS on-line backup volumes typically consume little disk space. On -average, they are composed mostly of links and to a lesser extent the bodies of -those few files which have been modified since the last backup took place. -Also, the system does not have to be shut down to insure the integrity of the -backup images. Dumps are generated from the unchanging backup volumes, and are -transferred to tape at any convenient time before the next backup snapshot is -performed. - - \subsubsection sec4-2-2-6 Section 4.2.2.6: Relocation - -\par -Volumes may be moved transparently between disk partitions on a given file -server, or between different file server machines. The transparency of volume -motion comes from the fact that neither the user-visible names for the files -nor the internal AFS FIDs contain server-specific location information. -\par -Interruption to file service while a volume move is being executed is typically -on the order of a few seconds, regardless of the amount of data contained -within the volume. This derives from the staged algorithm used to move a volume -to a new server. First, a dump is taken of the volume's contents, and this -image is installed at the new site. The second stage involves actually locking -the original volume, taking an incremental dump to capture file updates since -the first stage. The third stage installs the changes at the new site, and the -fourth stage deletes the original volume. Further references to this volume -will resolve to its new location. - - \subsection sec4-2-3 Section 4.2.3: Authentication - -\par -AFS uses the Kerberos [22] [23] authentication system developed at MIT's -Project Athena to provide reliable identification of the principals attempting -to operate on the files in its central store. Kerberos provides for mutual -authentication, not only assuring AFS servers that they are interacting with -the stated user, but also assuring AFS clients that they are dealing with the -proper server entities and not imposters. Authentication information is -mediated through the use of tickets. Clients register passwords with the -authentication system, and use those passwords during authentication sessions -to secure these tickets. A ticket is an object which contains an encrypted -version of the user's name and other information. The file server machines may -request a caller to present their ticket in the course of a file system -operation. If the file server can successfully decrypt the ticket, then it -knows that it was created and delivered by the authentication system, and may -trust that the caller is the party identified within the ticket. -\par -Such subjects as mutual authentication, encryption and decryption, and the use -of session keys are complex ones. Readers are directed to the above references -for a complete treatment of Kerberos-based authentication. - - \subsection sec4-2-4 Section 4.2.4: Authorization - - \subsubsection sec4-2-4-1 Section 4.2.4.1: Access Control Lists - -\par -AFS implements per-directory Access Control Lists (ACLs) to improve the ability -to specify which sets of users have access to the ?les within the directory, -and which operations they may perform. ACLs are used in addition to the -standard unix mode bits. ACLs are organized as lists of one or more (principal, -rights) pairs. A principal may be either the name of an individual user or a -group of individual users. There are seven expressible rights, as listed below. -\li Read (r): The ability to read the contents of the files in a directory. -\li Lookup (l): The ability to look up names in a directory. -\li Write (w): The ability to create new files and overwrite the contents of -existing files in a directory. -\li Insert (i): The ability to insert new files in a directory, but not to -overwrite existing files. -\li Delete (d): The ability to delete files in a directory. -\li Lock (k): The ability to acquire and release advisory locks on a given -directory. -\li Administer (a): The ability to change a directory's ACL. - - \subsubsection sec4-2-4-2 Section 4.2.4.2: AFS Groups - -\par -AFS users may create a certain number of groups, differing from the standard -unix notion of group. These AFS groups are objects that may be placed on ACLs, -and simply contain a list of AFS user names that are to be treated identically -for authorization purposes. For example, user erz may create a group called -erz:friends consisting of the kazar, vasilis, and mason users. Should erz wish -to grant read, lookup, and insert rights to this group in directory d, he -should create an entry reading (erz:friends, rli) in d's ACL. -\par -AFS offers three special, built-in groups, as described below. -\par -1. system:anyuser: Any individual who accesses AFS files is considered by the -system to be a member of this group, whether or not they hold an authentication -ticket. This group is unusual in that it doesn't have a stable membership. In -fact, it doesn't have an explicit list of members. Instead, the system:anyuser -"membership" grows and shrinks as file accesses occur, with users being -(conceptually) added and deleted automatically as they interact with the -system. -\par -The system:anyuser group is typically put on the ACL of those directories for -which some specific level of completely public access is desired, covering any -user at any AFS site. -\par -2. system:authuser: Any individual in possession of a valid Kerberos ticket -minted by the organization's authentication service is treated as a member of -this group. Just as with system:anyuser, this special group does not have a -stable membership. If a user acquires a ticket from the authentication service, -they are automatically "added" to the group. If the ticket expires or is -discarded by the user, then the given individual will automatically be -"removed" from the group. -\par -The system:authuser group is usually put on the ACL of those directories for -which some specific level of intra-site access is desired. Anyone holding a -valid ticket within the organization will be allowed to perform the set of -accesses specified by the ACL entry, regardless of their precise individual ID. -\par -3. system:administrators: This built-in group de?nes the set of users capable -of performing certain important administrative operations within the cell. -Members of this group have explicit 'a' (ACL administration) rights on every -directory's ACL in the organization. Members of this group are the only ones -which may legally issue administrative commands to the file server machines -within the organization. This group is not like the other two described above -in that it does have a stable membership, where individuals are added and -deleted from the group explicitly. -\par -The system:administrators group is typically put on the ACL of those -directories which contain sensitive administrative information, or on those -places where only administrators are allowed to make changes. All members of -this group have implicit rights to change the ACL on any AFS directory within -their organization. Thus, they don't have to actually appear on an ACL, or have -'a' rights enabled in their ACL entry if they do appear, to be able to modify -the ACL. - - \subsection sec4-2-5 Section 4.2.5: Cells - -\par -A cell is the set of server and client machines managed and operated by an -administratively independent organization, as fully described in the original -proposal [17] and specification [18] documents. The cell's administrators make -decisions concerning such issues as server deployment and configuration, user -backup schedules, and replication strategies on their own hardware and disk -storage completely independently from those implemented by other cell -administrators regarding their own domains. Every client machine belongs to -exactly one cell, and uses that information to determine where to obtain -default system resources and services. -\par -The cell concept allows autonomous sites to retain full administrative control -over their facilities while allowing them to collaborate in the establishment -of a single, common name space composed of the union of their individual name -spaces. By convention, any file name beginning with /afs is part of this shared -global name space and can be used at any AFS-capable machine. The original -mount point concept was modified to contain cell information, allowing volumes -housed in foreign cells to be mounted in the file space. Again by convention, -the top-level /afs directory contains a mount point to the root.cell volume for -each cell in the AFS community, attaching their individual file spaces. Thus, -the top of the data tree managed by cell xyz is represented by the /afs/xyz -directory. -\par -Creating a new AFS cell is straightforward, with the operation taking three -basic steps: -\par -1. Name selection: A prospective site has to first select a unique name for -itself. Cell name selection is inspired by the hierarchical Domain naming -system. Domain-style names are designed to be assignable in a completely -decentralized fashion. Example cell names are transarc.com, ssc.gov, and -umich.edu. These names correspond to the AFS installations at Transarc -Corporation in Pittsburgh, PA, the Superconducting Supercollider Lab in Dallas, -TX, and the University of Michigan at Ann Arbor, MI. respectively. -\par -2. Server installation: Once a cell name has been chosen, the site must bring -up one or more AFS file server machines, creating a local file space and a -suite of local services, including authentication (Section 4.2.6.4) and volume -location (Section 4.2.6.2). -\par -3. Advertise services: In order for other cells to discover the presence of the -new site, it must advertise its name and which of its machines provide basic -AFS services such as authentication and volume location. An established site -may then record the machines providing AFS system services for the new cell, -and then set up its mount point under /afs. By convention, each cell places the -top of its file tree in a volume named root.cell. - - \subsection sec4-2-6 Section 4.2.6: Implementation of Server -Functionality - -\par -AFS server functionality is implemented by a set of user-level processes which -execute on server machines. This section examines the role of each of these -processes. - - \subsubsection sec4-2-6-1 Section 4.2.6.1: File Server - -\par -This AFS entity is responsible for providing a central disk repository for a -particular set of files within volumes, and for making these files accessible -to properly-authorized users running on client machines. - - \subsubsection sec4-2-6-2 Section 4.2.6.2: Volume Location Server - -\par -The Volume Location Server maintains and exports the Volume Location Database -(VLDB). This database tracks the server or set of servers on which volume -instances reside. Among the operations it supports are queries returning volume -location and status information, volume ID management, and creation, deletion, -and modification of VLDB entries. -\par -The VLDB may be replicated to two or more server machines for availability and -load-sharing reasons. A Volume Location Server process executes on each server -machine on which a copy of the VLDB resides, managing that copy. - - \subsubsection sec4-2-6-3 Section 4.2.6.3: Volume Server - -\par -The Volume Server allows administrative tasks and probes to be performed on the -set of AFS volumes residing on the machine on which it is running. These -operations include volume creation and deletion, renaming volumes, dumping and -restoring volumes, altering the list of replication sites for a read-only -volume, creating and propagating a new read-only volume image, creation and -update of backup volumes, listing all volumes on a partition, and examining -volume status. - - \subsubsection sec4-2-6-4 Section 4.2.6.4: Authentication Server - -\par -The AFS Authentication Server maintains and exports the Authentication Database -(ADB). This database tracks the encrypted passwords of the cell's users. The -Authentication Server interface allows operations that manipulate ADB entries. -It also implements the Kerberos mutual authentication protocol, supplying the -appropriate identification tickets to successful callers. -\par -The ADB may be replicated to two or more server machines for availability and -load-sharing reasons. An Authentication Server process executes on each server -machine on which a copy of the ADB resides, managing that copy. - - \subsubsection sec4-2-6-5 Section 4.2.6.5: Protection Server - -\par -The Protection Server maintains and exports the Protection Database (PDB), -which maps between printable user and group names and their internal numerical -AFS identifiers. The Protection Server also allows callers to create, destroy, -query ownership and membership, and generally manipulate AFS user and group -records. -\par -The PDB may be replicated to two or more server machines for availability and -load-sharing reasons. A Protection Server process executes on each server -machine on which a copy of the PDB resides, managing that copy. - - \subsubsection sec4-2-6-6 Section 4.2.6.6: BOS Server - -\par -The BOS Server is an administrative tool which runs on each file server machine -in a cell. This server is responsible for monitoring the health of the AFS -agent processess on that machine. The BOS Server brings up the chosen set of -AFS agents in the proper order after a system reboot, answers requests as to -their status, and restarts them when they fail. It also accepts commands to -start, suspend, or resume these processes, and install new server binaries. - - \subsubsection sec4-2-6-7 Section 4.2.6.7: Update Server/Client - -\par -The Update Server and Update Client programs are used to distribute important -system files and server binaries. For example, consider the case of -distributing a new File Server binary to the set of Sparcstation server -machines in a cell. One of the Sparcstation servers is declared to be the -distribution point for its machine class, and is configured to run an Update -Server. The new binary is installed in the appropriate local directory on that -Sparcstation distribution point. Each of the other Sparcstation servers runs an -Update Client instance, which periodically polls the proper Update Server. The -new File Server binary will be detected and copied over to the client. Thus, -new server binaries need only be installed manually once per machine type, and -the distribution to like server machines will occur automatically. - - \subsection sec4-2-7 Section 4.2.7: Implementation of Client -Functionality - - \subsubsection sec4-2-7-1 Section 4.2.7.1: Introduction - -\par -The portion of the AFS WADFS which runs on each client machine is called the -Cache Manager. This code, running within the client's kernel, is a user's -representative in communicating and interacting with the File Servers. The -Cache Manager's primary responsibility is to create the illusion that the -remote AFS file store resides on the client machine's local disk(s). -\par -s implied by its name, the Cache Manager supports this illusion by maintaining -a cache of files referenced from the central AFS store on the machine's local -disk. All file operations executed by client application programs on files -within the AFS name space are handled by the Cache Manager and are realized on -these cached images. Client-side AFS references are directed to the Cache -Manager via the standard VFS and vnode file system interfaces pioneered and -advanced by Sun Microsystems [21]. The Cache Manager stores and fetches files -to and from the shared AFS repository as necessary to satisfy these operations. -It is responsible for parsing unix pathnames on open() operations and mapping -each component of the name to the File Server or group of File Servers that -house the matching directory or file. -\par -The Cache Manager has additional responsibilities. It also serves as a reliable -repository for the user's authentication information, holding on to their -tickets and wielding them as necessary when challenged during File Server -interactions. It caches volume location information gathered from probes to the -VLDB, and keeps the client machine's local clock synchronized with a reliable -time source. - - \subsubsection sec4-2-7-2 Section 4.2.7.2: Chunked Access - -\par -In previous AFS incarnations, whole-file caching was performed. Whenever an AFS -file was referenced, the entire contents of the file were stored on the -client's local disk. This approach had several disadvantages. One problem was -that no file larger than the amount of disk space allocated to the client's -local cache could be accessed. -\par -AFS-3 supports chunked file access, allowing individual 64 kilobyte pieces to -be fetched and stored. Chunking allows AFS files of any size to be accessed -from a client. The chunk size is settable at each client machine, but the -default chunk size of 64K was chosen so that most unix files would fit within a -single chunk. - - \subsubsection sec4-2-7-3 Section 4.2.7.3: Cache Management - -\par -The use of a file cache by the AFS client-side code, as described above, raises -the thorny issue of cache consistency. Each client must effciently determine -whether its cached file chunks are identical to the corresponding sections of -the file as stored at the server machine before allowing a user to operate on -those chunks. -\par -AFS employs the notion of a callback as the backbone of its cache consistency -algorithm. When a server machine delivers one or more chunks of a file to a -client, it also includes a callback "promise" that the client will be notified -if any modifications are made to the data in the file at the server. Thus, as -long as the client machine is in possession of a callback for a file, it knows -it is correctly synchronized with the centrally-stored version, and allows its -users to operate on it as desired without any further interaction with the -server. Before a file server stores a more recent version of a file on its own -disks, it will first break all outstanding callbacks on this item. A callback -will eventually time out, even if there are no changes to the file or directory -it covers. - - \subsection sec4-2-8 Section 4.2.8: Communication Substrate: Rx - -\par -All AFS system agents employ remote procedure call (RPC) interfaces. Thus, -servers may be queried and operated upon regardless of their location. -\par -The Rx RPC package is used by all AFS agents to provide a high-performance, -multi-threaded, and secure communication mechanism. The Rx protocol is -adaptive, conforming itself to widely varying network communication media -encountered by a WADFS. It allows user applications to de?ne and insert their -own security modules, allowing them to execute the precise end-to-end -authentication algorithms required to suit their specific needs and goals. Rx -offers two built-in security modules. The first is the null module, which does -not perform any encryption or authentication checks. The second built-in -security module is rxkad, which utilizes Kerberos authentication. -\par -Although pervasive throughout the AFS distributed file system, all of its -agents, and many of its standard application programs, Rx is entirely separable -from AFS and does not depend on any of its features. In fact, Rx can be used to -build applications engaging in RPC-style communication under a variety of -unix-style file systems. There are in-kernel and user-space implementations of -the Rx facility, with both sharing the same interface. - - \subsection sec4-2-9 Section 4.2.9: Database Replication: ubik - -\par -The three AFS system databases (VLDB, ADB, and PDB) may be replicated to -multiple server machines to improve their availability and share access loads -among the replication sites. The ubik replication package is used to implement -this functionality. A full description of ubik and of the quorum completion -algorithm it implements may be found in [19] and [20]. -\par -The basic abstraction provided by ubik is that of a disk file replicated to -multiple server locations. One machine is considered to be the synchronization -site, handling all write operations on the database file. Read operations may -be directed to any of the active members of the quorum, namely a subset of the -replication sites large enough to insure integrity across such failures as -individual server crashes and network partitions. All of the quorum members -participate in regular elections to determine the current synchronization site. -The ubik algorithms allow server machines to enter and exit the quorum in an -orderly and consistent fashion. -\par -All operations to one of these replicated "abstract files" are performed as -part of a transaction. If all the related operations performed under a -transaction are successful, then the transaction is committed, and the changes -are made permanent. Otherwise, the transaction is aborted, and all of the -operations for that transaction are undone. -\par -Like Rx, the ubik facility may be used by client applications directly. Thus, -user applicatons may easily implement the notion of a replicated disk file in -this fashion. - - \subsection sec4-2-10 Section 4.2.10: System Management - -\par -There are several AFS features aimed at facilitating system management. Some of -these features have already been mentioned, such as volumes, the BOS Server, -and the pervasive use of secure RPCs throughout the system to perform -administrative operations from any AFS client machinein the worldwide -community. This section covers additional AFS features and tools that assist in -making the system easier to manage. - - \subsubsection sec4-2-10-1 Section 4.2.10.1: Intelligent Access -Programs - -\par -A set of intelligent user-level applications were written so that the AFS -system agents could be more easily queried and controlled. These programs -accept user input, then translate the caller's instructions into the proper -RPCs to the responsible AFS system agents, in the proper order. -\par -An example of this class of AFS application programs is vos, which mediates -access to the Volume Server and the Volume Location Server agents. Consider the -vos move operation, which results in a given volume being moved from one site -to another. The Volume Server does not support a complex operation like a -volume move directly. In fact, this move operation involves the Volume Servers -at the current and new machines, as well as the Volume Location Server, which -tracks volume locations. Volume moves are accomplished by a combination of full -and incremental volume dump and restore operations, and a VLDB update. The vos -move command issues the necessary RPCs in the proper order, and attempts to -recovers from errors at each of the steps. -\par -The end result is that the AFS interface presented to system administrators is -much simpler and more powerful than that offered by the raw RPC interfaces -themselves. The learning curve for administrative personnel is thus flattened. -Also, automatic execution of complex system operations are more likely to be -successful, free from human error. - - \subsubsection sec4-2-10-2 Section 4.2.10.2: Monitoring Interfaces - -\par -The various AFS agent RPC interfaces provide calls which allow for the -collection of system status and performance data. This data may be displayed by -such programs as scout, which graphically depicts File Server performance -numbers and disk utilizations. Such monitoring capabilites allow for quick -detection of system problems. They also support detailed performance analyses, -which may indicate the need to reconfigure system resources. - - \subsubsection sec4-2-10-3 Section 4.2.10.3: Backup System - -\par -A special backup system has been designed and implemented for AFS, as described -in [6]. It is not sufficient to simply dump the contents of all File Server -partitions onto tape, since volumes are mobile, and need to be tracked -individually. The AFS backup system allows hierarchical dump schedules to be -built based on volume names. It generates the appropriate RPCs to create the -required backup volumes and to dump these snapshots to tape. A database is used -to track the backup status of system volumes, along with the set of tapes on -which backups reside. - - \subsection sec4-2-11 Section 4.2.11: Interoperability - -\par -Since the client portion of the AFS software is implemented as a standard -VFS/vnode file system object, AFS can be installed into client kernels and -utilized without interference with other VFS-style file systems, such as -vanilla unix and the NFS distributed file system. -\par -Certain machines either cannot or choose not to run the AFS client software -natively. If these machines run NFS, it is still possible to access AFS files -through a protocol translator. The NFS-AFS Translator may be run on any machine -at the given site that runs both NFS and the AFS Cache Manager. All of the NFS -machines that wish to access the AFS shared store proceed to NFS-mount the -translator's /afs directory. File references generated at the NFS-based -machines are received at the translator machine, which is acting in its -capacity as an NFS server. The file data is actually obtained when the -translator machine issues the corresponding AFS references in its role as an -AFS client. - - \section sec4-3 Section 4.3: Meeting AFS Goals - -\par -The AFS WADFS design, as described in this chapter, serves to meet the system -goals stated in Chapter 3. This section revisits each of these AFS goals, and -identifies the specific architectural constructs that bear on them. - - \subsection sec4-3-1 Section 4.3.1: Scale - -\par -To date, AFS has been deployed to over 140 sites world-wide, with approximately -60 of these cells visible on the public Internet. AFS sites are currently -operating in several European countries, in Japan, and in Australia. While many -sites are modest in size, certain cells contain more than 30,000 accounts. AFS -sites have realized client/server ratios in excess of the targeted 200:1. - - \subsection sec4-3-2 Section 4.3.2: Name Space - -\par -A single uniform name space has been constructed across all cells in the -greater AFS user community. Any pathname beginning with /afs may indeed be used -at any AFS client. A set of common conventions regarding the organization of -the top-level /afs directory and several directories below it have been -established. These conventions also assist in the location of certain per-cell -resources, such as AFS configuration files. -\par -Both access transparency and location transparency are supported by AFS, as -evidenced by the common access mechanisms and by the ability to transparently -relocate volumes. - - \subsection sec4-3-3 Section 4.3.3: Performance - -\par -AFS employs caching extensively at all levels to reduce the cost of "remote" -references. Measured data cache hit ratios are very high, often over 95%. This -indicates that the file images kept on local disk are very effective in -satisfying the set of remote file references generated by clients. The -introduction of file system callbacks has also been demonstrated to be very -effective in the efficient implementation of cache synchronization. Replicating -files and system databases across multiple server machines distributes load -among the given servers. The Rx RPC subsystem has operated successfully at -network speeds ranging from 19.2 kilobytes/second to experimental -gigabit/second FDDI networks. -\par -Even at the intra-site level, AFS has been shown to deliver good performance, -especially in high-load situations. One often-quoted study [1] compared the -performance of an older version of AFS with that of NFS on a large file system -task named the Andrew Benchmark. While NFS sometimes outperformed AFS at low -load levels, its performance fell off rapidly at higher loads while AFS -performance degradation was not significantly affected. - - \subsection sec4-3-4 Section 4.3.4: Security - -\par -The use of Kerberos as the AFS authentication system fits the security goal -nicely. Access to AFS files from untrusted client machines is predicated on the -caller's possession of the appropriate Kerberos ticket(s). Setting up per-site, -Kerveros-based authentication services compartmentalizes any security breach to -the cell which was compromised. Since the Cache Manager will store multiple -tickets for its users, they may take on different identities depending on the -set of file servers being accessed. - - \subsection sec4-3-5 Section 4.3.5: Access Control - -\par -AFS extends the standard unix authorization mechanism with per-directory Access -Control Lists. These ACLs allow specific AFS principals and groups of these -principals to be granted a wide variety of rights on the associated files. -Users may create and manipulate AFS group entities without administrative -assistance, and place these tailored groups on ACLs. - - \subsection sec4-3-6 Section 4.3.6: Reliability - -\par -A subset of file server crashes are masked by the use of read-only replication -on volumes containing slowly-changing files. Availability of important, -frequently-used programs such as editors and compilers may thus been greatly -improved. Since the level of replication may be chosen per volume, and easily -changed, each site may decide the proper replication levels for certain -programs and/or data. -Similarly, replicated system databases help to maintain service in the face of -server crashes and network partitions. - - \subsection sec4-3-7 Section 4.3.7: Administrability - -\par -Such features as pervasive, secure RPC interfaces to all AFS system components, -volumes, overseer processes for monitoring and management of file system -agents, intelligent user-level access tools, interface routines providing -performance and statistics information, and an automated backup service -tailored to a volume-based environment all contribute to the administrability -of the AFS system. - - \subsection sec4-3-8 Section 4.3.8: Interoperability/Coexistence - -\par -Due to its VFS-style implementation, the AFS client code may be easily -installed in the machine's kernel, and may service file requests without -interfering in the operation of any other installed file system. Machines -either not capable of running AFS natively or choosing not to do so may still -access AFS files via NFS with the help of a protocol translator agent. - - \subsection sec4-3-9 Section 4.3.9: Heterogeneity/Portability - -\par -As most modern kernels use a VFS-style interface to support their native file -systems, AFS may usually be ported to a new hardware and/or software -environment in a relatively straightforward fashion. Such ease of porting -allows AFS to run on a wide variety of platforms. - - \page chap5 Chapter 5: Future AFS Design Re?nements - - \section sec5-1 Section 5.1: Overview - -\par -The current AFS WADFS design and implementation provides a high-performance, -scalable, secure, and flexible computing environment. However, there is room -for improvement on a variety of fronts. This chapter considers a set of topics, -examining the shortcomings of the current AFS system and considering how -additional functionality may be fruitfully constructed. -\par -Many of these areas are already being addressed in the next-generation AFS -system which is being built as part of Open Software Foundation?s (OSF) -Distributed Computing Environment [7] [8]. - - \section sec5-2 Section 5.2: unix Semantics - -\par -Any distributed file system which extends the unix file system model to include -remote file accesses presents its application programs with failure modes which -do not exist in a single-machine unix implementation. This semantic difference -is dificult to mask. -\par -The current AFS design varies from pure unix semantics in other ways. In a -single-machine unix environment, modifications made to an open file are -immediately visible to other processes with open file descriptors to the same -file. AFS does not reproduce this behavior when programs on different machines -access the same file. Changes made to one cached copy of the file are not made -immediately visible to other cached copies. The changes are only made visible -to other access sites when a modified version of a file is stored back to the -server providing its primary disk storage. Thus, one client's changes may be -entirely overwritten by another client's modifications. The situation is -further complicated by the possibility that dirty file chunks may be flushed -out to the File Server before the file is closed. -\par -The version of AFS created for the OSF offering extends the current, untyped -callback notion to a set of multiple, independent synchronization guarantees. -These synchronization tokens allow functionality not offered by AFS-3, -including byte-range mandatory locking, exclusive file opens, and read and -write privileges over portions of a file. - - \section sec5-3 Section 5.3: Improved Name Space Management - -\par -Discovery of new AFS cells and their integration into each existing cell's name -space is a completely manual operation in the current system. As the rate of -new cell creations increases, the load imposed on system administrators also -increases. Also, representing each cell's file space entry as a mount point -object in the /afs directory leads to a potential problem. As the number of -entries in the /afs directory increase, search time through the directory also -grows. -\par -One improvement to this situation is to implement the top-level /afs directory -through a Domain-style database. The database would map cell names to the set -of server machines providing authentication and volume location services for -that cell. The Cache Manager would query the cell database in the course of -pathname resolution, and cache its lookup results. -\par -In this database-style environment, adding a new cell entry under /afs is -accomplished by creating the appropriate database entry. The new cell -information is then immediately accessible to all AFS clients. - - \section sec5-4 Section 5.4: Read/Write Replication - -\par -The AFS-3 servers and databases are currently equipped to handle read/only -replication exclusively. However, other distributed file systems have -demonstrated the feasibility of providing full read/write replication of data -in environments very similar to AFS [11]. Such systems can serve as models for -the set of required changes. - - \section sec5-5 Section 5.5: Disconnected Operation - -\par -Several facilities are provided by AFS so that server failures and network -partitions may be completely or partially masked. However, AFS does not provide -for completely disconnected operation of file system clients. Disconnected -operation is a mode in which a client continues to access critical data during -accidental or intentional inability to access the shared file repository. After -some period of autonomous operation on the set of cached files, the client -reconnects with the repository and resynchronizes the contents of its cache -with the shared store. -\par -Studies of related systems provide evidence that such disconnected operation is -feasible [11] [12]. Such a capability may be explored for AFS. - - \section sec5-6 Section 5.6: Multiprocessor Support - -\par -The LWP lightweight thread package used by all AFS system processes assumes -that individual threads may execute non-preemeptively, and that all other -threads are quiescent until control is explicitly relinquished from within the -currently active thread. These assumptions conspire to prevent AFS from -operating correctly on a multiprocessor platform. -\par -A solution to this restriction is to restructure the AFS code organization so -that the proper locking is performed. Thus, critical sections which were -previously only implicitly defined are explicitly specified. - - \page biblio Bibliography - -\li [1] John H. Howard, Michael L. Kazar, Sherri G. Menees, David A. Nichols, -M. Satyanarayanan, Robert N. Sidebotham, Michael J. West, Scale and Performance -in a Distributed File System, ACM Transactions on Computer Systems, Vol. 6, No. -1, February 1988, pp. 51-81. -\li [2] Michael L. Kazar, Synchronization and Caching Issues in the Andrew File -System, USENIX Proceedings, Dallas, TX, Winter 1988. -\li [3] Alfred Z. Spector, Michael L. Kazar, Uniting File Systems, Unix -Review, March 1989, -\li [4] Johna Till Johnson, Distributed File System Brings LAN Technology to -WANs, Data Communications, November 1990, pp. 66-67. -\li [5] Michael Padovano, PADCOM Associates, AFS widens your horizons in -distributed computing, Systems Integration, March 1991. -\li [6] Steve Lammert, The AFS 3.0 Backup System, LISA IV Conference -Proceedings, Colorado Springs, Colorado, October 1990. -\li [7] Michael L. Kazar, Bruce W. Leverett, Owen T. Anderson, Vasilis -Apostolides, Beth A. Bottos, Sailesh Chutani, Craig F. Everhart, W. Anthony -Mason, Shu-Tsui Tu, Edward R. Zayas, DEcorum File System Architectural -Overview, USENIX Conference Proceedings, Anaheim, Texas, Summer 1990. -\li [8] AFS Drives DCE Selection, Digital Desktop, Vol. 1, No. 6, -September 1990. -\li [9] Levine, P.H., The Apollo DOMAIN Distributed File System, in NATO ASI -Series: Theory and Practice of Distributed Operating Systems, Y. Paker, J-P. -Banatre, M. Bozyigit, editors, Springer-Verlag, 1987. -\li [10] M.N. Nelson, B.B. Welch, J.K. Ousterhout, Caching in the Sprite -Network File System, ACM Transactions on Computer Systems, Vol. 6, No. 1, -February 1988. -\li [11] James J. Kistler, M. Satyanarayanan, Disconnected Operaton in the Coda -File System, CMU School of Computer Science technical report, CMU-CS-91-166, 26 -July 1991. -\li [12] Puneet Kumar, M. Satyanarayanan, Log-Based Directory Resolution -in the Coda File System, CMU School of Computer Science internal document, 2 -July 1991. -\li [13] Sun Microsystems, Inc., NFS: Network File System Protocol -Specification, RFC 1094, March 1989. -\li [14] Sun Microsystems, Inc,. Design and Implementation of the Sun Network -File System, USENIX Summer Conference Proceedings, June 1985. -\li [15] C.H. Sauer, D.W Johnson, L.K. Loucks, A.A. Shaheen-Gouda, and T.A. -Smith, RT PC Distributed Services Overview, Operating Systems Review, Vol. 21, -No. 3, July 1987. -\li [16] A.P. Rifkin, M.P. Forbes, R.L. Hamilton, M. Sabrio, S. Shah, and -K. Yueh, RFS Architectural Overview, Usenix Conference Proceedings, Atlanta, -Summer 1986. -\li [17] Edward R. Zayas, Administrative Cells: Proposal for Cooperative Andrew -File Systems, Information Technology Center internal document, Carnegie Mellon -University, 25 June 1987. -\li [18] Ed. Zayas, Craig Everhart, Design and Specification of the Cellular -Andrew Environment, Information Technology Center, Carnegie Mellon University, -CMU-ITC-070, 2 August 1988. -\li [19] Kazar, Michael L., Information Technology Center, Carnegie Mellon -University. Ubik -A Library For Managing Ubiquitous Data, ITCID, Pittsburgh, -PA, Month, 1988. -\li [20] Kazar, Michael L., Information Technology Center, Carnegie Mellon -University. Quorum Completion, ITCID, Pittsburgh, PA, Month, 1988. -\li [21] S. R. Kleinman. Vnodes: An Architecture for Multiple file -System Types in Sun UNIX, Conference Proceedings, 1986 Summer Usenix Technical -Conference, pp. 238-247, El Toro, CA, 1986. -\li [22] S.P. Miller, B.C. Neuman, J.I. Schiller, J.H. Saltzer. Kerberos -Authentication and Authorization System, Project Athena Technical Plan, Section -E.2.1, M.I.T., December 1987. -\li [23] Bill Bryant. Designing an Authentication System: a Dialogue in Four -Scenes, Project Athena internal document, M.I.T, draft of 8 February 1988. - - -*/ diff --git a/doc/arch/dafs-fsa.dot b/doc/arch/dafs-fsa.dot deleted file mode 100644 index 565de7122..000000000 --- a/doc/arch/dafs-fsa.dot +++ /dev/null @@ -1,109 +0,0 @@ -# -# This is a dot (http://www.graphviz.org) description of the various -# states volumes can be in for DAFS (Demand Attach File Server). -# -# Author: Steven Jenkins -# Date: 2007-05-24 -# - -digraph VolumeStates { - size="11,17" - graph [ - rankdir = "TB" - ]; - - subgraph clusterKey { - rankdir="LR"; - shape = "rectangle"; - - s1 [ shape=plaintext, label = "VPut after VDetach in brown", - fontcolor="brown" ]; - s2 [ shape=plaintext, label = "VAttach in blue", - fontcolor="blue" ]; - s3 [ shape=plaintext, label = "VGet/VHold in purple", - fontcolor="purple" ]; - s4 [ shape=plaintext, label = "Error States in red", - fontcolor="red" ]; - s5 [ shape=plaintext, label = "VPut after VOffline in green", - fontcolor="green" ]; - s6 [ shape=ellipse, label = "re-entrant" ]; - s7 [ shape=ellipse, peripheries=2, label="non re-entrant" ]; - s8 [ shape=ellipse, color="red", label="Error States" ]; - - s6->s7->s8->s1->s2->s3->s4->s5 [style="invis"]; - - } - - node [ peripheries = "2" ] ATTACHING \ - LOADING_VNODE_BITMAPS HDR_LOADING_FROM_DISK \ - HDR_ATTACHING_LRU_PULL \ - "UPDATING\nSYNCING_VOL_HDR_TO_DISK" \ - OFFLINING DETACHING; - node [ shape = "ellipse", peripheries = "1" ]; - node [ color = "red" ] HARD_ERROR SALVAGE_REQUESTED SALVAGING; - - node [ color = "black" ]; // default back to black - - UNATTACHED->Exclusive_vol_op_executing [label = "controlled by FSSYNC" ]; - Exclusive_vol_op_executing->UNATTACHED [label = "controlled by FSSYNC" ]; - UNATTACHED->FREED [ label = "VCancelReservation_r() after a\nVDetach() or FreeVolume() will\ncause CheckDetach() or CheckFree() to fire" ]; - OFFLINING->UNATTACHED; - UNATTACHED->PREATTACHED [ color = "orange", label = "PreAttach()" ]; - PREATTACHED->UNATTACHED [ color = "orange", label = "VOffline()"]; - HARD_ERROR->PREATTACHED [ color = "orange", label = "operator intervention via FSSYNC" ]; - - PREATTACHED->Exclusive_vol_op_executing [color = "orange", label = "controlled by FSSYNC" ]; - Exclusive_vol_op_executing->PREATTACHED [color = "orange", label = "controlled by FSSYNC" ]; - PREATTACHED->FREED [ color = "orange", label = "VCancelReservation_r() after a\nVDetach() or FreeVolume() will\ncause CheckDetach() or CheckFree() to fire" ]; - PREATTACHED->ATTACHING [ color = "blue", weight = "8" ]; - SALVAGING->PREATTACHED [ label = "controlled via FSSYNC" ]; - - DETACHING->FREED ; - SHUTTING_DOWN->DETACHING [ color = "brown" ]; - ATTACHED_nUsers_GT_0->SHUTTING_DOWN [ color = "orange", label = "VDetach()" ]; - - DETACHING->"UPDATING\nSYNCING_VOL_HDR_TO_DISK" [ color = "brown" ]; - "UPDATING\nSYNCING_VOL_HDR_TO_DISK"->DETACHING [ color = "brown" ]; - OFFLINING->"UPDATING\nSYNCING_VOL_HDR_TO_DISK" [ color = "green" ]; - "UPDATING\nSYNCING_VOL_HDR_TO_DISK"->OFFLINING [ color = "green" ]; - GOING_OFFLINE->OFFLINING [ color = "green" ]; - - "UPDATING\nSYNCING_VOL_HDR_TO_DISK"->SALVAGE_REQUESTED [ color = "red" ]; - "UPDATING\nSYNCING_VOL_HDR_TO_DISK"->ATTACHING [ color = "blue" ]; - ATTACHING->"UPDATING\nSYNCING_VOL_HDR_TO_DISK" [ color = "blue" ]; - - ATTACHED_nUsers_GT_0->GOING_OFFLINE [ color = "orange", label = "VOffline" ]; - ATTACHED_nUsers_GT_0->ATTACHED_nUsers_EQ_0 [ color = "orange", label = "VPut" ]; - - ATTACHED_nUsers_GT_0->SALVAGE_REQUESTED [ color = "red" ]; - - LOADING_VNODE_BITMAPS->ATTACHING [ color = "blue" ]; - ATTACHING->LOADING_VNODE_BITMAPS [ color = "blue" ] ; - LOADING_VNODE_BITMAPS->SALVAGE_REQUESTED [ color = "red" ]; - HDR_LOADING_FROM_DISK->SALVAGE_REQUESTED [ color = "red" ]; - HDR_LOADING_FROM_DISK->ATTACHING [ color = "blue" ] ; - HDR_LOADING_FROM_DISK->ATTACHED_nUsers_GT_0 [ color = "purple" ]; - - SALVAGE_REQUESTED->SALVAGING [ label = "controlled via FSSYNC" ]; - SALVAGE_REQUESTED->HARD_ERROR [ color = "red", - label = "After hard salvage limit reached,\n hard error state is in effect\nuntil there is operator intervention" ]; - - HDR_ATTACHING_LRU_PULL->HDR_LOADING_FROM_DISK [ color = "blue" ]; - HDR_ATTACHING_LRU_PULL->HDR_LOADING_FROM_DISK [ color = "purple" ]; - HDR_ATTACHING_LRU_PULL->ATTACHED_nUsers_GT_0 [ color = "purple", label = "header can be in LRU\nand not have been reclaimed\nthus skipping disk I/O" ]; - - ATTACHING->HDR_ATTACHING_LRU_PULL [ color = "blue" ]; - ATTACHING->ATTACHED_nUsers_EQ_0 [ color = "blue" ]; - - ATTACHING->SALVAGE_REQUESTED [ color = "red" ]; - ATTACHED_nUsers_EQ_0->HDR_ATTACHING_LRU_PULL [ color = "purple" ]; - - ATTACHED_nUsers_EQ_0->SALVAGE_REQUESTED [ color = "red" ]; - - // Various loopback transitions - GOING_OFFLINE->GOING_OFFLINE [ label = "VPut when (nUsers > 1)" ]; - SHUTTING_DOWN->SHUTTING_DOWN - [ label = "VPut when ((nUsers > 1) ||\n((nUsers == 1) && (nWaiters > 0)))" ]; - SHUTTING_DOWN->SHUTTING_DOWN - [ label = "VCancelReservation_r when ((nWaiters > 1)\n|| ((nWaiters == 1) && (nUsers > 0)))"]; -} diff --git a/doc/arch/dafs-overview.txt b/doc/arch/dafs-overview.txt deleted file mode 100644 index 2b2e58668..000000000 --- a/doc/arch/dafs-overview.txt +++ /dev/null @@ -1,396 +0,0 @@ -The Demand-Attach FileServer (DAFS) has resulted in many changes to how -many things on AFS fileservers behave. The most sweeping changes are -probably in the volume package, but significant changes have also been -made in the SYNC protocol, the vnode package, salvaging, and a few -miscellaneous bits in the various fileserver processes. - -This document serves as an overview for developers on how to deal with -these changes, and how to use the new mechanisms. For more specific -details, consult the relevant doxygen documentation, the code comments, -and/or the code itself. - - - The salvageserver - -The salvageserver (or 'salvaged') is a new OpenAFS fileserver process in -DAFS. This daemon accepts salvage requests via SALVSYNC (see below), and -salvages a volume group by fork()ing a child, and running the normal -salvager code (it enters vol-salvage.c by calling SalvageFileSys1). - -Salvages that are initiated from a request to the salvageserver (called -'demand-salvages') occur automatically; whenever the fileserver (or -other tool) discovers that a volume needs salvaging, it will schedule a -salvage on the salvageserver without any intervention needed. - -When scheduling a salvage, the vol id should be the id for the volume -group (the RW vol id). If the salvaging child discovers that it was -given a non-RW vol id, it will send the salvageserver a SALVSYNC LINK -command, and will exit. This will tell the salvageserver that whenever -it receives a salvage request for that vol id, it should schedule a -salvage for the corresponding RW id instead. - - - FSSYNC/SALVSYNC - -The FSSYNC and SALVSYNC protocols are the protocols used for -interprocess communication between the various fileserver processes. -FSSYNC is used for querying the fileserver for volume metadata, -'checking out' volumes from the fileserver, and a few other things. -SALVSYNC is used to schedule and query salvages in the salvageserver. - -FSSYNC existed prior to DAFS, but it encompasses a much larger set of -commands with the advent of DAFS. SALVSYNC is entirely new to DAFS. - - -- SYNC - -FSSYNC and SALVSYNC are both layered on top of a protocol called SYNC. -SYNC isn't much a protocol in itself; it just handles some boilerplate -for the messages passed back and forth, and some error codes common to -both FSSYNC and SALVSYNC. - -SYNC is layered on top of TCP/IP, though we only use it to communicate -with the local host (usually via a unix domain socket). It does not -handle anything like authentication, authorization, or even things like -serialization. Although it uses network primitives for communication, -it's only useful for communication between processes on the same -machine, and that is all we use it for. - -SYNC calls are basically RPCs, but very simple. The calls are always -synchronous, and each SYNC server can only handle one request at a time. -Thus, it is important for SYNC server handlers to return as quickly as -possible; hitting the network or disk to service a SYNC request should -be avoided to the extent that such is possible. - -SYNC-related source files are src/vol/daemon_com.c and -src/vol/daemon_com.h - - -- FSSYNC - - --- server - -The FSSYNC server runs in the fileserver; source is in -src/vol/fssync-server.c. - -As mentioned above, FSSYNC handlers should finish quickly when -servicing a request, so hitting the network or disk should be avoided. -In particular, you absolutely cannot make a SALVSYNC call inside an -FSSYNC handler; the SALVSYNC client wrapper routines actively prevent -this from happening, so even if you try to do such a thing, you will not -be allowed to. This prohibition is to prevent deadlock, since the -salvageserver could have made the FSSYNC request that you are servicing. - -When a client makes a FSYNC_VOL_OFF or NEEDVOLUME request, the -fileserver offlines the volume if necessary, and keeps track that the -volume has been 'checked out'. A volume is left online if the checkout -mode indicates the volume cannot change (see VVolOpLeaveOnline_r). - -Until the volume has been 'checked in' with the ON, LEAVE_OFFLINE, or -DONE commands, no other program can check out the volume. - -Other FSSYNC commands include abilities to query volume metadata and -stats, to force volumes to be attached or offline, and to update the -volume group cache. See doc/arch/fssync.txt for documentation on the -individual FSSYNC commands. - - --- clients - -FSSYNC clients are generally any OpenAFS process that runs on a -fileserver and tries to access volumes directly. The volserver, -salvageserver, and bosserver all qualify, as do (sometimes) some -utilities like vol-info or vol-bless. For issuing FSSYNC commands -directly, there is the debugging tool fssync-debug. FSSYNC client code -is in src/vol/fssync-client.c, but it's not very interesting. - -Any program that wishes to directly access a volume on disk must check -out the volume via FSSYNC (NEEDVOLUME or OFF commands), to ensure the -volume doesn't change while the program is using it. If the program -determines that the volume is somehow inconsistent and should be -salvaged, it should send the FSSYNC command FORCE_ERROR with reason code -FSYNC_SALVAGE to the fileserver, which will take care of salvaging it. - - -- SALVSYNC - -The SALVSYNC server runs in the salvageserver; code is in -src/vol/salvsync-server.c. SALVSYNC clients are just the fileserver, the -salvageserver run with the -client switch, and the salvageserver worker -children. If any other process notices that a volume needs salvaging, it -should issue a FORCE_ERROR FSSYNC command to the fileserver with the -FSYNC_SALVAGE reason code. - -The SALVSYNC protocol is simpler than the FSSYNC protocol. The commands -are basically just to create, cancel, change, and query salvages. The -RAISEPRIO command increases the priority of a salvage job that hasn't -started yet, so volumes that are accessed more frequently will get -salvaged first. The LINK command is used by the salvageserver worker -children to inform the salvageserver parent that it tried to salvage a -readonly volume for which a read-write clone exists (in which case we -should just schedule a salvage for the parent read-write volume). - -Note that canceling a salvage is just for salvages that haven't run -yet; it only takes a salvage job off of a queue; it doesn't stop a -salvageserver worker child in the middle of a salvage. - - - The volume package - - -- refcounts - -Before DAFS, the Volume struct just had one reference count, vp->nUsers. -With DAFS, we know have the notion of an internal/lightweight reference -count, and an external/heavyweight reference count. Lightweight refs are -acquired with VCreateReservation_r, and released with -VCancelReservation_r. Heavyweight refs are acquired as before, normally -with a GetVolume or AttachVolume variant, and releasing the ref with -VPutVolume. - -Lightweight references are only acquired within the volume package; a vp -should not be given to e.g. the fileserver code with an extra -lightweight ref. A heavyweight ref is generally acquired for a vp that -will be given to some non-volume-package code; acquiring a heavyweight -ref guarantees that the volume header has been loaded. - -Acquiring a lightweight ref just guarantees that the volume will not go -away or suddenly become unavailable after dropping VOL_LOCK. Certain -operations like detachment or scheduling a salvage only occur when all -of the heavy and lightweight refs go away; see VCancelReservation_r. - - -- state machine - -Instead of having a per-volume lock, each vp always has an associated -'state', that says what, if anything, is occurring to a volume at any -particular time; or if the volume is attached, offline, etc. To do the -basic equivalent of a lock -- that is, ensure that nobody else will -change the volume when we drop VOL_LOCK -- you can put the volume in -what is called an 'exclusive' state (see VIsExclusiveState). - -When a volume is in an exclusive state, no thread should modify the -volume (or expect the vp data to stay the same), except the thread that -put it in that state. Whenever you manipulate a volume, you should make -sure it is not in an exclusive state; first call VCreateReservation_r to -make sure the volume doesn't go away, and then call -VWaitExclusiveState_r. When that returns, you are guaranteed to have a -vp that is in a non-exclusive state, and so can me manipulated. Call -VCancelReservation_r when done with it, to indicate you don't need it -anymore. - -Look at the definition of the VolState enumeration to see all volume -states, and a brief explanation of them. - - -- VLRU - -See: Most functions with VLRU in their name in src/vol/volume.c. - -The VLRU is what dictates when volumes are detached after a certain -amount of inactivity. The design is pretty much a generational garbage -collection mechanism. There are 5 queues that a volume can be on the -VLRU (VLRUQueueName in volume.h). 'Candidate' volumes haven't seen -activity in a while, and so are candidates to be detached. 'New' volumes -have seen activity only recently; 'mid' volumes have seen activity for -awhile, and 'old' volumes have seen activity for a long while. 'Held' -volumes cannot be soft detached at all. - -Volumes are moved from new->mid->old if they have had activity recently, -and are moved from old->mid->new->candidate if they have not had any -activity recently. The definition of 'recently' is configurable by the --vlruthresh fileserver parameter; see VLRU_ComputeConstants for how they -are determined. Volumes start at 'new' on attachment, and if any -activity occurs when a volume is on 'candidate', it's moved to 'new' -immediately. - -Volumes are generally promoted/demoted and soft-detached by -VLRU_ScannerThread, which runs every so often and moves volumes between -VLRU queues depending on their last access time and the various -thresholds (or soft-detaches them, in the case of the 'candidate' -queue). Soft-detaching just means the volume is taken offline and put -into the preattached state. - - --- DONT_SALVAGE - -The dontSalvage flag in volume headers can be set to DONT_SALVAGE to -indicate that a volume probably doesn't need to be salvaged. Before -DAFS, volumes were placed on an 'UpdateList' which was periodically -scanned, and dontSalvage was set on volumes that hadn't been touched in -a while. - -With DAFS and the VLRU additions, setting dontSalvage now happens when a -volume is demoted a VLRU generation, and no separate list is kept. So if -a volume has been idle enough to demote, and it hasn't been accessed in -SALVAGE_INTERVAL time, dontSalvage will be set automatically by the VLRU -scanner. - - -- Vnode - -Source files: src/vol/vnode.c, src/vol/vnode.h, src/vol/vnode_inline.h - -The changes to the vnode package are largely very similar to those in -the volume package. A Vnode is put into specific states, some of which -are exclusive and act like locks (see VnChangeState_r, -VnIsExclusiveState). Vnodes also have refcounts, incremented and -decremented with VnCreateReservation_r and VnCancelReservation_r like -you would expect. I/O should be done outside of any global locks; just -the vnode is 'locked' by being put in an exclusive state if necessary. - -In addition to a state, vnodes also have a count of readers. When a -caller gets a vnode with a read lock, we of course must wait for the -vnode to be in a nonexclusive state (VnWaitExclusive_r), then the number -of readers is incremented (VnBeginRead_r), but the vnode is kept in a -non-exclusive state (VN_STATE_READ). - -When a caller gets a vnode with a write lock, we must wait not only for -the vnode to be in a nonexclusive state, but also for there to be no -readers (VnWaitQuiescent_r), so we can actually change it. - -VnLock still exists in DAFS, but it's almost a no-op. All we do for DAFS -in VnLock is set vnp->writer to the current thread id for a write lock, -for some consistency checks later (read locks are actually no-ops). -Actual mutual exclusion in DAFS is done by the vnode state machine and -the reader count. - - - viced state serialization - -See src/viced/serialize_state.* and ShutDownAndCore in -src/viced/viced.c - -Before DAFS, whenever a fileserver restarted, it lost all information -about all clients, what callbacks they had, etc. So when a client with -existing callbacks contacted the fileserver, all callback information -needed to be reset, potentially causing a bunch of unnecessary traffic. -And of course, if the client does not contact the fileserver again, it -could not get sent callbacks it should get sent. - -DAFS now has the ability to save the host and CB data to a file on -shutdown, and restore it when it starts up again. So when a fileserver -is restarted, the host and CB information should be effectively the same -as when it shut down. So a client may not even know if a fileserver was -restarted. - -Getting this state information can be a little difficult, since the host -package data structures aren't necessarily always consistent, even after -H_LOCK is dropped. What we attempt to do is stop all of the background -threads early in the shutdown process (set fs_state.mode - -FS_MODE_SHUTDOWN), and wait for the background threads to exit (or be -marked as 'tranquil'; see the fs_state struct) later on, before trying -to save state. This makes it a lot less likely for anything to be -modifying the host or CB structures by the time we try to save them. - - - volume group cache - -See: src/vol/vg_cache* and src/vol/vg_scan.c - -The VGC is a mechanism in DAFS to speed up volume salvages. Pre-VGC, -whenever the salvager code salvaged an individual volume, it would need -to read all of the volume headers on the partition, so it knows what -volumes are in the volume group it is salvaging, so it knows what -volumes to tell the fileserver to take offline. With demand-salvages, -this can make salvaging take a very long time, since the time to read in -all volume headers can take much more time than the time to actually -salvage a single volume group. - -To prevent the need to scan the partition volume headers every single -time, the fileserver maintains a cache of which volumes are in what -volume groups. The cache is populated by scanning a partition's volume -headers, and is started in the background upon receiving the first -salvage request for a partition (VVGCache_scanStart_r, -_VVGC_scan_start). - -After the VGC is populated, it is kept up to date with volumes being -created and deleted via the FSSYNC VG_ADD and VG_DEL -commands. These are called every time a volume header is created, -removed, or changed when using the volume header wrappers in vutil.c -(VCreateVolumeDiskHeader, VDestroyVolumeDiskHeader, -VWriteVolumeDiskHeader). These wrappers should always be used to -create/remove/modify vol headers, to ensure that the necessary FSSYNC -commands are called. - - -- race prevention - -In order to prevent races between volume changes and VGC partition scans -(that is, someone scans a header while it is being written and not yet -valid), updates to the VGC involving adding or modifying volume headers -should always be done under the 'partition header lock'. This is a -per-partition lock to conceptually lock the set of volume headers on -that partition. It is only read-held when something is writing to a -volume header, and it is write-held for something that is scanning the -partition for volume headers (the VGC or partition salvager). This is a -little counterintuitive, but it is what we want. We want multiple -headers to be written to at once, but if we are the VGC scanner, we want -to ensure nobody else is writing when we look at a header file. - -Because the race described above is so rare, vol header scanners don't -actually hold the lock unless a problem is detected. So, what they do is -read a particular volume header without any lock, and if there is a -problem with it, they grab a write lock on the partition vol headers, -and try again. If it still has a problem, the header is just faulty; if -it's okay, then we avoided the race. - -Note that destroying vol headers does not require any locks, since -unlink()s are atomic and don't cause any races for us here. - - - partition and volume locking - -Previously, whenever the volserver would attach a volume or the salvager -would salvage anything, the partition would be locked -(VLockPartition_r). This unnecessarily serializes part of most volserver -operations. It also makes it so only one salvage can run on a partition -at a time, and that a volserver operation cannot occur at the same time -as a salvage. With the addition of the VGC (previous section), the -salvager partition lock is unnecessary on namei, since the salvager does -not need to scan all volume headers. - -Instead of the rather heavyweight partition lock, in DAFS we now lock -individual volumes. Locking an individual volume is done by locking a -certain byte in the file /vicepX/.volume.lock. To lock volume with ID -1234, you lock 1 byte at offset 1234 (with VLockFile: fcntl on unix, -LockFileEx on windows as of the time of this writing). To read-lock the -volume, acquire a read lock; to write-lock the volume, acquire a write -lock. - -Due to the potentially very large number of volumes attached by the -fileserver at once, the fileserver does not keep volumes locked the -entire time they are attached (which would make volume locking -potentially very slow). Rather, it locks the volume before attaching, -and unlocks it when the volume has been attached. However, all other -programs are expected to acquire a volume lock for the entire duration -they interact with the volume. Whether a read or write lock is obtained -is determined by the attachment mode, and whether or not the volume in -question is an RW volume (see VVolLockType()). - -These locks are all acquired non-blocking, so we can just fail if we -fail to acquire a lock. That is, an errant process holding a file-level -lock cannot cause any process to just hang, waiting for a lock. - - -- re-reading volume headers - -Since we cannot know whether a volume is writable or not until the -volume header is read, and we cannot atomically upgrade file-level -locks, part of attachment can now occur twice (see attach2 and -attach_volume_header). What occurs is we read the vol header, assuming -the volume is readonly (acquiring a read or write lock as necessary). -If, after reading the vol header, we discover that the volume is -writable and that means we need to acquire a write lock, we read the vol -header again while acquiring a write lock on the header. - - -- verifying checkouts - -Since the fileserver does not hold volume locks for the entire time a -volume is attached, there could have been a potential race between the -fileserver and other programs. Consider when a non-fileserver program -checks out a volume from the fileserver via FSSYNC, then locks the -volume. Before the program locked the volume, the fileserver could have -restarted and attached the volume. Since the fileserver releases the -volume lock after attachment, the fileserver and the other program could -both think they have control over the volume, which is a problem. - -To prevent this non-fileserver programs are expected to verify that -their volume is checked out after locking it (FSYNC_VerifyCheckout). -What this does is ask the fileserver for the current volume operation on -the specific volume, and verifies that it matches how the program -checked out the volume. - -For example, programType X checks out volume V from the fileserver, and -then locks it. We then ask the fileserver for the current volume -operation on volume V. If the programType on the vol operation does not -match (or the PID, or the checkout mode, or other things), we know the -fileserver must have restarted or something similar, and we do not have -the volume checked out like we thought we did. - -If the program determines that the fileserver may have restarted, it -then must retry checking out and locking the volume (or return an -error). diff --git a/doc/arch/dafs-vnode-fsa.dot b/doc/arch/dafs-vnode-fsa.dot deleted file mode 100644 index a0e28ae80..000000000 --- a/doc/arch/dafs-vnode-fsa.dot +++ /dev/null @@ -1,89 +0,0 @@ -# -# This is a dot (http://www.graphviz.org) description of the various -# states volumes can be in for DAFS (Demand Attach File Server). -# -# Author: Tom Keiser -# Date: 2008-06-03 -# - -digraph VolumeStates { - size="11,17" - graph [ - rankdir = "TB" - ]; - - subgraph clusterKey { - rankdir="LR"; - shape = "rectangle"; - - s1 [ shape=plaintext, label = "VAllocVnode", - fontcolor="brown" ]; - s2 [ shape=plaintext, label = "VGetVnode", - fontcolor="blue" ]; - s3 [ shape=plaintext, label = "VPutVnode", - fontcolor="purple" ]; - s4 [ shape=plaintext, label = "Error States", - fontcolor="red" ]; - s5 [ shape=plaintext, label = "VVnodeWriteToRead", - fontcolor="green" ]; - s6 [ shape=ellipse, label = "re-entrant" ]; - s7 [ shape=ellipse, peripheries=2, label="non re-entrant" ]; - s8 [ shape=ellipse, color="red", label="Error States" ]; - - s6->s7->s8->s1->s2->s3->s5->s4 [style="invis"]; - - } - - node [ peripheries = "2" ] \ - RELEASING ALLOC LOADING EXCLUSIVE STORE ; - node [ shape = "ellipse", peripheries = "1" ]; - node [ color = "red" ] ERROR ; - - node [ color = "black" ]; // default back to black - - - // node descriptions - INVALID [ label = "Vn_state(vnp) == VN_STATE_INVALID\n(vnode cache entry is invalid)" ]; - RELEASING [ label = "Vn_state(vnp) == VN_STATE_RELEASING\n(vnode is busy releasing its inode handle ref)" ]; - ALLOC [ label = "Vn_state(vnp) == VN_STATE_ALLOC\n(vnode is busy allocating disk entry)" ]; - ALLOC_read [ label = "reading stale vnode from disk\nto verify inactive state" ]; - ALLOC_extend [ label = "extending vnode index file" ]; - ONLINE [ label = "Vn_state(vnp) == VN_STATE_ONLINE\n(vnode is a valid cache entry)" ]; - LOADING [ label = "Vn_state(vnp) == VN_STATE_LOAD\n(vnode is busy loading from disk)" ]; - EXCLUSIVE [ label = "Vn_state(vnp) == VN_STATE_EXCLUSIVE\n(vnode is owned exclusively by an external caller)" ]; - STORE [ label = "Vn_state(vnp) == VN_STATE_STORE\n(vnode is busy writing to disk)" ]; - READ [ label = "Vn_state(vnp) == VN_STATE_READ\n(vnode is shared by several external callers)" ]; - ERROR [ label = "Vn_state(vnp) == VN_STATE_ERROR\n(vnode hard error state)" ]; - - - ONLINE->RELEASING [ label = "VGetFreeVnode_r()" ]; - RELEASING->INVALID [ label = "VGetFreeVnode_r()" ]; - - INVALID->ALLOC [ color="brown", label="vnode not in cache; allocating" ]; - ONLINE->EXCLUSIVE [ color="brown", label="vnode in cache" ]; - ALLOC->ALLOC_read [ color="brown", label="vnode index is within present file size" ]; - ALLOC->ALLOC_extend [ color="brown", label="vnode index is beyond end of file" ]; - ALLOC_read->EXCLUSIVE [ color="brown" ]; - ALLOC_extend->EXCLUSIVE [ color="brown" ]; - ALLOC_read->INVALID [ color="red", label="I/O error; invalidating vnode\nand scheduling salvage" ]; - ALLOC_extend->INVALID [ color="red", label="I/O error; invalidating vnode\nand scheduling salvage" ]; - - INVALID->LOADING [ color="blue", label="vnode not cached" ]; - LOADING->INVALID [ color="red", label="I/O error; invalidating vnode\nand scheduling salvage" ]; - LOADING->ONLINE [ color="blue" ]; - ONLINE->READ [ color="blue", label="caller requested read lock" ]; - ONLINE->EXCLUSIVE [ color="blue", label="caller requested write lock" ]; - - EXCLUSIVE->READ [ color="green", label="vnode not changed" ]; - EXCLUSIVE->STORE [ color="green", label="vnode changed" ]; - EXCLUSIVE->ONLINE [ color="purple", label="vnode not changed" ]; - EXCLUSIVE->STORE [ color="purple", label="vnode changed" ]; - - STORE->READ [ color="green" ]; - STORE->ONLINE [ color="purple" ]; - STORE->ERROR [ color="red", label="I/O error; scheduling salvage" ]; - - READ->READ [ color="blue", label="Vn_readers(vnp) > 0" ]; - READ->READ [ color="purple", label="Vn_readers(vnp) > 1" ]; - READ->ONLINE [ color="purple", label="Vn_readers(vnp) == 1" ]; -} diff --git a/doc/arch/fssync.txt b/doc/arch/fssync.txt deleted file mode 100644 index 726d6b9e1..000000000 --- a/doc/arch/fssync.txt +++ /dev/null @@ -1,253 +0,0 @@ -This file provides a brief description of the commands of the FSSYNC -protocol, and how/why each are typically used. - - -- vol op FSSYNC commands - -FSSYNC commands involving volume operations take a FSSYNC_VolOp_command -struct as their command and arguments. They all deal with a specific -volume, so "the specified volume" below refers to the volume in the -FSSYNC_VolOp_hdr in the FSSYNC_VolOp_command. - - -- FSYNC_VOL_ON - -Tells the fileserver to bring the specified volume online. For DAFS, -this brings the volume into the preattached state. For non-DAFS, the -volume is attached. - -This is generally used to tell the fileserver about a newly-created -volume, or to return ('check in') a volume to the fileserver that was -previously checked-out (e.g. via FSYNC_VOL_NEEDVOLUME). - - -- FSYNC_VOL_OFF - -Tells the fileserver to take a volume offline, so nothing else will -access the volume until it is brought online via FSSYNC again. A volume -that is offlined with this command and the FSYNC_SALVAGE reason code -will not be allowed access from the fileserver by anything. The volume -will be 'checked out' until it is 'checked in' by another FSYNC command. - -Currently only the salvaging code uses this command; the only difference -between it an FSYNC_VOL_NEEDVOLUME is the logic that determines whether -an offlined volume can be accessed by other programs or not. - - -- FSYNC_VOL_LISTVOLUMES - -This is currently a no-op; all it does is return success, assuming the -FSSYNC command is well-formed. - -In Transarc/IBM AFS 3.1, this was used to create a file listing all -volumes on the server, and was used with a tool to create a list of -volumes to backup. After AFS 3.1, however, it never did anything. - - -- FSYNC_VOL_NEEDVOLUME - -Tells the fileserver that the calling program needs the volume for a -certain operation. The fileserver will offline the volume or keep it -online, depending on the reason code given. The volume will be marked as -'checked out' until 'checked in' again with another FSYNC command. - -Reason codes for this command are different than for normal FSSYNC -commands; reason codes for _NEEDVOLUME are volume checkout codes like -V_CLONE, V_DUMP, and the like. The fileserver will keep the volume -online if the given reason code is V_READONLY, or if the volume is an RO -volume and the given reason code is V_CLONE or V_DUMP. If the volume is -taken offline, the volume's specialStatus will also be marked with VBUSY -in the case of the V_CLONE or V_DUMP reason codes. - - -- FSYNC_VOL_MOVE - -Tells the fileserver that the specified volume was moved to a new site. -The new site is given in the reason code of the request. On receiving -this, the fileserver merely sets the specialStatus on the volume, and -breaks all of the callbacks on the volume. - - -- FSYNC_VOL_BREAKCBKS - -Tells the fileserver to break all callbacks with the specified volume. -This is used when volumes are deleted or overwritten (restores, -reclones, etc). - - -- FSYNC_VOL_DONE - -Tells the fileserver that a volume has been deleted. This is actually -similar to FSYNC_VOL_ON, except that the volume isn't onlined. The -volume is 'checked in', though, and is removed from the list of volumes. - - -- FSYNC_VOL_QUERY - -Asks the fileserver to provide the known volume state information for -the specified volume. If the volume is known, the response payload is a -filled-in 'struct Volume'. - -This is used as a debugging tool to query volume state in the -fileserver, but is also used by the volserver as an optimization so it -does not need to always go to disk to retrieve volume information for -e.g. the AFSVolListOneVolume or AFSVolListVolumes RPCs. - - -- FSYNC_VOL_QUERY_HDR - -Asks the fileserver to provide the on-disk volume header for the -specified volume, if the fileserver already has it loaded. If the -fileserver does not already know this information, it responds with -SYNC_FAILED with the reason code FSYNC_HDR_NOT_ATTACHED. Otherwise it -responds with a filled-in 'struct VolumeDiskData' in the response -payload. - -This is used by non-fileservers as an optimization during attachment if -we are just reading from the volume and we don't need to 'check out' the -volume from the fileserver (attaching with V_PEEK). If the fileserver -has the header loaded, it avoids needing to hit the disk for the volume -header. - - -- FSYNC_VOL_QUERY_VOP (DAFS only) - -Asks the fileserver to provide information about the current volume -operation that has the volume checked out. If the volume is checked out, -the response payload is a filled-in 'struct FSSYNC_VolOp_info'; -otherwise the command fails with SYNC_FAILED. - -This is useful as a debugging aid, and is also used by the volserver to -determine if a volume should be advertised as 'offline' or 'online'. - - -- FSYNC_VOL_ATTACH - -This is like FSYNC_VOL_ON, but for DAFS forces the volume to become -fully attached (as opposed to preattached). This is used for debugging, -to ensure that a volume is attached and online without needing to -contact the fileserver via e.g. a client. - - -- FSYNC_VOL_FORCE_ERROR (DAFS only) - -This tells the fileserver that there is something wrong with a volume, -and it should be put in an error state or salvaged. - -If the reason code is FSYNC_SALVAGE, the fileserver will potentially -schedule a salvage for the volume. It may or may not actually schedule a -salvage, depending on how many salvages have occurred and other internal -logic; basically, specifying FSYNC_SALVAGE makes the fileserver behave -as if the fileserver itself encountered an error with the volume that -warrants a salvage. - -Non-fileserver programs use this to schedule salvages; they should not -contact the salvageserver directly. Note when a salvage is scheduled as -a result of this command, it is done so in the background; getting a -response from this command does not necessarily mean the salvage has -been scheduled, as it may be deferred until later. - -If the reason code is not FSYNC_SALVAGE, the fileserver will just put -the volume into an error state, and the volume will be inaccessible -until it is salvaged, or forced online. - - -- FSYNC_VOL_LEAVE_OFF - -This 'checks in' a volume back to the fileserver, but tells the -fileserver not to bring the volume back online. This can occur when a -non-fileserver program is done with a volume, but the volume's "blessed" -or "inService" fields are not set; this prevents the fileserver from -trying to attach the volume later, only to find the volume is not -blessed and take the volume offline. - - -- FSYNC_VG_QUERY (DAFS only) - -This queries the fileserver VGC (volume group cache) for the volume -group of the requested volume. The payload consists of an -FSSYNC_VGQry_response_t, specifying the volume group and all of the -volumes in that volume group. - -If the VGC for the requested partition is currently being populated, -this will fail with SYNC_FAILED, and the FSYNC_PART_SCANNING reason -code. If the VGC for the requested partition is currently completely -unpopulated, a VGC scan for the partition will be started automatically -in the background, and FSYNC_PART_SCANNING will still be returned. - -The demand-salvager uses this to find out what volumes are in the volume -group it is salvaging; it can also be used for debugging the VGC. - - -- FSYNC_VG_SCAN (DAFS only) - -This discards any information in the VGC for the specified partition, -and re-scans the partition to populate the VGC in the background. This -should normally not be needed, since scans start automatically when VGC -information is requested. This can be used as a debugging tool, or to -force the VGC to discard incorrect information that somehow got into the -VGC. - -Note that the scan is scheduled in the background, so getting a response -from this command does not imply that the scan has started; it may start -sometime in the future. - - -- FSYNC_VG_SCAN_ALL - -This is the same as FSYNC_VG_SCAN, but schedules scans for all -partitions on the fileserver, instead of a particular one. - - -- FSYNC_VOL_QUERY_VNODE - -Asks the fileserver for information about specific vnode. This takes a -different command header than other vol ops; it takes a struct -FSSYNC_VnQry_hdr which specifies the volume and vnode requested. The -response payload is a 'struct Vnode' if successful. - -This responds with FSYNC_UNKNOWN_VNID if the fileserver doesn't know -anything about the given vnode. This command will not automatically -attach the associated volume; the volume must be attached before issuing -this command in order to do anything useful. - -This is just a debugging tool, to see what the fileserver thinks about a -particular vnode. - - -- stats FSSYNC commands - -FSSYNC commands involving statistics take a FSSYNC_StatsOp_command -struct as their command and arguments. Some of them use arguments to -specify what stats are requested, which are specified in sop->args, the -union in the FSSYNC_StatsOp_hdr struct. - - -- FSYNC_VOL_STATS_GENERAL - -Retrieves general volume package stats from the fileserver. Response -payload consists of a 'struct VolPkgStats'. - - -- FSYNC_VOL_STATS_VICEP (DAFS only) - -Retrieves per-partition stats from the fileserver for the partition -specified in sop->partName. Response payload consists of a 'struct -DiskPartitionStats64'. - - -- FSYNC_VOL_STATS_HASH (DAFS only) - -Retrieves hash chain stats for the hash bucket specified in -sop->hash_bucket. Response payload consists of a 'struct -VolumeHashChainStats'. - - -- FSYNC_VOL_STATS_HDR (DAFS only) - -Retrieves stats for the volume header cache. Response payload consists -of a 'struct volume_hdr_LRU_stats'. - - -- FSYNC_VOL_STATS_VLRU (DAFS only) - -This is intended to retrieve stats for the VLRU generation specified in -sop->vlru_generation. However, it is not yet implemented and currently -always results in a SYNC_BAD_COMMAND result from the fileserver. - - -- VGC update FSSYNC commands - -FSSYNC commands involving updating the VGC (volume group cache) take an -FSSYNC_VGUpdate_command struct as their command arguments. The parent -and child fields specify the (parent,child) entry in the partName VGC to -add or remove. - - -- FSYNC_VG_ADD (DAFS only) - -Adds an entry to the fileserver VGC. This merely adds the specified -child volume to the specified parent volume group, and creates the -parent volume group if it does not exist. This is used by programs that -create new volumes, in order to keep the VGC up to date. - - -- FSYNC_VG_DEL (DAFS only) - -Deletes an entry from the fileserver VGC. This merely removes the -specified child volume from the specified parent volume group, deleting -the volume group if the last entry was deleted. This is used by programs -that destroy volumes, in order to keep the VGC up to date. diff --git a/doc/examples/CellAlias b/doc/examples/CellAlias deleted file mode 100644 index f16ed3b50..000000000 --- a/doc/examples/CellAlias +++ /dev/null @@ -1,10 +0,0 @@ -# -# This file can be used to specify AFS cell aliases, one per line. -# The syntax to specify "my" as an alias for "my.cell.name" is: -# -# my.cell.name my - -#athena.mit.edu athena -#sipb.mit.edu sipb -#andrew.cmu.edu andrew -#transarc.ibm.com transarc diff --git a/doc/txt/README b/doc/txt/README new file mode 100644 index 000000000..4c4690f9f --- /dev/null +++ b/doc/txt/README @@ -0,0 +1,13 @@ + +- dafs-fsa.dot is a description of the finite-state machine for volume +states in the Demand Attach fileserver +- dafs-vnode-fsa.dot is a description of the finite-state machine +for vnodes in the Demand Attach fileserver. + +Both diagrams are in Dot (http://www.graphviz.org) format, +and can be converted to graphics formats via an +an invocation like: + + dot -Tsvg dafs-fsa.dot > dafs-fsa.svg + + diff --git a/doc/txt/README.linux-nfstrans b/doc/txt/README.linux-nfstrans deleted file mode 100644 index 901080f0a..000000000 --- a/doc/txt/README.linux-nfstrans +++ /dev/null @@ -1,270 +0,0 @@ -## Introduction - -This version works on Linux 2.6, and provides the following features: - -- Basic AFS/NFS translator functionality, similar to other platforms -- Ability to distinguish PAG's assigned within each NFS client -- A new 'afspag' kernel module, which provides PAG management on - NFS client systems, and forwards AFS system calls to the translator - system via the remote AFS system call (rmtsys) protocol. -- Support for transparent migration of an NFS client from one translator - server to another, without loss of credentials or sysnames. -- The ability to force the translator to discard all credentials - belonging to a specified NFS client host. - - -The patch applies to OpenAFS 1.4.1, and has been tested against the -kernel-2.6.9-22.0.2.EL kernel binaries as provided by the CentOS project -(essentially these are rebuilds from source of Red Hat Enterprise Linux). -This patch is not expected to apply cleanly to newer versions of OpenAFS, -due to conflicting changes in parts of the kernel module source. To apply -this patch, use 'patch -p0'. - -It has been integrated into OpenAFS 1.5.x. - -## New in Version 1.4 - -- There was no version 1.3 -- Define a "sysname generation number" which changes any time the sysname - list is changed for the translator or any client. This number is used - as the nanoseconds part of the mtime of directories, which forces NFS - clients to reevaluate directory lookups any time the sysname changes. -- Fixed several bugs related to sysname handling -- Fixed a bug preventing 'fs exportafs' from changing the flag which - controls whether callbacks are made to NFS clients to obtain tokens - and sysname lists. -- Starting in this version, when the PAG manager starts up, it makes a - call to the translator to discard any tokens belonging to that client. - This fixes a problem where newly-created PAG's on the client would - inherit tokens owned by an unrelated process from an earlier boot. -- Enabled the PAG manager to forward non-V-series pioctl's. -- Forward ported to OpenAFS 1.4.1 final -- Added a file, /proc/fs/openafs/unixusers, which reports information - about "unixuser" structures, which are used to record tokens and to - bind translator-side PAG's to NFS client data and sysname lists. - - -## Finding the RPC server authtab - -In order to correctly detect NFS clients and distinguish between them, -the translator must insert itself into the RPC authentication process. -This requires knowing the address of the RPC server authentication dispatch -table, which is not exported from standard kernels. To address this, the -kernel must be patched such that net/sunrpc/svcauth.c exports the 'authtab' -symbol, or this symbol's address must be provided when the OpenAFS kernel -module is loaded, using the option "authtab_addr=0xXXXXXXXX" where XXXXXXXX -is the address of the authtab symbol as obtained from /proc/kallsyms. The -latter may be accomplished by adding the following three lines to the -openafs-client init script in place of 'modprobe openafs': - - modprobe sunrpc - authtab=`awk '/[ \t]authtab[ \t]/ { print $1 }' < /proc/kallsyms` - modprobe openafs ${authtab:+authtab_addr=0x$authtab} - - -## Exporting the NFS filesystem - -In order for the translator to work correctly, /afs must be exported with -specific options. Specifically, the 'no_subtree_check' option is needed -in order to prevent the common NFS server code from performing unwanted -access checks, and an fsid option must be provided to set the filesystem -identifier to be used in NFS filehandles. Note that for live migration -to work, a consistent filesystem id must be used on all translator systems. -The export may be accomplished with a line in /etc/exports: - - /afs (rw,no_subtree_check,fsid=42) - -Or with a command: - - exportfs -o rw,no_subtree_check,fsid=42 :/afs - -The AFS/NFS translator code is enabled by default; no additional command -is required to activate it. However, the 'fs exportafs nfs' command can -be used to disable or re-enable the translator and to set options. Note -that support for client-assigned PAG's is not enabled by default, and -must be enabled with the following command: - - fs exportafs nfs -clipags on - -Support for making callbacks to obtain credentials and sysnames from -newly-discovered NFS clients is also not enabled by default, because this -would result in long timeouts on requests from NFS clients which do not -support this feature. To enable this feature, use the following command: - - fs exportafs nfs -pagcb on - - -## Client-Side PAG Management - -Management of PAG's on individual NFS clients is provided by the kernel -module afspag.ko, which is automatically built alongside the libafs.ko -module on Linux 2.6 systems. This component is not currently supported -on any other platform. - -To activate the client PAG manager, simply load the module; no additional -parameters or commands are required. Once the module is loaded, PAG's -may be acquired using the setpag() call, exactly as on systems running the -full cache manager. Both the traditional system call and new-style ioctl -entry points are supported. - -In addition, the PAG manager can forward pioctl() calls to an AFS/NFS -translator system via the remote AFS system call service (rmtsys). To -enable this feature, the kernel module must be loaded with a parameter -specifying the location of the translator system: - - insmod afspag.ko nfs_server_addr=0xAABBCCDD - -In this example, 0xAABBCCDD is the IP address of the translator system, -in network byte order. For example, if the translator has the IP address -192.168.42.100, the nfs_server_addr parameter should be set to 0xc0a82a64. - -The PAG manager can be shut down using 'afsd -shutdown' (ironically, this -is the only circumstance in which that command is useful). Once the -shutdown is complete, the kernel module can be removed using rmmod. - - -## Remote System Calls - -The NFS translator supports the ability of NFS clients to perform various -AFS-specific operations via the remote system call interface (rmtsys). -To enable this feature, afsd must be run with the -rmtsys switch. OpenAFS -client utilities will use this feature automatically if the AFSSERVER -environment variable is set to the address or hostname of the translator -system, or if the file ~/.AFSSERVER or /.AFSSERVER exists and contains the -translator's address or hostname. - -On systems running the client PAG manager (afspag.ko), AFS system calls -made via the traditional methods will be automatically forwarded to the -NFS translator system, if the PAG manager is configured to do so. This -feature must be enabled, as described above. - - -## Credential Caching - -The client PAG manager maintains a cache of credentials belonging to each -PAG. When an application makes a system call to set or remove AFS tokens, -the PAG manager updates its cache in addition to forwarding the request -to the NFS server. - -When the translator hears from a previously-unknown client, it makes a -callback to the client to retrieve a copy of any cached credentials. -This means that credentials belonging to an NFS client are not lost if -the translator is rebooted, or if the client's location on the network -changes such that it is talking to a different translator. - -This feature is automatically supported by the PAG manager if it has -been configured to forward system calls to an NFS translator. However, -requests will be honored only if they come from port 7001 on the NFS -translator host. In addition, this feature must be enabled on the NFS -translator system as described above. - - -## System Name List - -When the NFS translator hears from a new NFS client whose system name -list it does not know, it can make a callback to the client to discover -the correct system name list. This ability is enabled automatically -with credential caching and retrieval is enabled as described above. - -The PAG manager maintains a system-wide sysname list, which is used to -satisfy callback requests from the NFS translator. This list is set -initially to contain only the compiled-in default sysname, but can be -changed by the superuser using the VIOC_AFS_SYSNAME pioctl or the -'fs sysname' command. Any changes are automatically propagated to the -NFS translator. - - -## Dynamic Mount Points - -This patch introduces a special directory ".:mount", which can be found -directly below the AFS root directory. This directory always appears to -be empty, but any name of the form "cell:volume" will resolve to a mount -point for the specified volume. The resulting mount points are always -RW-path mount points, and so will resolve to an RW volume even if the -specified name refers to a replicated volume. However, the ".readonly" -and ".backup" suffixes can be used to refer to volumes of those types, -and a numeric volume ID will always be used as-is. - -This feature is required to enable the NFS translator to reconstruct a -reachable path for any valid filehandle presented by an NFS client. -Specifically, when the path reconstruction algorithm is walking upward -from a client-provided filehandle and encounters the root directory of -a volume which is no longer in the cache (and thus has no known mount -point), it will complete the path to the AFS root using the dynamic -mount directory. - -On non-linux cache managers, this feature is available when dynamic -root and fake stat modes are enabled. - -On Linux systems, it is also available even when dynroot is not enabled, -to support the NFS translator. It is presently not possible to disable -this feature, though that ability may be added in the future. It would -be difficult to make this feature unavailable to users and still make the -Linux NFS translator work, since the point of the check being performed -by the NFS server is to ensure the requested file would be reachable by -the client. - - -## Security - -The security of the NFS translator depends heavily on the underlying -network. Proper configuration is required to prevent unauthorized -access to files, theft of credentials, or other forms of attack. - -NFS, remote syscall, and PAG callback traffic between an NFS client host -and translator may contain sensitive file data and/or credentials, and -should be protected from snooping by unprivileged users or other hosts. - -Both the NFS translator and remote system call service authorize requests -in part based on the IP address of the requesting client. To prevent an -attacker from making requests on behalf of another host, the network must -be configured such that it is impossible for one client to spoof the IP -address of another. - -In addition, both the NFS translator and remote system call service -associate requests with specific users based on user and group ID data -contained within the request. In order to prevent users on the same client -from making filesystem access requests as each other, the NFS server must -be configured to accept requests only from privileged ports. In order to -prevent users from making AFS system calls on each other's behalf, possibly -including retrieving credentials, the network must be configured such that -requests to the remote system call service (port 7009) are accepted only -from port 7001 on NFS clients. - -When a client is migrated away from a translator, any credentials held -on behalf of that client must be discarded before that client's IP address -can safely be reused. The VIOC_NFS_NUKE_CREDS pioctl and 'fs nukenfscreds' -command are provided for this purpose. Both take a single argument, which -is the IP address of the NFS client whose credentials should be discarded. - - -## Known Issues - - + Because NFS clients do not maintain active references on every inode - they are using, it is possible that portions of the directory tree - in use by an NFS client will expire from the translator's AFS and - Linux dentry cache's. When this happens, the NFS server attempts to - reconstruct the missing portion of the directory tree, but may fail - if the client does not have sufficient access (for example, if his - tokens have expired). In these cases, a "stale NFS filehandle" error - will be generated. This behavior is similar to that found on other - translator platforms, but is triggered under a slightly different set - of circumstances due to differences in the architecture of the Linux - NFS server. - - + Due to limitations of the rmtsys protocol, some pioctl calls require - large (several KB) transfers between the client and rmtsys server. - Correcting this issues would require extensions to the rmtsys protocol - outside the scope of this project. - - + The rmtsys interface requires that AFS be mounted in the same place - on both the NFS client and translator system, or at least that the - translator be able to correctly resolve absolute paths provided by - the client. - - + If a client is migrated or an NFS translator host is unexpectedly - rebooted while AFS filesystem access is in progress, there may be - a short delay before the client recovers. This is because the NFS - client must time out any request it made to the old server before - it can retransmit the request, which will then be handled by the - new server. The same applies to remote system call requests. diff --git a/doc/txt/arch-overview.h b/doc/txt/arch-overview.h new file mode 100644 index 000000000..64c6cb834 --- /dev/null +++ b/doc/txt/arch-overview.h @@ -0,0 +1,1224 @@ +/*! + \addtogroup arch-overview Architectural Overview + \page title AFS-3 Programmer's Reference: Architectural Overview + +\author Edward R. Zayas +Transarc Corporation +\version 1.0 +\date 2 September 1991 22:53 .cCopyright 1991 Transarc Corporation All Rights +Reserved FS-00-D160 + + + \page chap1 Chapter 1: Introduction + + \section sec1-1 Section 1.1: Goals and Background + +\par +This paper provides an architectural overview of Transarc's wide-area +distributed file system, AFS. Specifically, it covers the current level of +available software, the third-generation AFS-3 system. This document will +explore the technological climate in which AFS was developed, the nature of +problem(s) it addresses, and how its design attacks these problems in order to +realize the inherent benefits in such a file system. It also examines a set of +additional features for AFS, some of which are actively being considered. +\par +This document is a member of a reference suite providing programming +specifications as to the operation of and interfaces offered by the various AFS +system components. It is intended to serve as a high-level treatment of +distributed file systems in general and of AFS in particular. This document +should ideally be read before any of the others in the suite, as it provides +the organizational and philosophical framework in which they may best be +interpreted. + + \section sec1-2 Section 1.2: Document Layout + +\par +Chapter 2 provides a discussion of the technological background and +developments that created the environment in which AFS and related systems were +inspired. Chapter 3 examines the specific set of goals that AFS was designed to +meet, given the possibilities created by personal computing and advances in +communication technology. Chapter 4 presents the core AFS architecture and how +it addresses these goals. Finally, Chapter 5 considers how AFS functionality +may be be improved by certain design changes. + + \section sec1-3 Section 1.3: Related Documents + +\par +The names of the other documents in the collection, along with brief summaries +of their contents, are listed below. +\li AFS-3 Programmer?s Reference: File Server/Cache Manager Interface: This +document describes the File Server and Cache Manager agents, which provide the +backbone ?le managment services for AFS. The collection of File Servers for a +cell supplies centralized ?le storage for that site, and allows clients running +the Cache Manager component to access those ?les in a high-performance, secure +fashion. +\li AFS-3 Programmer?s Reference:Volume Server/Volume Location Server +Interface: This document describes the services through which ?containers? of +related user data are located and managed. +\li AFS-3 Programmer?s Reference: Protection Server Interface: This paper +describes the server responsible for mapping printable user names to and from +their internal AFS identi?ers. The Protection Server also allows users to +create, destroy, and manipulate ?groups? of users, which are suitable for +placement on Access Control Lists (ACLs). +\li AFS-3 Programmer?s Reference: BOS Server Interface: This paper covers the +?nanny? service which assists in the administrability of the AFS environment. +\li AFS-3 Programmer?s Reference: Speci?cation for the Rx Remote Procedure Call +Facility: This document speci?es the design and operation of the remote +procedure call and lightweight process packages used by AFS. + + \page chap2 Chapter 2: Technological Background + +\par +Certain changes in technology over the past two decades greatly in?uenced the +nature of computational resources, and the manner in which they were used. +These developments created the conditions under which the notion of a +distributed ?le systems (DFS) was born. This chapter describes these +technological changes, and explores how a distributed ?le system attempts to +capitalize on the new computing environment?s strengths and minimize its +disadvantages. + + \section sec2-1 Section 2.1: Shift in Computational Idioms + +\par +By the beginning of the 1980s, new classes of computing engines and new methods +by which they may be interconnected were becoming firmly established. At this +time, a shift was occurring away from the conventional mainframe-based, +timeshared computing environment to one in which both workstation-class +machines and the smaller personal computers (PCs) were a strong presence. +\par +The new environment offered many benefits to its users when compared with +timesharing. These smaller, self-sufficient machines moved dedicated computing +power and cycles directly onto people's desks. Personal machines were powerful +enough to support a wide variety of applications, and allowed for a richer, +more intuitive, more graphically-based interface for them. Learning curves were +greatly reduced, cutting training costs and increasing new-employee +productivity. In addition, these machines provided a constant level of service +throughout the day. Since a personal machine was typically only executing +programs for a single human user, it did not suffer from timesharing's +load-based response time degradation. Expanding the computing services for an +organization was often accomplished by simply purchasing more of the relatively +cheap machines. Even small organizations could now afford their own computing +resources, over which they exercised full control. This provided more freedom +to tailor computing services to the specific needs of particular groups. +\par +However, many of the benefits offered by the timesharing systems were lost when +the computing idiom first shifted to include personal-style machines. One of +the prime casualties of this shift was the loss of the notion of a single name +space for all files. Instead, workstation-and PC-based environments each had +independent and completely disconnected file systems. The standardized +mechanisms through which files could be transferred between machines (e.g., +FTP) were largely designed at a time when there were relatively few large +machines that were connected over slow links. Although the newer multi-megabit +per second communication pathways allowed for faster transfers, the problem of +resource location in this environment was still not addressed. There was no +longer a system-wide file system, or even a file location service, so +individual users were more isolated from the organization's collective data. +Overall, disk requirements ballooned, since lack of a shared file system was +often resolved by replicating all programs and data to each machine that needed +it. This proliferation of independent copies further complicated the problem of +version control and management in this distributed world. Since computers were +often no longer behind locked doors at a computer center, user authentication +and authorization tasks became more complex. Also, since organizational +managers were now in direct control of their computing facilities, they had to +also actively manage the hardware and software upon which they depended. +\par +Overall, many of the benefits of the proliferation of independent, +personal-style machines were partially offset by the communication and +organizational penalties they imposed. Collaborative work and dissemination of +information became more difficult now that the previously unified file system +was fragmented among hundreds of autonomous machines. + + \section sec2-2 Section 2.2: Distributed File Systems + +\par +As a response to the situation outlined above, the notion of a distributed file +system (DFS) was developed. Basically, a DFS provides a framework in which +access to files is permitted regardless of their locations. Specifically, a +distributed file system offers a single, common set of file system operations +through which those accesses are performed. +\par +There are two major variations on the core DFS concept, classified according to +the way in which file storage is managed. These high-level models are defined +below. +\li Peer-to-peer: In this symmetrical model, each participating machine +provides storage for specific set of files on its own attached disk(s), and +allows others to access them remotely. Thus, each node in the DFS is capable of +both importing files (making reference to files resident on foreign machines) +and exporting files (allowing other machines to reference files located +locally). +\li Server-client: In this model, a set of machines designated as servers +provide the storage for all of the files in the DFS. All other machines, known +as clients, must direct their file references to these machines. Thus, servers +are the sole exporters of files in the DFS, and clients are the sole importers. + +\par +The notion of a DFS, whether organized using the peer-to-peer or server-client +discipline, may be used as a conceptual base upon which the advantages of +personal computing resources can be combined with the single-system benefits of +classical timeshared operation. +\par +Many distributed file systems have been designed and deployed, operating on the +fast local area networks available to connect machines within a single site. +These systems include DOMAIN [9], DS [15], RFS [16], and Sprite [10]. Perhaps +the most widespread of distributed file systems to date is a product from Sun +Microsystems, NFS [13] [14], extending the popular unix file system so that it +operates over local networks. + + \section sec2-3 Section 2.3: Wide-Area Distributed File Systems + +\par +Improvements in long-haul network technology are allowing for faster +interconnection bandwidths and smaller latencies between distant sites. +Backbone services have been set up across the country, and T1 (1.5 +megabit/second) links are increasingly available to a larger number of +locations. Long-distance channels are still at best approximately an order of +magnitude slower than the typical local area network, and often two orders of +magnitude slower. The narrowed difference between local-area and wide-area data +paths opens the window for the notion of a wide-area distributed file system +(WADFS). In a WADFS, the transparency of file access offered by a local-area +DFS is extended to cover machines across much larger distances. Wide-area file +system functionality facilitates collaborative work and dissemination of +information in this larger theater of operation. + + \page chap3 Chapter 3: AFS-3 Design Goals + + \section sec3-1 Section 3.1: Introduction + +\par +This chapter describes the goals for the AFS-3 system, the first commercial +WADFS in existence. +\par +The original AFS goals have been extended over the history of the project. The +initial AFS concept was intended to provide a single distributed file system +facility capable of supporting the computing needs of Carnegie Mellon +University, a community of roughly 10,000 people. It was expected that most CMU +users either had their own workstation-class machine on which to work, or had +access to such machines located in public clusters. After being successfully +implemented, deployed, and tuned in this capacity, it was recognized that the +basic design could be augmented to link autonomous AFS installations located +within the greater CMU campus. As described in Section 2.3, the long-haul +networking environment developed to a point where it was feasible to further +extend AFS so that it provided wide-area file service. The underlying AFS +communication component was adapted to better handle the widely-varying channel +characteristics encountered by intra-site and inter-site operations. +\par +A more detailed history of AFS evolution may be found in [3] and [18]. + + \section sec3-2 Section 3.2: System Goals + +\par +At a high level, the AFS designers chose to extend the single-machine unix +computing environment into a WADFS service. The unix system, in all of its +numerous incarnations, is an important computing standard, and is in very wide +use. Since AFS was originally intended to service the heavily unix-oriented CMU +campus, this decision served an important tactical purpose along with its +strategic ramifications. +\par +In addition, the server-client discipline described in Section 2.2 was chosen +as the organizational base for AFS. This provides the notion of a central file +store serving as the primary residence for files within a given organization. +These centrally-stored files are maintained by server machines and are made +accessible to computers running the AFS client software. +\par +Listed in the following sections are the primary goals for the AFS system. +Chapter 4 examines how the AFS design decisions, concepts, and implementation +meet this list of goals. + + \subsection sec3-2-1 Section 3.2.1: Scale + +\par +AFS differs from other existing DFSs in that it has the specific goal of +supporting a very large user community with a small number of server machines. +Unlike the rule-of-thumb ratio of approximately 20 client machines for every +server machine (20:1) used by Sun Microsystem's widespread NFS distributed file +system, the AFS architecture aims at smoothly supporting client/server ratios +more along the lines of 200:1 within a single installation. In addition to +providing a DFS covering a single organization with tens of thousands of users, +AFS also aims at allowing thousands of independent, autonomous organizations to +join in the single, shared name space (see Section 3.2.2 below) without a +centralized control or coordination point. Thus, AFS envisions supporting the +file system needs of tens of millions of users at interconnected yet autonomous +sites. + + \subsection sec3-2-2 Section 3.2.2: Name Space + +\par +One of the primary strengths of the timesharing computing environment is the +fact that it implements a single name space for all files in the system. Users +can walk up to any terminal connected to a timesharing service and refer to its +files by the identical name. This greatly encourages collaborative work and +dissemination of information, as everyone has a common frame of reference. One +of the major AFS goals is the extension of this concept to a WADFS. Users +should be able to walk up to any machine acting as an AFS client, anywhere in +the world, and use the identical file name to refer to a given object. +\par +In addition to the common name space, it was also an explicit goal for AFS to +provide complete access transparency and location transparency for its files. +Access transparency is defined as the system's ability to use a single +mechanism to operate on a file, regardless of its location, local or remote. +Location transparency is defined as the inability to determine a file's +location from its name. A system offering location transparency may also +provide transparent file mobility, relocating files between server machines +without visible effect to the naming system. + + \subsection sec3-2-3 Section 3.2.3: Performance + +\par +Good system performance is a critical AFS goal, especially given the scale, +client-server ratio, and connectivity specifications described above. The AFS +architecture aims at providing file access characteristics which, on average, +are similar to those of local disk performance. + + \subsection sec3-2-4 Section 3.2.4: Security + +\par +A production WADFS, especially one which allows and encourages transparent file +access between different administrative domains, must be extremely conscious of +security issues. AFS assumes that server machines are "trusted" within their +own administrative domain, being kept behind locked doors and only directly +manipulated by reliable administrative personnel. On the other hand, AFS client +machines are assumed to exist in inherently insecure environments, such as +offices and dorm rooms. These client machines are recognized to be +unsupervisable, and fully accessible to their users. This situation makes AFS +servers open to attacks mounted by possibly modified client hardware, firmware, +operating systems, and application software. In addition, while an organization +may actively enforce the physical security of its own file servers to its +satisfaction, other organizations may be lax in comparison. It is important to +partition the system's security mechanism so that a security breach in one +administrative domain does not allow unauthorized access to the facilities of +other autonomous domains. +\par +The AFS system is targeted to provide confidence in the ability to protect +system data from unauthorized access in the above environment, where untrusted +client hardware and software may attempt to perform direct remote file +operations from anywhere in the world, and where levels of physical security at +remote sites may not meet the standards of other sites. + + \subsection sec3-2-5 Section 3.2.5: Access Control + +\par +The standard unix access control mechanism associates mode bits with every file +and directory, applying them based on the user's numerical identifier and the +user's membership in various groups. This mechanism was considered too +coarse-grained by the AFS designers. It was seen as insufficient for specifying +the exact set of individuals and groups which may properly access any given +file, as well as the operations these principals may perform. The unix group +mechanism was also considered too coarse and inflexible. AFS was designed to +provide more flexible and finer-grained control of file access, improving the +ability to define the set of parties which may operate on files, and what their +specific access rights are. + + \subsection sec3-2-6 Section 3.2.6: Reliability + +\par +The crash of a server machine in any distributed file system causes the +information it hosts to become unavailable to the user community. The same +effect is observed when server and client machines are isolated across a +network partition. Given the potential size of the AFS user community, a single +server crash could potentially deny service to a very large number of people. +The AFS design reflects a desire to minimize the visibility and impact of these +inevitable server crashes. + + \subsection sec3-2-7 Section 3.2.7: Administrability + +\par +Driven once again by the projected scale of AFS operation, one of the system's +goals is to offer easy administrability. With the large projected user +population, the amount of file data expected to be resident in the shared file +store, and the number of machines in the environment, a WADFS could easily +become impossible to administer unless its design allowed for easy monitoring +and manipulation of system resources. It is also imperative to be able to apply +security and access control mechanisms to the administrative interface. + + \subsection sec3-2-8 Section 3.2.8: Interoperability/Coexistence + +\par +Many organizations currently employ other distributed file systems, most +notably Sun Microsystem's NFS, which is also an extension of the basic +single-machine unix system. It is unlikely that AFS will receive significant +use if it cannot operate concurrently with other DFSs without mutual +interference. Thus, coexistence with other DFSs is an explicit AFS goal. +\par +A related goal is to provide a way for other DFSs to interoperate with AFS to +various degrees, allowing AFS file operations to be executed from these +competing systems. This is advantageous, since it may extend the set of +machines which are capable of interacting with the AFS community. Hardware +platforms and/or operating systems to which AFS is not ported may thus be able +to use their native DFS system to perform AFS file references. +\par +These two goals serve to extend AFS coverage, and to provide a migration path +by which potential clients may sample AFS capabilities, and gain experience +with AFS. This may result in data migration into native AFS systems, or the +impetus to acquire a native AFS implementation. + + \subsection sec3-2-9 Section 3.2.9: Heterogeneity/Portability + +\par +It is important for AFS to operate on a large number of hardware platforms and +operating systems, since a large community of unrelated organizations will most +likely utilize a wide variety of computing environments. The size of the +potential AFS user community will be unduly restricted if AFS executes on a +small number of platforms. Not only must AFS support a largely heterogeneous +computing base, it must also be designed to be easily portable to new hardware +and software releases in order to maintain this coverage over time. + + \page chap4 Chapter 4: AFS High-Level Design + + \section sec4-1 Section 4.1: Introduction + +\par +This chapter presents an overview of the system architecture for the AFS-3 +WADFS. Different treatments of the AFS system may be found in several +documents, including [3], [4], [5], and [2]. Certain system features discussed +here are examined in more detail in the set of accompanying AFS programmer +specification documents. +\par +After the archtectural overview, the system goals enumerated in Chapter 3 are +revisited, and the contribution of the various AFS design decisions and +resulting features is noted. + + \section sec4-2 Section 4.2: The AFS System Architecture + + \subsection sec4-2-1 Section 4.2.1: Basic Organization + +\par +As stated in Section 3.2, a server-client organization was chosen for the AFS +system. A group of trusted server machines provides the primary disk space for +the central store managed by the organization controlling the servers. File +system operation requests for specific files and directories arrive at server +machines from machines running the AFS client software. If the client is +authorized to perform the operation, then the server proceeds to execute it. +\par +In addition to this basic file access functionality, AFS server machines also +provide related system services. These include authentication service, mapping +between printable and numerical user identifiers, file location service, time +service, and such administrative operations as disk management, system +reconfiguration, and tape backup. + + \subsection sec4-2-2 Section 4.2.2: Volumes + + \subsubsection sec4-2-2-1 Section 4.2.2.1: Definition + +\par +Disk partitions used for AFS storage do not directly host individual user files +and directories. Rather, connected subtrees of the system's directory structure +are placed into containers called volumes. Volumes vary in size dynamically as +the objects it houses are inserted, overwritten, and deleted. Each volume has +an associated quota, or maximum permissible storage. A single unix disk +partition may thus host one or more volumes, and in fact may host as many +volumes as physically fit in the storage space. However, the practical maximum +is currently 3,500 volumes per disk partition. This limitation is imposed by +the salvager program, which examines and repairs file system metadata +structures. +\par +There are two ways to identify an AFS volume. The first option is a 32-bit +numerical value called the volume ID. The second is a human-readable character +string called the volume name. +\par +Internally, a volume is organized as an array of mutable objects, representing +individual files and directories. The file system object associated with each +index in this internal array is assigned a uniquifier and a data version +number. A subset of these values are used to compose an AFS file identifier, or +FID. FIDs are not normally visible to user applications, but rather are used +internally by AFS. They consist of ordered triplets, whose components are the +volume ID, the index within the volume, and the uniquifier for the index. +\par +To understand AFS FIDs, let us consider the case where index i in volume v +refers to a file named example.txt. This file's uniquifier is currently set to +one (1), and its data version number is currently set to zero (0). The AFS +client software may then refer to this file with the following FID: (v, i, 1). +The next time a client overwrites the object identified with the (v, i, 1) FID, +the data version number for example.txt will be promoted to one (1). Thus, the +data version number serves to distinguish between different versions of the +same file. A higher data version number indicates a newer version of the file. +\par +Consider the result of deleting file (v, i, 1). This causes the body of +example.txt to be discarded, and marks index i in volume v as unused. Should +another program create a file, say a.out, within this volume, index i may be +reused. If it is, the creation operation will bump the index's uniquifier to +two (2), and the data version number is reset to zero (0). Any client caching a +FID for the deleted example.txt file thus cannot affect the completely +unrelated a.out file, since the uniquifiers differ. + + \subsubsection sec4-2-2-2 Section 4.2.2.2: Attachment + +\par +The connected subtrees contained within individual volumes are attached to +their proper places in the file space defined by a site, forming a single, +apparently seamless unix tree. These attachment points are called mount points. +These mount points are persistent file system objects, implemented as symbolic +links whose contents obey a stylized format. Thus, AFS mount points differ from +NFS-style mounts. In the NFS environment, the user dynamically mounts entire +remote disk partitions using any desired name. These mounts do not survive +client restarts, and do not insure a uniform namespace between different +machines. +\par +A single volume is chosen as the root of the AFS file space for a given +organization. By convention, this volume is named root.afs. Each client machine +belonging to this organization peforms a unix mount() of this root volume (not +to be confused with an AFS mount point) on its empty /afs directory, thus +attaching the entire AFS name space at this point. + + \subsubsection sec4-2-2-3 Section 4.2.2.3: Administrative Uses + +\par +Volumes serve as the administrative unit for AFS ?le system data, providing as +the basis for replication, relocation, and backup operations. + + \subsubsection sec4-2-2-4 Section 4.2.2.4: Replication + +Read-only snapshots of AFS volumes may be created by administrative personnel. +These clones may be deployed on up to eight disk partitions, on the same server +machine or across di?erent servers. Each clone has the identical volume ID, +which must di?er from its read-write parent. Thus, at most one clone of any +given volume v may reside on a given disk partition. File references to this +read-only clone volume may be serviced by any of the servers which host a copy. + + \subsubsection sec4-2-2-4 Section 4.2.2.5: Backup + +\par +Volumes serve as the unit of tape backup and restore operations. Backups are +accomplished by first creating an on-line backup volume for each volume to be +archived. This backup volume is organized as a copy-on-write shadow of the +original volume, capturing the volume's state at the instant that the backup +took place. Thus, the backup volume may be envisioned as being composed of a +set of object pointers back to the original image. The first update operation +on the file located in index i of the original volume triggers the +copy-on-write association. This causes the file's contents at the time of the +snapshot to be physically written to the backup volume before the newer version +of the file is stored in the parent volume. +\par +Thus, AFS on-line backup volumes typically consume little disk space. On +average, they are composed mostly of links and to a lesser extent the bodies of +those few files which have been modified since the last backup took place. +Also, the system does not have to be shut down to insure the integrity of the +backup images. Dumps are generated from the unchanging backup volumes, and are +transferred to tape at any convenient time before the next backup snapshot is +performed. + + \subsubsection sec4-2-2-6 Section 4.2.2.6: Relocation + +\par +Volumes may be moved transparently between disk partitions on a given file +server, or between different file server machines. The transparency of volume +motion comes from the fact that neither the user-visible names for the files +nor the internal AFS FIDs contain server-specific location information. +\par +Interruption to file service while a volume move is being executed is typically +on the order of a few seconds, regardless of the amount of data contained +within the volume. This derives from the staged algorithm used to move a volume +to a new server. First, a dump is taken of the volume's contents, and this +image is installed at the new site. The second stage involves actually locking +the original volume, taking an incremental dump to capture file updates since +the first stage. The third stage installs the changes at the new site, and the +fourth stage deletes the original volume. Further references to this volume +will resolve to its new location. + + \subsection sec4-2-3 Section 4.2.3: Authentication + +\par +AFS uses the Kerberos [22] [23] authentication system developed at MIT's +Project Athena to provide reliable identification of the principals attempting +to operate on the files in its central store. Kerberos provides for mutual +authentication, not only assuring AFS servers that they are interacting with +the stated user, but also assuring AFS clients that they are dealing with the +proper server entities and not imposters. Authentication information is +mediated through the use of tickets. Clients register passwords with the +authentication system, and use those passwords during authentication sessions +to secure these tickets. A ticket is an object which contains an encrypted +version of the user's name and other information. The file server machines may +request a caller to present their ticket in the course of a file system +operation. If the file server can successfully decrypt the ticket, then it +knows that it was created and delivered by the authentication system, and may +trust that the caller is the party identified within the ticket. +\par +Such subjects as mutual authentication, encryption and decryption, and the use +of session keys are complex ones. Readers are directed to the above references +for a complete treatment of Kerberos-based authentication. + + \subsection sec4-2-4 Section 4.2.4: Authorization + + \subsubsection sec4-2-4-1 Section 4.2.4.1: Access Control Lists + +\par +AFS implements per-directory Access Control Lists (ACLs) to improve the ability +to specify which sets of users have access to the ?les within the directory, +and which operations they may perform. ACLs are used in addition to the +standard unix mode bits. ACLs are organized as lists of one or more (principal, +rights) pairs. A principal may be either the name of an individual user or a +group of individual users. There are seven expressible rights, as listed below. +\li Read (r): The ability to read the contents of the files in a directory. +\li Lookup (l): The ability to look up names in a directory. +\li Write (w): The ability to create new files and overwrite the contents of +existing files in a directory. +\li Insert (i): The ability to insert new files in a directory, but not to +overwrite existing files. +\li Delete (d): The ability to delete files in a directory. +\li Lock (k): The ability to acquire and release advisory locks on a given +directory. +\li Administer (a): The ability to change a directory's ACL. + + \subsubsection sec4-2-4-2 Section 4.2.4.2: AFS Groups + +\par +AFS users may create a certain number of groups, differing from the standard +unix notion of group. These AFS groups are objects that may be placed on ACLs, +and simply contain a list of AFS user names that are to be treated identically +for authorization purposes. For example, user erz may create a group called +erz:friends consisting of the kazar, vasilis, and mason users. Should erz wish +to grant read, lookup, and insert rights to this group in directory d, he +should create an entry reading (erz:friends, rli) in d's ACL. +\par +AFS offers three special, built-in groups, as described below. +\par +1. system:anyuser: Any individual who accesses AFS files is considered by the +system to be a member of this group, whether or not they hold an authentication +ticket. This group is unusual in that it doesn't have a stable membership. In +fact, it doesn't have an explicit list of members. Instead, the system:anyuser +"membership" grows and shrinks as file accesses occur, with users being +(conceptually) added and deleted automatically as they interact with the +system. +\par +The system:anyuser group is typically put on the ACL of those directories for +which some specific level of completely public access is desired, covering any +user at any AFS site. +\par +2. system:authuser: Any individual in possession of a valid Kerberos ticket +minted by the organization's authentication service is treated as a member of +this group. Just as with system:anyuser, this special group does not have a +stable membership. If a user acquires a ticket from the authentication service, +they are automatically "added" to the group. If the ticket expires or is +discarded by the user, then the given individual will automatically be +"removed" from the group. +\par +The system:authuser group is usually put on the ACL of those directories for +which some specific level of intra-site access is desired. Anyone holding a +valid ticket within the organization will be allowed to perform the set of +accesses specified by the ACL entry, regardless of their precise individual ID. +\par +3. system:administrators: This built-in group de?nes the set of users capable +of performing certain important administrative operations within the cell. +Members of this group have explicit 'a' (ACL administration) rights on every +directory's ACL in the organization. Members of this group are the only ones +which may legally issue administrative commands to the file server machines +within the organization. This group is not like the other two described above +in that it does have a stable membership, where individuals are added and +deleted from the group explicitly. +\par +The system:administrators group is typically put on the ACL of those +directories which contain sensitive administrative information, or on those +places where only administrators are allowed to make changes. All members of +this group have implicit rights to change the ACL on any AFS directory within +their organization. Thus, they don't have to actually appear on an ACL, or have +'a' rights enabled in their ACL entry if they do appear, to be able to modify +the ACL. + + \subsection sec4-2-5 Section 4.2.5: Cells + +\par +A cell is the set of server and client machines managed and operated by an +administratively independent organization, as fully described in the original +proposal [17] and specification [18] documents. The cell's administrators make +decisions concerning such issues as server deployment and configuration, user +backup schedules, and replication strategies on their own hardware and disk +storage completely independently from those implemented by other cell +administrators regarding their own domains. Every client machine belongs to +exactly one cell, and uses that information to determine where to obtain +default system resources and services. +\par +The cell concept allows autonomous sites to retain full administrative control +over their facilities while allowing them to collaborate in the establishment +of a single, common name space composed of the union of their individual name +spaces. By convention, any file name beginning with /afs is part of this shared +global name space and can be used at any AFS-capable machine. The original +mount point concept was modified to contain cell information, allowing volumes +housed in foreign cells to be mounted in the file space. Again by convention, +the top-level /afs directory contains a mount point to the root.cell volume for +each cell in the AFS community, attaching their individual file spaces. Thus, +the top of the data tree managed by cell xyz is represented by the /afs/xyz +directory. +\par +Creating a new AFS cell is straightforward, with the operation taking three +basic steps: +\par +1. Name selection: A prospective site has to first select a unique name for +itself. Cell name selection is inspired by the hierarchical Domain naming +system. Domain-style names are designed to be assignable in a completely +decentralized fashion. Example cell names are transarc.com, ssc.gov, and +umich.edu. These names correspond to the AFS installations at Transarc +Corporation in Pittsburgh, PA, the Superconducting Supercollider Lab in Dallas, +TX, and the University of Michigan at Ann Arbor, MI. respectively. +\par +2. Server installation: Once a cell name has been chosen, the site must bring +up one or more AFS file server machines, creating a local file space and a +suite of local services, including authentication (Section 4.2.6.4) and volume +location (Section 4.2.6.2). +\par +3. Advertise services: In order for other cells to discover the presence of the +new site, it must advertise its name and which of its machines provide basic +AFS services such as authentication and volume location. An established site +may then record the machines providing AFS system services for the new cell, +and then set up its mount point under /afs. By convention, each cell places the +top of its file tree in a volume named root.cell. + + \subsection sec4-2-6 Section 4.2.6: Implementation of Server +Functionality + +\par +AFS server functionality is implemented by a set of user-level processes which +execute on server machines. This section examines the role of each of these +processes. + + \subsubsection sec4-2-6-1 Section 4.2.6.1: File Server + +\par +This AFS entity is responsible for providing a central disk repository for a +particular set of files within volumes, and for making these files accessible +to properly-authorized users running on client machines. + + \subsubsection sec4-2-6-2 Section 4.2.6.2: Volume Location Server + +\par +The Volume Location Server maintains and exports the Volume Location Database +(VLDB). This database tracks the server or set of servers on which volume +instances reside. Among the operations it supports are queries returning volume +location and status information, volume ID management, and creation, deletion, +and modification of VLDB entries. +\par +The VLDB may be replicated to two or more server machines for availability and +load-sharing reasons. A Volume Location Server process executes on each server +machine on which a copy of the VLDB resides, managing that copy. + + \subsubsection sec4-2-6-3 Section 4.2.6.3: Volume Server + +\par +The Volume Server allows administrative tasks and probes to be performed on the +set of AFS volumes residing on the machine on which it is running. These +operations include volume creation and deletion, renaming volumes, dumping and +restoring volumes, altering the list of replication sites for a read-only +volume, creating and propagating a new read-only volume image, creation and +update of backup volumes, listing all volumes on a partition, and examining +volume status. + + \subsubsection sec4-2-6-4 Section 4.2.6.4: Authentication Server + +\par +The AFS Authentication Server maintains and exports the Authentication Database +(ADB). This database tracks the encrypted passwords of the cell's users. The +Authentication Server interface allows operations that manipulate ADB entries. +It also implements the Kerberos mutual authentication protocol, supplying the +appropriate identification tickets to successful callers. +\par +The ADB may be replicated to two or more server machines for availability and +load-sharing reasons. An Authentication Server process executes on each server +machine on which a copy of the ADB resides, managing that copy. + + \subsubsection sec4-2-6-5 Section 4.2.6.5: Protection Server + +\par +The Protection Server maintains and exports the Protection Database (PDB), +which maps between printable user and group names and their internal numerical +AFS identifiers. The Protection Server also allows callers to create, destroy, +query ownership and membership, and generally manipulate AFS user and group +records. +\par +The PDB may be replicated to two or more server machines for availability and +load-sharing reasons. A Protection Server process executes on each server +machine on which a copy of the PDB resides, managing that copy. + + \subsubsection sec4-2-6-6 Section 4.2.6.6: BOS Server + +\par +The BOS Server is an administrative tool which runs on each file server machine +in a cell. This server is responsible for monitoring the health of the AFS +agent processess on that machine. The BOS Server brings up the chosen set of +AFS agents in the proper order after a system reboot, answers requests as to +their status, and restarts them when they fail. It also accepts commands to +start, suspend, or resume these processes, and install new server binaries. + + \subsubsection sec4-2-6-7 Section 4.2.6.7: Update Server/Client + +\par +The Update Server and Update Client programs are used to distribute important +system files and server binaries. For example, consider the case of +distributing a new File Server binary to the set of Sparcstation server +machines in a cell. One of the Sparcstation servers is declared to be the +distribution point for its machine class, and is configured to run an Update +Server. The new binary is installed in the appropriate local directory on that +Sparcstation distribution point. Each of the other Sparcstation servers runs an +Update Client instance, which periodically polls the proper Update Server. The +new File Server binary will be detected and copied over to the client. Thus, +new server binaries need only be installed manually once per machine type, and +the distribution to like server machines will occur automatically. + + \subsection sec4-2-7 Section 4.2.7: Implementation of Client +Functionality + + \subsubsection sec4-2-7-1 Section 4.2.7.1: Introduction + +\par +The portion of the AFS WADFS which runs on each client machine is called the +Cache Manager. This code, running within the client's kernel, is a user's +representative in communicating and interacting with the File Servers. The +Cache Manager's primary responsibility is to create the illusion that the +remote AFS file store resides on the client machine's local disk(s). +\par +s implied by its name, the Cache Manager supports this illusion by maintaining +a cache of files referenced from the central AFS store on the machine's local +disk. All file operations executed by client application programs on files +within the AFS name space are handled by the Cache Manager and are realized on +these cached images. Client-side AFS references are directed to the Cache +Manager via the standard VFS and vnode file system interfaces pioneered and +advanced by Sun Microsystems [21]. The Cache Manager stores and fetches files +to and from the shared AFS repository as necessary to satisfy these operations. +It is responsible for parsing unix pathnames on open() operations and mapping +each component of the name to the File Server or group of File Servers that +house the matching directory or file. +\par +The Cache Manager has additional responsibilities. It also serves as a reliable +repository for the user's authentication information, holding on to their +tickets and wielding them as necessary when challenged during File Server +interactions. It caches volume location information gathered from probes to the +VLDB, and keeps the client machine's local clock synchronized with a reliable +time source. + + \subsubsection sec4-2-7-2 Section 4.2.7.2: Chunked Access + +\par +In previous AFS incarnations, whole-file caching was performed. Whenever an AFS +file was referenced, the entire contents of the file were stored on the +client's local disk. This approach had several disadvantages. One problem was +that no file larger than the amount of disk space allocated to the client's +local cache could be accessed. +\par +AFS-3 supports chunked file access, allowing individual 64 kilobyte pieces to +be fetched and stored. Chunking allows AFS files of any size to be accessed +from a client. The chunk size is settable at each client machine, but the +default chunk size of 64K was chosen so that most unix files would fit within a +single chunk. + + \subsubsection sec4-2-7-3 Section 4.2.7.3: Cache Management + +\par +The use of a file cache by the AFS client-side code, as described above, raises +the thorny issue of cache consistency. Each client must effciently determine +whether its cached file chunks are identical to the corresponding sections of +the file as stored at the server machine before allowing a user to operate on +those chunks. +\par +AFS employs the notion of a callback as the backbone of its cache consistency +algorithm. When a server machine delivers one or more chunks of a file to a +client, it also includes a callback "promise" that the client will be notified +if any modifications are made to the data in the file at the server. Thus, as +long as the client machine is in possession of a callback for a file, it knows +it is correctly synchronized with the centrally-stored version, and allows its +users to operate on it as desired without any further interaction with the +server. Before a file server stores a more recent version of a file on its own +disks, it will first break all outstanding callbacks on this item. A callback +will eventually time out, even if there are no changes to the file or directory +it covers. + + \subsection sec4-2-8 Section 4.2.8: Communication Substrate: Rx + +\par +All AFS system agents employ remote procedure call (RPC) interfaces. Thus, +servers may be queried and operated upon regardless of their location. +\par +The Rx RPC package is used by all AFS agents to provide a high-performance, +multi-threaded, and secure communication mechanism. The Rx protocol is +adaptive, conforming itself to widely varying network communication media +encountered by a WADFS. It allows user applications to de?ne and insert their +own security modules, allowing them to execute the precise end-to-end +authentication algorithms required to suit their specific needs and goals. Rx +offers two built-in security modules. The first is the null module, which does +not perform any encryption or authentication checks. The second built-in +security module is rxkad, which utilizes Kerberos authentication. +\par +Although pervasive throughout the AFS distributed file system, all of its +agents, and many of its standard application programs, Rx is entirely separable +from AFS and does not depend on any of its features. In fact, Rx can be used to +build applications engaging in RPC-style communication under a variety of +unix-style file systems. There are in-kernel and user-space implementations of +the Rx facility, with both sharing the same interface. + + \subsection sec4-2-9 Section 4.2.9: Database Replication: ubik + +\par +The three AFS system databases (VLDB, ADB, and PDB) may be replicated to +multiple server machines to improve their availability and share access loads +among the replication sites. The ubik replication package is used to implement +this functionality. A full description of ubik and of the quorum completion +algorithm it implements may be found in [19] and [20]. +\par +The basic abstraction provided by ubik is that of a disk file replicated to +multiple server locations. One machine is considered to be the synchronization +site, handling all write operations on the database file. Read operations may +be directed to any of the active members of the quorum, namely a subset of the +replication sites large enough to insure integrity across such failures as +individual server crashes and network partitions. All of the quorum members +participate in regular elections to determine the current synchronization site. +The ubik algorithms allow server machines to enter and exit the quorum in an +orderly and consistent fashion. +\par +All operations to one of these replicated "abstract files" are performed as +part of a transaction. If all the related operations performed under a +transaction are successful, then the transaction is committed, and the changes +are made permanent. Otherwise, the transaction is aborted, and all of the +operations for that transaction are undone. +\par +Like Rx, the ubik facility may be used by client applications directly. Thus, +user applicatons may easily implement the notion of a replicated disk file in +this fashion. + + \subsection sec4-2-10 Section 4.2.10: System Management + +\par +There are several AFS features aimed at facilitating system management. Some of +these features have already been mentioned, such as volumes, the BOS Server, +and the pervasive use of secure RPCs throughout the system to perform +administrative operations from any AFS client machinein the worldwide +community. This section covers additional AFS features and tools that assist in +making the system easier to manage. + + \subsubsection sec4-2-10-1 Section 4.2.10.1: Intelligent Access +Programs + +\par +A set of intelligent user-level applications were written so that the AFS +system agents could be more easily queried and controlled. These programs +accept user input, then translate the caller's instructions into the proper +RPCs to the responsible AFS system agents, in the proper order. +\par +An example of this class of AFS application programs is vos, which mediates +access to the Volume Server and the Volume Location Server agents. Consider the +vos move operation, which results in a given volume being moved from one site +to another. The Volume Server does not support a complex operation like a +volume move directly. In fact, this move operation involves the Volume Servers +at the current and new machines, as well as the Volume Location Server, which +tracks volume locations. Volume moves are accomplished by a combination of full +and incremental volume dump and restore operations, and a VLDB update. The vos +move command issues the necessary RPCs in the proper order, and attempts to +recovers from errors at each of the steps. +\par +The end result is that the AFS interface presented to system administrators is +much simpler and more powerful than that offered by the raw RPC interfaces +themselves. The learning curve for administrative personnel is thus flattened. +Also, automatic execution of complex system operations are more likely to be +successful, free from human error. + + \subsubsection sec4-2-10-2 Section 4.2.10.2: Monitoring Interfaces + +\par +The various AFS agent RPC interfaces provide calls which allow for the +collection of system status and performance data. This data may be displayed by +such programs as scout, which graphically depicts File Server performance +numbers and disk utilizations. Such monitoring capabilites allow for quick +detection of system problems. They also support detailed performance analyses, +which may indicate the need to reconfigure system resources. + + \subsubsection sec4-2-10-3 Section 4.2.10.3: Backup System + +\par +A special backup system has been designed and implemented for AFS, as described +in [6]. It is not sufficient to simply dump the contents of all File Server +partitions onto tape, since volumes are mobile, and need to be tracked +individually. The AFS backup system allows hierarchical dump schedules to be +built based on volume names. It generates the appropriate RPCs to create the +required backup volumes and to dump these snapshots to tape. A database is used +to track the backup status of system volumes, along with the set of tapes on +which backups reside. + + \subsection sec4-2-11 Section 4.2.11: Interoperability + +\par +Since the client portion of the AFS software is implemented as a standard +VFS/vnode file system object, AFS can be installed into client kernels and +utilized without interference with other VFS-style file systems, such as +vanilla unix and the NFS distributed file system. +\par +Certain machines either cannot or choose not to run the AFS client software +natively. If these machines run NFS, it is still possible to access AFS files +through a protocol translator. The NFS-AFS Translator may be run on any machine +at the given site that runs both NFS and the AFS Cache Manager. All of the NFS +machines that wish to access the AFS shared store proceed to NFS-mount the +translator's /afs directory. File references generated at the NFS-based +machines are received at the translator machine, which is acting in its +capacity as an NFS server. The file data is actually obtained when the +translator machine issues the corresponding AFS references in its role as an +AFS client. + + \section sec4-3 Section 4.3: Meeting AFS Goals + +\par +The AFS WADFS design, as described in this chapter, serves to meet the system +goals stated in Chapter 3. This section revisits each of these AFS goals, and +identifies the specific architectural constructs that bear on them. + + \subsection sec4-3-1 Section 4.3.1: Scale + +\par +To date, AFS has been deployed to over 140 sites world-wide, with approximately +60 of these cells visible on the public Internet. AFS sites are currently +operating in several European countries, in Japan, and in Australia. While many +sites are modest in size, certain cells contain more than 30,000 accounts. AFS +sites have realized client/server ratios in excess of the targeted 200:1. + + \subsection sec4-3-2 Section 4.3.2: Name Space + +\par +A single uniform name space has been constructed across all cells in the +greater AFS user community. Any pathname beginning with /afs may indeed be used +at any AFS client. A set of common conventions regarding the organization of +the top-level /afs directory and several directories below it have been +established. These conventions also assist in the location of certain per-cell +resources, such as AFS configuration files. +\par +Both access transparency and location transparency are supported by AFS, as +evidenced by the common access mechanisms and by the ability to transparently +relocate volumes. + + \subsection sec4-3-3 Section 4.3.3: Performance + +\par +AFS employs caching extensively at all levels to reduce the cost of "remote" +references. Measured data cache hit ratios are very high, often over 95%. This +indicates that the file images kept on local disk are very effective in +satisfying the set of remote file references generated by clients. The +introduction of file system callbacks has also been demonstrated to be very +effective in the efficient implementation of cache synchronization. Replicating +files and system databases across multiple server machines distributes load +among the given servers. The Rx RPC subsystem has operated successfully at +network speeds ranging from 19.2 kilobytes/second to experimental +gigabit/second FDDI networks. +\par +Even at the intra-site level, AFS has been shown to deliver good performance, +especially in high-load situations. One often-quoted study [1] compared the +performance of an older version of AFS with that of NFS on a large file system +task named the Andrew Benchmark. While NFS sometimes outperformed AFS at low +load levels, its performance fell off rapidly at higher loads while AFS +performance degradation was not significantly affected. + + \subsection sec4-3-4 Section 4.3.4: Security + +\par +The use of Kerberos as the AFS authentication system fits the security goal +nicely. Access to AFS files from untrusted client machines is predicated on the +caller's possession of the appropriate Kerberos ticket(s). Setting up per-site, +Kerveros-based authentication services compartmentalizes any security breach to +the cell which was compromised. Since the Cache Manager will store multiple +tickets for its users, they may take on different identities depending on the +set of file servers being accessed. + + \subsection sec4-3-5 Section 4.3.5: Access Control + +\par +AFS extends the standard unix authorization mechanism with per-directory Access +Control Lists. These ACLs allow specific AFS principals and groups of these +principals to be granted a wide variety of rights on the associated files. +Users may create and manipulate AFS group entities without administrative +assistance, and place these tailored groups on ACLs. + + \subsection sec4-3-6 Section 4.3.6: Reliability + +\par +A subset of file server crashes are masked by the use of read-only replication +on volumes containing slowly-changing files. Availability of important, +frequently-used programs such as editors and compilers may thus been greatly +improved. Since the level of replication may be chosen per volume, and easily +changed, each site may decide the proper replication levels for certain +programs and/or data. +Similarly, replicated system databases help to maintain service in the face of +server crashes and network partitions. + + \subsection sec4-3-7 Section 4.3.7: Administrability + +\par +Such features as pervasive, secure RPC interfaces to all AFS system components, +volumes, overseer processes for monitoring and management of file system +agents, intelligent user-level access tools, interface routines providing +performance and statistics information, and an automated backup service +tailored to a volume-based environment all contribute to the administrability +of the AFS system. + + \subsection sec4-3-8 Section 4.3.8: Interoperability/Coexistence + +\par +Due to its VFS-style implementation, the AFS client code may be easily +installed in the machine's kernel, and may service file requests without +interfering in the operation of any other installed file system. Machines +either not capable of running AFS natively or choosing not to do so may still +access AFS files via NFS with the help of a protocol translator agent. + + \subsection sec4-3-9 Section 4.3.9: Heterogeneity/Portability + +\par +As most modern kernels use a VFS-style interface to support their native file +systems, AFS may usually be ported to a new hardware and/or software +environment in a relatively straightforward fashion. Such ease of porting +allows AFS to run on a wide variety of platforms. + + \page chap5 Chapter 5: Future AFS Design Re?nements + + \section sec5-1 Section 5.1: Overview + +\par +The current AFS WADFS design and implementation provides a high-performance, +scalable, secure, and flexible computing environment. However, there is room +for improvement on a variety of fronts. This chapter considers a set of topics, +examining the shortcomings of the current AFS system and considering how +additional functionality may be fruitfully constructed. +\par +Many of these areas are already being addressed in the next-generation AFS +system which is being built as part of Open Software Foundation?s (OSF) +Distributed Computing Environment [7] [8]. + + \section sec5-2 Section 5.2: unix Semantics + +\par +Any distributed file system which extends the unix file system model to include +remote file accesses presents its application programs with failure modes which +do not exist in a single-machine unix implementation. This semantic difference +is dificult to mask. +\par +The current AFS design varies from pure unix semantics in other ways. In a +single-machine unix environment, modifications made to an open file are +immediately visible to other processes with open file descriptors to the same +file. AFS does not reproduce this behavior when programs on different machines +access the same file. Changes made to one cached copy of the file are not made +immediately visible to other cached copies. The changes are only made visible +to other access sites when a modified version of a file is stored back to the +server providing its primary disk storage. Thus, one client's changes may be +entirely overwritten by another client's modifications. The situation is +further complicated by the possibility that dirty file chunks may be flushed +out to the File Server before the file is closed. +\par +The version of AFS created for the OSF offering extends the current, untyped +callback notion to a set of multiple, independent synchronization guarantees. +These synchronization tokens allow functionality not offered by AFS-3, +including byte-range mandatory locking, exclusive file opens, and read and +write privileges over portions of a file. + + \section sec5-3 Section 5.3: Improved Name Space Management + +\par +Discovery of new AFS cells and their integration into each existing cell's name +space is a completely manual operation in the current system. As the rate of +new cell creations increases, the load imposed on system administrators also +increases. Also, representing each cell's file space entry as a mount point +object in the /afs directory leads to a potential problem. As the number of +entries in the /afs directory increase, search time through the directory also +grows. +\par +One improvement to this situation is to implement the top-level /afs directory +through a Domain-style database. The database would map cell names to the set +of server machines providing authentication and volume location services for +that cell. The Cache Manager would query the cell database in the course of +pathname resolution, and cache its lookup results. +\par +In this database-style environment, adding a new cell entry under /afs is +accomplished by creating the appropriate database entry. The new cell +information is then immediately accessible to all AFS clients. + + \section sec5-4 Section 5.4: Read/Write Replication + +\par +The AFS-3 servers and databases are currently equipped to handle read/only +replication exclusively. However, other distributed file systems have +demonstrated the feasibility of providing full read/write replication of data +in environments very similar to AFS [11]. Such systems can serve as models for +the set of required changes. + + \section sec5-5 Section 5.5: Disconnected Operation + +\par +Several facilities are provided by AFS so that server failures and network +partitions may be completely or partially masked. However, AFS does not provide +for completely disconnected operation of file system clients. Disconnected +operation is a mode in which a client continues to access critical data during +accidental or intentional inability to access the shared file repository. After +some period of autonomous operation on the set of cached files, the client +reconnects with the repository and resynchronizes the contents of its cache +with the shared store. +\par +Studies of related systems provide evidence that such disconnected operation is +feasible [11] [12]. Such a capability may be explored for AFS. + + \section sec5-6 Section 5.6: Multiprocessor Support + +\par +The LWP lightweight thread package used by all AFS system processes assumes +that individual threads may execute non-preemeptively, and that all other +threads are quiescent until control is explicitly relinquished from within the +currently active thread. These assumptions conspire to prevent AFS from +operating correctly on a multiprocessor platform. +\par +A solution to this restriction is to restructure the AFS code organization so +that the proper locking is performed. Thus, critical sections which were +previously only implicitly defined are explicitly specified. + + \page biblio Bibliography + +\li [1] John H. Howard, Michael L. Kazar, Sherri G. Menees, David A. Nichols, +M. Satyanarayanan, Robert N. Sidebotham, Michael J. West, Scale and Performance +in a Distributed File System, ACM Transactions on Computer Systems, Vol. 6, No. +1, February 1988, pp. 51-81. +\li [2] Michael L. Kazar, Synchronization and Caching Issues in the Andrew File +System, USENIX Proceedings, Dallas, TX, Winter 1988. +\li [3] Alfred Z. Spector, Michael L. Kazar, Uniting File Systems, Unix +Review, March 1989, +\li [4] Johna Till Johnson, Distributed File System Brings LAN Technology to +WANs, Data Communications, November 1990, pp. 66-67. +\li [5] Michael Padovano, PADCOM Associates, AFS widens your horizons in +distributed computing, Systems Integration, March 1991. +\li [6] Steve Lammert, The AFS 3.0 Backup System, LISA IV Conference +Proceedings, Colorado Springs, Colorado, October 1990. +\li [7] Michael L. Kazar, Bruce W. Leverett, Owen T. Anderson, Vasilis +Apostolides, Beth A. Bottos, Sailesh Chutani, Craig F. Everhart, W. Anthony +Mason, Shu-Tsui Tu, Edward R. Zayas, DEcorum File System Architectural +Overview, USENIX Conference Proceedings, Anaheim, Texas, Summer 1990. +\li [8] AFS Drives DCE Selection, Digital Desktop, Vol. 1, No. 6, +September 1990. +\li [9] Levine, P.H., The Apollo DOMAIN Distributed File System, in NATO ASI +Series: Theory and Practice of Distributed Operating Systems, Y. Paker, J-P. +Banatre, M. Bozyigit, editors, Springer-Verlag, 1987. +\li [10] M.N. Nelson, B.B. Welch, J.K. Ousterhout, Caching in the Sprite +Network File System, ACM Transactions on Computer Systems, Vol. 6, No. 1, +February 1988. +\li [11] James J. Kistler, M. Satyanarayanan, Disconnected Operaton in the Coda +File System, CMU School of Computer Science technical report, CMU-CS-91-166, 26 +July 1991. +\li [12] Puneet Kumar, M. Satyanarayanan, Log-Based Directory Resolution +in the Coda File System, CMU School of Computer Science internal document, 2 +July 1991. +\li [13] Sun Microsystems, Inc., NFS: Network File System Protocol +Specification, RFC 1094, March 1989. +\li [14] Sun Microsystems, Inc,. Design and Implementation of the Sun Network +File System, USENIX Summer Conference Proceedings, June 1985. +\li [15] C.H. Sauer, D.W Johnson, L.K. Loucks, A.A. Shaheen-Gouda, and T.A. +Smith, RT PC Distributed Services Overview, Operating Systems Review, Vol. 21, +No. 3, July 1987. +\li [16] A.P. Rifkin, M.P. Forbes, R.L. Hamilton, M. Sabrio, S. Shah, and +K. Yueh, RFS Architectural Overview, Usenix Conference Proceedings, Atlanta, +Summer 1986. +\li [17] Edward R. Zayas, Administrative Cells: Proposal for Cooperative Andrew +File Systems, Information Technology Center internal document, Carnegie Mellon +University, 25 June 1987. +\li [18] Ed. Zayas, Craig Everhart, Design and Specification of the Cellular +Andrew Environment, Information Technology Center, Carnegie Mellon University, +CMU-ITC-070, 2 August 1988. +\li [19] Kazar, Michael L., Information Technology Center, Carnegie Mellon +University. Ubik -A Library For Managing Ubiquitous Data, ITCID, Pittsburgh, +PA, Month, 1988. +\li [20] Kazar, Michael L., Information Technology Center, Carnegie Mellon +University. Quorum Completion, ITCID, Pittsburgh, PA, Month, 1988. +\li [21] S. R. Kleinman. Vnodes: An Architecture for Multiple file +System Types in Sun UNIX, Conference Proceedings, 1986 Summer Usenix Technical +Conference, pp. 238-247, El Toro, CA, 1986. +\li [22] S.P. Miller, B.C. Neuman, J.I. Schiller, J.H. Saltzer. Kerberos +Authentication and Authorization System, Project Athena Technical Plan, Section +E.2.1, M.I.T., December 1987. +\li [23] Bill Bryant. Designing an Authentication System: a Dialogue in Four +Scenes, Project Athena internal document, M.I.T, draft of 8 February 1988. + + +*/ diff --git a/doc/txt/dafs-fsa.dot b/doc/txt/dafs-fsa.dot new file mode 100644 index 000000000..565de7122 --- /dev/null +++ b/doc/txt/dafs-fsa.dot @@ -0,0 +1,109 @@ +# +# This is a dot (http://www.graphviz.org) description of the various +# states volumes can be in for DAFS (Demand Attach File Server). +# +# Author: Steven Jenkins +# Date: 2007-05-24 +# + +digraph VolumeStates { + size="11,17" + graph [ + rankdir = "TB" + ]; + + subgraph clusterKey { + rankdir="LR"; + shape = "rectangle"; + + s1 [ shape=plaintext, label = "VPut after VDetach in brown", + fontcolor="brown" ]; + s2 [ shape=plaintext, label = "VAttach in blue", + fontcolor="blue" ]; + s3 [ shape=plaintext, label = "VGet/VHold in purple", + fontcolor="purple" ]; + s4 [ shape=plaintext, label = "Error States in red", + fontcolor="red" ]; + s5 [ shape=plaintext, label = "VPut after VOffline in green", + fontcolor="green" ]; + s6 [ shape=ellipse, label = "re-entrant" ]; + s7 [ shape=ellipse, peripheries=2, label="non re-entrant" ]; + s8 [ shape=ellipse, color="red", label="Error States" ]; + + s6->s7->s8->s1->s2->s3->s4->s5 [style="invis"]; + + } + + node [ peripheries = "2" ] ATTACHING \ + LOADING_VNODE_BITMAPS HDR_LOADING_FROM_DISK \ + HDR_ATTACHING_LRU_PULL \ + "UPDATING\nSYNCING_VOL_HDR_TO_DISK" \ + OFFLINING DETACHING; + node [ shape = "ellipse", peripheries = "1" ]; + node [ color = "red" ] HARD_ERROR SALVAGE_REQUESTED SALVAGING; + + node [ color = "black" ]; // default back to black + + UNATTACHED->Exclusive_vol_op_executing [label = "controlled by FSSYNC" ]; + Exclusive_vol_op_executing->UNATTACHED [label = "controlled by FSSYNC" ]; + UNATTACHED->FREED [ label = "VCancelReservation_r() after a\nVDetach() or FreeVolume() will\ncause CheckDetach() or CheckFree() to fire" ]; + OFFLINING->UNATTACHED; + UNATTACHED->PREATTACHED [ color = "orange", label = "PreAttach()" ]; + PREATTACHED->UNATTACHED [ color = "orange", label = "VOffline()"]; + HARD_ERROR->PREATTACHED [ color = "orange", label = "operator intervention via FSSYNC" ]; + + PREATTACHED->Exclusive_vol_op_executing [color = "orange", label = "controlled by FSSYNC" ]; + Exclusive_vol_op_executing->PREATTACHED [color = "orange", label = "controlled by FSSYNC" ]; + PREATTACHED->FREED [ color = "orange", label = "VCancelReservation_r() after a\nVDetach() or FreeVolume() will\ncause CheckDetach() or CheckFree() to fire" ]; + PREATTACHED->ATTACHING [ color = "blue", weight = "8" ]; + SALVAGING->PREATTACHED [ label = "controlled via FSSYNC" ]; + + DETACHING->FREED ; + SHUTTING_DOWN->DETACHING [ color = "brown" ]; + ATTACHED_nUsers_GT_0->SHUTTING_DOWN [ color = "orange", label = "VDetach()" ]; + + DETACHING->"UPDATING\nSYNCING_VOL_HDR_TO_DISK" [ color = "brown" ]; + "UPDATING\nSYNCING_VOL_HDR_TO_DISK"->DETACHING [ color = "brown" ]; + OFFLINING->"UPDATING\nSYNCING_VOL_HDR_TO_DISK" [ color = "green" ]; + "UPDATING\nSYNCING_VOL_HDR_TO_DISK"->OFFLINING [ color = "green" ]; + GOING_OFFLINE->OFFLINING [ color = "green" ]; + + "UPDATING\nSYNCING_VOL_HDR_TO_DISK"->SALVAGE_REQUESTED [ color = "red" ]; + "UPDATING\nSYNCING_VOL_HDR_TO_DISK"->ATTACHING [ color = "blue" ]; + ATTACHING->"UPDATING\nSYNCING_VOL_HDR_TO_DISK" [ color = "blue" ]; + + ATTACHED_nUsers_GT_0->GOING_OFFLINE [ color = "orange", label = "VOffline" ]; + ATTACHED_nUsers_GT_0->ATTACHED_nUsers_EQ_0 [ color = "orange", label = "VPut" ]; + + ATTACHED_nUsers_GT_0->SALVAGE_REQUESTED [ color = "red" ]; + + LOADING_VNODE_BITMAPS->ATTACHING [ color = "blue" ]; + ATTACHING->LOADING_VNODE_BITMAPS [ color = "blue" ] ; + LOADING_VNODE_BITMAPS->SALVAGE_REQUESTED [ color = "red" ]; + HDR_LOADING_FROM_DISK->SALVAGE_REQUESTED [ color = "red" ]; + HDR_LOADING_FROM_DISK->ATTACHING [ color = "blue" ] ; + HDR_LOADING_FROM_DISK->ATTACHED_nUsers_GT_0 [ color = "purple" ]; + + SALVAGE_REQUESTED->SALVAGING [ label = "controlled via FSSYNC" ]; + SALVAGE_REQUESTED->HARD_ERROR [ color = "red", + label = "After hard salvage limit reached,\n hard error state is in effect\nuntil there is operator intervention" ]; + + HDR_ATTACHING_LRU_PULL->HDR_LOADING_FROM_DISK [ color = "blue" ]; + HDR_ATTACHING_LRU_PULL->HDR_LOADING_FROM_DISK [ color = "purple" ]; + HDR_ATTACHING_LRU_PULL->ATTACHED_nUsers_GT_0 [ color = "purple", label = "header can be in LRU\nand not have been reclaimed\nthus skipping disk I/O" ]; + + ATTACHING->HDR_ATTACHING_LRU_PULL [ color = "blue" ]; + ATTACHING->ATTACHED_nUsers_EQ_0 [ color = "blue" ]; + + ATTACHING->SALVAGE_REQUESTED [ color = "red" ]; + ATTACHED_nUsers_EQ_0->HDR_ATTACHING_LRU_PULL [ color = "purple" ]; + + ATTACHED_nUsers_EQ_0->SALVAGE_REQUESTED [ color = "red" ]; + + // Various loopback transitions + GOING_OFFLINE->GOING_OFFLINE [ label = "VPut when (nUsers > 1)" ]; + SHUTTING_DOWN->SHUTTING_DOWN + [ label = "VPut when ((nUsers > 1) ||\n((nUsers == 1) && (nWaiters > 0)))" ]; + SHUTTING_DOWN->SHUTTING_DOWN + [ label = "VCancelReservation_r when ((nWaiters > 1)\n|| ((nWaiters == 1) && (nUsers > 0)))"]; +} diff --git a/doc/txt/dafs-overview.txt b/doc/txt/dafs-overview.txt new file mode 100644 index 000000000..2b2e58668 --- /dev/null +++ b/doc/txt/dafs-overview.txt @@ -0,0 +1,396 @@ +The Demand-Attach FileServer (DAFS) has resulted in many changes to how +many things on AFS fileservers behave. The most sweeping changes are +probably in the volume package, but significant changes have also been +made in the SYNC protocol, the vnode package, salvaging, and a few +miscellaneous bits in the various fileserver processes. + +This document serves as an overview for developers on how to deal with +these changes, and how to use the new mechanisms. For more specific +details, consult the relevant doxygen documentation, the code comments, +and/or the code itself. + + - The salvageserver + +The salvageserver (or 'salvaged') is a new OpenAFS fileserver process in +DAFS. This daemon accepts salvage requests via SALVSYNC (see below), and +salvages a volume group by fork()ing a child, and running the normal +salvager code (it enters vol-salvage.c by calling SalvageFileSys1). + +Salvages that are initiated from a request to the salvageserver (called +'demand-salvages') occur automatically; whenever the fileserver (or +other tool) discovers that a volume needs salvaging, it will schedule a +salvage on the salvageserver without any intervention needed. + +When scheduling a salvage, the vol id should be the id for the volume +group (the RW vol id). If the salvaging child discovers that it was +given a non-RW vol id, it will send the salvageserver a SALVSYNC LINK +command, and will exit. This will tell the salvageserver that whenever +it receives a salvage request for that vol id, it should schedule a +salvage for the corresponding RW id instead. + + - FSSYNC/SALVSYNC + +The FSSYNC and SALVSYNC protocols are the protocols used for +interprocess communication between the various fileserver processes. +FSSYNC is used for querying the fileserver for volume metadata, +'checking out' volumes from the fileserver, and a few other things. +SALVSYNC is used to schedule and query salvages in the salvageserver. + +FSSYNC existed prior to DAFS, but it encompasses a much larger set of +commands with the advent of DAFS. SALVSYNC is entirely new to DAFS. + + -- SYNC + +FSSYNC and SALVSYNC are both layered on top of a protocol called SYNC. +SYNC isn't much a protocol in itself; it just handles some boilerplate +for the messages passed back and forth, and some error codes common to +both FSSYNC and SALVSYNC. + +SYNC is layered on top of TCP/IP, though we only use it to communicate +with the local host (usually via a unix domain socket). It does not +handle anything like authentication, authorization, or even things like +serialization. Although it uses network primitives for communication, +it's only useful for communication between processes on the same +machine, and that is all we use it for. + +SYNC calls are basically RPCs, but very simple. The calls are always +synchronous, and each SYNC server can only handle one request at a time. +Thus, it is important for SYNC server handlers to return as quickly as +possible; hitting the network or disk to service a SYNC request should +be avoided to the extent that such is possible. + +SYNC-related source files are src/vol/daemon_com.c and +src/vol/daemon_com.h + + -- FSSYNC + + --- server + +The FSSYNC server runs in the fileserver; source is in +src/vol/fssync-server.c. + +As mentioned above, FSSYNC handlers should finish quickly when +servicing a request, so hitting the network or disk should be avoided. +In particular, you absolutely cannot make a SALVSYNC call inside an +FSSYNC handler; the SALVSYNC client wrapper routines actively prevent +this from happening, so even if you try to do such a thing, you will not +be allowed to. This prohibition is to prevent deadlock, since the +salvageserver could have made the FSSYNC request that you are servicing. + +When a client makes a FSYNC_VOL_OFF or NEEDVOLUME request, the +fileserver offlines the volume if necessary, and keeps track that the +volume has been 'checked out'. A volume is left online if the checkout +mode indicates the volume cannot change (see VVolOpLeaveOnline_r). + +Until the volume has been 'checked in' with the ON, LEAVE_OFFLINE, or +DONE commands, no other program can check out the volume. + +Other FSSYNC commands include abilities to query volume metadata and +stats, to force volumes to be attached or offline, and to update the +volume group cache. See doc/arch/fssync.txt for documentation on the +individual FSSYNC commands. + + --- clients + +FSSYNC clients are generally any OpenAFS process that runs on a +fileserver and tries to access volumes directly. The volserver, +salvageserver, and bosserver all qualify, as do (sometimes) some +utilities like vol-info or vol-bless. For issuing FSSYNC commands +directly, there is the debugging tool fssync-debug. FSSYNC client code +is in src/vol/fssync-client.c, but it's not very interesting. + +Any program that wishes to directly access a volume on disk must check +out the volume via FSSYNC (NEEDVOLUME or OFF commands), to ensure the +volume doesn't change while the program is using it. If the program +determines that the volume is somehow inconsistent and should be +salvaged, it should send the FSSYNC command FORCE_ERROR with reason code +FSYNC_SALVAGE to the fileserver, which will take care of salvaging it. + + -- SALVSYNC + +The SALVSYNC server runs in the salvageserver; code is in +src/vol/salvsync-server.c. SALVSYNC clients are just the fileserver, the +salvageserver run with the -client switch, and the salvageserver worker +children. If any other process notices that a volume needs salvaging, it +should issue a FORCE_ERROR FSSYNC command to the fileserver with the +FSYNC_SALVAGE reason code. + +The SALVSYNC protocol is simpler than the FSSYNC protocol. The commands +are basically just to create, cancel, change, and query salvages. The +RAISEPRIO command increases the priority of a salvage job that hasn't +started yet, so volumes that are accessed more frequently will get +salvaged first. The LINK command is used by the salvageserver worker +children to inform the salvageserver parent that it tried to salvage a +readonly volume for which a read-write clone exists (in which case we +should just schedule a salvage for the parent read-write volume). + +Note that canceling a salvage is just for salvages that haven't run +yet; it only takes a salvage job off of a queue; it doesn't stop a +salvageserver worker child in the middle of a salvage. + + - The volume package + + -- refcounts + +Before DAFS, the Volume struct just had one reference count, vp->nUsers. +With DAFS, we know have the notion of an internal/lightweight reference +count, and an external/heavyweight reference count. Lightweight refs are +acquired with VCreateReservation_r, and released with +VCancelReservation_r. Heavyweight refs are acquired as before, normally +with a GetVolume or AttachVolume variant, and releasing the ref with +VPutVolume. + +Lightweight references are only acquired within the volume package; a vp +should not be given to e.g. the fileserver code with an extra +lightweight ref. A heavyweight ref is generally acquired for a vp that +will be given to some non-volume-package code; acquiring a heavyweight +ref guarantees that the volume header has been loaded. + +Acquiring a lightweight ref just guarantees that the volume will not go +away or suddenly become unavailable after dropping VOL_LOCK. Certain +operations like detachment or scheduling a salvage only occur when all +of the heavy and lightweight refs go away; see VCancelReservation_r. + + -- state machine + +Instead of having a per-volume lock, each vp always has an associated +'state', that says what, if anything, is occurring to a volume at any +particular time; or if the volume is attached, offline, etc. To do the +basic equivalent of a lock -- that is, ensure that nobody else will +change the volume when we drop VOL_LOCK -- you can put the volume in +what is called an 'exclusive' state (see VIsExclusiveState). + +When a volume is in an exclusive state, no thread should modify the +volume (or expect the vp data to stay the same), except the thread that +put it in that state. Whenever you manipulate a volume, you should make +sure it is not in an exclusive state; first call VCreateReservation_r to +make sure the volume doesn't go away, and then call +VWaitExclusiveState_r. When that returns, you are guaranteed to have a +vp that is in a non-exclusive state, and so can me manipulated. Call +VCancelReservation_r when done with it, to indicate you don't need it +anymore. + +Look at the definition of the VolState enumeration to see all volume +states, and a brief explanation of them. + + -- VLRU + +See: Most functions with VLRU in their name in src/vol/volume.c. + +The VLRU is what dictates when volumes are detached after a certain +amount of inactivity. The design is pretty much a generational garbage +collection mechanism. There are 5 queues that a volume can be on the +VLRU (VLRUQueueName in volume.h). 'Candidate' volumes haven't seen +activity in a while, and so are candidates to be detached. 'New' volumes +have seen activity only recently; 'mid' volumes have seen activity for +awhile, and 'old' volumes have seen activity for a long while. 'Held' +volumes cannot be soft detached at all. + +Volumes are moved from new->mid->old if they have had activity recently, +and are moved from old->mid->new->candidate if they have not had any +activity recently. The definition of 'recently' is configurable by the +-vlruthresh fileserver parameter; see VLRU_ComputeConstants for how they +are determined. Volumes start at 'new' on attachment, and if any +activity occurs when a volume is on 'candidate', it's moved to 'new' +immediately. + +Volumes are generally promoted/demoted and soft-detached by +VLRU_ScannerThread, which runs every so often and moves volumes between +VLRU queues depending on their last access time and the various +thresholds (or soft-detaches them, in the case of the 'candidate' +queue). Soft-detaching just means the volume is taken offline and put +into the preattached state. + + --- DONT_SALVAGE + +The dontSalvage flag in volume headers can be set to DONT_SALVAGE to +indicate that a volume probably doesn't need to be salvaged. Before +DAFS, volumes were placed on an 'UpdateList' which was periodically +scanned, and dontSalvage was set on volumes that hadn't been touched in +a while. + +With DAFS and the VLRU additions, setting dontSalvage now happens when a +volume is demoted a VLRU generation, and no separate list is kept. So if +a volume has been idle enough to demote, and it hasn't been accessed in +SALVAGE_INTERVAL time, dontSalvage will be set automatically by the VLRU +scanner. + + -- Vnode + +Source files: src/vol/vnode.c, src/vol/vnode.h, src/vol/vnode_inline.h + +The changes to the vnode package are largely very similar to those in +the volume package. A Vnode is put into specific states, some of which +are exclusive and act like locks (see VnChangeState_r, +VnIsExclusiveState). Vnodes also have refcounts, incremented and +decremented with VnCreateReservation_r and VnCancelReservation_r like +you would expect. I/O should be done outside of any global locks; just +the vnode is 'locked' by being put in an exclusive state if necessary. + +In addition to a state, vnodes also have a count of readers. When a +caller gets a vnode with a read lock, we of course must wait for the +vnode to be in a nonexclusive state (VnWaitExclusive_r), then the number +of readers is incremented (VnBeginRead_r), but the vnode is kept in a +non-exclusive state (VN_STATE_READ). + +When a caller gets a vnode with a write lock, we must wait not only for +the vnode to be in a nonexclusive state, but also for there to be no +readers (VnWaitQuiescent_r), so we can actually change it. + +VnLock still exists in DAFS, but it's almost a no-op. All we do for DAFS +in VnLock is set vnp->writer to the current thread id for a write lock, +for some consistency checks later (read locks are actually no-ops). +Actual mutual exclusion in DAFS is done by the vnode state machine and +the reader count. + + - viced state serialization + +See src/viced/serialize_state.* and ShutDownAndCore in +src/viced/viced.c + +Before DAFS, whenever a fileserver restarted, it lost all information +about all clients, what callbacks they had, etc. So when a client with +existing callbacks contacted the fileserver, all callback information +needed to be reset, potentially causing a bunch of unnecessary traffic. +And of course, if the client does not contact the fileserver again, it +could not get sent callbacks it should get sent. + +DAFS now has the ability to save the host and CB data to a file on +shutdown, and restore it when it starts up again. So when a fileserver +is restarted, the host and CB information should be effectively the same +as when it shut down. So a client may not even know if a fileserver was +restarted. + +Getting this state information can be a little difficult, since the host +package data structures aren't necessarily always consistent, even after +H_LOCK is dropped. What we attempt to do is stop all of the background +threads early in the shutdown process (set fs_state.mode - +FS_MODE_SHUTDOWN), and wait for the background threads to exit (or be +marked as 'tranquil'; see the fs_state struct) later on, before trying +to save state. This makes it a lot less likely for anything to be +modifying the host or CB structures by the time we try to save them. + + - volume group cache + +See: src/vol/vg_cache* and src/vol/vg_scan.c + +The VGC is a mechanism in DAFS to speed up volume salvages. Pre-VGC, +whenever the salvager code salvaged an individual volume, it would need +to read all of the volume headers on the partition, so it knows what +volumes are in the volume group it is salvaging, so it knows what +volumes to tell the fileserver to take offline. With demand-salvages, +this can make salvaging take a very long time, since the time to read in +all volume headers can take much more time than the time to actually +salvage a single volume group. + +To prevent the need to scan the partition volume headers every single +time, the fileserver maintains a cache of which volumes are in what +volume groups. The cache is populated by scanning a partition's volume +headers, and is started in the background upon receiving the first +salvage request for a partition (VVGCache_scanStart_r, +_VVGC_scan_start). + +After the VGC is populated, it is kept up to date with volumes being +created and deleted via the FSSYNC VG_ADD and VG_DEL +commands. These are called every time a volume header is created, +removed, or changed when using the volume header wrappers in vutil.c +(VCreateVolumeDiskHeader, VDestroyVolumeDiskHeader, +VWriteVolumeDiskHeader). These wrappers should always be used to +create/remove/modify vol headers, to ensure that the necessary FSSYNC +commands are called. + + -- race prevention + +In order to prevent races between volume changes and VGC partition scans +(that is, someone scans a header while it is being written and not yet +valid), updates to the VGC involving adding or modifying volume headers +should always be done under the 'partition header lock'. This is a +per-partition lock to conceptually lock the set of volume headers on +that partition. It is only read-held when something is writing to a +volume header, and it is write-held for something that is scanning the +partition for volume headers (the VGC or partition salvager). This is a +little counterintuitive, but it is what we want. We want multiple +headers to be written to at once, but if we are the VGC scanner, we want +to ensure nobody else is writing when we look at a header file. + +Because the race described above is so rare, vol header scanners don't +actually hold the lock unless a problem is detected. So, what they do is +read a particular volume header without any lock, and if there is a +problem with it, they grab a write lock on the partition vol headers, +and try again. If it still has a problem, the header is just faulty; if +it's okay, then we avoided the race. + +Note that destroying vol headers does not require any locks, since +unlink()s are atomic and don't cause any races for us here. + + - partition and volume locking + +Previously, whenever the volserver would attach a volume or the salvager +would salvage anything, the partition would be locked +(VLockPartition_r). This unnecessarily serializes part of most volserver +operations. It also makes it so only one salvage can run on a partition +at a time, and that a volserver operation cannot occur at the same time +as a salvage. With the addition of the VGC (previous section), the +salvager partition lock is unnecessary on namei, since the salvager does +not need to scan all volume headers. + +Instead of the rather heavyweight partition lock, in DAFS we now lock +individual volumes. Locking an individual volume is done by locking a +certain byte in the file /vicepX/.volume.lock. To lock volume with ID +1234, you lock 1 byte at offset 1234 (with VLockFile: fcntl on unix, +LockFileEx on windows as of the time of this writing). To read-lock the +volume, acquire a read lock; to write-lock the volume, acquire a write +lock. + +Due to the potentially very large number of volumes attached by the +fileserver at once, the fileserver does not keep volumes locked the +entire time they are attached (which would make volume locking +potentially very slow). Rather, it locks the volume before attaching, +and unlocks it when the volume has been attached. However, all other +programs are expected to acquire a volume lock for the entire duration +they interact with the volume. Whether a read or write lock is obtained +is determined by the attachment mode, and whether or not the volume in +question is an RW volume (see VVolLockType()). + +These locks are all acquired non-blocking, so we can just fail if we +fail to acquire a lock. That is, an errant process holding a file-level +lock cannot cause any process to just hang, waiting for a lock. + + -- re-reading volume headers + +Since we cannot know whether a volume is writable or not until the +volume header is read, and we cannot atomically upgrade file-level +locks, part of attachment can now occur twice (see attach2 and +attach_volume_header). What occurs is we read the vol header, assuming +the volume is readonly (acquiring a read or write lock as necessary). +If, after reading the vol header, we discover that the volume is +writable and that means we need to acquire a write lock, we read the vol +header again while acquiring a write lock on the header. + + -- verifying checkouts + +Since the fileserver does not hold volume locks for the entire time a +volume is attached, there could have been a potential race between the +fileserver and other programs. Consider when a non-fileserver program +checks out a volume from the fileserver via FSSYNC, then locks the +volume. Before the program locked the volume, the fileserver could have +restarted and attached the volume. Since the fileserver releases the +volume lock after attachment, the fileserver and the other program could +both think they have control over the volume, which is a problem. + +To prevent this non-fileserver programs are expected to verify that +their volume is checked out after locking it (FSYNC_VerifyCheckout). +What this does is ask the fileserver for the current volume operation on +the specific volume, and verifies that it matches how the program +checked out the volume. + +For example, programType X checks out volume V from the fileserver, and +then locks it. We then ask the fileserver for the current volume +operation on volume V. If the programType on the vol operation does not +match (or the PID, or the checkout mode, or other things), we know the +fileserver must have restarted or something similar, and we do not have +the volume checked out like we thought we did. + +If the program determines that the fileserver may have restarted, it +then must retry checking out and locking the volume (or return an +error). diff --git a/doc/txt/dafs-vnode-fsa.dot b/doc/txt/dafs-vnode-fsa.dot new file mode 100644 index 000000000..a0e28ae80 --- /dev/null +++ b/doc/txt/dafs-vnode-fsa.dot @@ -0,0 +1,89 @@ +# +# This is a dot (http://www.graphviz.org) description of the various +# states volumes can be in for DAFS (Demand Attach File Server). +# +# Author: Tom Keiser +# Date: 2008-06-03 +# + +digraph VolumeStates { + size="11,17" + graph [ + rankdir = "TB" + ]; + + subgraph clusterKey { + rankdir="LR"; + shape = "rectangle"; + + s1 [ shape=plaintext, label = "VAllocVnode", + fontcolor="brown" ]; + s2 [ shape=plaintext, label = "VGetVnode", + fontcolor="blue" ]; + s3 [ shape=plaintext, label = "VPutVnode", + fontcolor="purple" ]; + s4 [ shape=plaintext, label = "Error States", + fontcolor="red" ]; + s5 [ shape=plaintext, label = "VVnodeWriteToRead", + fontcolor="green" ]; + s6 [ shape=ellipse, label = "re-entrant" ]; + s7 [ shape=ellipse, peripheries=2, label="non re-entrant" ]; + s8 [ shape=ellipse, color="red", label="Error States" ]; + + s6->s7->s8->s1->s2->s3->s5->s4 [style="invis"]; + + } + + node [ peripheries = "2" ] \ + RELEASING ALLOC LOADING EXCLUSIVE STORE ; + node [ shape = "ellipse", peripheries = "1" ]; + node [ color = "red" ] ERROR ; + + node [ color = "black" ]; // default back to black + + + // node descriptions + INVALID [ label = "Vn_state(vnp) == VN_STATE_INVALID\n(vnode cache entry is invalid)" ]; + RELEASING [ label = "Vn_state(vnp) == VN_STATE_RELEASING\n(vnode is busy releasing its inode handle ref)" ]; + ALLOC [ label = "Vn_state(vnp) == VN_STATE_ALLOC\n(vnode is busy allocating disk entry)" ]; + ALLOC_read [ label = "reading stale vnode from disk\nto verify inactive state" ]; + ALLOC_extend [ label = "extending vnode index file" ]; + ONLINE [ label = "Vn_state(vnp) == VN_STATE_ONLINE\n(vnode is a valid cache entry)" ]; + LOADING [ label = "Vn_state(vnp) == VN_STATE_LOAD\n(vnode is busy loading from disk)" ]; + EXCLUSIVE [ label = "Vn_state(vnp) == VN_STATE_EXCLUSIVE\n(vnode is owned exclusively by an external caller)" ]; + STORE [ label = "Vn_state(vnp) == VN_STATE_STORE\n(vnode is busy writing to disk)" ]; + READ [ label = "Vn_state(vnp) == VN_STATE_READ\n(vnode is shared by several external callers)" ]; + ERROR [ label = "Vn_state(vnp) == VN_STATE_ERROR\n(vnode hard error state)" ]; + + + ONLINE->RELEASING [ label = "VGetFreeVnode_r()" ]; + RELEASING->INVALID [ label = "VGetFreeVnode_r()" ]; + + INVALID->ALLOC [ color="brown", label="vnode not in cache; allocating" ]; + ONLINE->EXCLUSIVE [ color="brown", label="vnode in cache" ]; + ALLOC->ALLOC_read [ color="brown", label="vnode index is within present file size" ]; + ALLOC->ALLOC_extend [ color="brown", label="vnode index is beyond end of file" ]; + ALLOC_read->EXCLUSIVE [ color="brown" ]; + ALLOC_extend->EXCLUSIVE [ color="brown" ]; + ALLOC_read->INVALID [ color="red", label="I/O error; invalidating vnode\nand scheduling salvage" ]; + ALLOC_extend->INVALID [ color="red", label="I/O error; invalidating vnode\nand scheduling salvage" ]; + + INVALID->LOADING [ color="blue", label="vnode not cached" ]; + LOADING->INVALID [ color="red", label="I/O error; invalidating vnode\nand scheduling salvage" ]; + LOADING->ONLINE [ color="blue" ]; + ONLINE->READ [ color="blue", label="caller requested read lock" ]; + ONLINE->EXCLUSIVE [ color="blue", label="caller requested write lock" ]; + + EXCLUSIVE->READ [ color="green", label="vnode not changed" ]; + EXCLUSIVE->STORE [ color="green", label="vnode changed" ]; + EXCLUSIVE->ONLINE [ color="purple", label="vnode not changed" ]; + EXCLUSIVE->STORE [ color="purple", label="vnode changed" ]; + + STORE->READ [ color="green" ]; + STORE->ONLINE [ color="purple" ]; + STORE->ERROR [ color="red", label="I/O error; scheduling salvage" ]; + + READ->READ [ color="blue", label="Vn_readers(vnp) > 0" ]; + READ->READ [ color="purple", label="Vn_readers(vnp) > 1" ]; + READ->ONLINE [ color="purple", label="Vn_readers(vnp) == 1" ]; +} diff --git a/doc/txt/examples/CellAlias b/doc/txt/examples/CellAlias new file mode 100644 index 000000000..f16ed3b50 --- /dev/null +++ b/doc/txt/examples/CellAlias @@ -0,0 +1,10 @@ +# +# This file can be used to specify AFS cell aliases, one per line. +# The syntax to specify "my" as an alias for "my.cell.name" is: +# +# my.cell.name my + +#athena.mit.edu athena +#sipb.mit.edu sipb +#andrew.cmu.edu andrew +#transarc.ibm.com transarc diff --git a/doc/txt/fssync.txt b/doc/txt/fssync.txt new file mode 100644 index 000000000..726d6b9e1 --- /dev/null +++ b/doc/txt/fssync.txt @@ -0,0 +1,253 @@ +This file provides a brief description of the commands of the FSSYNC +protocol, and how/why each are typically used. + + -- vol op FSSYNC commands + +FSSYNC commands involving volume operations take a FSSYNC_VolOp_command +struct as their command and arguments. They all deal with a specific +volume, so "the specified volume" below refers to the volume in the +FSSYNC_VolOp_hdr in the FSSYNC_VolOp_command. + + -- FSYNC_VOL_ON + +Tells the fileserver to bring the specified volume online. For DAFS, +this brings the volume into the preattached state. For non-DAFS, the +volume is attached. + +This is generally used to tell the fileserver about a newly-created +volume, or to return ('check in') a volume to the fileserver that was +previously checked-out (e.g. via FSYNC_VOL_NEEDVOLUME). + + -- FSYNC_VOL_OFF + +Tells the fileserver to take a volume offline, so nothing else will +access the volume until it is brought online via FSSYNC again. A volume +that is offlined with this command and the FSYNC_SALVAGE reason code +will not be allowed access from the fileserver by anything. The volume +will be 'checked out' until it is 'checked in' by another FSYNC command. + +Currently only the salvaging code uses this command; the only difference +between it an FSYNC_VOL_NEEDVOLUME is the logic that determines whether +an offlined volume can be accessed by other programs or not. + + -- FSYNC_VOL_LISTVOLUMES + +This is currently a no-op; all it does is return success, assuming the +FSSYNC command is well-formed. + +In Transarc/IBM AFS 3.1, this was used to create a file listing all +volumes on the server, and was used with a tool to create a list of +volumes to backup. After AFS 3.1, however, it never did anything. + + -- FSYNC_VOL_NEEDVOLUME + +Tells the fileserver that the calling program needs the volume for a +certain operation. The fileserver will offline the volume or keep it +online, depending on the reason code given. The volume will be marked as +'checked out' until 'checked in' again with another FSYNC command. + +Reason codes for this command are different than for normal FSSYNC +commands; reason codes for _NEEDVOLUME are volume checkout codes like +V_CLONE, V_DUMP, and the like. The fileserver will keep the volume +online if the given reason code is V_READONLY, or if the volume is an RO +volume and the given reason code is V_CLONE or V_DUMP. If the volume is +taken offline, the volume's specialStatus will also be marked with VBUSY +in the case of the V_CLONE or V_DUMP reason codes. + + -- FSYNC_VOL_MOVE + +Tells the fileserver that the specified volume was moved to a new site. +The new site is given in the reason code of the request. On receiving +this, the fileserver merely sets the specialStatus on the volume, and +breaks all of the callbacks on the volume. + + -- FSYNC_VOL_BREAKCBKS + +Tells the fileserver to break all callbacks with the specified volume. +This is used when volumes are deleted or overwritten (restores, +reclones, etc). + + -- FSYNC_VOL_DONE + +Tells the fileserver that a volume has been deleted. This is actually +similar to FSYNC_VOL_ON, except that the volume isn't onlined. The +volume is 'checked in', though, and is removed from the list of volumes. + + -- FSYNC_VOL_QUERY + +Asks the fileserver to provide the known volume state information for +the specified volume. If the volume is known, the response payload is a +filled-in 'struct Volume'. + +This is used as a debugging tool to query volume state in the +fileserver, but is also used by the volserver as an optimization so it +does not need to always go to disk to retrieve volume information for +e.g. the AFSVolListOneVolume or AFSVolListVolumes RPCs. + + -- FSYNC_VOL_QUERY_HDR + +Asks the fileserver to provide the on-disk volume header for the +specified volume, if the fileserver already has it loaded. If the +fileserver does not already know this information, it responds with +SYNC_FAILED with the reason code FSYNC_HDR_NOT_ATTACHED. Otherwise it +responds with a filled-in 'struct VolumeDiskData' in the response +payload. + +This is used by non-fileservers as an optimization during attachment if +we are just reading from the volume and we don't need to 'check out' the +volume from the fileserver (attaching with V_PEEK). If the fileserver +has the header loaded, it avoids needing to hit the disk for the volume +header. + + -- FSYNC_VOL_QUERY_VOP (DAFS only) + +Asks the fileserver to provide information about the current volume +operation that has the volume checked out. If the volume is checked out, +the response payload is a filled-in 'struct FSSYNC_VolOp_info'; +otherwise the command fails with SYNC_FAILED. + +This is useful as a debugging aid, and is also used by the volserver to +determine if a volume should be advertised as 'offline' or 'online'. + + -- FSYNC_VOL_ATTACH + +This is like FSYNC_VOL_ON, but for DAFS forces the volume to become +fully attached (as opposed to preattached). This is used for debugging, +to ensure that a volume is attached and online without needing to +contact the fileserver via e.g. a client. + + -- FSYNC_VOL_FORCE_ERROR (DAFS only) + +This tells the fileserver that there is something wrong with a volume, +and it should be put in an error state or salvaged. + +If the reason code is FSYNC_SALVAGE, the fileserver will potentially +schedule a salvage for the volume. It may or may not actually schedule a +salvage, depending on how many salvages have occurred and other internal +logic; basically, specifying FSYNC_SALVAGE makes the fileserver behave +as if the fileserver itself encountered an error with the volume that +warrants a salvage. + +Non-fileserver programs use this to schedule salvages; they should not +contact the salvageserver directly. Note when a salvage is scheduled as +a result of this command, it is done so in the background; getting a +response from this command does not necessarily mean the salvage has +been scheduled, as it may be deferred until later. + +If the reason code is not FSYNC_SALVAGE, the fileserver will just put +the volume into an error state, and the volume will be inaccessible +until it is salvaged, or forced online. + + -- FSYNC_VOL_LEAVE_OFF + +This 'checks in' a volume back to the fileserver, but tells the +fileserver not to bring the volume back online. This can occur when a +non-fileserver program is done with a volume, but the volume's "blessed" +or "inService" fields are not set; this prevents the fileserver from +trying to attach the volume later, only to find the volume is not +blessed and take the volume offline. + + -- FSYNC_VG_QUERY (DAFS only) + +This queries the fileserver VGC (volume group cache) for the volume +group of the requested volume. The payload consists of an +FSSYNC_VGQry_response_t, specifying the volume group and all of the +volumes in that volume group. + +If the VGC for the requested partition is currently being populated, +this will fail with SYNC_FAILED, and the FSYNC_PART_SCANNING reason +code. If the VGC for the requested partition is currently completely +unpopulated, a VGC scan for the partition will be started automatically +in the background, and FSYNC_PART_SCANNING will still be returned. + +The demand-salvager uses this to find out what volumes are in the volume +group it is salvaging; it can also be used for debugging the VGC. + + -- FSYNC_VG_SCAN (DAFS only) + +This discards any information in the VGC for the specified partition, +and re-scans the partition to populate the VGC in the background. This +should normally not be needed, since scans start automatically when VGC +information is requested. This can be used as a debugging tool, or to +force the VGC to discard incorrect information that somehow got into the +VGC. + +Note that the scan is scheduled in the background, so getting a response +from this command does not imply that the scan has started; it may start +sometime in the future. + + -- FSYNC_VG_SCAN_ALL + +This is the same as FSYNC_VG_SCAN, but schedules scans for all +partitions on the fileserver, instead of a particular one. + + -- FSYNC_VOL_QUERY_VNODE + +Asks the fileserver for information about specific vnode. This takes a +different command header than other vol ops; it takes a struct +FSSYNC_VnQry_hdr which specifies the volume and vnode requested. The +response payload is a 'struct Vnode' if successful. + +This responds with FSYNC_UNKNOWN_VNID if the fileserver doesn't know +anything about the given vnode. This command will not automatically +attach the associated volume; the volume must be attached before issuing +this command in order to do anything useful. + +This is just a debugging tool, to see what the fileserver thinks about a +particular vnode. + + -- stats FSSYNC commands + +FSSYNC commands involving statistics take a FSSYNC_StatsOp_command +struct as their command and arguments. Some of them use arguments to +specify what stats are requested, which are specified in sop->args, the +union in the FSSYNC_StatsOp_hdr struct. + + -- FSYNC_VOL_STATS_GENERAL + +Retrieves general volume package stats from the fileserver. Response +payload consists of a 'struct VolPkgStats'. + + -- FSYNC_VOL_STATS_VICEP (DAFS only) + +Retrieves per-partition stats from the fileserver for the partition +specified in sop->partName. Response payload consists of a 'struct +DiskPartitionStats64'. + + -- FSYNC_VOL_STATS_HASH (DAFS only) + +Retrieves hash chain stats for the hash bucket specified in +sop->hash_bucket. Response payload consists of a 'struct +VolumeHashChainStats'. + + -- FSYNC_VOL_STATS_HDR (DAFS only) + +Retrieves stats for the volume header cache. Response payload consists +of a 'struct volume_hdr_LRU_stats'. + + -- FSYNC_VOL_STATS_VLRU (DAFS only) + +This is intended to retrieve stats for the VLRU generation specified in +sop->vlru_generation. However, it is not yet implemented and currently +always results in a SYNC_BAD_COMMAND result from the fileserver. + + -- VGC update FSSYNC commands + +FSSYNC commands involving updating the VGC (volume group cache) take an +FSSYNC_VGUpdate_command struct as their command arguments. The parent +and child fields specify the (parent,child) entry in the partName VGC to +add or remove. + + -- FSYNC_VG_ADD (DAFS only) + +Adds an entry to the fileserver VGC. This merely adds the specified +child volume to the specified parent volume group, and creates the +parent volume group if it does not exist. This is used by programs that +create new volumes, in order to keep the VGC up to date. + + -- FSYNC_VG_DEL (DAFS only) + +Deletes an entry from the fileserver VGC. This merely removes the +specified child volume from the specified parent volume group, deleting +the volume group if the last entry was deleted. This is used by programs +that destroy volumes, in order to keep the VGC up to date. diff --git a/doc/txt/linux-nfstrans b/doc/txt/linux-nfstrans new file mode 100644 index 000000000..901080f0a --- /dev/null +++ b/doc/txt/linux-nfstrans @@ -0,0 +1,270 @@ +## Introduction + +This version works on Linux 2.6, and provides the following features: + +- Basic AFS/NFS translator functionality, similar to other platforms +- Ability to distinguish PAG's assigned within each NFS client +- A new 'afspag' kernel module, which provides PAG management on + NFS client systems, and forwards AFS system calls to the translator + system via the remote AFS system call (rmtsys) protocol. +- Support for transparent migration of an NFS client from one translator + server to another, without loss of credentials or sysnames. +- The ability to force the translator to discard all credentials + belonging to a specified NFS client host. + + +The patch applies to OpenAFS 1.4.1, and has been tested against the +kernel-2.6.9-22.0.2.EL kernel binaries as provided by the CentOS project +(essentially these are rebuilds from source of Red Hat Enterprise Linux). +This patch is not expected to apply cleanly to newer versions of OpenAFS, +due to conflicting changes in parts of the kernel module source. To apply +this patch, use 'patch -p0'. + +It has been integrated into OpenAFS 1.5.x. + +## New in Version 1.4 + +- There was no version 1.3 +- Define a "sysname generation number" which changes any time the sysname + list is changed for the translator or any client. This number is used + as the nanoseconds part of the mtime of directories, which forces NFS + clients to reevaluate directory lookups any time the sysname changes. +- Fixed several bugs related to sysname handling +- Fixed a bug preventing 'fs exportafs' from changing the flag which + controls whether callbacks are made to NFS clients to obtain tokens + and sysname lists. +- Starting in this version, when the PAG manager starts up, it makes a + call to the translator to discard any tokens belonging to that client. + This fixes a problem where newly-created PAG's on the client would + inherit tokens owned by an unrelated process from an earlier boot. +- Enabled the PAG manager to forward non-V-series pioctl's. +- Forward ported to OpenAFS 1.4.1 final +- Added a file, /proc/fs/openafs/unixusers, which reports information + about "unixuser" structures, which are used to record tokens and to + bind translator-side PAG's to NFS client data and sysname lists. + + +## Finding the RPC server authtab + +In order to correctly detect NFS clients and distinguish between them, +the translator must insert itself into the RPC authentication process. +This requires knowing the address of the RPC server authentication dispatch +table, which is not exported from standard kernels. To address this, the +kernel must be patched such that net/sunrpc/svcauth.c exports the 'authtab' +symbol, or this symbol's address must be provided when the OpenAFS kernel +module is loaded, using the option "authtab_addr=0xXXXXXXXX" where XXXXXXXX +is the address of the authtab symbol as obtained from /proc/kallsyms. The +latter may be accomplished by adding the following three lines to the +openafs-client init script in place of 'modprobe openafs': + + modprobe sunrpc + authtab=`awk '/[ \t]authtab[ \t]/ { print $1 }' < /proc/kallsyms` + modprobe openafs ${authtab:+authtab_addr=0x$authtab} + + +## Exporting the NFS filesystem + +In order for the translator to work correctly, /afs must be exported with +specific options. Specifically, the 'no_subtree_check' option is needed +in order to prevent the common NFS server code from performing unwanted +access checks, and an fsid option must be provided to set the filesystem +identifier to be used in NFS filehandles. Note that for live migration +to work, a consistent filesystem id must be used on all translator systems. +The export may be accomplished with a line in /etc/exports: + + /afs (rw,no_subtree_check,fsid=42) + +Or with a command: + + exportfs -o rw,no_subtree_check,fsid=42 :/afs + +The AFS/NFS translator code is enabled by default; no additional command +is required to activate it. However, the 'fs exportafs nfs' command can +be used to disable or re-enable the translator and to set options. Note +that support for client-assigned PAG's is not enabled by default, and +must be enabled with the following command: + + fs exportafs nfs -clipags on + +Support for making callbacks to obtain credentials and sysnames from +newly-discovered NFS clients is also not enabled by default, because this +would result in long timeouts on requests from NFS clients which do not +support this feature. To enable this feature, use the following command: + + fs exportafs nfs -pagcb on + + +## Client-Side PAG Management + +Management of PAG's on individual NFS clients is provided by the kernel +module afspag.ko, which is automatically built alongside the libafs.ko +module on Linux 2.6 systems. This component is not currently supported +on any other platform. + +To activate the client PAG manager, simply load the module; no additional +parameters or commands are required. Once the module is loaded, PAG's +may be acquired using the setpag() call, exactly as on systems running the +full cache manager. Both the traditional system call and new-style ioctl +entry points are supported. + +In addition, the PAG manager can forward pioctl() calls to an AFS/NFS +translator system via the remote AFS system call service (rmtsys). To +enable this feature, the kernel module must be loaded with a parameter +specifying the location of the translator system: + + insmod afspag.ko nfs_server_addr=0xAABBCCDD + +In this example, 0xAABBCCDD is the IP address of the translator system, +in network byte order. For example, if the translator has the IP address +192.168.42.100, the nfs_server_addr parameter should be set to 0xc0a82a64. + +The PAG manager can be shut down using 'afsd -shutdown' (ironically, this +is the only circumstance in which that command is useful). Once the +shutdown is complete, the kernel module can be removed using rmmod. + + +## Remote System Calls + +The NFS translator supports the ability of NFS clients to perform various +AFS-specific operations via the remote system call interface (rmtsys). +To enable this feature, afsd must be run with the -rmtsys switch. OpenAFS +client utilities will use this feature automatically if the AFSSERVER +environment variable is set to the address or hostname of the translator +system, or if the file ~/.AFSSERVER or /.AFSSERVER exists and contains the +translator's address or hostname. + +On systems running the client PAG manager (afspag.ko), AFS system calls +made via the traditional methods will be automatically forwarded to the +NFS translator system, if the PAG manager is configured to do so. This +feature must be enabled, as described above. + + +## Credential Caching + +The client PAG manager maintains a cache of credentials belonging to each +PAG. When an application makes a system call to set or remove AFS tokens, +the PAG manager updates its cache in addition to forwarding the request +to the NFS server. + +When the translator hears from a previously-unknown client, it makes a +callback to the client to retrieve a copy of any cached credentials. +This means that credentials belonging to an NFS client are not lost if +the translator is rebooted, or if the client's location on the network +changes such that it is talking to a different translator. + +This feature is automatically supported by the PAG manager if it has +been configured to forward system calls to an NFS translator. However, +requests will be honored only if they come from port 7001 on the NFS +translator host. In addition, this feature must be enabled on the NFS +translator system as described above. + + +## System Name List + +When the NFS translator hears from a new NFS client whose system name +list it does not know, it can make a callback to the client to discover +the correct system name list. This ability is enabled automatically +with credential caching and retrieval is enabled as described above. + +The PAG manager maintains a system-wide sysname list, which is used to +satisfy callback requests from the NFS translator. This list is set +initially to contain only the compiled-in default sysname, but can be +changed by the superuser using the VIOC_AFS_SYSNAME pioctl or the +'fs sysname' command. Any changes are automatically propagated to the +NFS translator. + + +## Dynamic Mount Points + +This patch introduces a special directory ".:mount", which can be found +directly below the AFS root directory. This directory always appears to +be empty, but any name of the form "cell:volume" will resolve to a mount +point for the specified volume. The resulting mount points are always +RW-path mount points, and so will resolve to an RW volume even if the +specified name refers to a replicated volume. However, the ".readonly" +and ".backup" suffixes can be used to refer to volumes of those types, +and a numeric volume ID will always be used as-is. + +This feature is required to enable the NFS translator to reconstruct a +reachable path for any valid filehandle presented by an NFS client. +Specifically, when the path reconstruction algorithm is walking upward +from a client-provided filehandle and encounters the root directory of +a volume which is no longer in the cache (and thus has no known mount +point), it will complete the path to the AFS root using the dynamic +mount directory. + +On non-linux cache managers, this feature is available when dynamic +root and fake stat modes are enabled. + +On Linux systems, it is also available even when dynroot is not enabled, +to support the NFS translator. It is presently not possible to disable +this feature, though that ability may be added in the future. It would +be difficult to make this feature unavailable to users and still make the +Linux NFS translator work, since the point of the check being performed +by the NFS server is to ensure the requested file would be reachable by +the client. + + +## Security + +The security of the NFS translator depends heavily on the underlying +network. Proper configuration is required to prevent unauthorized +access to files, theft of credentials, or other forms of attack. + +NFS, remote syscall, and PAG callback traffic between an NFS client host +and translator may contain sensitive file data and/or credentials, and +should be protected from snooping by unprivileged users or other hosts. + +Both the NFS translator and remote system call service authorize requests +in part based on the IP address of the requesting client. To prevent an +attacker from making requests on behalf of another host, the network must +be configured such that it is impossible for one client to spoof the IP +address of another. + +In addition, both the NFS translator and remote system call service +associate requests with specific users based on user and group ID data +contained within the request. In order to prevent users on the same client +from making filesystem access requests as each other, the NFS server must +be configured to accept requests only from privileged ports. In order to +prevent users from making AFS system calls on each other's behalf, possibly +including retrieving credentials, the network must be configured such that +requests to the remote system call service (port 7009) are accepted only +from port 7001 on NFS clients. + +When a client is migrated away from a translator, any credentials held +on behalf of that client must be discarded before that client's IP address +can safely be reused. The VIOC_NFS_NUKE_CREDS pioctl and 'fs nukenfscreds' +command are provided for this purpose. Both take a single argument, which +is the IP address of the NFS client whose credentials should be discarded. + + +## Known Issues + + + Because NFS clients do not maintain active references on every inode + they are using, it is possible that portions of the directory tree + in use by an NFS client will expire from the translator's AFS and + Linux dentry cache's. When this happens, the NFS server attempts to + reconstruct the missing portion of the directory tree, but may fail + if the client does not have sufficient access (for example, if his + tokens have expired). In these cases, a "stale NFS filehandle" error + will be generated. This behavior is similar to that found on other + translator platforms, but is triggered under a slightly different set + of circumstances due to differences in the architecture of the Linux + NFS server. + + + Due to limitations of the rmtsys protocol, some pioctl calls require + large (several KB) transfers between the client and rmtsys server. + Correcting this issues would require extensions to the rmtsys protocol + outside the scope of this project. + + + The rmtsys interface requires that AFS be mounted in the same place + on both the NFS client and translator system, or at least that the + translator be able to correctly resolve absolute paths provided by + the client. + + + If a client is migrated or an NFS translator host is unexpectedly + rebooted while AFS filesystem access is in progress, there may be + a short delay before the client recovers. This is because the NFS + client must time out any request it made to the old server before + it can retransmit the request, which will then be handled by the + new server. The same applies to remote system call requests.