From eb5190eb4a7cd95166866a89e0a8f3a69bbc6e8f Mon Sep 17 00:00:00 2001 From: Andrew Deason Date: Fri, 29 Mar 2013 13:40:41 -0500 Subject: [PATCH] Make ihandle sync behavior runtime-configurable The actual behavior of FDH_SYNC has changed a bit over the years, and some people want one behavior, and some want another. Make it possible to make this choice at runtime with the new -sync option, instead of making this decision by running with different patches. Note that FDH_SYNC is not a macro anymore, nor is it an inline function. While it could be a macro, it would look a bit complex, and there are some oddities with trying to use vol_io_params inside the FDH_SYNC expansion (vol_io_params is not declared for LWP, for example). And having it be an inline function causes problems with some odd linking dependencies. For example, vlib.a contains volume.o, but does not contain a definition for DFlushVolume (dir/buffer.c), which is referenced in volume.o. 'vos' uses vlib.a, but does not bring in anything that defines DFlushVolume. Currently this appears to not cause a problem because 'vos' uses nothing from volume.o, so the dependencies of volume.o don't matter. Adding an inline FDH_SYNC for platforms that don't support 'static inline' would add a dependency to volume.o (via vol_io_params), which causes an error for the lack of a DFlushVolume. Those are possibly just some problems, and may not be all. So instead, make it so we don't have to deal with that and just have a normal function. While FDH_SYNC may be called in a performance-critical section, the overhead of a real function call is nowhere near the delay of an actual fsync(), so presumably any overhead doesn't matter. Change-Id: I23620bd8ac31b9019e9d55cb46ec9f3a75f5675c Reviewed-on: http://gerrit.openafs.org/9694 Reviewed-by: Andrew Deason Tested-by: BuildBot Reviewed-by: Derrick Brashear --- .../pod8/fragments/fileserver-options.pod | 89 +++++++++++++++++++ .../pod8/fragments/fileserver-synopsis.pod | 1 + .../pod8/fragments/volserver-options.pod | 5 ++ .../pod8/fragments/volserver-synopsis.pod | 1 + src/viced/viced.c | 12 ++- src/vol/ihandle.c | 46 ++++++++++ src/vol/ihandle.h | 23 ++++- src/vol/vol-salvage.c | 13 +-- src/volser/volmain.c | 12 +++ 9 files changed, 188 insertions(+), 14 deletions(-) diff --git a/doc/man-pages/pod8/fragments/fileserver-options.pod b/doc/man-pages/pod8/fragments/fileserver-options.pod index 9e4727f2e..a35faa0de 100644 --- a/doc/man-pages/pod8/fragments/fileserver-options.pod +++ b/doc/man-pages/pod8/fragments/fileserver-options.pod @@ -337,3 +337,92 @@ B<-offline-shutdown-timeout> is the value specified for B<-offline-timeout>. Otherwise, the default value is C<-1>. For the LWP fileserver, the only valid value for this option is C<-1>. + +=item B<-sync> + +This option changes how hard the fileserver tries to ensure that data written +to volumes actually hits the physical disk. + +Normally, when the fileserver writes to disk, the underlying filesystem or +Operating System may delay writes from actually going to disk, and reorder +which writes hit the disk first. So, during an unclean shutdown of the machine +(if the power goes out, or the machine crashes, etc), file data may become lost +that the server previously told clients was already successfully written. + +To try to mitigate this, the fileserver will try to "sync" file data to the +physical disk at numerous points during various I/O. However, this can result +in significantly reduced performance. Depending on the usage patterns, this may +or may not be acceptable. This option dictates specifically what the fileserver +does when it wants to perform a "sync". + +There are several options; pass one of these as the argument to -sync. The +default is C. + +=over 4 + +=item always + +This causes a sync operation to always sync immediately and synchronously. +This is the slowest option that provides the greatest protection against data +loss in the event of a crash. + +Note that this is still not a 100% guarantee that data will not be lost or +corrupted during a crash. The underlying filesystem itself may cause data to +be lost or corrupt in such a situation. And OpenAFS itself does not (yet) even +guarantee that all data is consistent at any point in time; so even if the +filesystem and OS do not buffer or reorder any writes, you are not guaranteed +that all data will be okay after a crash. + +This was the only behavior allowed in OpenAFS releases prior to 1.4.5. + +=item onclose + +This causes a sync to do nothing immediately, but causes the relevant file to +be flagged as potentially needing a sync. When a volume is detached, volume +metadata files flaged for synced are synced, as well as data files that have +been accessed recently. Events that cause a volume to detach include: +performing volume operations (dump, restore, clone, etc), a clean shutdown +of the fileserver, or during DAFS "soft detachment". + +Effectively this option is the same as C while a volume is attached and +actively being used, but if a volume is detached, there is an additional +guarantee for the data's consistency. + +After the removal of the C option after the OpenAFS 1.6 series, this +option became the default. + +=item never + +This causes all syncs to never do anything. This is the fastest option, with +the weakest guarantees for data consistency. + +Depending on the underlying filesystem and Operating System, there may be +guarantees that any data written to disk will hit the physical media after a +certain amount of time. For example, Linux's pdflush process usually makes this +guarantee, and ext3 can make certain various consistency guarantees according +to the options given. ZFS on Solaris can also provide similar guarantees, as +can various other platforms and filesystems. Consult the documentation for +your platform if you are unsure. + +=item delayed + +This option used to exist in OpenAFS 1.6, but was later removed due to issues +encountered with data corruption during normal operation. Outside of the +OpenAFS 1.6 series, it is not a valid option, and the fileserver will fail to +start if you specify this (or any other unknown option). It caused syncs to +occur in a background thread, executing every 10 seconds. + +This was the only behavior allowed in OpenAFS releases starting from 1.4.5 up +to and including 1.6.2. It was also the default for the 1.6 series starting in +OpenAFS 1.6.3. + +=back + +Which option you choose is not an easy decision to make. Various developers +and experts sometimes disagree on which option is the most reasonable, and it +may depend on the specific scenario and workload involved. Some argue that +the C option does not provide significantly greater guarantees over +any other option, whereas others argue that choosing anything besides the +C option allows for an unacceptable risk of data loss. This may +depend on your usage patterns, your platform and filesystem, and who you talk +to about this topic. diff --git a/doc/man-pages/pod8/fragments/fileserver-synopsis.pod b/doc/man-pages/pod8/fragments/fileserver-synopsis.pod index c22d288c0..1dc31885c 100644 --- a/doc/man-pages/pod8/fragments/fileserver-synopsis.pod +++ b/doc/man-pages/pod8/fragments/fileserver-synopsis.pod @@ -47,3 +47,4 @@ B S<<< [B<-lock>] >>> S<<< [B<-offline-timeout> >] >>> S<<< [B<-offline-shutdown-timeout> >] >>> + S<<< [B<-sync> >] >>> diff --git a/doc/man-pages/pod8/fragments/volserver-options.pod b/doc/man-pages/pod8/fragments/volserver-options.pod index 6ac5ebbcc..1ad49a069 100644 --- a/doc/man-pages/pod8/fragments/volserver-options.pod +++ b/doc/man-pages/pod8/fragments/volserver-options.pod @@ -80,6 +80,11 @@ Preserve volume access statistics over volume restore and reclone operations. By default, volume access statistics are reset during volume restore and reclone operations. +=item B<-sync> > + +This is the same as the B<-sync> option in L. See +L. + =item B<-help> Prints the online help for this command. All other valid options are diff --git a/doc/man-pages/pod8/fragments/volserver-synopsis.pod b/doc/man-pages/pod8/fragments/volserver-synopsis.pod index 1d7805de0..0340a50bc 100644 --- a/doc/man-pages/pod8/fragments/volserver-synopsis.pod +++ b/doc/man-pages/pod8/fragments/volserver-synopsis.pod @@ -6,3 +6,4 @@ B [B<-nojumbo>] [B<-jumbo>] [B<-enable_peer_stats>] [B<-enable_process_stats>] [B<-allow-dotted-principals>] [B<-preserve-vol-stats>] [B<-help>] + [B<-sync> >] diff --git a/src/viced/viced.c b/src/viced/viced.c index 4abbb4a24..afffe53ec 100644 --- a/src/viced/viced.c +++ b/src/viced/viced.c @@ -978,7 +978,8 @@ enum optionsList { OPT_rxmaxmtu, OPT_udpsize, OPT_dotted, - OPT_realm + OPT_realm, + OPT_sync }; static int @@ -992,6 +993,7 @@ ParseArgs(int argc, char *argv[]) int lwps_max; char *auditFileName = NULL; + char *sync_behavior = NULL; #if defined(AFS_AIX32_ENV) extern int aixlow_water; @@ -1156,6 +1158,8 @@ ParseArgs(int argc, char *argv[]) "permit Kerberos 5 principals with dots"); cmd_AddParmAtOffset(opts, OPT_realm, "-realm", CMD_LIST, CMD_OPTIONAL, "local realm"); + cmd_AddParmAtOffset(opts, OPT_sync, "-sync", + CMD_SINGLE, CMD_OPTIONAL, "always | onclose | never"); code = cmd_Parse(argc, argv, &opts); if (code) @@ -1289,6 +1293,12 @@ ParseArgs(int argc, char *argv[]) &vol_io_params.fd_max_cachesize); cmd_OptionAsUint(opts, OPT_vhandle_initial_cachesize, &vol_io_params.fd_initial_cachesize); + if (cmd_OptionAsString(opts, OPT_sync, &sync_behavior) == 0) { + if (ih_SetSyncBehavior(sync_behavior)) { + printf("Invalid -sync value %s\n", sync_behavior); + return -1; + } + } #ifdef AFS_DEMAND_ATTACH_FS if (cmd_OptionPresent(opts, OPT_fs_state_dont_save)) diff --git a/src/vol/ihandle.c b/src/vol/ihandle.c index 8b5d751f0..d903c016c 100644 --- a/src/vol/ihandle.c +++ b/src/vol/ihandle.c @@ -97,6 +97,31 @@ void ih_PkgDefaults(void) /* fd cache size that will be used if/when ih_UseLargeCache() * is called */ vol_io_params.fd_max_cachesize = FD_MAX_CACHESIZE; + + vol_io_params.sync_behavior = IH_SYNC_ONCLOSE; +} + +int +ih_SetSyncBehavior(const char *behavior) +{ + int val; + + if (strcmp(behavior, "always") == 0) { + val = IH_SYNC_ALWAYS; + + } else if (strcmp(behavior, "onclose") == 0) { + val = IH_SYNC_ONCLOSE; + + } else if (strcmp(behavior, "never") == 0) { + val = IH_SYNC_NEVER; + + } else { + /* invalid behavior name */ + return -1; + } + + vol_io_params.sync_behavior = val; + return 0; } #ifdef AFS_PTHREAD_ENV @@ -864,6 +889,8 @@ ih_reallyclose(IHandle_t * ihP) ihP->ih_refcnt++; /* must not disappear over unlock */ if (ihP->ih_synced) { FdHandle_t *fdP; + opr_Assert(vol_io_params.sync_behavior != IH_SYNC_ALWAYS); + opr_Assert(vol_io_params.sync_behavior != IH_SYNC_NEVER); ihP->ih_synced = 0; IH_UNLOCK; @@ -1031,3 +1058,22 @@ ih_isunlinked(int fd) return 0; } #endif /* !AFS_NT40_ENV */ + +int +ih_fdsync(FdHandle_t *fdP) +{ + switch (vol_io_params.sync_behavior) { + case IH_SYNC_ALWAYS: + return OS_SYNC(fdP->fd_fd); + case IH_SYNC_ONCLOSE: + if (fdP->fd_ih) { + fdP->fd_ih->ih_synced = 1; + return 0; + } + return 1; + case IH_SYNC_NEVER: + return 0; + default: + opr_Assert(0); + } +} diff --git a/src/vol/ihandle.h b/src/vol/ihandle.h index 4a1ae6352..6682af765 100644 --- a/src/vol/ihandle.h +++ b/src/vol/ihandle.h @@ -200,6 +200,21 @@ typedef struct StreamHandle_s { #define FD_HANDLE_MALLOCSIZE ((size_t)((4096/sizeof(FdHandle_t)))) #define STREAM_HANDLE_MALLOCSIZE 1 +/* Possible values for the vol_io_params.sync_behavior option. + * These dictate what actually happens when you call FDH_SYNC or IH_CONDSYNC. */ +#define IH_SYNC_ALWAYS (1) /* This makes FDH_SYNCs do what you'd probably + * expect: a synchronous fsync() */ +#define IH_SYNC_ONCLOSE (2) /* This makes FDH_SYNCs just flag the ih as "I + * need to sync", and does not perform the actual + * fsync() until we IH_REALLYCLOSE. This provides a + * little assurance over IH_SYNC_NEVER when a volume + * has gone offline, and a few other situations. */ +#define IH_SYNC_NEVER (3) /* This makes FDH_SYNCs do nothing. Faster, but + * obviously less durable. The OS may ensure that + * our data hits the disk eventually, depending on + * the platform and various OS-specific tuning + * parameters. */ + /* READ THIS. * @@ -218,8 +233,9 @@ typedef struct ih_init_params afs_uint32 fd_handle_setaside; /* for non-cached i/o, trad. was 128 */ afs_uint32 fd_initial_cachesize; /* what was 'default' */ afs_uint32 fd_max_cachesize; /* max open files if large-cache activated */ -} ih_init_params; + int sync_behavior; /* one of the IH_SYNC_* constants */ +} ih_init_params; /* Number of file descriptors needed for non-cached I/O */ #define FD_HANDLE_SETASIDE 128 /* Match to MAX_FILESERVER_THREAD */ @@ -301,6 +317,7 @@ extern FD_t *ih_fdopen(FdHandle_t * h, char *fdperms); extern void ih_PkgDefaults(void); extern void ih_Initialize(void); extern void ih_UseLargeCache(void); +extern int ih_SetSyncBehavior(const char *behavior); extern IHandle_t *ih_init(int /*@alt Device@ */ dev, int /*@alt VolId@ */ vid, Inode ino); extern IHandle_t *ih_copy(IHandle_t * ihP); @@ -550,13 +567,15 @@ extern afs_sfsize_t ih_size(FD_t); #define FDH_WRITE(H, B, S) OS_WRITE((H)->fd_fd, B, S) #define FDH_SEEK(H, O, F) OS_SEEK((H)->fd_fd, O, F) -#define FDH_SYNC(H) ((H->fd_ih!=NULL) ? ( H->fd_ih->ih_synced = 1) - 1 : 1) +#define FDH_SYNC(H) ih_fdsync(H) #define FDH_TRUNC(H, L) OS_TRUNC((H)->fd_fd, L) #define FDH_SIZE(H) OS_SIZE((H)->fd_fd) #define FDH_LOCKFILE(H, O) OS_LOCKFILE((H)->fd_fd, O) #define FDH_UNLOCKFILE(H, O) OS_UNLOCKFILE((H)->fd_fd, O) #define FDH_ISUNLINKED(H) OS_ISUNLINKED((H)->fd_fd) +extern int ih_fdsync(FdHandle_t *fdP); + #ifdef AFS_NT40_ENV # define afs_stat_st __stat64 # define afs_stat _stat64 diff --git a/src/vol/vol-salvage.c b/src/vol/vol-salvage.c index 976a21f34..0ba989668 100644 --- a/src/vol/vol-salvage.c +++ b/src/vol/vol-salvage.c @@ -2999,17 +2999,8 @@ CopyAndSalvage(struct SalvInfo *salvinfo, struct DirSummary *dir) vnodeIndexOffset(vcp, dir->vnodeNumber), (char *)&vnode, sizeof(vnode)); opr_Assert(lcode == sizeof(vnode)); -#if 0 -#ifdef AFS_NT40_ENV - nt_sync(salvinfo->fileSysDevice); -#else - sync(); /* this is slow, but hopefully rarely called. We don't have - * an open FD on the file itself to fsync. - */ -#endif -#else - salvinfo->vnodeInfo[vLarge].handle->ih_synced = 1; -#endif + IH_CONDSYNC(salvinfo->vnodeInfo[vLarge].handle); + /* make sure old directory file is really closed */ fdP = IH_OPEN(dir->dirHandle.dirh_handle); FDH_REALLYCLOSE(fdP); diff --git a/src/volser/volmain.c b/src/volser/volmain.c index e6a418580..0c395707c 100644 --- a/src/volser/volmain.c +++ b/src/volser/volmain.c @@ -367,6 +367,16 @@ main(int argc, char **argv) rx_enableProcessRPCStats(); } else if (strcmp(argv[code], "-preserve-vol-stats") == 0) { DoPreserveVolumeStats = 1; + } else if (strcmp(argv[code], "-sync") == 0) { + if ((code + 1) >= argc) { + printf("You have to specify -sync \n"); + exit(1); + } + ih_PkgDefaults(); + if (ih_SetSyncBehavior(argv[++code])) { + printf("Invalid -sync value %s\n", argv[code]); + exit(1); + } } #ifndef AFS_NT40_ENV else if (strcmp(argv[code], "-syslog") == 0) { @@ -387,6 +397,7 @@ main(int argc, char **argv) "[-udpsize ] " "[-syslog[=FACILITY]] " "[-enable_peer_stats] [-enable_process_stats] " + "[-sync ] " "[-help]\n"); #else printf("Usage: volserver [-log] [-p ] " @@ -394,6 +405,7 @@ main(int argc, char **argv) "[-nojumbo] [-jumbo] [-rxmaxmtu ] [-rxbind] [-allow-dotted-principals] " "[-udpsize ] " "[-enable_peer_stats] [-enable_process_stats] " + "[-sync ] " "[-help]\n"); #endif VS_EXIT(1); -- 2.39.5