From: Avery Pennarun Date: Sun, 20 Feb 2011 04:33:36 +0000 (-0800) Subject: hashsplit.py: okay, *really* fix BLOB_MAX. X-Git-Url: https://git.michaelhowe.org/gitweb/?a=commitdiff_plain;h=84f4cf05c68f0fa3e594542520e9c71e459bfb66;p=packages%2Fb%2Fbup.git hashsplit.py: okay, *really* fix BLOB_MAX. In some conditions, we were still splitting into blobs larger than BLOB_MAX. Fix that too. Unfortunately adding an assertion about it in the 'bup split' main loop slows things down by a measurable amount, so I can't easily add that to prevent this from happening by accidenta again in the future. After implementing this, it looks like 8192 (typical blob size) times two isn't big enough to prevent this from kicking in in "normal" cases; let's use 4x instead. In my test file, we exceed this maximum much less. (Every time we exceed BLOB_MAX, it means the bupsplit algorithm isn't working, so we won't be deduplicating as effectively. So we want that to be rare.) Signed-off-by: Avery Pennarun --- diff --git a/lib/bup/hashsplit.py b/lib/bup/hashsplit.py index 938fcaa..6134b61 100644 --- a/lib/bup/hashsplit.py +++ b/lib/bup/hashsplit.py @@ -2,7 +2,7 @@ import math from bup import _helpers from bup.helpers import * -BLOB_MAX = 8192*2 # 8192 is the "typical" blob size for bupsplit +BLOB_MAX = 8192*4 # 8192 is the "typical" blob size for bupsplit BLOB_READ_SIZE = 1024*1024 MAX_PER_TREE = 256 progress_callback = None @@ -58,12 +58,14 @@ def _splitbuf(buf): while 1: b = buf.peek(buf.used()) (ofs, bits) = _helpers.splitbuf(b) + if ofs > BLOB_MAX: + ofs = BLOB_MAX if ofs: buf.eat(ofs) yield buffer(b, 0, ofs), bits else: break - if buf.used() > BLOB_MAX: + while buf.used() >= BLOB_MAX: # limit max blob size yield buf.get(BLOB_MAX), 0