From: Avery Pennarun <apenwarr@gmail.com>
Date: Sun, 20 Feb 2011 04:33:36 +0000 (-0800)
Subject: hashsplit.py: okay, *really* fix BLOB_MAX.
X-Git-Url: https://git.michaelhowe.org/gitweb/?a=commitdiff_plain;h=84f4cf05c68f0fa3e594542520e9c71e459bfb66;p=packages%2Fb%2Fbup.git

hashsplit.py: okay, *really* fix BLOB_MAX.

In some conditions, we were still splitting into blobs larger than BLOB_MAX.
Fix that too.

Unfortunately adding an assertion about it in the 'bup split' main loop
slows things down by a measurable amount, so I can't easily add that to
prevent this from happening by accidenta again in the future.

After implementing this, it looks like 8192 (typical blob size) times two
isn't big enough to prevent this from kicking in in "normal" cases; let's
use 4x instead.  In my test file, we exceed this maximum much less.  (Every
time we exceed BLOB_MAX, it means the bupsplit algorithm isn't working, so
we won't be deduplicating as effectively.  So we want that to be rare.)

Signed-off-by: Avery Pennarun <apenwarr@gmail.com>
---

diff --git a/lib/bup/hashsplit.py b/lib/bup/hashsplit.py
index 938fcaa..6134b61 100644
--- a/lib/bup/hashsplit.py
+++ b/lib/bup/hashsplit.py
@@ -2,7 +2,7 @@ import math
 from bup import _helpers
 from bup.helpers import *
 
-BLOB_MAX = 8192*2   # 8192 is the "typical" blob size for bupsplit
+BLOB_MAX = 8192*4   # 8192 is the "typical" blob size for bupsplit
 BLOB_READ_SIZE = 1024*1024
 MAX_PER_TREE = 256
 progress_callback = None
@@ -58,12 +58,14 @@ def _splitbuf(buf):
     while 1:
         b = buf.peek(buf.used())
         (ofs, bits) = _helpers.splitbuf(b)
+        if ofs > BLOB_MAX:
+            ofs = BLOB_MAX
         if ofs:
             buf.eat(ofs)
             yield buffer(b, 0, ofs), bits
         else:
             break
-    if buf.used() > BLOB_MAX:
+    while buf.used() >= BLOB_MAX:
         # limit max blob size
         yield buf.get(BLOB_MAX), 0