Dan Stromberg: Backshift not That slow, and for good reason


Backshiftis a deduplicating backup program in Python. At http://burp.grke.org/burp2/08results1.htmlyou can find a performance comparison between some backup applications. The comparison did not compare backshift, because backshift was believed to have prohibitively slow deduplication. Backshift is truly not a speed-demon. It is designed to: minimize storage requirements minimize bandwidth requirements emphasize parallel (concurrent backups of different computers) performance to some extent allow expiration of old data that is no longer needed Also, it was almost certainly not backshift's deduplication that was slow, it was: backshift's variable-length, content-based blocking algorithm. This makes python inspect every byte of the backup, one byte at a time. backshift's use of xzcompression. xz packs files very hard, reducing storage and bandwidth requirements, but it is known to be slower than something like gzipthat doesn't compress as well. Also, while the initial fullsave is slow, subsequent backups are muchfaster because they do not reblock or recompress any files that still have the same mtime and size as found in 1 of (up to) 3 previous backups. Also, if you run backshift on Pypy, its variable-length, content-based blocking algorithm is many times fasterthan if you run it on CPython. Pypy is not only faster than CPython, it's also much faster than CPython augmented with Cython. I sent G. P. E. Keeling an e-mail about this some time ago (the date of this writing is October 2015), but never received a response