Delan Azabani

Glacier and snapshot.debian.org

 470 words 2 min  attic

Debian's snapshot archive is a priceless resource containing every version of every Debian package ever released. With over 11 million files clocking into well over 16 TB, it's also presumably quite expensive to maintain, especially given that most files are accessed infrequently. Perhaps that can be improved though.

I met James Bromberger at linux.conf.au earlier this month, and his presentation on migrating the archive to use Amazon Glacier was quite impressive. The archive can be stored at an order of magnitude lower expense ($160 versus $1295 per month for 16 TB) in exchange for retrieval times of around 3–5 hours.

The challenge that now arises is providing users with a retrieval method. Some ideas for the basic workflow were suggested:

Today I started hacking on a prototype which currently has some of the above logic, but no pretty user interface on top as yet. Essentially, the current behaviour boils down to what is by far the most HTTP status codes I've ever used in one application:

Initially, HTTP 503 was suggested for every result other than a redirect, but I feel it's better to use distinct status codes so that programs can detect situations without checking the human-readable output. Also, returning a status code from the 2xx category for an initiated restoration seems like a semantic improvement.

To keep the code reasonably maintainable, I've actually thought about coupling and cohesion from the start, which is a first. It looks like the Software Engineering 110 knowledge that Dave bestowed upon me is working. Hopefully I can hold out on major refactoring for longer than I usually do. Currently the bulk of the Python code is spread out across several files:

A tricky issue is mapping Debian package filenames, which are thankfully unique, to the S3 object keys, which in this situation are 160-bit hashes. Without a database this is impossible, but access to one may be on the horizon, should this prototype be found useful.