Glacier and snapshot.debian.org

Debian's snapshot archive is a priceless resource containing every version of every Debian package ever released. With over 11 million files clocking into well over 16 TB, it's also presumably quite expensive to maintain, especially given that most files are accessed infrequently. Perhaps that can be improved though.

I met James Bromberger at linux.conf.au earlier this month, and his presentation on migrating the archive to use Amazon Glacier was quite impressive. The archive can be stored at an order of magnitude lower expense ($160 versus $1295 per month for 16 TB) in exchange for retrieval times of around 3–5 hours.

The challenge that now arises is providing users with a retrieval method. Some ideas for the basic workflow were suggested:

Have a web interface to allow users to find and request packages.
Limit restorations to, for example, 100 per day to control costs.
Redirect users directly to S3 URLs where files are live.
Where a requested file has been archived in Glacier:
- Reject the restoration if the quota has been exceeded.
- Commence restoration of the file from Glacier otherwise.

Today I started hacking on a prototype which currently has some of the above logic, but no pretty user interface on top as yet. Essentially, the current behaviour boils down to what is by far the most HTTP status codes I've ever used in one application:

/ — list restorations initiated in the last 24 hours
/blob/$hash — retrieve S3 object identified by hash
- If the file doesn't exist, throw a 404
- If the file is live, HTTP 307 to a generated S3 URL
- If the quota has been exceeded, throw a 503
- Otherwise, initiate restoration and return a 202

Initially, HTTP 503 was suggested for every result other than a redirect, but I feel it's better to use distinct status codes so that programs can detect situations without checking the human-readable output. Also, returning a status code from the 2xx category for an initiated restoration seems like a semantic improvement.

To keep the code reasonably maintainable, I've actually thought about coupling and cohesion from the start, which is a first. It looks like the Software Engineering 110 knowledge that Dave bestowed upon me is working. Hopefully I can hold out on major refactoring for longer than I usually do. Currently the bulk of the Python code is spread out across several files:

backend.py — file restoration and quota business logic
errors.py — a variety of descriptive custom exceptions
storage.py — methods interacting with Amazon Glacier
web.py — the web interface which sits atop the backend

A tricky issue is mapping Debian package filenames, which are thankfully unique, to the S3 object keys, which in this situation are 160-bit hashes. Without a database this is impossible, but access to one may be on the horizon, should this prototype be found useful.