Friday, July 23, 2010

GetDeb: Archive traffic distribution

GetDeb user base increased exponentially since it was started on 2006. The migration to a proper APT repository while providing important benefits is also a big technical challenge. The GetDeb and PlayDeb repositories are presently configured by more than 30k users, providing tens of Gigabytes from our mirror pool during traffic peaks.
On 2007 we had a single server and the traffic was unaffordable, we have gathered some mirrors and we have developed a php script which was responsible for validating files availability on the candidate mirrors and then redirect the users to them (using http redirect). This script was poorly developed but sufficient for a long time.
Before moving to APT the file requests were human originated from web clicks, now this scrip is massively used by the automatic system upgrades, it's original faults have now a much serious impact. It needs to be replaced.

I have checked existing solutions for mirror distribution:

APT mirror: method - APT supports a specific mirror: method which dynamically obtains a mirror from an URL, however it's transaction based, the same archive will be used for all requests after an initial retrieval. This means that on the beginning of transaction it should get the url of a mirror which provides all the files required by the subsequent operations. For GetDeb this is a major limitation, since we have very frequent updates (somtimes hourly) most of the mirrors would be unavailable for mirror selection because they would be out of synch, even if they do have the packages for that specific transaction there is no way to know in advance. This issue is not present with http redirects, we always return the packages index from the master server, files will be obtained from individual mirrors as long they match the master server version, regardless of the overall mirror status.

Mirrorbrain - Mirrorbain is used by mainstream solutiosn like OpenSuse's build service and OpenOffice so it was a strong candidate. After some research I have found that it detects file availability by using a database which must be kept current using a mirror scan tool which does a full mirror scan (file info: size, last modified). While this maybe great for most scenarios I don't think it is as efficient as doing on demand mirror check, our slowest mirror took >10m for a full scan, we would need large intervals increasing the risk of redirection to a failed mirror.

mirror-selector - Because I have a strong believe on the technical merit of the on demand scan I have decided to implement a mirror selection system from scratch using Python.
The utility/project name is "mirror-selector" it runs as a standalone HTTP Server whose only purpose is to handle static file GET requests, check the availability from a local directory (it must be run on a local mirror) and then redirect to an available mirror after checking that an exact copy of the file is available remotely.

The http server uses a fixed size thread pool, each web client request is handled on it's own http server thread. When mirror-selector starts a thread is started for each mirror, each mirror thread provides an input queue which maybe used by any http server thread. With this architecture all requests related to a unique mirror are handled on a single thread, this allows to easily reuse the same TCP connection by using HTTP 1.1 Keep-Alive for multiple requests. The caching facility is also simpler to implement because it works on a per thread basis.

The code is available at launchpad: bzr branch lp:mirror-selector (check the README to test it), it should be considered as alpha.

GetDeb's/PlayDeb's main archive pool was already switched to mirror-selector, we may intermittently swap to the legacy selector as serious problems maybe still be found.

To check if it's available and some stats:
http://archive.getdeb.net/status/

1 comment:

  1. I had to disable the new selection because after about 1h the mirror check threads will start reporting "Temporary failure in name resolution", there is no network problem or DNS connectivity and retrying will not fix it, just restarting. After some research on google this might be related to urllib2 leaking file descriptiors.

    ReplyDelete