Debian Wheezy: US$19 Billion. Your price… FREE!

As many would know, Debian GNU/Linux is one of the oldest, and the largest Linux distributions that is available for free. Since it was first released in 1993, several people have analysed the size and produced cost estimates for the project.

In 2001, Jesús M. González-Barahona et al produced an article entitled “Counting Potatoes“, an analysis of Debian 2.2 (code named Potato). When Potato was released in June 2003, it contained 2,800 source packages of software, totalling around 55 million lines of source code. When using David A. Wheeler’s sloccount tool to apply the COCOMO model of development, and an average developer salary of US$56,000, the projected development cost that González-Barahona calculated to start-from-scratch and build Debian 2.2 in 2003 was US$1.9 billion.

In 2007 an analysis entitled ‘Macro-level software evolution: a case study of a large software compilation‘ by Jesús M. González-Barahona, Gregorio Robles, Martin Michlmayr, Juan José Amor and Daniel M. German was released. It found that Debian 4.0 (codename Etch released April 2007) had just over 10,000 source packages of software and 288 million lines of source code. This analysis also delved into the dependencies of software packages, and the update flow between Debian release (not all packages are updated with each release).

Today (February 2012) the current development version of Debian, codenamed Wheezy, contains some 17,141 source packages of software, but as it’s still in development this number may change over the coming months.

I analysied the source code in Wheezy, looking at the content from the “original” software that Debian distributes from its upstream authors without including the additional patches that Debian Developers apply to this software, or the package management scripts (used to install, configure and de-install packages). One might argue that these patches and configuration scripts are the added value of Debian, however the in my analysis I only examined the ‘pristine’ upstream source code.

By using David A Wheeler’s sloccount tool and average wage of a developer of US$72,533 (using median estimates from Salary.com and PayScale.com for 2011) I summed the individual results to find a total of 419,776,604 source lines of code for the ‘pristine’ upstream sources, in 31 programming languages — including 429 lines of Cobol and 1933 lines of Modula3!

In my analysis the projected cost of producing Debian Wheezy in February 2012 is US$19,070,177,727 (AU$17.7B, EUR€14.4B, GBP£12.11B), making each package’s upstream source code wrth an average of US$1,112,547.56 (AU$837K) to produce. Impressively, this is all free (of cost).

Zooming in on the Linux “Kernel”

In 2004 David A. Wheeler did a cost analysis of the Linux Kernel project by itself. He found 4,000,000 source lines of code (SLOC), and a projected cost between US$175M and US$611M depending on the complexity rating of the software. Within my analysis above, I used the ‘standard’ (default) complexity with the adjusted salary for 2011 (US$72K), and deducted that Kernel version 3.1.8 with almost 10,000,000 lines of source code would be worth US$540M at standard complexity, or US$1,877M when rated as ‘complex’.

Another Kernel Costing in 2011 put this figure at US$3 billion, so perhaps there’s some more variance in here to play with.

Individual Projects

Other highlights by project included:

Project Version Thousands
of SLOC
Projected cost
at US$72,533/developer/year
Samba 3.6.1 2,000 US$101 (AU$93M)
Apache 2.2.9 693 US$33.5M (AU$31M)
MySQL 5.5.17 1,200 US$64.2M (AU$59.7M)
Perl 5.14.2 669 US$32.3M (AU$30M)
PHP 5.3.9 693 US$33.5M (AU$31.1M)
Bind 9.7.3 319 US$14.8M (AU$13.8M)
Moodle 1.9.9 396 US$18.6M (AU$17.3M)
Dasher 4.11 109 US$4.8M (AU$4.4M)
DVSwitch 0.8.3.6 6 US$250K (AU$232K)

Debian Wheezy by Programming Language

The upstream code that Debian distributes is written in many different languages. ANSI C with 168,536,758 is the dominant language (40% of all lines), followed by C++ at 83,187,329 (20%) and Java with 34,698,990 (8%).

Line chart
Break down of Wheezy by Language

If you are intersted in finding the line count and cost projections for any of the 17,000+ projects, you will find them in the raw data CSV.

Other Tools and Comparisons

Ohcount is another source code cost analysis tool. In March 2011 Ohcount was run across Debian Sid: its results are here. In comparison, its results  appear much lower than the sloccount tool. There’s also the Ohloh.net Debian Estimate which only finds 55 Million source lines of code and a projected cost of US$1B. However Ohloh uses Ohcount for its estimates, and seems to be to be around 370 million SLOC missing compared to my recent analysis.

Summary

Over the last 10 years the cost to develop Debian has increased ten-fold. It’s intersting to know that US$19 billion of software is available to use, review, extend, and share, for the bargain price of $0. If we were to add in Debian patches and install scripts then this projected figure would increase. If only more organisations would realise the potential they have before them.

Need help with Linux (including Debian), Perl, or AWS? See www.jamesbromberger.com.

Load Balancing on Amazon Web Services

I’ve been using Amazon’s Elastic Load Balancing (ELB) service for about a year now, and thought I should pen some of the things I’ve had to do to make it work nicely.

Firstly, when using HTTP with Apache, you probably want to add a new log format that, instead of using the Source IP address of the connection int he first field, you use the extra header that ELB adds, X-Forwarded-For. It’s very simple, something like:

LogFormat "%{X-Forwarded-For}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" fwd_combined

… and then wherever you’ve been using a Log statement with format “common”, just use “fwd_common”. Next, if you’re trying to use your domain name as your web server, eg “example.com” instead (or as well) as “www.example.com”, then with Amazon Route53 (DNS hosting) you’ll get a message about a conflict witht he “apex” of the domain. You get around this using the elb-associate-route53-hosted-zone command line tool, with something like:

./elb-associate-route53-hosted-zone ELB-Web --region ap-southeast-1 --hosted-zone-id Z3S76ABCFYXRX6 --rr-name example.com --weight 100

And if you want to also use IPv6:

./elb-associate-route53-hosted-zone ELB-Web --region ap-southeast-1 --hosted-zone-id Z3S76ABCFYXRX6 --rr-name example.com --weight 100 --rr-type AAAA

If you’re using HTTPS, then you may have an issue if you chose to pass your SSL traffic through the ELB (just as a generic TCP stream). Since the content is encrypted, the ELB cannot modify the request header to add the X-Forwarded-For. Your only option is to “terminate” the incoming HTTPS connection on the ELB, and then having it establish a new connection to the back end instance (web server). You will need to load your certificate and key into the ELB for it to correctly represent itself as the target server. This will be an overhead on the load balancer having to decrypt (and option re-encrypt to the back end), so be aware of the costs.

One of the nice things about having the ELB in place, even for a single instance web site, is that it will do health checks and push the results to CloudWatch. CloudWatch will give you pretty graphs, but also permit you to set Alerts, which may be pushed to the Amazon Notification Service – which in turn can send you an email, or call a URL to trigger some other action that you configure (send SMS, or sound a klaxon?).

Amazon Linux, EC2, S3, Perl, SSL Wildcard Certificates

Amazon Linux, one of the distributions that is recommended for Amazon EC2 customers, recently had an update — 11.09. In this there was an update to a whole raft of libraries, including the Perl LWP (libwww) library in perl-libwww-perl-5.837 (previously 5.8.33), and other related modules.

One of the changes that happened is a change of the default for “verify hostname” in the SSL protocol when using LWP::UserAgent; previously verification of the certificate to the hostname given was default disabled, and in an effort to improve security, this was turned on. You’ll see this mentioned in LWP::UserAgent documentation “The no checks behaviour was the default for libwww-perl-5.837 and earlier releases”. What’s unusual is the no-checks behaviour change is DIFFERENT in Amazon Linux’s package of 5.8.37 compared to this statement – I suspect this one line got back ported into 5.8.37 to change this default ‘in the interst of security’.

Unfortunately, this breaks a lot of scripts and other modules/libraries out there, one of which is the Amazon-issued S3 libary. S3 is the Amazon Simple Storage Service (SSS => S3), with which a user (customer) has their data arranged in “buckets”, with data in objects identified by ‘keys’ (like a file name). All data is put to, and read from the S3 service over HTTPS – it’s not locally mounted (though some cleaver fuse stuff may make that look possible – but it is still over HTTPS.

A bucket in S3 has a name, and for the example I have, the name looks like a domain name (images.foo.com). When accessing this bucket, the Amazon S3 Perl library connects to an alias hostname (CNAME) made up combining the bucket name above with “s3.amazonaws.com“, so our example here becomes “images.foo.com.s3.amazonaws.com“. This site is using a wildcard certificate for “*.s3.amazonaws.com” (you can see it as an Alternate Subject Name extension in the SSL certificate). This permits the certificate to be considered as valid for any hostname directly under the s3.amazonaws.com domain. However, subject to RFC 2818, the only thing permitted before “s3.amazonaws.com” is a single name – not a (seemingly valid) dotted domain name. So “com.s3.amazonaws.com” is OK with a wildcard certificate, but “images.foo.com.amazonaws.com” is not.

There are several solutions. The easiest is to turn off SSL certificate verification again in your script. A handy ENV environment variable may be set to do this: $ENV{PERL_LWP_SSL_VERIFY_HOSTNAME}=0. Alternatively, if you are using LWP directly, you can pass an initalisation parameter to LWP of ssl_options => { verify_hostname => 0}. Both effectively abandon any certificate verification.

Somewhat more complicated, you can define a custom validation callback (procedure) to further determine if the certificate is valid. This is in contravention to RFC 2818, and seems like a lot more hassle to work around.

Perhaps the easiest solution here is to avoid using period/dot/’.’ in Bucket Names in S3, thereby removing the conflict between the strict checking.

The most important thing is how lax we have been at verifying SSL certificates, and have come to rely on that just working. It is good to verify the SSL certificate matches the host in scripts: I don’t want to start communicating authentication information over an SSL channel if we can easily see we’ve been duped on the remote end. I was not familiar with wildcard certificates only being valid for one component of a domain name; this kind of reduces their effectiveness in my mind in some sense.They’ve always been more expensive than standard certificates, but being better aware of the FQDNs they will validate on is useful.

I’ve seen several other instances outside of this S3 example where invalid certificates have blindly been accepted by scripts (a CloudWatch example I saw with a redirect ‘hop’ through an SSL site); this default change from lax to legitimate certificates may actually encourage better adoption of the security that SSL can give — when we’re already paying for SSL certs — or lead us (as developers and architects) to acknowledge when we’re actively ignoring that layer of protection.

It’s early days now but as this default change filters into Linux distributions (and Perl distributions on other platforms) then we’ll start to see a lot of FAQs on this.

Rusty’s talk at PLUG

What a week for PLUG. After months of organisation, we were honoured by Rusty Russell flying to Perth for PLUG. He presented a talk entitled “Coding: lets have fun“, which showed the simplicity and beauty of a regular expression engine in around 20 lines of C, to a wireframe Flight Sim from a recent IOCCC where the code itself was formated in the outline of an aircraft, and then a dotted history of his experiences and where he has found joy in coding.

After a pizza dinner break for the 46 (or thereabouts) people present, Rusty was then corraled into a panel discussion with Dr Chris McDonald from UWA CompSci, and Assistant Professor Robert Cunningham from UWA Law for a chat on various topics; seems like cloud computing was on everyone’s thoughts.

The PLUG AV crew streamed this event live, and recorded it: videos of the talk (93 MB mp4) and the panel (115 MB mp4) are now available (both are around an hour and a quarter). Older videos are here.

Rusty was very generous in refusing to accept the collected funds for the expenses, so we have money now to repeat this exercise of flying in another speaker. It’s up to PLUGGers to try and decide who they would like to see next! Time-wise its likely to be Q2 next year as PLUG has a full schedule until then.

Big thanks to Chris, Robert and Rusty for speaking – they were all excellent. Also to Daniel Hamrsworth for co-ordinating tickets, the AV crew for their recording, and for everyone who put their hand in their pocket to help the event come together.

New a new PC. Time for a desktop?

My 2 year old Dell Studio 1558 is doing it again: slowing to a snails pace, heating to an inferno, and then spontaneously powering off (which I think is a saftety set at CPU temperature reaching 100*C).

I had Dell come and replace parts on this laptop about 9 months ago when similar symptoms developped. I originally purchased this unit while I was in the UK, around January 2010 I think it was. I was hoping to get 3 years out of it. Sadly, at around 20 months old, I’m getting too frustrated to put up with it. I’m now living in Australia, and having any PC multi-national company honour their warranty internationally is a challenge. Heck, worse offender in this scenario is Sony, who want £20 to answer the phone!

Now that I’m no longer living in a flat with a very transient lifestyle (lots of travel having gone, and replaced by a 1 year old boy), I’m much more rooted to my home office desk. So, in light of this, I’m thinking of getting a desktop with a reasonable screen. I saw Russell Coker’s post about a 27″ whopper from Dell for AU$899 or so, and was wondering what to pair that with, or if to go for a slightly smaller screen. Then comes the questions of the all-in-ones, and the touchscreens that are around.

What I’d like is something thats got a few (2?) USB 3 ports for the next few years of my accessory usage, SATA 3 so I can throw in a fast SSD. I’d potentially run Debian on this, so possibly don’t want a Windows license.4 GB RAM minimum, possibly 8.

So looking around its a quagmire of detaisl that 15 years ago I used to thrive on. Do I care about UEFI instead of a traditional BIOS. DO I really need SATA 3 instead of 2? What about legacy (!) 1394? HDMI connector – yes please – do I still want a VGA port? What about a second HDMI? Hm. That 27″ screen’s native res is more than most on-board graphics can drive… perhaps drop to a 24″ screen. What size should this be: ATX, mini ITX, smaller?

Then comes the pre-built or custom built. Dell, pretty I’m upset about your product quality right now. HP, you’ve (a) killed my DreamScreen recently, and (b) put your entire business in up the creek with indications that the PC business is going away/sold off. Lenovo? Acer?

So I’m at a computing crossroads. I can’t be bothered to build my own PC again – I’ve been living on laptops for almost a decade now. But they are expensive, and when something goes wrong, the there’s very little to salvage. Laptops suck, but do desktops suck less. Vendors suck, but then so does the time waste on building your own? I think Tablets suck for doing lots of data input (programming). All in ones – not sure. Touchscreens – probably a gimmick.