Thursday, November 22, 2007

25 million green bottles

iStock_000003735726XSmall There are, as you'd expect, 1001 stories about the loss of 25 million records relating to children and their parents.  Child benefit is one of the most "taken up" government benefits - something like 98% of parents (umm, sorry, children) receive it (versus perhaps 80% for child tax credit). So there's certainly a large number of people affected - the figures of 7.5 million households and 25 million people total look about right.  I've seen this called "DataGate" by the Independent.  Perhaps "Shutting the DataGate after the horse has bolted" may be better.  The story definitely isn't over and I'm sure, barring any other major news developments, it will hold space in the first 2 or 3 pages of newspapers for several weeks and several more instances will doubtless come to light.

If you have a child under 16, your personal detail (name, address, bank account, date of birth and national insurance number).  It's unclear whether if you used to receive child benefit (i.e. your child is now older than 16), your data was still available on the system, but I suspect not.  Likewise, if you are one of those who are generally off-system (certain members of the military, the police and so on), I suspect that data was held elsewhere - so those who talk about the risk of protected identities being compromised are probably wrong.  It is, sadly, one of the hallmarks of IT the world over that data is held locally in each application for each purpose - so this kind of data exists in dozens of applications across every unit of government, whether central or local, state or national, metropolitan or federal.  When we built the Government Gateway, we looked hard at the data we needed - for instance, to post the PIN, we needed an address; but, once posted, we didn't need it anymore.  So we issued a query to the relevant government back end system, got the address, and then dispensed with it as soon as the envelopes were printed.  But that was relatively easy to do in designing a new system from scratch.  Most systems have been around a lot longer.

Let me state two things up front:

1)  Loss of sensitive data is not just a UK government problem or even just a UK problem.  It's prevalent all around the world, in corporates and govenrments, and made ever easier by the increasingly wide access to email and the Internet - and, of course, by the ever increasing number of systems that store all the data that they ever need right in their main database.  It's almost like we should be surprised if our data isn't out there in the wild world.  Never mind worries about putting some personal information on Facebook, your data is already on several other sites, for anyone malicious or maligned to access.  There's a reason that whenever you see people in a film going into a secure nuclear area, there are two of them and they each have a key that has to be turned simultaneously.  Putting control in the hands of one person can be a recipe for disaster. This latest issue comes on top of:

    • An event just a couple of months ago when a disc being sent to Standard Life and containing details of 15,000 people was lost (sadly also by HMRC)
    • 94 million Visa and Mastercard accounts exposed at TJ Maxx
    • Bank of America's loss of backup tapes containing credit card information for 1.2 million Americans
    • The exposure of the records of 800,000 people at UCLA
    • Reed Elsevier's loss of personal information on 300,000 Americans
    • Transaction data for 180,000 customers of Ralph Lauren
    • The use of unsecure email to send out classified nuclear secrets (that's a link to the story by the way, not to the actual secrets)
    • Choicepoints loss of 163,000 individuals records (and the accompanying ID fraud)
    • Hackers in Ohio Universities systems took 137,000 records of students and alumni
    • The loss of doctor's personal information on an NHS website
    • The loss of 26 million records for US veterans
    • and, golly, I've just found this extraordinarily comprehensive list of data breaches.

2) This isn't a problem about why weren't the CDs encrypted or why wasn't the data sent by some other, presumably safer means, it's about several lengthy failings in process: who can access the data, how easy is it to get a full database dump, what controls are there on writing data to CD, who needs to approve what and so on.  In the technical world that most of us operate in we're used to a window popping up and saying "hey, stupid, are you sure you want to delete that entire list of folders and files?".  There is no "are you sure you want to send this data by post dummy?" dialogue box, but there would have been checks and balances before it got to that stage.

It must have been a long chain of events to get to this point.  A full download of every data item in any of the government's big systems isn't the kind of thing that can be just asked for - I'd go as far as to say that it's a one time request requiring special work (although it's possible in this case that the extract had already been prepared for some other reason in the past - and, if that was the case, perhaps many of the usual controls would have been bypassed in this case.  Imagine the conversation "you need an extract? Well, normally that would take us 3 months but I just happen to have one over here, only one previous careful owner, that we took in April 2007"). 

I'd bet that there isn't a requirement in the specification of any government system anywhere in the world to be able to "hit f12 to dump database to two CDs", password protected or not.  So my assumption would be a change request is raised, the IT supplier (probably EDS as the Child Benefit process and accompanying systems used to belong to DWP but were transferred 4 or 5 years ago to HMRC but I don't think they were absorbed by the CapGemini contract) does a quick check to see how long it will take, the change request gets approved (not as quick to get done as it is to write - perhaps a month or more), the data gets offloaded at the next convenient point in processing and then copied to two CDs by someone technical.    Lots of people get involved in this process.  There would even have been a discussion about the cost of removing some fields, hashing out others, creating dummy data and so on.  In the end, it sounds like we've got a very big spreadsheet secured by a password when you try to open it.  I'm not even sure that old versions of Excel can handle that many rows so maybe it was just a word file.  That's a lot of pages.

My guess it that encryption wasn't asked for because the person doing the asking wouldn't have known much about that and the people receiving the data would have known even less, and the technical folks would have wondered about it but would have been busy and so moved on. PKI isn't part of the default desktop installation of any where in government outside spooksville.  I could get into this a lot more but it's a long time since I worked at the Inland Revenue and even then I wasn't that close to the systems involved here - and I'd be speculating.  Doubtless someone is already working on a report and it will come out under FoI or through the persuasive nature of various journalists and, I'm sure, a series of Internet message boards.

As far as I understand, no one ever actually asked for a "full copy of the entire child benefit database".  The NAO asked for a sample of de-sensitized data.  Typically that's a few tens of records with personal identification information removed - certainly the NI record hashed and probably the bank details removed.  When I did a stint in audit back in my banking days, a typical sample was 30 records - statistically, that's enough to give you a sense of whether everything is in order when you're doing a substantive test.  I'm not sure what NAO were trying to prove - maybe that only appropriate data was stored (perhaps that only parents with children under 16 were in the system?) or perhaps that the fields contained the right data and in the right format (post codes matched what they were supposed to) or maybe they were testing that the population claiming matched the expected population claiming.

Putting aside then the issues of should the data even have been floating around 0r what process breakdowns were there, here's a take on the technical aspects of how data should be shipped around:

Most people - as did one commenter on an earlier post - will be asking "why on earth is data being shipped on CD in this day and age?"  A perfectly reasonable question. And one that when you look at the other ways that were probably immediately available, you might briefly think "oh, I see why they'd do it that way" ... right before you clap your hand to your forehead.  Don't think that government (generally, not just the UK) are endowed with the latest hi-tech gear available to one and all.

Two CDs is a fair chunk of data.  At least 1.2GB based on standard format of 600MB a disc.  Not much compared with the capacity of the average ipod (even my iphone has 8gb, I think the entry level classic is now 80gb) or even the average memory stick (2gb is a common size for Vista ReadyBoost).  But a lot of data to ship around nonetheless.

Let's take email as one option - most people would consider that first:

  1. Email systems in government generally have very small mailbox sizes. A few tens of megabytes is very common, even as much (as little?) as 200mb would be uncommon.  This is not like google where you get a couple of gigabytes or more on signup.   Trying to send 600mb would bust both sender and receiver.
  2. Bandwidth between departments is relatively small.  More accurately, there's lots of bandwidth along the backbone  that links departments, but individual links to that backbone are typically small - 1.5MB/s, sometimes less (and are set as a function of the size of the department - I'd expect NAO to be one of the smallest (and I'm actually pretty sure, but not certain, that they're not on the GSI), HMRC to be one of the largest).  Network performance in offices is load dependent and likely to be slow making uploading an attachment of 600MB to the server interminable.
  3. Many government staff don't have access to email at all (if they are routinely processing citizen tax transactions, it's felt there's no need).  Likewise, even fewer have access to the Internet.
  4. Firewalls on the email systems limit attachments to 2mb, sometimes 4mb, rarely much more than that (there are exceptions but they are rare)

But had these all been overcome, the file would have moved between HMRC and NAO within the secure network of government departments known as the GSI.  Risk of interception would have been low (the GSI is regularly penetration tested and is built to a high standard).  But, realistically, this wasn't an option for anyone in HMRC. Government email systems are just not built for files of this size - and I believe that even those that the rest of us use day to day would fall over after trying to digest a file of 1.2gb.  My entire PST file in outlook is only about one gigabyte now (and it has 2 years of email in it, the rest is in archives). With all these issues - and the continuing sense that e-mail is somehow unsafe (like all things on the Internet) compared with "sending 2 CDs by post (!) - I would not be at all surprised to hear that CDs by post is the default choice for exchanging even relatively small amounts of data between departments, agencies and 3rd parties (such as pension companies and banks).

Sometime in 2002 the team I ran in the Cabinet Office built, on behalf of the Criminal Justice folks, a secure email system.  It was the brainchild of the same guy that thought up the Gateway as a pan-government authentication system and, I think, ukonline (which was known originally as me.gov).  It was designed to allow lawyers working on criminal cases to exchange, securely, documents between their offices and the courts (and each other).  Remote users could use a web-based email front end or their own outlook client and everything inbetween would have been encrypted and secured.  At the time we deployed it, the common way to send such data around was to fax it (you remember the way it used to be done - you'd phone them up, say "stand by the fax machine", then they'd put the phone down and go to their fax, nothing would happen because it was out of paper, or it was already receiving someone else's 100 page fax, all on that slightly fuzzy thermal-style paper).  It was a comedy and needed to be sorted, hence the requirement for the secure mail.  This solution was made available to the whole of government, but take up was low.  I'm not sure that this would have been any better - it would have had the same limitations of bandwidth, firewalls, and so on.

In our own team, and before the secure mail system, we also used various commercial products to exchange secure data (the systems we built and ran were at least restricted and were sometimes higher).  They were based on hosted servers.  But the same issues of bandwidth, firewalls and so on would have applied.  On top of that, both parties have to be connected to the  secure system - so there has to be a set up process: passwords, keyfobs and so on need to be exchanged in advance and kept current. All of those things complicate the issue enormously - especially when such exchanges are not routine and day to day.   What usually happens is that they fall into disuse, the processes breakdown and then rather than take the time to set them up again, people look for a quicker way - popping 2 CDs into an envelope and putting them in the mail for instance.

So, no, email isn't a viable alternative for large volumes of data.  In fact, uploading and downloading to websites via secure spaces, even when encrypted and super-protected, probably isn't a viable way of shifting data around outside of your own secure network within the building, except when you're talking about project-type information and using sharepoint or similar tools - and when you're moving data that you wouldn't mind someone else finding by accident if you haven't set up your server security quite right.

Lots of companies offer solutions to these - the usual products chasing a problem to solve.  There will be lines of them queuing up to offer their services to governments (globally) and their IT suppliers over the next few weeks.  They will offer super-duper-extra-double encryption, they'll say that they can identify rogue data being sent by email and divert it, they can check staff activities on the Internet and make sure they're not doing things, they can spot people trying to download data off a system and copy it to their iPod and so on.  Of course, they spot the problems they're design to spot; not the ones that happen off the beaten track or where the procedures are deliberately over-ridden.

But, on the face of it, had this data been copied to an iPod and hand-carried to where it was going and copied on to another iPod, we might never have known about this.  So iPods to come equipped with a government-approved fingerprint reader as the next step?  Or maybe personal memory sticks with dual control - sender and receiver fingerprint readers.

This is an undeniably serious problem.  There may have been many serious breaches as noted above, but few have stretched as far as the child benefit data.  The solution isn't, however, simple.  And it isn't about secure ways of exchanging data - at least not initially.  There's nothing to say that had this data not arrived at the NAO securely, it wouldn't have been left on an unsecure laptop and then been stolen from the back of a car for instance.

So:

  1. All of the processes around access to patient, customer, taxpayer, citizen etc data in every department, agency, non-departmental public body and local authority are going to go through a rapid review.  New standards will be enforced: senior management sign-off, dual control (keys round the neck and everything), IT supplier held accountable for where data is put and so on. This will take time and still things will be missed and it will happen again - let's not hope that it's on this scale, but it will happen again.
    • Lock down data exchange now.  People come to the data, not the data to the people. Until better processes are in place, this should stop the problem from getting worse.
  2. All staff should be taught the "green cross code" of using computers. The very basics need to be re-taught.  For that matter, the code should be taught at schools, colleges and libraries.
  3. The spooks should lead a review of deploying encryption technology to departments holding individual data so that all correspondence is encrypted automatically in transit using appropriate levels of protection for the job.  This will be expensive.  The alternative though is to make encryption optional - but because you can choose, sometimes people will choose not to (because it's too slow or something) and the problem will recur.
  4. Systems being architected now and those to be architected in the future will look at what data they really need to hold and for how long and will, wherever possible, make transient use of data held elsewhere.  The mother of all ID databases would be a good place to start.

All of this will take time.  In the interim, managers in the line of fire are going to have to use common sense and check and recheck when they're asked to provide information to anyone.  Social engineering is alive and well after all.

4 comments:

  1. Anonymous9:41 pm

    Thanks Alan for the comprehensive blog! We've successfully used Sharepoint as a way of sharing 20M+ records to our software supplier.

    WCAM progressing very well in the delivery phase. I've recently moved to a GM role creating physical assets. DC

    ReplyDelete
  2. Anonymous9:25 am

    Excellent blog. Rebuilding Joe Public's appetite for a single or mother of all databases as you put it, will be much harder now. Silly mistakes like these take years to undo or uncommit. CG

    ReplyDelete
  3. Keith U4:39 pm

    Excellent article, confirming some of my suspicions. Would you mind if I quoted it on www.iqtrainwrecks.com

    This is a website run by the IAIDQ. See www.iaidq.org. We would love to have you as a member.

    ReplyDelete
  4. quote away. i'll check out your site.

    ReplyDelete