Safely using production data to drive testing

Sunday, 21 April 2013

Testing in your development environments using data from your production systems gives you accurate feedback on the latest additions to your application, but how do you avoid breaking Data Security rules?

A problem with incremental development and testing at scale

You’re developing a web app, and done the right things. You’ve worked hard to get confidence that your code will work at production-scale - generating large, realistic data volumes to test your code against. You test and integrate often and you push your code to production often because you know that the load and the data produced by users at scale are the real test of your system.

But you hit the wall, your generated data isn’t good enough – it’s neither accurate or wide ranging enough - users don’t always use the software like you expected. Generating tonnes of accurate data against a changing feature set and schema is costly and a hell-of-a-pain. Frustratingly there is a database full of real data sitting just over there in production. 

You want to use your production data to test your software but you can’t – either the data is just too sensitive to move to less secure environments, or you are working offshore or in the cloud and your Data Protection rules don’t allow you and your team bulk access to production data.

Basic anonymisation processes can be risky

Production data is often a secure store of sensitive personal data; Data Protection officers are keen that it stays that way. If we want that data, we can write a simple script to grab it and anonymise the private bits before it moves.  

But in a Continuous Delivery environment, the data schemas are under constant change – all part of working incrementally and iteratively. And as soon as we make feature changes that require changes to the data schemas, we risk exposing important data.  

Using a Blacklist Anonymisation process puts us one schema change away from a data security leak. We need something more watertight.

Blacklist Anonymisation 

Obscuring sensitive data in a schema with safe boilerplate copy, though a list of fields to be obscured

A safer option

Using a more mature anonymisation process means our data is more likely to remain safe as the schema develops.

Rather than having the constant vigilance of managing the list of fields to anonymise, everything gets obscured by default.  We specify which fields fine to let through unobscured – usually the structural data that sets up the relationships between your data items, plus some safe data, the rest is removed or obscured. This way a schema change that renames a field or restructures a collection of data, is likely to only degrade your anonymisation, it won’t have leaked personal details out of the secure store.

Obscuring of data with boilerplate such as ‘Lorem ipsum’ for copy, and placeholder or random values for other data types fills the gaps in a schema safely - whilst retaining the feel of the original data. Tricksy fields may be dropped or nulled.

This gives a large realistic data-set for testing. Many important data points can be gathered around data-size expectations and you should be able to get further confidence in your predictions of software performance.

Whitelist Anonymisation

Obscuring all data in a schema with safe boilerplate copy by default, allowing very specific fields pass though uncorrupted.

Making the data more useful

With Whitelist Anonymisation we have data safety and data-at-scale, but the data isn’t all that useful yet – a richer dataset would be useful for exploratory testing, as well as more in depth realistic performance measurements.  We can apply special rules to important fields, to ensure that they remain obscured but provide more useful values. For example Geo-location values can be driven from a known set or locations can be offset by a random safe amount. Postcodes and zip codes can be tweaked shortened or relocated, ages varied within ranges, and emails directed to test accounts.

These data adjustments are put in place whilst keeping to the data security aims: usually to ensure there is not enough data remaining that would allow a person to be identified.  Working in partnership with a Data Protection expert to ensure correct compliance is advised.

Graylist Anonymisation

Adjusting specific fields in non-reversible ways whilst anonymising a schema, to leave a wide range of domain specific data available for testing without revealing private information.

Tooling to help

Over at Github, Sunit Parekh has built a tool based on our experiences solving these sorts of problems for our clients and he has examples of usage at his blog. It offers the above features, as well as default strategies for obscuring data types.

Other Methods?

What do you do on your projects? How did you solve the problems of working with large amounts of personal data?

No comments :