Guide to safeguarding your precious data

Blog written by Jonty Rougier, Professor of Statistical Science.

Data are a precious resource, but easily corrupted through avoidable poor practices. This blog outlines a lightweight approach for ensuring a high level of integrity for datasets held as computer files and used by several people. There are more hi-tech approaches using a common repository and version control (e.g. https://github.com/), but they are more opaque.

If this approach is accompanied by this document (in the same folder) then there should be no loss of continuity if there are personnel changes. It might be helpful to create a file README stating: See *_safeguarding.pdf for details about how these files are managed.

The first point is the most important one:

1. Appoint someone as the Data Manager (DM). All requests for datasets are directed ‘upwards’ to the DM: do not share datasets ‘horizontally’. The DM may respond individually to dataset requests, or they may create an accessible folder containing the current versions of the datasets (see below).

The following points concern how the DM manages datasets:

2. Each dataset is a single file, with a name of the form DSNAME.xlsx (not necessarily an Excel spreadsheet, although this is a common format for storing datasets). At any time, there is a single current version of each dataset, with the name yyyymmdd_DSNAME.xlsx; the prefix yyyymmdd shows the date at which this file became the current version of DSNAME.xlsx. The current version is the one which is distributed. Everyone should identify their analysis according to the full name of the current version of the dataset. This makes it easier to reproduce old calculations, and to figure out why things work differently when the dataset is updated.

3. The DM never opens the current version. They simply distribute it (including to themself, if necessary). This is very important for spreadsheets, where every time the file is opened, there is the risk that entries in the cells will be inadvertently changed. Typically, though, new data will need to be added to DSNAME.xlsx, and corrections made. These changes are all passed upwards to the DM; they are not made on local versions of DSNAME.xlsx.

4. Incorporating changes. As well as current versions with different date prefixes, there will also be a development copy, named dev_DSNAME.xlsx. All changes occur in dev_DSNAME.xlsx. I recommend that each change to dev_DSNAME.xlsx is described by a sentence or two at the top of the file changes_DSNAME.txt. This file is made available alongside the current version of the dataset.

5. Updating the current version. When the changes in DSNAME.xlsx have become sufficient to distribute, it is copied to become the new current version, with an updated date prefix. The DM should alert the team that there is an update of DSNAME.xlsx. If the changes are being logged, insert the name of the new current version as a section heading at the top of changes_DSNAME.txt so that it is clear how the new current version differs from the old one.

6. If there is a crisis in DSNAME.xlsx then the DM should delete it, create a new dev file from the current version, and remake the changes. Crises happen from time to time, and it is a good idea not to accumulate too many changes in the dev file before creating a new current version. On the other hand, it can be tedious for people to be constantly updating their version for only minor changes. So the DM might want to create ‘backstop’ versions of the dev file for their own convenience, perhaps named backstop_DSNAME.xlsx, which also requires a line in changes_DSNAME.txt if it is being used.

7. The DM is responsible for backing up the entire folder. This folder contains, for each dataset: all of the current versions (with different date prefixes), the dev file, the changes file if it exists (recommended), and additional helpful files like a README. An obvious option is to locate the entire folder somewhere where there are automatic back-ups, but it is still the DM’s responsibility to know the back-up policy, and to monitor compliance; even to run a recovery exercise. It’s a bit ramshackle, but regularly creating a tar file of the folder and mailing to oneself is a pragmatic safety net.