Data management

The strategy to manage research data may differ based on the nature of the data, particularly its size, team size, and access restrictions.

Principles of data provenance

Data provenance refers to the detailed tracking of the origin, lineage, and transformation history of data as it moves through various processes and systems. Having a good data provenance means that you have the documentation of where the data came from, how it has been altered or manipulated, and by whom. This traceability is crucial for ensuring the data’s accuracy, reliability, and credibility.

Documentation

How to ruin your life with poor data documentation

What should we do instead

Every dataset that enters the research project should be documented in detail. For each dataset in the raw data directory, document the following at the minimum:

Filename
Timestamp
Source (from whom, URL, etc.)
Brief description of the dataset.
Explanation of each column/row, including its datatype.
Notes on missing data or other important information (see the questions in Gebru2020datasheets).

Ideally, you would want to be able to point to the Datasheets for the dataset. If you created the dataset, create a datasheet. See also Bagrow2022Network for network datasets.

Raw data is sacred

How to ruin your life with poor raw data management

Sarah, a promising PhD student, was working on her dissertation about climate change’s impact on agriculture. Her colleague, John, decided to play data doctor, manually tweaking some values in the raw dataset to “correct” perceived errors without documenting these changes or informing anyone. Unaware of John’s tampering, Sarah used this altered data for her analysis. She produced a groundbreaking paper that received widespread acclaim, even getting her invited to prestigious conferences. However, the accolades were short-lived. During a routine data audit, the discrepancies were discovered, and an investigation was launched. The academic community was rocked when it came to light that the dataset had been tampered with. Sarah faced allegations of academic misconduct, her paper was retracted, and her academic career was in jeopardy. John’s hasty handiwork not only cost him his job but also tainted Sarah’s reputation and her future in academia.

Mike, an analyst at a pharmaceutical company, received a dataset to evaluate the effectiveness of a new wonder drug. He noticed some odd trends but figured they were quirky insights. He presented his findings with great fanfare, only for his team to smell something fishy. When asked to verify the data, Mike realized he couldn’t trace the discrepancies between his dataset and the original source. Turns out, errors were introduced during an undocumented data transfer – the dataset he analyzed was a corrupted clone. The fallout was dramatic: months of research were invalidated, the drug’s approval was delayed, and trust in Mike’s work evaporated faster than a sneeze in a cyclone.

Laura, a diligent student, spent weeks scrubbing a raw dataset until it sparkled with cleanliness. She meticulously removed duplicates, corrected errors, and ensured consistency, but manually and directly on the raw dataset. Enter Bob, a well-meaning but oblivious colleague, who uploaded an updated raw dataset, overwriting Laura’s clean data. When it came time to analyze, the team found Laura’s painstaking efforts had vanished into the ether. Because there was no record of each change Laura made, they had to restart the entire cleaning process.

Next time, they remembered to create a new version, but still the cleaning was done manually and the record wasn’t kept. when the new version rolled in, they still had to apply the changes manually by tracing back the changes that they made.

What should we do instead

The raw datasets should never be touched by hand. Protect those files with read-only permission, versioning, etc.

When you apply some changes, ideally they should also be applied upstream so that you don’t have to worry about them in the future. But in most cases, this is not the case. When apply changes, create an auxiliary dataset that list the changes with an accompany documentation why. Write a script to apply those changes and integrate that script into the workflow.

This practice will make it very clear what has been done and why. It also allows you to keep building on it and adapt to new versions of the datasets.

Use cases

Unrestricted and not too big datasets

Most projects may fall into this category. Datasets can be shared freely and not too big to store locally, but they may be a bit too big to store in the git repo for the project.

One handy way is to share a Google drive or dropbox folder across the team and symlink them to local data directory like following:

ln -s /path/to/the/shared/rawdatadir ./data/raw

In doing so, all scripts and notebooks can use the same path (e.g., ../data/raw/raw_dataset.csv rather than /home/yy/projects/xxx/data/raw/) and everyone can more easily run their replication locally.

Another way is to keep a configuration file that sets the data path locally.