Training data leakage in language models

Data leakage

Memorization

How? why?

Chen2024multi perspective studies how and why memorization happens in LLMs. Huang2024demystifying studies how the verbatim memorization happens.

Exact or verbatim memorization

Carlini2021extracting demonstrated that we can extract (potentially private) training examples from large LLMs. Although these language models do not necessarily “overfit” as a whole, they still memorize examples in the Training dataset (Carlini2023quantifying, ).

Other types of memorization

Memorization can be about remembering very specific sentences or paragraphs, but it can also be about the proportion of different data. Hayase2024data proposes “data mixture inference”, which “aims to uncover the distributional make-up of training data”.

Evaluation

This makes it tricky to evaluate LLMs (see Data leakage). It is often difficult to determine whether LLMs are capable of doing a certain task, or it is simply remembering the example (Wang2024generalization). There was a case where the astonishing example where an LLM could draw a reasonable shape with Tikz, but later people found a similar example on Stack overflow.

Golchin2024time proposes a method to identify data contamination.

Chen2024copybench

Data archaeology

This can be used to perform privacy attack or “data archaeology” (Chang2023speak). Pretraining data can be detected (Shi2023detecting).

Memory manipulation and knowledge editing

It is also possible to force-memorize texts by manipulating localized weights (Stoehr2024localizing).

see also Knowledge editing

Privacy

Most models do not implement any Differential privacy mechanisms because (1) they tend to use public data and such mechanisms (2) reduce the accuracy in downstream tasks and (3) increases the training cost.

The incidence of Scatter Lab’s Lee Luda chatbot may be a good example although it was a much simpler error.

This may be a more of a problem in Language models for health records or other sensitive private data, especially regarding the release of the pre-trained language models.

Xu2024benchmarking