Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

Training data leakage and memorization in language models