Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? https://arxiv.org/abs/2407.16607 Jonathan Hayase, Alisa Liu, Yejin Choi, Sewoong Oh, Noah A. Smith Training data leakage and memorization in language models