
The study reveals a hidden bottleneck in data preparation that can limit model knowledge and efficiency, prompting dataset engineers to adopt multi‑extractor strategies for richer, higher‑quality training corpora.
The preprocessing stage that converts raw HTML into clean text has long been treated as a routine step, but new research shows it can be a decisive factor in the breadth of data available to train large language models. By systematically comparing three widely used extractors—Resiliparse, Trafilatura, and JusText—the authors demonstrated that each tool favors different sections of the web, with only a minority of pages surviving across all three. This divergence means that a single‑tool pipeline can unintentionally exclude a majority of potentially valuable content, limiting the diversity of linguistic patterns and factual knowledge that models can learn.
When the researchers merged the outputs of all three extractors, the token pool swelled by as much as 71%, expanding a 7B‑parameter model's dataset from 193 billion to 283 billion tokens. Remarkably, this surge did not translate into higher scores on standard language benchmarks, indicating that the additional data was of comparable quality to the original set. Moreover, the combined approach outperformed simply loosening filter thresholds on any single extractor, especially in low‑resource scenarios where the web’s remaining data is increasingly scarce. This suggests that smarter preprocessing can unlock hidden reserves of information without sacrificing model performance.
The impact is most pronounced for structured content such as tables and code, where extraction tools differ dramatically. Resiliparse retained table layouts and code blocks far better than Trafilatura, which often loses cell content, and JusText, which can strip these elements entirely. Consequently, models trained with Resiliparse showed a 73% reduction in the performance gap on table‑comprehension tasks compared to peers using other tools. As the industry grapples with data licensing, toxicity, and copyright concerns, the ability to safely expand the training corpus through parallel extraction offers a compelling path to more capable and responsible AI systems.
Comments
Want to join the conversation?
Loading comments...