I think one of the next big milestones in training frontier models will be learning on large, current, and curated collections of educational and scientific literature.
Training on Textbooks
This isn't exactly a new idea overall. Models already perform quite well answering various current questions, solving complex problems up to competition level, and are progressing extremely rapidly. Some models openly train on books — see "Textbooks Are All You Need" (which we discussed here as well).
Though cynics claim they're training on the test set, hence the development of the concept to "Pretraining on the Test Set Is All You Need". Other model authors are very reluctant to disclose their training set composition, and not necessarily because of test set contamination — it might also be due to copyright complexities.
So we're sort of already there or at least heading that way. But I think we still have some ground to cover before we get to truly large-scale training on textbooks. Along the way, we need to solve several issues:
Copyright. Good textbooks belong to someone, are protected by copyright, and aren't open source with convenient licensing. And they won't be for a while. This is a complex, multifaceted topic that requires systemic solutions, including economic incentives. There's not much benefit when someone creates a good textbook but the profit is extracted by the author of the model trained on it. The entire economy around such models and data needs to restructure somehow.
Really large effective context, sufficient for the model to internalize large domains of knowledge without losing anything, and ideally pull in fresh results along the way without retraining the model. Textbooks will likely need to be in some new format. Knowledge graphs might be part of the solution, but maybe not — we haven't seen large adoption across different domains yet.
Mature RAG and other tools for working with new information. Nothing particularly new here — we'll need verification and quality assessment, orchestration for regular updates and preprocessing of new books, articles, etc., and generally building systems where integration of old and new knowledge happens more or less automatically.
Multimodality at least for text + images, which are everywhere and need to be well understood — diagrams, graphs, schemes, mathematical and other formulas. Video could be useful too, but we can start without it. I wonder what good educational video would look like for a model rather than for humans?
The output would be a model, or rather a scientist assistant agent, with exceptional capabilities across different knowledge domains. A Copilot for scientists, and eventually an auto-scientist, which many are already moving toward. And also a tutor or "A Young Lady's Illustrated Primer".
Nearby is the question of safety, misuse, dual use, and other dangerous model capabilities. Model testing for such capabilities has been around for a while and the training described here will certainly raise risks. But the benefits, I'm confident, are significant, and there will clearly be separation between models for verified humans and everyone else.
The largest and most significant part of the problems here, as you can see, isn't entirely technical.
Domain-specific models (DLLMs)
I also want to discuss a specific manifestation of this trend: domain-specific models (DLLMs).
DLLMs could become the most notable disruption. By various estimates, 2.8 to 3.3 million new scientific papers are published annually — humans can't read them all, but copilot can.
Most likely, expert models will emerge in most significant domains, capable of answering field-specific questions, helping solve current problems, and giving humans a 10x boost in productivity. We'll be able to have experts in modern physics, super-intelligent assistants in materials science, deep experts in software engineering, advisors in medicine or agriculture, and so on.
Models will differ significantly from each other — different types of knowledge, much of which other models don't need to know (solid-state physics isn't required for a compiler; medical models don't critically need to know software licenses), different licensing and safety requirements, different quality assessment procedures, etc. Each will have its own regulation, checks, and certifications.
Multimodality is needed, but at a more detailed level it will vary — even for image modality, objects will be quite different: 3D molecules, medical images, UML diagrams, phase diagrams — each discipline needs its own sub-modality.
I don't think DLLMs will be covered by current universal model manufacturers. There aren't enough to dig deep into all these areas and handle constant updates and quality control. But they'll likely provide good base models and infrastructure for tuning and usage. Other people and organizations with unique data and expertise will create DLLMs.
Important dimensions will be scale range (on-device → GPU-cluster) and open or closed (what and how you control). In edge and on-device, I think there will be especially much interesting development in the coming years. Many places need to be able to work without internet, especially in continuous industrial processes.
The trajectory of independent model pretraining (hundreds of billions to trillions of tokens) will remain for the select few and wealthiest, while the truly mass scenario will be adaptation of base models, in the cloud or locally.
Data essentially breaks down into three different layers:
Core corpus — stabilized sources (textbooks, standards, review articles).
Dynamic feed — preprints, patents, fresh news (auto-RAG-pipeline).
Telemetry (private logs and feedback) — for the model to gradually learn in the context of specific organizations.
Special value: the ability to keep these layers current (a SaaS niche called "DataOps for DLLM"). Core updates quarterly, Dynamic feed — daily (or even streaming) ingestion of preprints and patents through RAG pipeline, Telemetry — online fine-tuning/RLHF.
Separately, a compliance layer sits atop these: for Core corpus licensing matters, for Dynamic — copyright verification, for Telemetry — GDPR/local laws.
And in reality, this won't just be DLLM, but an agent with DLLM inside, equipped with special additional instructions, tools, and orchestrated to work together with other agents.
A major topic — domain benchmarks, and possibly certification in the limit. On one hand, without benchmarks there's no trust, on the other, you still need to test on your own data and tasks — each company has its own specifics and requirements, and different models can behave differently.
DLLMs have a different risk profile from general models — a good domain model errs less frequently, but the cost of error is higher: wrong dosage, incorrect financial report, etc. Hence the need for domain auditing, traceable citations, explainability in some cases. Likely, a market for independent red-team auditing will emerge, which regulators and insurers will consider when deploying models to production.
Early incarnations like Med-PaLM 2 in healthcare, BloombergGPT in finance, Sec-PaLM 2 in security show that "general → narrow" recipes already work; the next couple of years will set the pace for further LLM market fragmentation into verticals.
Agreed, domain specific seems the likely progression that we are seeing the first steps towards.
What are your thoughts on moving away from training on language?
I'm a big fan of us humans and the language we produce, but if it is say medical/health foundational knowledge we are seeking to build, I would posit that we should be training on biological measurements and data on interventions/actions that were taken on bodies (and the measurements on those bodies before and after). Possibly mixing in graphs of strongly established biological pathways, but leaving language from textbooks and scientific papers out of the training set as they are abstractions from data we have more directly.
Adding a link to the post I made elaborating on another probable upside of using this type of time and location stamped data as opposed to language for the training set: https://michaelgeer.substack.com/p/large-biological-models-inherently