CTB-3: LLM Heuristics
Allen Zhu recently had a fantastic presentation on ICML 2024, demonstrating some of his and other authors’ findings on LLMs. They referred to them as physics of LLMs. Despite the convincing results, I don’t think they are that rigorously proved. I would rather call them heuristics.
In this blog I will summarize most of their findings into a concise list of LLM Heuristics.
Knowledge Heuristics
Setup: given biography of N individuals, instruction-tune on N/2 biography QAs, evaluate on the other N/2 biography QAs.
[Knowledge storage and extraction]
Pre-train then instruction-tune fails to store knowledge
Solution: mixed-training (by adding instruction-tuning data to pre-training stage)
Solution: knowledge augmentation on pre-training data for multiple times (e.g. different styles on the same person)
Knowledge on celebrities (with augmentation) helps extraction on minorities (without augmentation)
[Knowledge manipulation]
Knowledge partial and dual retrievals are hard
Model may answer the full birthday correctly but cannot answer just the birth month correctly
Knowledge classification and ranking are hard without CoT
When asked whether a person is born in an even year, the model needs to first say the year number then classify
Similar for ranking two people’s birthday
Knowledge inverse search is hard (e.g. “who was born on July 6 1997?”)
Solution: RAG
Solution: insert reverse knowledge into pre-training data
Solution: insert line numbers to enable reversal search on critical documents
[Knowledge capacity]
All LLMs can store 2 bit/param knowledge with sufficient training (e.g. 1000 exposures)
Implication: 7B model sufficient for all English Wiki + Textbooks
Insufficient training will reduce knowledge capacity (e.g. 1 bit/param when only 100 exposures)
And much worse for models with gated MLP
int8 does not affect knowledge capacity (i.e. 4x compression)
int4 hurts by more than 2x
MoE does not affect knowledge capacity, and in fact very efficient if each expert evenly stores knowledge
Junk data significantly hurts knowledge capacity
Solution: prefix domain tokens (e.g. “from wikipedia.org”)
Reasoning Heuristics
We define:
Level 0 reasoning: given a question, brute-forces to compute all params maximally
Level 1 reasoning: given a question, uses topological sort and gives shortest CoT
Level 2 reasoning: before asking a question, mentally computes all-pair dependency graph (not needed for human reasoning)
[Hidden reasoning process]
LLMs have the skill to reason, not by memorizing solution templates
LLMs exhibit level 1 reasoning like humans, even before CoT
LLMs secretly learn level 2 reasoning beyond human capabilities
Model depth matters for reasoning, more layers the better, cannot be mitigated by CoT
High quality synthetic reasoning data improves model reasoning capabilities
[Mistakes in reasoning]
LLMs often know they just made a mistake, and can demonstrate regretful behavior
LLMs may compute unnecessary reasoning parameters
With level 2 reasoning, we can detect such systematic mistakes before generation starts (e.g. with probing)
LLMs may reason parameters not yet ready to compute
We should pre-train with mistakes, e.g. insert mistakes and their corrections in the data, STaR, Quiet-STaR, etc
Even with such longer solution text, LLM can still generate shortest paths (i.e. level 1 reasoning)
Beam-search or fine-tune with mistakes is too late and does not help much
One efficient way to insert mistakes: insert later sentences (that are not ready yet to compute) in the solution steps as mistakes
Hierarchical Heuristics
Hallucination is just LLMs learn format faster than task (knowledge, reasoning, etc)
LLMs can implicitly learn very deep context-free grammar trees with rotary/relative attention
LLMs learn hierarchy of context-free grammar trees via dynamic programming
DP states are encoded in hidden layers
DP transitions are encoded in attention layers
Do not require the CFG trees to be explicitly exposed (e.g. an official language dictionary)
NITs: bi-directional models like BERT and DeBERTa fail to learn deep hierarchy since they are encouraged to learn local planning instead of global planning, and also fail to perform effective knowledge storage even with mixed training and knowledge augmentation.