CTB-3: LLM Heuristics

Allen Zhu recently had a fantastic presentation on ICML 2024, demonstrating some of his and other authors’ findings on LLMs. They referred to them as physics of LLMs. Despite the convincing results, I don’t think they are that rigorously proved. I would rather call them heuristics.

In this blog I will summarize most of their findings into a concise list of LLM Heuristics.

Knowledge Heuristics

Setup: given biography of N individuals, instruction-tune on N/2 biography QAs, evaluate on the other N/2 biography QAs.

[Knowledge storage and extraction]

  • Pre-train then instruction-tune fails to store knowledge

    • Solution: mixed-training (by adding instruction-tuning data to pre-training stage)

    • Solution: knowledge augmentation on pre-training data for multiple times (e.g. different styles on the same person)

      • Knowledge on celebrities (with augmentation) helps extraction on minorities (without augmentation)

[Knowledge manipulation]

  • Knowledge partial and dual retrievals are hard

    • Model may answer the full birthday correctly but cannot answer just the birth month correctly

  • Knowledge classification and ranking are hard without CoT

    • When asked whether a person is born in an even year, the model needs to first say the year number then classify

    • Similar for ranking two people’s birthday

  • Knowledge inverse search is hard (e.g. “who was born on July 6 1997?”)

    • Solution: RAG

    • Solution: insert reverse knowledge into pre-training data

    • Solution: insert line numbers to enable reversal search on critical documents

[Knowledge capacity]

  • All LLMs can store 2 bit/param knowledge with sufficient training (e.g. 1000 exposures)

    • Implication: 7B model sufficient for all English Wiki + Textbooks

  • Insufficient training will reduce knowledge capacity (e.g. 1 bit/param when only 100 exposures)

    • And much worse for models with gated MLP

  • int8 does not affect knowledge capacity (i.e. 4x compression)

    • int4 hurts by more than 2x

  • MoE does not affect knowledge capacity, and in fact very efficient if each expert evenly stores knowledge

  • Junk data significantly hurts knowledge capacity

    • Solution: prefix domain tokens (e.g. “from wikipedia.org”)

Reasoning Heuristics

We define:

  • Level 0 reasoning: given a question, brute-forces to compute all params maximally

  • Level 1 reasoning: given a question, uses topological sort and gives shortest CoT

  • Level 2 reasoning: before asking a question, mentally computes all-pair dependency graph (not needed for human reasoning)

[Hidden reasoning process]

  • LLMs have the skill to reason, not by memorizing solution templates

  • LLMs exhibit level 1 reasoning like humans, even before CoT

  • LLMs secretly learn level 2 reasoning beyond human capabilities

  • Model depth matters for reasoning, more layers the better, cannot be mitigated by CoT

  • High quality synthetic reasoning data improves model reasoning capabilities

[Mistakes in reasoning]

  • LLMs often know they just made a mistake, and can demonstrate regretful behavior

  • LLMs may compute unnecessary reasoning parameters

    • With level 2 reasoning, we can detect such systematic mistakes before generation starts (e.g. with probing)

  • LLMs may reason parameters not yet ready to compute

  • We should pre-train with mistakes, e.g. insert mistakes and their corrections in the data, STaR, Quiet-STaR, etc

    • Even with such longer solution text, LLM can still generate shortest paths (i.e. level 1 reasoning)

    • Beam-search or fine-tune with mistakes is too late and does not help much

    • One efficient way to insert mistakes: insert later sentences (that are not ready yet to compute) in the solution steps as mistakes

Hierarchical Heuristics

  • Hallucination is just LLMs learn format faster than task (knowledge, reasoning, etc)

  • LLMs can implicitly learn very deep context-free grammar trees with rotary/relative attention

  • LLMs learn hierarchy of context-free grammar trees via dynamic programming

    • DP states are encoded in hidden layers

    • DP transitions are encoded in attention layers

    • Do not require the CFG trees to be explicitly exposed (e.g. an official language dictionary)

NITs: bi-directional models like BERT and DeBERTa fail to learn deep hierarchy since they are encouraged to learn local planning instead of global planning, and also fail to perform effective knowledge storage even with mixed training and knowledge augmentation.

Next
Next

CTB-2: Do we really need DD (discrete tokens and diffusion)?