Articles & Other Blogs/ Articles / News

Constrained output generation through LLMs

Structure specification formats supported by different libraries

  • LLMs generate tokens auto-regressively, selecting the next token based on probability distributions learned during training. At inference time, this probability distribution is heavily biased by the model’s training data, meaning the next token is usually drawn from a specific subset of all possible tokens. That said, the model could technically generate any token—even when, by logical constraints, only a fixed subset of tokens should be valid.
  • Consider a scenario where you have text data on thousands of people, including their age, gender, and ethnicity. Most LLMs can extract this information and structure it into JSON without issue. However, as the JSON schema grows more complex, the model may begin to hallucinate keys—introducing elements that don’t belong, which could break your downstream code. The question then is: how do we force the LLM to generate a 100% valid JSON at inference time?
  • Context-Free Grammars (CFGs) define strict rules on what tokens are allowed at each step to ensure a string remains valid within the grammar. In JSON, for example, we know exactly which keys should appear and in what order, that a key must always be followed by a :, and that an object must close with } once all keys are defined.
  • If we already know what token must come next (or at least which subset of tokens are valid at that step), we can override the LLM’s normal token selection process. Instead of relying on the model to generate the correct token probabilistically, we can artificially bias the probabilities during generation to ensure only valid tokens are selected. This guarantees that the output strictly adheres to a given grammar or schema.
  • The key constraint here is that this correction must happen during inference, not after as a validation step. This makes the technique impossible to apply to closed-source models unless they provide access to their inference-time probabilities—which no major closed-source LLM (e.g., GPT) currently does.
  • However, this approach can be applied to open-source models like Llama 3.1 or any model supported by Hugging Face.
  • Several libraries offer these capabilities. The image above summarizes which structure specification formats each library supports.

People by WTF (What is intelligence really?)- Yann LeCunn

  • In the 1950s, there were two major schools of thought on intelligence: one saw intelligence as the ability to solve problems, while the other saw it as the ability to learn.
  • Some research supports the idea that learning happens through the strengthening of neural connections, which is analogous to how weights are updated in neural networks.
  • Early neural networks, like MLPs, were initially implemented as electronic circuits but performed poorly. This led to AI splitting into three major approaches: heuristic intelligence (search-based methods), expert systems, and neural networks, which later evolved into mathematical pattern recognition and statistical machine learning.
  • LLMs lack persistent memory. They have two forms of memory: parameter-based memory (knowledge stored in weights) and context-based memory (limited working memory). What they need is a memory system closer to the hippocampus, allowing for true long-term recall.
  • System 1 vs. System 2 reasoning: System 1 is intuitive and fast, while System 2 is deliberate and logical. Traditional AI search methods fell into System 2. LLMs operate mostly in System 1, but with improvements like O1 optimization, they are slowly creeping into System 2 territory—though very inefficiently.
  • Human-level intelligence will likely emerge from models that can watch and understand videos rather than just processing text because of the vast wealth of real-world information there in.
  • Predicting the next frame in a video is far harder than predicting the next word in a sentence since video operates in a continuous space, making it computationally intractable.
  • JEPA (Joint Embedding Predictive Architecture): Instead of predicting the raw next frame, it encodes a video into a representation space, then predicts future representations—similar to next-word prediction but in a more abstract feature space.
  • Vertical fine-tuning is a strong strategy for B2B applications, while AI assistants work well in urban areas. In rural areas, speech-based interaction in local languages is critical for accessibility.
  • In five years, AI platforms will likely be open-sourced, which will accelerate innovation and benefit AI technology overall.

Understanding Reasoning LLMs

DeepSeek's training pipeline

DeepSeek's training pipeline

  • Reasoning is a combination of training methodology but also prompting + test-time scaling - essentially train the model on how to think, ask the model to think and give it time to think.
  • Reasoning shouldn’t be used when you can get quick and cheap responses from LLMs, tasks are knowledge based and not reason based, or the task is simple (since the reasoning model over-thinks).
  • Deepseek’s training pipeline :
    • Directly applied RL with rewards for accuracy & format to V3 (foundational) to get R1-Zero. This R1-Zero is called cold-started since it did not use any Supervised Fine Tuning (SFT). Traditionally foundational models use SFT–>RLHF for human-alignment.
    • The team trained R1-Zero further with cold-start SFT data + more RL (rule-based verification for math & code) to get R1.
    • Using generated SFT from R1(CoT) + V3 (knowledge), the team fine-tuning Qwen & Llama models to “distill” reasoning ability to them, demonstrating that larger-models reasoning capability can be introduced into smaller models. This is not distillation from the traditional perspective which directly compares teacher-student logits.

Alfred Lin on The Knowledge Project (Partial)

  • There’s a balance between pushing forward with velocity and removing obstacles—knowing which to prioritize is key to long-term success.
  • You don’t just need to be better; you need to be different. Differentiation is what creates real competitive advantage.
  • DoorDash started on a college campus but quickly expanded to suburbs, realizing that delivering food to people who would otherwise drive 20+ minutes created more value than competing in dense cities where takeout was already convenient.
  • Being both right and contrarian is incredibly difficult—but it’s also the only way to generate outsized returns. Also called advantageous divergence.
  • At Sequoia, the hiring philosophy is to opt for slope over intercept—past experience doesn’t always predict future growth. What matters is whether someone can scale as fast as the company; otherwise, misalignment will become a bottleneck.
  • In hiring, it’s critical to balance regrettable turnovers (losing key talent) with non-regrettable turnovers (letting go of those who don’t fit)—both impact a company’s long-term trajectory.

Book Books

Block Chain Governance - MIT Press Essential Knowledge Series

  • Technology has always driven the evolution of governance—from the shift to settled civilizations, to the rise of transportation, every major technological advancement has reshaped how societies govern themselves.
  • In ancient Greece, city-states (polis) operated under radically different constitutions, each experimenting with unique forms of governance.
  • The emergence of functional blockchains has led to a similar Cambrian explosion of governance models, where existing structures are being questioned, and new decentralized autonomous systems—like DAOs—are being proposed. At the same time, debates around how blockchains themselves should be regulated are gaining momentum.
  • This creates a Collingridge dilemma. Blockchains are still evolving, making it difficult to fully grasp their societal impact. Regulating too early risks stifling innovation, but waiting too long could mean enacting insufficient or ineffective controls. Worse, once regulations are in place, they could slow blockchain adoption and development, reinforcing the very problem they sought to prevent.
  • The evolution of the blockchain:
    • Extropians believed science and technology would enable human immortality—a direct challenge to entropy. They saw existing governance structures as barriers to capital allocation toward longevity research and sought new governance models. A decentralized autonomous currency was their first step toward that vision.
    • Cypherpunks, operating in the same era, were skeptical of government control and also pushed for decentralized systems—starting with a decentralized currency as the foundation.
    • DigiCash introduced anonymous transactions but relied on a central authority for issuance and ledger maintenance.
    • BitGold followed, aiming for full decentralization and anonymity, but suffered from the double-spending problem—where a user could spend the same currency on multiple ledgers simultaneously.
    • Satoshi Nakamoto (pseudonym) resolved these issues with Bitcoin, combining:
      • Decentralization, using an early form of distributed ledgers (blockchain).
      • Proof of Work (PoW) for verification, ensuring that:
        • As the chain grows in value, tampering with the ledger becomes exponentially harder.
        • Currency is issued to miners who demonstrate computational work (and by extension, energy expenditure).
    • Proof of Stake (PoS) is the next evolution—where instead of mining, participants pledge their own coins as a stake to add new blocks. The highest bidder gets to add the block, but if malicious intent is detected, their stake is slashed.

Research Papers Papers

The foundations of LLMs

  • Self-supervised pretraining paradigms:
    Commonly employed pre-training paradigms

    Commonly employed pre-training paradigms

    • Decoder only : Causal language modeling (next-word prediction). For every sentence, consider predicting the sentence from its first token. Use the multiplied probabilties for each word in the sentence to model the loss.
    • Encoder only :
      • Masked language modeling: take a sentence, mask some tokens, have the model predict those masked tokens based on all the remaining context.
      • Permuted language modeling: take a sentence, add positional embeddings, now predict the masked tokens based on the remaining tokens by adding attention level masks. Advantages: no extra [MASK] token is used.
      • Classification based modeling (auxillary tasks) :
        • Next Sentence Prediction (NSP): create a compound sentence of the form [CLS] A [SEP] B. Based on the [CLS] token predict if B follows A. This is in addition to masking language modeling.
        • Adversarial training: for each masked token, pass through another model which predicts wether this was generated or part of the original text. This way you have two adversaries competing against each other which makes masked prediction better.
    • Encoder-Decoder:
      • Masked language modeling - here, the decoder will predict the masked tokens only.
      • Prefix language modeling - Based on some input words, the model will attempt to autoregressively predict the remainder of the sentence. (This is mostly what is used in fine-tuning for Decoder-only models)
      • Autoregressive denoising -
        • Mask denoising - We pass a sentence to the model in which we mask some tokens, add extra masks where no token is actually masked and have the model generate the original string.
        • Deletion denoising - Random parts of the sentence are deleted and the model is tasked with reconstructing the complete original sentence.
        • Span denoising - Instead of masking individual tokens, some sequences of tokens are masked as a span. Similar to masking, spans may be added in places where no tokens were masked. The model is tasked with reconstructing the original sentence.
        • Document rotation - similar to the classic DSA problem in which you must find the correct K by which a sentence is rotated. That is assuming the sentence to be a circular array, find its correct starting point.
    • Employing BERT :

      • BERT is an encoder-only model which is pre-trained with Masked Language Modeling & NSP.
      • RoBERTa is scaled up BERT without NSP. With scale the auxillary NSP task was found to make little to no difference to overall model performance.
      Translation modeling

      Translation modeling with BERT - training multilingual encoders

      • Multilingual models are trained using Language Translation Modeling (LTM), where the model learns from machine translation tasks with masked tokens in both languages. This encourages the development of a shared embedding space that integrates tokens from all involved languages.
      • For low-resource languages, multilingual training has proven to be an effective way to transfer task knowledge. The process involves first using Language Translation Modeling across multiple languages, followed by training on downstream tasks (e.g., span prediction or text highlighting) in the language with the most available training data. This enables the model to generalize and perform the same tasks in resource-constrained languages, despite their limited data.
      • What kinds of inference can be performed with BERT? Note: pre-trained BERT should be fine-tuned for these tasks.
        • Classification - using the [CLS] token, which is considered the representation for the entire text
        • Regression - adding a sigmoid over the [CLS] token to output a value. Ex : sentence similarity.
        Span-prediction using BERT

        Span-prediction using BERT

        • Span prediction - formulated as sequence labelling: for every token predict none, start, end. Take sections between start-end as answers for a given question. Ex : reading comprehension.
        • Most commonly encoding for encoder-decoder models.