• (WIP) GRPO from First Principles

    Group Relative Policy Optimization (GRPO) is a widely used policy gradient algorithm that was popularized by DeepSeek [1]. In this post, we start entirely from first principles and progressively add complexity until we get to the full GRPO objective.

  • Small Proofs and Derivations

    Last updated: March 1, 2026

  • Three Discrete Sampling Methods

    This post describes and implements three methods which can be used to sample from any discrete probability distribution.

  • Linear Sandwich

    Creating a Linear sandwich by stacking a bunch of Linear layers on top of each other just results in another linear transformation. Below, we show that this is indeed the case.