• (WIP) Building GRPO From Scratch

    Group Relative Policy Optimization (GRPO) is a widely used reinforcement learning algorithm that was popularized by DeepSeek. In this post, we start entirely from first principles and progressively add complexity until we get to the full GRPO objective.

  • Small Proofs and Derivations

    Last updated: Dec 2, 2024

  • Three Discrete Sampling Methods

    This post describes and implements three methods which can be used to sample from any discrete probability distribution.

  • Linear Sandwich

    Creating a Linear sandwich by stacking a bunch of Linear layers on top of each other just results in another linear transformation. Below, we show that this is indeed the case.