-
(WIP) GRPO from First Principles
Group Relative Policy Optimization (GRPO) is a widely used policy gradient algorithm that was popularized by DeepSeek [1]. In this post, we start entirely from first principles and progressively add complexity until we get to the full GRPO objective.
-
Small Proofs and Derivations
Last updated: March 1, 2026
-
Three Discrete Sampling Methods
This post describes and implements three methods which can be used to sample from any discrete probability distribution.
-
Linear Sandwich
Creating a
Linearsandwich by stacking a bunch ofLinearlayers on top of each other just results in another linear transformation. Below, we show that this is indeed the case.