<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.4">Jekyll</generator><link href="https://www.jsigal.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.jsigal.com/" rel="alternate" type="text/html" /><updated>2024-11-07T15:37:30+00:00</updated><id>https://www.jsigal.com/feed.xml</id><title type="html">Jesse Sigal</title><subtitle>Academic website of Jesse Sigal</subtitle><author><name>Jesse Sigal</name><email>firstname.lastname(at)ed.ac.uk</email></author><entry><title type="html">Deriving forward and reverse mode AD</title><link href="https://www.jsigal.com/theory/deriving-forward-and-reverse-mode-ad/" rel="alternate" type="text/html" title="Deriving forward and reverse mode AD" /><published>2024-08-16T00:00:00+00:00</published><updated>2024-11-07T15:35:33+00:00</updated><id>https://www.jsigal.com/theory/deriving-forward-and-reverse-mode-ad</id><content type="html" xml:base="https://www.jsigal.com/theory/deriving-forward-and-reverse-mode-ad/"><![CDATA[<p>We will derive forward and reverse mode automatic differentiation (AD) for pure, straight-line programs by example.
<!--excerpt-->
We assume that the reader is familiar with partial derivatives of real-valued functions, as well as matrix-matrix and matrix-vector multiplication.
This post is heavily inspired by the explanation in <a class="citation" href="#pearlmutter_reverse-mode_2008">(Pearlmutter &amp; Siskind, 2008)</a>, which helped me first understand AD.</p>

<h2 class="section" id="automatic-differentiation"><span class="numbering" id="1">1</span>Automatic differentiation</h2>

<p>The family of algorithms known as automatic differentiation is the foundation of the tools which allow automated calculation of derivatives.
The family can be coarsely divided into <em>forward mode</em> and <em>reverse mode</em>.
Multiple modes exist because their asymptotics depend on different features of the differentiated programs.
Forward mode AD was introduced in <a class="citation" href="#wengert_simple_1964">(Wengert, 1964)</a>, and reverse mode AD was created by <a class="citation" href="#speelpenning_compiling_1980">(Speelpenning, 1980)</a> in his thesis.</p>

<p>AD works in the presence of many programming language features, but its essence can still be understood by looking at pure, straight-line programs.
These are programs consisting of a sequence of variable assignments where the value assigned is calculated from the previous variables using a pure function.</p>

<h2 class="section" id="the-multivariate-derivative"><span class="numbering" id="2">2</span>The multivariate derivative</h2>

<p>The multivariate version of differentiation is given by the <a href="https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant">Jacobian</a>, which for a differentiable function <span class="kdmath">$F \colon \R^n \to \R^m$</span> and a point <span class="kdmath">$\vec{x} \in \R^n$</span>  we denote by <span class="kdmath">$\nabla F(\vec{x})$</span>.
The Jacobian <span class="kdmath">$\nabla F(\vec{x})$</span> is an <span class="kdmath">$m \times n$</span> matrix containing all the partial derivatives of <span class="kdmath">$F$</span> at <span class="kdmath">$\vec{x}$</span>.
Thus, writing <span class="kdmath">$F(\vec{x})$</span> as <span class="kdmath">$F(\vec{x}) = (f_{1}(\vec{x}),\ldots,f_{m}(\vec{x}))$</span> for differentiable functions <span class="kdmath">$f_{j} \colon \R^n \to \R$</span>, the Jacobian <span class="kdmath">$\nabla F(\vec{x})$</span> is
\[
  \nabla F(\vec{x}) \defeq
  \begin{pmatrix}
    \partial_{1}f_1(\vec{x}) &amp; \cdots &amp; \partial_{n}f_1(\vec{x}) \cr
    \vdots                   &amp; \ddots &amp; \vdots                   \cr
    \partial_{1}f_m(\vec{x}) &amp; \cdots &amp; \partial_{n}f_m(\vec{x}) \cr
  \end{pmatrix}
\]
where <span class="kdmath">$\partial_{i}$</span> is the <span class="kdmath">$i^{\text{th}}$</span> partial derivative operator.
The Jacobian satisfies the multivariate chain rule <span class="kdmath">$\nabla (G \circ F)(\vec{x}) = \nabla G (F(\vec{x})) \times \nabla F(\vec{x})$</span>.
The chain rule is the key behind forward and reverse mode AD.</p>

<h2 class="section" id="the-straight-line-program"><span class="numbering" id="3">3</span>The straight-line program</h2>

<p>Consider the algebraic definition
\[
  z = h\left( g(f(a),b), f(a)\right)
\]
where <span class="kdmath">$a, b \in \R$</span>, <span class="kdmath">$f \colon \R \to \R$</span>, <span class="kdmath">$g, h \colon \R^2 \to \R$</span>, and all functions are differentiable.
We can rewrite this as a sequence of calculations using intermediate variables
\[
  \begin{aligned}
    x &amp;= f(a) &amp;(1)\cr
    y &amp;= g(x, b) &amp; (2)\cr
    z &amp;= h(y, x) &amp; (3)
  \end{aligned}
\]
and consider the sequence as a pure, straight-line program where the variables <span class="kdmath">$a, b$</span> are inputs and the variables <span class="kdmath">$x, y, z$</span> are initialized to <span class="kdmath">$0$</span>. 
We now regard the state of the program at each line as a five-tuple <span class="kdmath">$(a, b, x, y, z) \in \R^5$</span> containing the values of our variables.
Thus, each line <span class="kdmath">$(i)$</span> gives a function <span class="kdmath">$F_i \colon \R^5 \to \R^5$</span>, i.e.
\[
  \begin{aligned}
    F_1(v_0, v_1, v_2, v_3, v_4) &amp;= (v_0, v_1, f(v_0), v_3, v_4) \cr
    F_2(v_0, v_1, v_2, v_3, v_4) &amp;= (v_0, v_1, v_2, g(v_2, v_1), v_4)\cr
    F_3(v_0, v_1, v_2, v_3, v_4) &amp;= (v_0, v_1, v_2, v_3, h(v_3, v_2)).
  \end{aligned}
\]
Our program can then be rewritten to
\[
  \begin{aligned}
    \vec{x}_0 &amp;= (a, b, 0, 0, 0) \cr
    \vec{x}_1 &amp;= F_1(\vec{x}_0) \cr
    \vec{x}_2 &amp;= F_2(\vec{x}_1) \cr
    \vec{x}_3 &amp;= F_3(\vec{x}_2)
  \end{aligned}
\]
where <span class="kdmath">$\vec{x}_3$</span> gives the final state.</p>

<h2 class="section" id="forward-mode"><span class="numbering" id="4">4</span>Forward mode</h2>

<p>We can now view our program as a composition of state-transforming functions, namely <span class="kdmath">$F_3 \circ F_2 \circ F_1$</span>.
We can now calculate its derivative using the chain rule as
\[
  \nabla (F_3 \circ F_2 \circ F_1)(\vec{x}_0) = \nabla F_3 (\vec{x}_2) \times \nabla F_2(\vec{x}_1) \times \nabla F_1 (\vec{x}_0)
\]
where <span class="kdmath">$\times$</span> is matrix-matrix multiplication, and later matrix-vector multiplication as well.
The crux of both forward and reverse mode AD is this calculation, which they each use differently.</p>

<p>For forward mode, we observe that the matrix product can be computed from right-to-left by
\[
  \begin{aligned}
    X_1 &amp;= \nabla F_1(\vec{x}_0) \cr
    X_2 &amp;= \nabla F_2(\vec{x}_1) \times X_1 \cr
    X_3 &amp;= \nabla F_3(\vec{x}_2) \times X_2.
  \end{aligned}
\]
It would be inefficient to materialize entire matrices in practice, and so we can pre-multiply by a vector <span class="kdmath">$\vec{dx}_0$</span> to obtain
\[
  \nabla (F_3 \circ F_2 \circ F_1)(\vec{x}_0) \times \vec{dx}_0 = \nabla F_3 (\vec{x}_2) \times \nabla F_2(\vec{x}_1) \times \nabla F_1 (\vec{x}_0) \times \vec{dx}_0
\]
giving the sequence of vectors
\[
  \begin{aligned}
    \vec{dx}_1 &amp;= \nabla F_1(\vec{x}_0) \times \vec{dx}_0\cr
    \vec{dx}_2 &amp;= \nabla F_2(\vec{x}_1) \times \vec{dx}_1\cr
    \vec{dx}_3 &amp;= \nabla F_3(\vec{x}_2) \times \vec{dx}_2.
  \end{aligned}
\]
Calculating the Jacobian of the function <span class="kdmath">$F_1(v_0, v_1, v_2, v_3, v_4) = (v_0, v_1, f(v_0), v_3, v_4)$</span> at <span class="kdmath">$\vec{x}_0$</span>, we see
\[
  \begingroup
  \nabla F_1 (\vec{x}_0) =
  \begin{pmatrix}
    1 &amp; &amp; &amp; &amp; \cr
    &amp; 1 &amp; &amp; &amp; \cr
    \partial f(a) &amp; &amp; 0 &amp; &amp; \cr
    &amp; &amp; &amp; 1 &amp; \cr
    &amp; &amp; &amp; &amp; 1 \cr
  \end{pmatrix}
  \endgroup
\]
where <span class="kdmath">$\partial f$</span> is shorthand for the derivative of <span class="kdmath">$f \colon \R \to \R$</span> at <span class="kdmath">$a$</span> and empty entries are <span class="kdmath">$0$</span>.
Similarly,
\[
  \nabla F_2 (\vec{x}_1) =
  \begin{pmatrix}
    1 &amp; &amp; &amp; &amp; \cr
    &amp; 1 &amp; &amp; &amp; \cr
    &amp; &amp; 1 &amp; &amp; \cr
    &amp; \partial_{R} g(x, b) &amp; \partial_{L} g(x, b) &amp; 0 &amp; \cr
    &amp; &amp; &amp; &amp; 1 \cr
  \end{pmatrix}
\]
\[
  \nabla F_3 (\vec{x}_2) =
  \begin{pmatrix}
    1 &amp; &amp; &amp; &amp; \cr
    &amp; 1 &amp; &amp; &amp; \cr
    &amp; &amp; 1 &amp; &amp; \cr
    &amp; &amp; &amp; 1 &amp; \cr
    &amp; &amp; \partial_{R} h(y, x) &amp; \partial_{L} h(y, x) &amp; 0 \cr
  \end{pmatrix}
\]
where <span class="kdmath">$\partial_{R} g$</span> is the partial derivative of <span class="kdmath">$g$</span> in the right argument and so on.
Observe that the Jacobians are sparse due to each of the <span class="kdmath">$F_{i}$</span>’s only changing one variable.
We now calculate the vectors <span class="kdmath">$\vec{dx}_i$</span>.
We use the notation <span class="kdmath">$\vec{dx}_i[a]$</span>, <span class="kdmath">$\vec{dx}_i[b]$</span>, <span class="kdmath">$\vec{dx}_i[x]$</span>, <span class="kdmath">$\vec{dx}_i[y]$</span>, and <span class="kdmath">$\vec{dx}_i[z]$</span> for the first through fifth components respectively.
Pairing each vector with the matching line of our original program, we get
\[
  \begin{aligned}
    x &amp;= f(a) &amp; \quad
    \vec{dx}_1 &amp;= {\color{gray}(\vec{dx}_0[a], \vec{dx}_0[b], {\color{black}\partial f(a) \cdot \vec{dx}_0[a]}, \vec{dx}_0[y], \vec{dx}_0[z])} \cr
    y &amp;= g(x, b) &amp; \quad
    \vec{dx}_2 &amp;= {\color{gray}(\vec{dx}_1[a], \vec{dx}_1[b], \vec{dx}_1[x], {\color{black}\partial_R g(x, b) \cdot \vec{dx}_1[b] + \partial_L g(x, b) \cdot \vec{dx}_1[x]}, \vec{dx}_1[z])} \cr
    z &amp;= h(y, x) &amp; \quad
    \vec{dx}_3 &amp;= {\color{gray}(\vec{dx}_2[a], \vec{dx}_2[b], \vec{dx}_2[x], \vec{dx}_2[y], {\color{black}\partial_R h(y, x) \cdot \vec{dx}_2[x] + \partial_L h(y, x) \cdot \vec{dx}_2[y]})}.
  \end{aligned}
\]
Observe that <span class="kdmath">$\vec{dx}_3[x] = \vec{dx}_2[x] = \vec{dx}_1[x]$</span> because the <span class="kdmath">$x$</span> components of the <span class="kdmath">$\vec{dx}_i$</span>’s are only changed when <span class="kdmath">$x$</span> is assigned to.
Thus, we do not need to define a vector <span class="kdmath">$\vec{dx}_i$</span> at each step, it is sufficient to only define one new scalar variable.
We can therefore rewrite the above as
\[
  \begin{aligned}
    x &amp;= f(a) &amp; \quad
    dx &amp;= \partial f(a) \cdot da \cr
    y &amp;= g(x, b) &amp; \quad
    dy &amp;= \partial_R g(x, b) \cdot db + \partial_L g(x, b) \cdot dx \cr
    z &amp;= h(y, x) &amp; \quad
    dz &amp;= \partial_R h(y, x) \cdot dx + \partial_L h(y, x) \cdot dy
  \end{aligned}
\]
which exactly captures the forward mode algorithm.
Namely, each line is paired with a derivative calculation using the partial derivatives, i.e. <span class="kdmath">$y = f(x_1, x_2, \ldots, x_n)$</span> is paired with
\[
  dy = \sum^{n}_{i = 1} \partial_i f (x_1, x_2, \ldots, x_n) \cdot dx_i.
\]
for a fresh variable <span class="kdmath">$dy$</span>.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>

<h2 class="section" id="reverse-mode"><span class="numbering" id="5">5</span>Reverse mode</h2>

<p>For reverse mode, we observe that the matrix product can be transformed by transposition
\[
  \nabla (F_3 \circ F_2 \circ F_1)(\vec{x}_0)^\intercal = \nabla F_1 (\vec{x}_0)^\intercal \times \nabla F_2(\vec{x}_1)^\intercal \times \nabla F_3 (\vec{x}_2)^\intercal
\]
and that this <em>reverses</em> the order of matrix multiplication.
We can again calculate right-to-left,
\[
  \begin{aligned}
    X_3 &amp;= \nabla F_3(\vec{x}_2)^\intercal \cr
    X_2 &amp;= \nabla F_2(\vec{x}_1)^\intercal \times X_3 \cr
    X_1 &amp;= \nabla F_1(\vec{x}_0)^\intercal \times X_2
  \end{aligned}
\]
and similarly opt for pre-multiplying by a vector <span class="kdmath">$\vec{\delta x}_4$</span> 
\[
  \nabla (F_3 \circ F_2 \circ F_1)(\vec{x}_0)^\intercal \times \vec{\delta x}_4 = \nabla F_1 (\vec{x}_0)^\intercal \times \nabla F_2(\vec{x}_1)^\intercal \times \nabla F_3 (\vec{x}_2)^\intercal \times \vec{\delta x}_4
\]
and thus we can define a sequence of intermediate vectors
\[
  \begin{aligned}
    \vec{\delta x}_3 &amp;= \nabla F_3(\vec{x}_2)^\intercal \times \vec{\delta x}_4 \cr
    \vec{\delta x}_2 &amp;= \nabla F_2(\vec{x}_1)^\intercal \times \vec{\delta x}_3 \cr
    \vec{\delta x}_1 &amp;= \nabla F_1(\vec{x}_0)^\intercal \times \vec{\delta x}_2.
  \end{aligned}
\]
The transposes of the Jacobians
\[
  \nabla F_1 (\vec{x}_0)^\intercal=
  \begin{pmatrix}
    1 &amp; &amp; \partial f(a) &amp; &amp; \cr
    &amp; 1 &amp; &amp; &amp; \cr
    &amp; &amp; 0 &amp; &amp; \cr
    &amp; &amp; &amp; 1 &amp; \cr
    &amp; &amp; &amp; &amp; 1 \cr
  \end{pmatrix}
\]
\[
  \nabla F_2 (\vec{x}_1)^\intercal =
  \begin{pmatrix}
    1 &amp; &amp; &amp; &amp; \cr
    &amp; 1 &amp; &amp; \partial_{R} g(x, b) &amp; \cr
    &amp; &amp; 1 &amp; \partial_{L} g(x, b) &amp; \cr
    &amp; &amp; &amp; 0 &amp; \cr
    &amp; &amp; &amp; &amp; 1 \cr
  \end{pmatrix}
\]
\[
  \nabla F_3 (\vec{x}_2)^\intercal =
  \begin{pmatrix}
    1 &amp; &amp; &amp; &amp; \cr
    &amp; 1 &amp; &amp; &amp; \cr
    &amp; &amp; 1 &amp; &amp; \partial_{R} h(y, x) \cr
    &amp; &amp; &amp; 1 &amp; \partial_{L} h(y, x) \cr
    &amp; &amp; &amp; &amp; 0 \cr
  \end{pmatrix}
\]
are also sparse.
Let <span class="kdmath">$\vec{\delta x}_i[a]$</span>, <span class="kdmath">$\vec{\delta x}_i[b]$</span>, <span class="kdmath">$\vec{\delta x}_i[x]$</span>, <span class="kdmath">$\vec{\delta x}_i[y]$</span>, and <span class="kdmath">$\vec{\delta x}_i[z]$</span> for the first through fifth components of <span class="kdmath">$\vec{\delta x}_i$</span> respectively.
Calculating with components, we see
\[
  \begin{aligned}
    \vec{\delta x}_3 &amp;= {\color{gray}(\vec{\delta x}_4[a], \vec{\delta x}_4[b], {\color{black}\vec{\delta x}_4[x] + \partial_R h(y, x) \cdot  \vec{\delta x}_4[z]}, {\color{black}\vec{\delta x}_4[y] + \partial_L h(y, x) \cdot  \vec{\delta x}_4[z]}, 0)} \cr
    \vec{\delta x}_2 &amp;= {\color{gray}(\vec{\delta x}_3[a], {\color{black}\vec{\delta x}_3[b] + \partial_R g(x, b) \cdot \vec{\delta x}_3[y]}, {\color{black}\vec{\delta x}_3[x] + \partial_L g(x, b) \cdot \vec{\delta x}_3[y]}, 0, \vec{\delta x}_3[z])} \cr
    \vec{\delta x}_1 &amp;= {\color{gray}({\color{black}\vec{\delta x}_2[a] + \partial f(a) \cdot \vec{\delta x}_2[x]}, \vec{\delta x}_2[b], 0, \vec{\delta x}_2[y], \vec{\delta x}_2[z])} \cr
  \end{aligned}
\]
and note that each line accumulates derivatives into the arguments of the function used based on the resulting variable.
For example, <span class="kdmath">$x = f(a)$</span> adds <span class="kdmath">$f(a) \cdot \vec{\delta x}_2[x]$</span> to <span class="kdmath">$\vec{\delta x}_2[a]$</span>.
We can use mutable variables <span class="kdmath">$\delta a$</span>, <span class="kdmath">$\delta b$</span>, <span class="kdmath">$\delta x$</span>, and <span class="kdmath">$\delta y$</span> initialized to <span class="kdmath">$0$</span> to perform the above calculation
\[
  \begin{aligned}
    x &amp;= f(a) \cr
    y &amp;= g(x, b) \cr
    z &amp;= h(y, x) \cr
    \delta y &amp;\mathrel{+}= \partial_L h(y, x) \cdot  \delta z, \quad \delta x \mathrel{+}= \partial_R h(y, x) \cdot  \delta z \cr
    \delta x &amp;\mathrel{+}= \partial_L g(x, b) \cdot \delta y, \quad \delta b \mathrel{+}= \partial_R g(x, b) \cdot \delta y \cr
    \delta a &amp;\mathrel{+}= \partial f(a) \cdot \delta x
  \end{aligned}
\]
which is exactly reverse mode AD, modulo zeroing out mutable variables.
Namely, each line has a corresponding stateful derivative update which accumulates into the mutable derivative associated with its arguments, i.e. <span class="kdmath">$y = f(x_1, x_2, \ldots, x_n)$</span> is paired with
\[
  \delta x_1 \mathrel{+}= \partial_1 f(x_1, \ldots, x_n) \cdot \delta y,\, \ldots,\, \delta x_n \mathrel{+}= \partial_n f(x_1, \ldots, x_n) \cdot \delta y
\]
in the reverse order of the original program.</p>

<h2 class="section" id="conclusion"><span class="numbering" id="6">6</span>Conclusion</h2>

<p>We have seen how to derive forward and reverse mode automatic differentiation for pure, straight-line programs.
We achieved this by:</p>

<ol>
  <li>Viewing our programs as state-transformers on all of the variables.</li>
  <li>Applying the multivariate chain rule, and in the case of reverse mode transposing the result.</li>
  <li>Pre-multiplying by a vector to get a sequence of intermediate vectors.</li>
  <li>Analyzing the sparsity of the Jacobians to determine the structure of each vector.</li>
  <li>Using this analysis to never create the vectors or Jacobians to begin with.</li>
</ol>

<p>If you want to see how implement forward and reverse mode AD (and many more modes!) for a more general purpose language, feel free to look at my <a href="/papers/#theses">thesis</a>.</p>

<h3 id="references">References</h3>

<ol class="bibliography"><li><span id="pearlmutter_reverse-mode_2008">Pearlmutter, B. A., &amp; Siskind, J. M. (2008). Reverse-mode AD in a Functional Framework: Lambda the Ultimate Backpropagator. <i>ACM Trans. Program. Lang. Syst.</i>, <i>30</i>(2), 7:1–7:36. <a href="https://doi.org/10.1145/1330017.1330018">https://doi.org/10.1145/1330017.1330018</a></span></li>
<li><span id="wengert_simple_1964">Wengert, R. E. (1964). A simple automatic derivative evaluation program. <i>Communications of the ACM</i>, <i>7</i>(8), 463–464. <a href="https://doi.org/10.1145/355586.364791">https://doi.org/10.1145/355586.364791</a></span></li>
<li><span id="speelpenning_compiling_1980">Speelpenning, B. (1980). <i>Compiling fast partial derivatives of functions given by algorithms</i> [Ph.D.]. University of Illinois at Urbana-Champaign.</span></li>
<li><span id="griewank_evaluating_2008">Griewank, A., &amp; Walther, A. (2008). <i>Evaluating Derivatives</i>. Society for Industrial and Applied Mathematics. <a href="https://doi.org/10.1137/1.9780898717761">https://doi.org/10.1137/1.9780898717761</a></span></li></ol>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Forward mode AD can also be viewed as arithmetic in the ring of truncated Taylor series <a class="citation" href="#griewank_evaluating_2008">(Griewank &amp; Walther, 2008, Chapter 13)</a>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Jesse Sigal</name><email>firstname.lastname(at)ed.ac.uk</email></author><category term="theory" /><category term="automatic differentiation" /><category term="derivatives" /><summary type="html"><![CDATA[We will derive forward and reverse mode automatic differentiation (AD) for pure, straight-line programs by example.]]></summary></entry><entry><title type="html">The local state monad has rank (WIP)</title><link href="https://www.jsigal.com/theory/local-state-has-rank/" rel="alternate" type="text/html" title="The local state monad has rank (WIP)" /><published>2024-08-05T00:00:00+00:00</published><updated>2024-11-07T15:35:33+00:00</updated><id>https://www.jsigal.com/theory/local-state-has-rank</id><content type="html" xml:base="https://www.jsigal.com/theory/local-state-has-rank/"><![CDATA[<p>In this post we will show that the local state monad of <a class="citation" href="#plotkin_notions_2002">(Plotkin &amp; Power, 2002)</a> (which was suggested by Peter O’Hearn) has rank.
<!--excerpt-->
In other words, dependent on the choice of values used for the state, the local state monad preserves <span class="kdmath">$\lambda$</span>-filtered colimits for a regular cardinal <span class="kdmath">$\lambda$</span>.</p>

<!-- cSpell:enableCompoundWords  -->
<!-- cSpell:ignore antex katex usepackage amsmath amsfonts amssymb stmaryrd mathchardef ordinarycolon mathcode catcode begingroup endgroup mathrel mathop defeq coloneqq mathrm mathbb mathcal tikzcd endtex plotkin -->

<h2 class="section" id="background"><span class="numbering" id="1">1</span>Background</h2>

<p><em>Toggle to show all details</em>:
        <label class="toggle-details-switch">
          <input id="toggle-details-1" type="checkbox" class="collapsed" />
          <span class="toggle-details-slider round"></span>
        </label></p>

<p>Let us first recall the necessary background details.
Some details are hidden by default, and the heading can be clicked to open them, or the above switch can be used to reveal them all.</p>

<details class="details-in-1"><summary class="details-summary">Regular cardinals and filtered colimits</summary><p />
<p>Regular cardinals generalize the statement “a finite union of finite sets is finite”.</p>

<div class="definition block" id="regular-cardinal">
  <p><span class="block-label"><span class="type">definition</span><span class="numbering" id="1.1">1.1</span><span class="desc">Regular cardinal</span></span>
<a class="citation" href="#borceux_handbook_1994">(Borceux, 1994, Page I.268)</a>
An infinite cardinal <span class="kdmath">$\lambda$</span> is <em>regular</em> when, for arbitrary sets <span class="kdmath">$J$</span> and <span class="kdmath">$\{ X_j\}_{j \in J}$</span>, it satisfies
\[
\left( \left|J\right| &lt; \lambda \, \wedge \, \forall j \in J, X_j &lt; \lambda \right)
\;\Rightarrow\;
\left| \bigcup_{j \in J} X_j \right| &lt; \lambda
\]
i.e. a union of size less than <span class="kdmath">$\lambda$</span> of sets size less than <span class="kdmath">$\lambda$</span> is size less than <span class="kdmath">$\lambda$</span>.</p>
</div>

<p>For a filtered category, every finite diagram over it has a cocone.
For a <span class="kdmath">$\lambda$</span>-filtered category, we want every diagram of size less than <span class="kdmath">$\lambda$</span> over it has a cocone.
This leads to the following definition.</p>

<div class="definition block" id="$$-lambda$$-filtered-category">
  <p><span class="block-label"><span class="type">definition</span><span class="numbering" id="1.2">1.2</span><span class="desc"><span class="kdmath">$\lambda$</span>-filtered category</span></span>
<a class="citation" href="#borceux_handbook_1994">(Borceux, 1994, Page I.268)</a>
Let <span class="kdmath">$\lambda$</span> be a regular cardinal.
A category <span class="kdmath">$\cat{C}$</span> is <em><span class="kdmath">$\lambda$</span>-filtered</em> when</p>
  <ol>
    <li>there exists at least one object in <span class="kdmath">$\cat{C}$</span>,</li>
    <li>given a set <span class="kdmath">$J$</span> with <span class="kdmath">$\left\vert J \right\vert &lt; \lambda$</span> and a family <span class="kdmath">$\left( C_j \in \cat{C} \right)_{j \in J}$</span> of objects of <span class="kdmath">$\cat{C}$</span>, there exists an object <span class="kdmath">$C \in \cat{C}$</span> and a morphisms <span class="kdmath">$f_j \colon C_j \to C$</span>,</li>
    <li>given a set <span class="kdmath">$J$</span> with <span class="kdmath">$\left\vert J \right\vert &lt; \lambda$</span> and a family <span class="kdmath">$\left(f_j \colon C_j \to C' \, \right)_{j \in J}$</span> in <span class="kdmath">$\cat{C}$</span>, there exists an object <span class="kdmath">$C'' \in \cat{C}$</span> and a morphism <span class="kdmath">$f \colon C' \to C''$</span> such that <span class="kdmath">$f \circ f_{j_1} = f \circ f_{j_2}$</span> for all <span class="kdmath">$j_1, j_2 \in J$</span>.</li>
  </ol>
</div>

<p>We then have the desired property.</p>

<div class="lemma block" id="$$-lambda$$-filtered-cocone">
  <p><span class="block-label"><span class="type">lemma</span><span class="numbering" id="1.3">1.3</span><span class="desc"><span class="kdmath">$\lambda$</span>-filtered cocone</span></span>
<a class="citation" href="#borceux_handbook_1994">(Borceux, 1994, Page I.269)</a>
Let <span class="kdmath">$\lambda$</span> be a regular cardinal and <span class="kdmath">$\cat{C}$</span> a <span class="kdmath">$\lambda$</span>-filtered category.
For every category <span class="kdmath">$\cat{D}$</span> such that the cardinality of all of its arrows is less than <span class="kdmath">$\lambda$</span> and every functor <span class="kdmath">$F \colon \cat{D} \to \cat{C}$</span>, there exists a cocone on <span class="kdmath">$F$</span>.</p>
</div>

<p>Recall that finite limits commute with filtered colimits in <span class="kdmath">$\Set$</span>.
Thus, <span class="kdmath">$\lambda$</span>-filtered colimits and <span class="kdmath">$\lambda$</span>-limits are defined to achieve an analogous theorem.</p>

<div class="definition block" id="$$-lambda$$-filtered-colimits-and-$$-lambda$$-limits">
  <p><span class="block-label"><span class="type">definition</span><span class="numbering" id="1.4">1.4</span><span class="desc"><span class="kdmath">$\lambda$</span>-filtered colimits and <span class="kdmath">$\lambda$</span>-limits</span></span>
<a class="citation" href="#borceux_handbook_1994">(Borceux, 1994, Page I.268)</a>
Let <span class="kdmath">$\lambda$</span> be a regular cardinal.</p>
  <ol>
    <li>By a <em><span class="kdmath">$\lambda$</span>-filtered colimit</em> in a category <span class="kdmath">$\cat{C}$</span> we mean the colimit of a functor <span class="kdmath">$F \colon \cat{D} \to \cat{C}$</span> where the category <span class="kdmath">$\cat{D}$</span> is <span class="kdmath">$\lambda$</span>-filtered.</li>
    <li>By an <em><span class="kdmath">$\lambda$</span>-limit</em> in a category <span class="kdmath">$\cat{C}$</span>, we mean the limit of a functor <span class="kdmath">$F \colon \cat{D} \to \cat{C}$</span> where <span class="kdmath">$\cat{D}$</span> is a small category and the cardinality of the union of all of its arrows is less than <span class="kdmath">$\lambda$</span>.</li>
  </ol>
</div>

<div class="theorem block" id="$$-lambda$$-filtered-colimits-and-$$-lambda$$-limits-commute-in-$$-set$$">
  <p><span class="block-label"><span class="type">theorem</span><span class="numbering" id="1.5">1.5</span><span class="desc"><span class="kdmath">$\lambda$</span>-filtered colimits and <span class="kdmath">$\lambda$</span>-limits commute in <span class="kdmath">$\Set$</span></span></span>
<a class="citation" href="#borceux_handbook_1994">(Borceux, 1994, Page I.269)</a>
Let <span class="kdmath">$\lambda$</span> be a regular cardinal.
In <span class="kdmath">$\Set$</span>, <span class="kdmath">$\lambda$</span>-limits commute with <span class="kdmath">$\lambda$</span>-filtered colimits.</p>
</div>
</details>
<p />

<div class="definition block" id="rank-of-a-functor">
  <p><span class="block-label"><span class="type">definition</span><span class="numbering" id="1.6">1.6</span><span class="desc">Rank of a functor</span></span>
<a class="citation" href="#borceux_handbook_1994-1">(Borceux, 1994, Page II.272)</a>
A functor <span class="kdmath">$F \colon \cat{C} \to \cat{D}$</span> has <em>rank <span class="kdmath">$\lambda$</span></em>, for some regular cardinal <span class="kdmath">$\lambda$</span>, when <span class="kdmath">$F$</span> preserves <span class="kdmath">$\lambda$</span>-filtered colimits.
It has <em>rank</em> when it has rank <span class="kdmath">$\lambda$</span> for some regular cardinal <span class="kdmath">$\lambda$</span>.</p>
</div>

<div class="definition block" id="finite-sets-and-injections">
  <p><span class="block-label"><span class="type">definition</span><span class="numbering" id="1.7">1.7</span><span class="desc">Finite sets and injections</span></span>
Let <span class="kdmath">$I$</span> denote the category whose objects are the sets <span class="kdmath">$n \defeq \{ 1, \ldots, n \}$</span> for <span class="kdmath">$n \in \N$</span> (so that <span class="kdmath">$0 = \emptyset$</span>) and whose maps are injections.</p>
</div>

<details class="details-in-1"><summary class="details-summary">Under category</summary><p />
<div class="definition block" id="under-category">
  <p><span class="block-label"><span class="type">definition</span><span class="numbering" id="1.8">1.8</span><span class="desc">Under category</span></span>
Let <span class="kdmath">$\cat{C}$</span> be a category and <span class="kdmath">$x \in \cat{C}$</span>.
Then the <em>under category</em> <span class="kdmath">$x / C$</span> has as objects <span class="kdmath">$f \colon x \to y$</span> maps in <span class="kdmath">$\cat{C}$</span> and morphisms <span class="kdmath">$g \colon (f_1 \colon x \to y_1) \to (f_2 \colon x \to y_2)$</span> are given by maps <span class="kdmath">$g \colon y_1 \to y_2$</span> in <span class="kdmath">$\cat{C}$</span> such that</p>

  <p><span class="antex display"><object data-antex-job-hash="8d7f814448bd935d92369ef7e40a98b3"></object></span>
commutes.</p>
</div>
</details>
<p />

<details class="details-in-1"><summary class="details-summary">Distributive law of monads</summary><p />
<div class="definition block" id="distributive-law">
  <p><span class="block-label"><span class="type">definition</span><span class="numbering" id="1.9">1.9</span><span class="desc">Distributive law</span></span>
<a class="citation" href="#beck_distributive_1969">(Beck, 1969)</a>
Let <span class="kdmath">$(T, \eta^T, \mu^T)$</span> and <span class="kdmath">$(S, \eta^S, \mu^S)$</span> be monads over the same category.
A <em>distributive law</em> of <span class="kdmath">$S$</span> over <span class="kdmath">$T$</span> is a natural transformation <span class="kdmath">$\theta \colon TS \to ST$</span> such that the following diagrams commute:</p>

  <p><span class="antex display"><object data-antex-job-hash="933c06fefaa67922e9bf85beaf17d96a"></object></span></p>

  <p><span class="antex display"><object data-antex-job-hash="28f625513cd6c94ee5a498571cf52ddf"></object></span></p>

  <p><span class="antex display"><object data-antex-job-hash="2f6bc43ffd44f06034edacb1211d5edb"></object></span></p>
</div>

<div class="proposition block" id="composed-monad-from-distributive-law">
  <p><span class="block-label"><span class="type">proposition</span><span class="numbering" id="1.10">1.10</span><span class="desc">Composed monad from distributive law</span></span>
<a class="citation" href="#beck_distributive_1969">(Beck, 1969)</a>
Let <span class="kdmath">$(T, \eta^T, \mu^T)$</span> and <span class="kdmath">$(S, \eta^S, \mu^S)$</span> be monads over the same categoryand <span class="kdmath">$\theta \colon TS \to ST$</span> a distributive law of <span class="kdmath">$S$</span> over <span class="kdmath">$T$</span>.
Then <span class="kdmath">$(ST, \eta^S \eta^T, \mu^S \mu^T \circ S \theta T)$</span> is a monad.
That is, the functor <span class="kdmath">$ST$</span> is a monad with unit and multiplication given by</p>

  <p><span class="antex display"><object data-antex-job-hash="e0d1ee5954a8a5ee646f56dce3246cad"></object></span></p>

  <p><span class="antex display"><object data-antex-job-hash="23a3261203b3aac9cfd7d824451365ff"></object></span>
respectively.</p>
</div>
</details>

<h2 class="section" id="original-definition"><span class="numbering" id="2">2</span>Original definition</h2>

<p>Let <span class="kdmath">$V \in \Set$</span> be a fixed set, and define the functor <span class="kdmath">$S \colon I^{\op} \to \Set$</span> by <span class="kdmath">$S(n) \defeq V^n$</span> on objects and for a morphism <span class="kdmath">$\rho \colon n \to m$</span> in <span class="kdmath">$I$</span>,
\[
\begin{aligned}
S(\rho) \colon S(m) &amp;\to S(n) \cr
(v_{1}, \ldots, v_{m}) &amp;\mapsto \left(v_{\rho(1)}, \ldots, v_{\rho(n)}\right)
\end{aligned}
\]
i.e. re-indexing by <span class="kdmath">$\rho$</span>.
Then the monad for local state of <a class="citation" href="#plotkin_notions_2002">(Plotkin &amp; Power, 2002)</a> on <span class="kdmath">$[I, \Set]$</span> , suggested to them by Peter O’Hearn, is</p>
<div class="labeled block" id="local-state-orig">
  <p><span class="block-label"><span class="type">labeled</span><span class="numbering" id="2.1">2.1</span></span>
\[
(TX)(n) \defeq S(n) \Rightarrow \int^{m \in (n / I)} S(m) \times X(m)
\]</p>
</div>
<p>For the functorial actions of <span class="kdmath">$TX$</span>, let <span class="kdmath">$\rho \colon n \to p$</span> in <span class="kdmath">$I$</span>.
Then <span class="kdmath">$n \leq p$</span>, and so define <span class="kdmath">$q \defeq p - n$</span> so that <span class="kdmath">$\rho \colon n \to n + q$</span>.
Observe that <span class="kdmath">$S(n + q) \simeq S(n) \times S(q)$</span>.
Thus, a map of type <span class="kdmath">$(TX)(\rho) \colon (TX)(n) \to (TX)(n + q)$</span> is equivalent to a map of type
\[
\left(S(n) \Rightarrow \int^{m \in (n / I)} S(m) \times X(m)\right) \times S(n) \times S(q) \to \int^{m' \in \left((n + q) / I\right)} S(m') \times X(m').
\]
We can then use evaluation at <span class="kdmath">$S(n)$</span> and colimit preservation of <span class="kdmath">$- \times S(q)$</span>, followed by the inclusion of the <span class="kdmath">$m$</span>-th component of the domain coend into the <span class="kdmath">$(m + q)$</span>-th component of the codomain coend, to define the functorial action.
Note that the category of which we take the coend over changes.
We will not recall the monad structure of <span class="kdmath">$T$</span> here.</p>

<p>To make our later proofs simpler and more modular, we will work with an alternative definition of <span class="kdmath">$T$</span>.
Namely, we will decompose <span class="kdmath">$T$</span> into two monads <span class="kdmath">$F$</span> and <span class="kdmath">$B$</span> with a distributive law <span class="kdmath">$\theta \colon BF \to FB$</span> such that their composition is <span class="kdmath">$T$</span>.</p>

<h2 class="section" id="modular-definition"><span class="numbering" id="3">3</span>Modular definition</h2>

<p>Let <span class="kdmath">$V \in \Set$</span> be a fixed set and define <span class="kdmath">$S \colon I^{\op} \to \Set$</span> as before. 
The distributive law we will use is from <a class="citation" href="#mellies_local_2014">(Melliès, 2014)</a>.
Their distributive law is between the <em>fiber monad</em> <span class="kdmath">$F$</span> and the <em>basis monad</em> B.</p>
<div class="definition block" id="fiber-monad">
  <p><span class="block-label"><span class="type">definition</span><span class="numbering" id="3.1">3.1</span><span class="desc">Fiber monad</span></span>
The <em>fiber monad</em> <span class="kdmath">$F$</span> on <span class="kdmath">$[I, \Set]$</span> is defined on <span class="kdmath">$X \colon I \to \Set$</span> by
\[
(FX)(n) \defeq S(n) \Rightarrow \left( S(n) \times X(n) \right)
\]
so that it is “fiberwise” the global state monad.</p>
</div>
<p>What is the functorial action of <span class="kdmath">$FX \colon I \to \Set$</span>?</p>

<p><a class="citation" href="#staton_completeness_2010">(Staton, 2010, Section 4.1)</a></p>

<p><span class="todo tag" data-text="—The rest of it" tabindex="0"><span>todo</span></span></p>

<h3 id="references">References</h3>

<ol class="bibliography"><li><span id="plotkin_notions_2002">Plotkin, G., &amp; Power, J. (2002). Notions of Computation Determine Monads. In M. Nielsen &amp; U. Engberg (Eds.), <i>Foundations of Software Science and Computation Structures</i> (pp. 342–356). Springer. <a href="https://doi.org/10.1007/3-540-45931-6_24">https://doi.org/10.1007/3-540-45931-6_24</a></span></li>
<li><span id="borceux_handbook_1994">Borceux, F. (1994). <i>Handbook of Categorical Algebra: Volume 1: Basic Category Theory</i> (Vol. 1). Cambridge University Press. <a href="https://doi.org/10.1017/CBO9780511525858">https://doi.org/10.1017/CBO9780511525858</a></span></li>
<li><span id="borceux_handbook_1994-1">Borceux, F. (1994). <i>Handbook of Categorical Algebra: Volume 2: Categories and Structures</i> (Vol. 2). Cambridge University Press. <a href="https://doi.org/10.1017/CBO9780511525865">https://doi.org/10.1017/CBO9780511525865</a></span></li>
<li><span id="beck_distributive_1969">Beck, J. (1969). Distributive laws. In H. Appelgate, M. Barr, J. Beck, F. W. Lawvere, F. E. J. Linton, E. Manes, M. Tierney, F. Ulmer, &amp; B. Eckmann (Eds.), <i>Seminar on Triples and Categorical Homology Theory</i> (pp. 119–140). Springer. <a href="https://doi.org/10.1007/BFb0083084">https://doi.org/10.1007/BFb0083084</a></span></li>
<li><span id="mellies_local_2014">Melliès, P.-A. (2014). Local States in String Diagrams. In G. Dowek (Ed.), <i>Rewriting and Typed Lambda Calculi</i> (pp. 334–348). Springer International Publishing. <a href="https://doi.org/10.1007/978-3-319-08918-8_23">https://doi.org/10.1007/978-3-319-08918-8_23</a></span></li>
<li><span id="staton_completeness_2010">Staton, S. (2010). Completeness for Algebraic Theories of Local State. In L. Ong (Ed.), <i>Foundations of Software Science and Computational Structures</i> (pp. 48–63). Springer. <a href="https://doi.org/10.1007/978-3-642-12032-9_5">https://doi.org/10.1007/978-3-642-12032-9_5</a></span></li></ol>]]></content><author><name>Jesse Sigal</name><email>firstname.lastname(at)ed.ac.uk</email></author><category term="theory" /><category term="monad" /><category term="local state" /><category term="category theory" /><summary type="html"><![CDATA[In this post we will show that the local state monad of (Plotkin &amp; Power, 2002) (which was suggested by Peter O’Hearn) has rank.]]></summary></entry></feed>