Consider the vanilla linear attention mechanism:
St=St−1+vtktT,ot=Stqt where
vtktT is the outer product of
vt and
kt.
Unrolling the recurrence, we get:
ot=i=1∑tsivi where
si is the inner product
kiTqt.
It turns out that we can obtain this recurrence by maximizing the following Lagrangian:
L(α)=i=1∑tαisi−21αTItα where
It is the
t×t identity matrix, and
α=(α1,…,αt) is some vector of indeterminates.
Question: The update mechanism for DeltaNet is:
St=St−1−βtSt−1ktktT+βtvtktT,ot=Stqt Find a Lagrangian that leads to the coefficients in front of
vi for
ot in the DeltaNet formulation.