Title and Authors

TensorGPT: Efficient Compression of the Embedding Layer in LLMs
based on the Tensor-Train Decomposition

Background and Motivation

LLMs的大量参数使得其无法在轻量应用以及低端设备上发挥潜力。有一种关注于压缩embedding layer的解决方式，embedding layer占用了30%的参数量。因此对其进行有效的压缩可以减少space complexity和edge devices. embedding layer占据了整个模型参数的31.02\%.

为了实现这个目的，本文使用tensor-train decomposition来压缩embedding layer of LLM以便使用low-rank张量格式来存储large embeddings。这种格式称为TT-format或Matrix Product State.
在许多应用中token vocabulary在不断变化，本文考虑的是每个单独的token embedding（token embedding matrix中的每行）而不是整个token embedding matrix。得益于tensor networks的super-compression特性，本文对每个token embedding进行张量化和分解，然后构建一个高效的embedding存储格式来进行分布式计算。

Main Contributions

第一个使用TT decomposition和matrix product state来分解GPT系列模型
将每个token embedding 视为一个matrix product state，这种方法可以便于token embedding的插入或删除，也有助于分布式计算
实验表明参数数量减少了2.31倍

Theoretical Framework/Algorithm

token, token embeddings and embedding layer in LLMs

token and tokenization token表示文本的单元例如一个单词或者一个标点。tokenization将一段文本分解为若干个单元。GPT系列模型使用了Byte Pair Encoding BPEde tokenization方法。BPE将文本分解为不同长度的子词单元varying-length subword unit。

token embedding and embedding layer in LLMs “embedding”是指将离散的输入tokens转化为连续的向量表示。这种表示称为word embedding或者token embedding。在LLMs中，embedding layers根据sequential input tokens进行output token embeddings，该层将每个输入token映射到一个高维向量，这个向量捕获关于token的含义和上下文的语义和句法信息。embedding layer有一个大小为V的vocabulary$\{v\}$，对于每个token $v$，对应的embedding $x_v$的长度为$D$，此时weights of embedding layer表示为$W\in\mathbb{R}^{V\times D}$. 如果令牌嵌入维度D过大，则会存在过多的参数复杂性，导致嵌入层的高存储成本，以及此后整个语言模型的高存储成本。

嵌入权重矩阵W可以被看作查找表。基本嵌入生成找到与所有输入令牌相对应的令牌嵌入，并根据输入顺序对其进行堆叠。应该注意的是，在当前的LLM中，例如类似GPT和类似BERT的模型中，令牌的位置和掩码信息也被编码到嵌入中。在这些情况下，嵌入xv的令牌不仅仅是通过查找过程生成的。

tensor-train decomposition and matrix product state

Tensor-Train Decomposition最常见的形式是量子物理界引入的Matrix Product State（MPS或TT-MPS），它应用算法中描述的Tensor-Train Singular Value Decomposition（TT-SVD）算法来分解一个大阶-$N$ 张量 $\mathcal{X} \in \mathbb{R}^{I_1 \times I_2 \times \cdots \times I_N}$, into $N$ smaller $2$-nd or $3$-rd order core tensors, $\mathcal{G}^{(n)} \in \mathbb{R}^{ R_{n-1} \times I_n \times R_n }$ for $n=1, \dots, N$, such that

$\begin{equation}\label{ttcp_eq:tt_def} \begin{aligned} \mathcal{X} \approx \mathcal{G}^{(1)} \times_2^1 \mathcal{G}^{(2)} \times_3^1 \mathcal{G}^{(3)} \times_3^1 \cdots \times_3^1 \mathcal{G}^{(N)} \end{aligned} \end{equation}$

A $(2, 1)$-tensor contraction between two order-$2$ tensors, $\textbf{A} \in \mathbb{R}^{I_1 \times I_2}$ and $\textbf{B} \in \mathbb{R}^{J_1 \times J_2}$, where $I_2 = J_1$, is equivalent to a standard matrix multiplication, $\textbf{C} = \textbf{A} \times_2^1 \textbf{B} = \textbf{A} \textbf{B} \in \mathbb{R}^{I_1 \times J_2}$

The tensors $\mathcal{G}^{(1)}, \ldots, \mathcal{G}^{(N)}$ 被称为core tensors, while the set $\{ R_0, \dots, R_{N} \}$, where $R_0=R_N=1$, represents the TT-rank of the TT decomposition. By defining $\mathcal{G}^{(n)}_{:, i_n, :}$, $i_n = 1, \dots, I_N$ as the $i_n$-th horizontal slice of tensor $\mathcal{G}^{(n)}$, the MPS assumes the element-wise form as

$\begin{equation} x_{i_1, i_2, \dots, i_N} = \mathcal{G}^{(1)}_{:, i_1, :}\cdots\mathcal{G}^{(N)}_{:, i_N, :} \end{equation}$

MPS formatted token embedding

vocabulary发生变化时，如果分解的是整个embedding weight matrix， tensor decomposition就需要整个重新执行。此外，加载整个权重embedding matrix和分解都需要大量的内存和计算资源。本文不分解整个embedding weight matrix，而是分解每个token embedding。每个token embedding reshape为一个N阶张量，$\texttt{ten} \left( \textbf{x}_v \right) = \mathcal{X} \in \mathbb{R}^{I_1 \times \cdots \times I_N}$ such that $D = \prod_{k=1}^{N} I_k$, 使用MPS form as $\mathcal{X} \approx \mathcal{G}^{(1)} \times_3^1 \cdots \times_3^1 \mathcal{G}^{(N)}$.

不存储the entire token embedding $\text{x} \in \mathbb{R}^{D}$, 我们存储 corresponding MPS cores, $\mathcal{G}^{(n)} \in \mathbb{R}^{R_{n-1} \times I_n \times R_n}$, for $n=1,\ldots,N$.
优点如下：

lower storage cost
affordable computation (更利于单节点计算)

Experimental Design and Results

实验数据： general language understanding evalution (GLUE)和 microsoft rsearch paraphrase corpus的文本数据集，一共是11602个English 句子。使用BPE tokenization
language model : GPT-2, embedding weight matrix $W\in \mathbb{R}^{50257\times 768}$，其中 vacabulary的大小为50257， token embedding dimension为768.
$\texttt{ten2} \left( \textbf{x}_v \right) = \mathcal{X} \in \mathbb{R}^{2\times2\times\cdots\times2}$. 这里使用了zero-padding

upupwords

Paper-Analysis-TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition

Title and Authors

Background and Motivation

Main Contributions

Theoretical Framework/Algorithm

token, token embeddings and embedding layer in LLMs

tensor-train decomposition and matrix product state

MPS formatted token embedding

Experimental Design and Results

evaluation metrics

Comparative Analysis

Discussion and Limitations

Conclusion