Improving Memory Efficiency in Large Language Models: A Novel Approach to Visual KV Cache

Title: Efficient Memory Management for Large-Scale Multimodal Transformer Models in Vision-Language Tasks via Compression Techniques and Quantization Strategies
Authors: Zheng Chen, Yuxuan Liang, Xiaoyu Wang, Jian Sun\
Institution(s): Microsoft AI Lab; University of Illinois at Urbana-Champaign;\
Keywords: Multimodal Deep Learning; Compression Techniques; Quantization Strategies; Memory Management
Abstract: Large-scale multimoderal transformer models have shown remarkable performance in vision-language tasks but suffer from high memory consumption, limiting their deployment on resource constrained devices. In this paper, we propose a comprehensive framework to address the memory issue by combining efficient compression techniques and quantization strategies for both weights and activations of these models. Specifically, we employ tensor decomposition methods such as low rank approximation (LRA) and sparse matrix factorization (SMF), along with knowledge distillation based weight pruning approach to compress model parameters. For activation memory reduction, we introduce a novel quantized dynamic range estimation method combined with adaptive precision control for both floating point numbers and integer data types. Experimental results demonstrate significant improvements in terms of memory usage while preserving the accuracy on various vision-language benchmarks using different multimodal transformer models such as ViLT, LXMERT, and UNITER.\
\#
Tags: AIMLab; Compression Techniques; Deep Learning; Efficient Memory Management; Image Processing; Machine Learning; Multimedia Computing; Natural Language Processing (NLP); Quantization Strategies; Vision-Language Tasks

Complete Article after the Jump: Here!