Photo credit: venturebeat.com
Innovative strides in memory utilization have emerged from Sakana AI, a startup based in Tokyo. Their latest development promises to revolutionize how businesses can leverage large language models (LLMs) and Transformer-based models, significantly reducing costs associated with application development.
The new methodology, known as universal transformer memory, incorporates specialized neural networks designed to optimize the retention of valuable information while eliminating superfluous details within LLMs’ contexts.
Enhancing Transformer Memory
The efficacy of Transformer models, which serve as the foundation for LLMs, is closely linked to their “context window,” defined as the input they receive from users. This context window can be likened to the model’s operational memory, where the arrangement of input significantly influences performance. The emergence of “prompt engineering” has become a direct result of this relational dynamic.
Modern models are capable of accommodating extensive context windows with hundreds of thousands to millions of tokens—numerical representations of language elements that guide the model’s responses.
While lengthy prompts allow for ample information input, they can lead to increased computational expenses and reduced performance speeds. By optimizing prompts to eliminate non-essential tokens while preserving crucial details, both costs and processing times can be minimized.
Existing prompt optimization methods can be resource-heavy or necessitate manual experimentation to refine input sizes effectively.
Introducing Neural Attention Memory Modules
The universal transformer memory employs neural attention memory models (NAMMs), which are streamlined neural networks that determine whether to “remember” or “forget” each token within the LLM’s memory structure.
According to the researchers, “This new capability allows Transformers to discard unhelpful or redundant details, focusing instead on the most critical information, which is vital for tasks that require extended contextual reasoning.”
NAMMs are trained independently from the LLMs and integrate seamlessly with the pre-trained models during the inference stage, ensuring ease of deployment. However, they require access to the internal activations of the model, making them suitable only for open-source architectures.
In contrast to traditional gradient-based optimization approaches, NAMMs are developed using evolutionary algorithms. This method involves iterative mutations and selections of top-performing models, enhancing both efficiency and functionality. This approach is crucial as NAMMs aim for a non-differentiable target: the decision to retain or discard specific tokens.
Operating on the attention layers of LLMs, NAMMs evaluate the importance of each token concerning the context window. Through an attention-based strategy, trained NAMMs can be universally applied across various models without necessitating further adjustments. For instance, a NAMM developed with text data can effectively transfer to vision or multi-modal frameworks.
Implementing Universal Memory
To evaluate the functionality of universal transformer memory, researchers implemented a NAMM atop an open-source model, the Meta Llama 3-8B. Findings indicated that incorporating NAMMs allowed Transformer-based models to excel in both natural language processing and coding tasks over notably long sequences. Notably, the use of NAMMs also enabled the reduction of cache memory usage by up to 75% for the LLM while performing these tasks.
“Our benchmarks clearly demonstrate that NAMMs enhance the performance of the Llama 3-8B transformer,” the team reported. “Additionally, our memory systems provide ancillary benefits by reducing the context size at each model layer, even without direct optimization for memory efficiency.”
The experiments extended to the 70B variant of Llama and other Transformer designs for diverse tasks such as Llava (focused on computer vision) and the Decision Transformer (applied in reinforcement learning). The results underscored that NAMMs maintain their advantages even in varied contexts by filtering out non-essential tokens, enhancing information relevance and performance.
Adapting to Task Needs
An intriguing aspect of NAMMs is their ability to modify their operational behavior based on the task at hand.
For instance, during coding tasks, the model efficiently eliminates sections of tokens linked to comments and unnecessary whitespace—elements that do not influence code execution. Conversely, for natural language processing, it discards tokens related to grammatical redundancies that do not alter the sequence’s overall meaning.
Researchers have made the code for generating NAMMs available. This universal transformer memory technology promises valuable applications for enterprises that manage vast quantities of tokens, enhancing processing speeds and reducing costs. Furthermore, the adaptability of a single NAMM across multiple applications underscores its versatility within business ecosystems.
Looking ahead, the research team anticipates more sophisticated developments, including the integration of NAMMs in the training phases of LLMs to enhance their memory capabilities.
“This work marks just the beginning of exploring the potential within our innovative memory models, which we believe could pave the way for advancements in future transformer generations,” the researchers concluded.
Source
venturebeat.com