Oct 10, 2024 | 6 min read
No doubt by now, you have heard people talk about "context windows". Let's talk about why context windows matter, and where they're headed in the future.
You probably didn't know this
Context windows aren't actually a physical limitation of large language models (LLMs). In theory, even the smallest LLM you can think of could "read" the entire Library of Congress. However, it probably wouldn't understand much of what it read. So why do companies like OpenAI and Anthropic set specific context window limits? There are two main reasons:
- Longer text requires more compute power, which translates to higher costs.
- The LLM may not be trained to work effectively with very long texts.
Hopefully this changes your thinking about context windows. It's not just about capacity; it's about cost, efficiency, and understanding.
What matters about context windows
Understanding context windows is crucial if you're trying to use LLMs. Here's why:
The "missing middle"
Simply expanding context windows doesn't automatically lead to better performance. Researchers at Stanford call this problem the "missing middle". As context length increases, LLMs struggle to accurately recall information from the middle portions of the input. They show a clear bias towards information at the beginning (primacy effect) and end (recency effect) of the context. This U-shaped accuracy curve means that dumping more informational context into a prompt isn't always going to help you. In fact, it could potentially dilute the model's focus on the most relevant information. Provide as little information as possible to complete your task successfully.
Quadratic costs
The computational cost of processing longer contexts grows quadratically. This means that doubling the amount of text you feed an LLM quadruples the compute required to do your task. This unfriendly scaling poses significant challenges for deployment, whether you're a small startup or a tech giant. Context windows aren't getting smaller, so the increase in computational requirements will ultimately mean higher costs until we can figure out better hardware solutions (assuming no improvement in neural network design). This length-dependent cost factor is the key reason most commercial API providers charge based on input token count.
Hallucination
You might think that larger context windows should reduce hallucination because you can provide the LLM with more information to ground its answers. But that couldn't be further from the truth. Research indicates that very large contexts can sometimes increase the likelihood of hallucination, particularly when the input contains irrelevant or contradictory information. This counterintuitive finding highlights the importance of curating the data you feed to an LLM. If you don't preprocess it correctly, forget about being able to trust it in the wild.
Where Are Context Windows Heading?
There are two promising areas of focus on research involving context windows: sparsifying computations, and data curation.
Sparse attention mechanisms
In case you don't know what an attention mechanism is: this is the most computationally expensive part of running an LLM. For every word an LLM generates, it knows which of the previous words it should "pay attention to." For example, if you ask the LLM "What is the capital of France?" it is likely going to pay attention to the words "capital" and "France" because the other stuff is just filler and not essential to answer the question. This is computationally expensive because for every word of the answer the LLM generates, it needs to check every word in the input for relevance.
To address this quadratic scaling cost, researchers are developing innovative sparse attention mechanisms. These techniques allow models to selectively focus on the most important parts of the input. Some techniques involve limiting how far back the model can look, while others attempt to do it more intelligently than by brute force.
Domain specificity
By curating data that is relevant to your domain (e.g. medical papers if you're a healthcare analyst), you can finetune or customize an LLM to bake in specific knowledge about your work. That way, you can reduce the amount of context you need to provide it. Think of it like training a first-year to be a specialist. You don't need to provide as much detail for them to understand you.
What should you do now?
Most of our customers require thoroughness: they often need to read through hundreds of documents to put together a detailed analysis of an industry. Even if all of these documents could fit in the context window of an LLM, hopefully you now realize you shouldn't take that approach. Instead, consider these strategies:
- Systematic analysis: Process each document individually to avoid overwhelming the LLM and to mitigate the 'missing middle' problem.
- Efficient aggregation: Combine the results of individual analyses, focusing on the most relevant information to reduce computational costs and minimize hallucination risks.
- Domain-specific fine-tuning: If possible, use a model tailored to your industry to reduce the amount of context needed for each analysis.
- Sparse attention techniques: Leverage advanced LLM implementations that can handle longer contexts more efficiently.
Doing all of the above may seem challenging, but with platforms like Fabric by Quilt Labs, you can effectively leverage AI for large scale analyses today.