The Biggest Challenges with LLMs Today (Part 2)

Pushing the Boundaries of LLMs

This is Part 2 in The Biggest Challenges with LLMs Today series.

For part 1, click here.

In part below, I share 5 more key challenges dominating LLM research.

1. Large-Scale Evaluation

Evaluating LLM outputs at scale is a challenge.

You have no guarantee the same input will generate the same output.

Determining what constitutes a "good" output can be highly subjective, which makes it hard to establish a quality benchmark.

In terms of volume, LLMs generate vast amounts of data.

Manually evaluating outputs to ensure high quality is impractical.

But automated evaluation while maintaining high accuracy is hard.

Automated metrics may not always align with human judgment.

Answers that may be appropriate in one conversation might be inappropriate elsewhere, depending on context and past interactions.

Example

Here’s an example of probabilistic outputs throwing a wrench in evaluation:

You explicitly ask the LLM to return a valid JSON output.

Yet, there’s a small chance the LLM will occasionally return an invalid JSON anyway, despite your explicit request.

This happened to me the other day while hacking on PassMyTests.com!

Because user prompts are unstructured, they add even more unpredictability.

Changing any one of these factors will alter the responses you get:

  • word reordering

  • capitalization

  • punctuation

  • phrasing

Future Work

There is ongoing research into improving large-scale evaluation:

  • Better task-specific metrics

  • Using AI to detect and fix troublesome outputs

  • Human-in-the-loop evaluation processes for quality control

LLM Ops is a relatively new field, emerging to tackle these challenges:

2. Computational Efficiency

Training and running LLMs involve vast amounts of data and calculations.

LLMs are massively resource-intensive — in terms of computational power, energy consumption, and cost of operations.

As models become more sophisticated and complex, their size typically increases, which then increases computational demands.

LLM training and operation is unfeasible for many companies, as you need substantial financial investment and robust compute infrastructure.

Example

Training OpenAI's GPT-3 involved 175 billion parameters!

Managing and updating this massive number of parameters requires substantial hardware resources (GPUs and TPUs), which lead to high electricity usage and environmental impacts.

Training a large language model like OpenAI’s GPT-3, for example, uses nearly 1,300 megawatt-hours (MWh) of electricity, the annual consumption of about 130 US homes. According to the IEA, a single Google search takes 0.3 watt-hours of electricity, while a ChatGPT request takes 2.9 watt-hours. (An incandescent light bulb draws an average of 60 watt-hours of juice.) If ChatGPT were integrated into the 9 billion searches done each day, the IEA says, the electricity demand would increase by 10 terawatt-hours a year — the amount consumed by about 1.5 million European Union residents.

Future Work

There are several research directions to improve computational efficiency:

  1. Model Pruning: Reduce a neural network’s size by removing parameters that have little impact on performance. This can significantly decrease computational requirements without substantially affecting output quality.

  2. Quantization: Reduce the precision of numbers used in computations from floating-point to lower-bit integers can decrease memory usage and speed up inference times without a large drop in performance. Here’s a recent example from Microsoft.

  3. Hardware: Specialized hardware for LLM workloads can also enhance efficiency. Hardware like Google's TPUs (Tensor Processing Units) are designed to handle AI tasks more efficiently than CPUs or GPUs.

3. Bias and Fairness

One of the most pressing ethical issues is bias.

Since LLMs learn from vast amounts of data, they can inadvertently perpetuate biases present in the data.

This could manifest in gender, racial, or ideological biases that skew LLM responses, which could lead to unfair or harmful outputs.

This issue is critical because LLMs are used in diverse real-world applications.

Example

For example, LLMs might generate job ads that are biased towards a certain gender based on training data, or it might offer prejudiced customer service responses based on racial stereotypes.

Future Work

Researchers are exploring ways to detect, understand, and mitigate bias and ensure fair outputs.

Mitigating bias involves:

  • Diversifying training data

  • Training AI to detect and correct bias

  • Reinforcement learning with human feedback (RLHF) to align LLMs with human preferences

For example, adversarial training is a promising technique:

LLMs are trained to overcome bias by introducing challenging scenarios that test and expose underlying prejudices.

4. New Architectures

An LLM’s architecture is how it processes information.

Transformer models, like ChatGPT, use layers of self-attention to process text.

The original transformer architecture looks like this:

ChatGPT, however, uses the encoder-decoder transformer (which is what is shown in the picture — left column is encode right column is decoder), it uses decoder only (see hte link above)

Example

The transformer architecture is most common but has severe limitations, especially with compute efficiency as data volumes grow.

It’s very resource-intensive due to the quadratic relationship between the number of tokens in the input sequence and computations required for attention mechanisms.

Processing longer text takes exponentially more compute power and memory with the transformer architecture.

Future Work

Researchers are exploring several promising avenues:

  1. Sparse Transformers: Introduce sparsity into the attention mechanism, allowing the model to focus only on a subset of important tokens rather than all tokens. This reduces computational load while improving efficiency.

  2. Reversible Architectures: Construct neural networks where each layer's outputs can be reconstructed from its subsequent layer's outputs. This reduces memory usage during training by eliminating the storage of intermediate activations for backpropagation.

  3. Efficient Attention Mechanisms: Methods like the Performer or Linformer use low-rank approximations or kernel-based techniques to reduce the computational complexity from quadratic to linear with respect to the input size.

5. Generalization

Generalizing the ability of LLMs to perform tasks with minimal or no training examples is an active research area.

Few-shot learning is when you give LLMs a few examples of a task.

Zero-shot learning is when LLMs try to do tasks they’ve never seen.

Both techniques challenge LLMs to generalize from little information.

This is imperative in use cases where labeled data collection is not feasible or data is rapidly changing.

Example

Suppose you’re using LLMs for customer service across various industries.

With few-shot learning, the LLM might have seen some examples from healthcare but still needs to perform well, recognizing health-specific terms.

In zero-shot learning, the LLM might face a new product category without prior exposure, relying only on its general customer service knowledge.

Future Work

Recent advancements include leveraging meta-learning, where models “learn how to learn” across tasks, improving adaptability with minimal data.

Meta-learning involves training LLMs on a variety of tasks and optimizing them to quickly adapt to new tasks using only a few examples.

ChatGPT has meta-learning with its “in-context learning” feature — you simply feed examples within prompts.

Synthetic data generation can also help in use cases where real data is scarce.

This could provide richer training data without costly tedious data collection.

Conclusion

Evaluating LLMs at scale is tough because of probabilistic outputs and subjective quality assessments.

The volume of data generated by LLMs makes human review impractical, while automated methods can fail to capture nuanced judgment.

Meanwhile, the compute demands of training and running LLMs are massive, almost incomprehensible, requiring extensive hardware and energy resources.

Addressing bias and generalization to unfamiliar tasks are both very active areas of LLM research today.

Altogether, this is the most exciting to be in AI!