The Biggest Challenges with LLMs Today (Part 1)

Pushing the Boundaries of LLMs

LLMs are improving exponentially.

But they still have a long way to go.

In this 2-part series, I share the 10 biggest challenges with LLMs today.

Each part covers 5 unique LLM challenges and highlights promising solutions.

1. Hallucination

A common complaint with LLMs is hallucination:

LLMs generating information not based in fact or reality.

LLMs learn patterns and relationships from training data, generating responses based on statistical correlations in the data.

Without a grasp of facts or a built-in mechanism to verify truth, LLMs can easily hallucinate, crafting answers that seem right but are factually wrong.

If the training data contains errors, biases, or limited perspectives on a subject, the LLM is likely to replicate these issues in its output.

If the training data does not cover a subject extensively, the model may generate plausible but incorrect information when asked about it.

Somewhat related, training data is fixed in time. But knowledge evolves, laws evolve, research evolves. Keeping LLMs factually up-to-date is hard.

Hallucination is a big issue in use cases where accuracy is crucial such as:

  • law

  • education

  • enterprise

  • healthcare

  • customer support


An LLM evaluates medical symptoms but incorrectly "remembers" or generates a diagnosis based on patterns from training data, rather than a patient’s specific report.

This LLM could falsely diagnose common cold when, in reality, the symptoms suggest a more severe condition like pneumonia.

Future Work

Researchers are developing ways to measure and mitigate hallucinations.

One approach is “Chain-of-Thought” reasoning.

LLMs are instructed to outline its thought process step-by-step.

This allows for easier verification of its conclusions and encourages LLMs to check its answers against trusted data.

Another potential solution is to build more rigorous processes to filter training data to make sure it’s accurate in the first place.

2. Multiple Languages

LLMs perform substantially worse in non-English languages.

The main issue is lack of high-quality training data in other languages.

Also, languages vary tremendously in grammar, syntax, and semantics.

Cultural nuances, idioms, and region-specific references pose a challenge.

LLMs would need to understand context embedded within each language.


Arabic and Mandarin present orthographic diversity of scripts, which include unique features like right-to-left writing and complex character sets.

Very different from English!

Future Work

Overcoming the multi-lingual challenge is essential to create globally accessible and inclusive LLMs.

One approach is crowdsourced data collection.

Engaging with native speakers in this community-driven approach helps generate high-quality linguistically diverse datasets.

Until we achieve a long-term solution, however, a near-term workaround is to ask your prompt in English, then translate back to your required language.

3. Multimodality

The future of LLMs is multimodality:

Processing and generating multimedia text, images, audio, and video.

There are many use cases for multimodality:

One challenge is the different structures and features across modalities:

  • Text is sequential and symbolic

  • Audio is sequential, temporal, and continuous

  • Images are spatial and high-dimensional

  • Video combines elements of both images and audio, plus motion

The alignment of multimodal data is crucial for understanding relationships between modalities, such as synchronizing audio with video.

There are also embedded modalities, such as text written on images, subtitles on a video, and videos embedded with a static image.

You need massive volumes of data to train these models.

Even though labeling text is faster and cheaper than labeling video, each modality must be sufficiently represented to cover diverse scenarios.


Imagine an LLM that can read restaurant reviews and analyze photos of food.

Beyond recreation, this would be invaluable in industries like healthcare.

Combining medical records (e.g. unstructured doctor’s notes) with imaging data (e.g. X-rays, MRIs) could positively impact diagnostics.

Future Work

One promising direction is the development of models that can handle any-to-any modality interactions, such as Google Gemini and Anthropic Claude 3.

For example, Google Gemini can analyze a photo and answer with a recipe:

4. Context

The “right” answer often depends on context.

First, there’s the context of your prompt.

"Local cuisine" differs vastly between Paris, France, and Paris, Texas.

If you ask LLMs about health, it should prioritize information from medical journals and trusted sources.

Second, LLMs have limited context window size.

Complex or long conversations are challenging.

Because LLMs need to remember and integrate details from earlier conversations to respond appropriately later.

Context is a tough problem because:

  • Conversations are non-linear

  • Important information is scattered throughout conversations

  • Difficult to differentiate between important vs. non-critical data


For example, you’re planning a birthday party and brainstorming decorations.

You mention a guest is allergic to peanuts.

Later, you ask for cake recommendations.

Ideally, the LLM remembers your guest is allergic to peanuts, tailoring its cake recommendations to accommodate.

But the LLM needs to:

  • Remember context from previous conversations

  • distinguish your earlier comment as a critical piece of data

  • Understand how this context is relevant to current conversations

Future Work

One research path is the In-Context AutoEncoder (ICAE), which focuses on context compression.

This model encodes contexts into memory slots.

These slots retain the essential information of the context, allowing LLMs to generate responses that are contextually appropriate, even when the context extends beyond its default processing capacity.

It’s useful when context spans multiple conversations or has complex information structures.

5. Transparency

Another challenge is the transparency and explainability of LLM decisions.

LLMs consist of billions of parameters.

It’s worth emphasizing:

Nobody understands how LLMs work.

We understand how to train LLMs, but not how they actually work.

The question becomes - how do we come to trust their output without fully understanding how they work internally?

After all, I don’t know the internal workings of Google’s search and recommendation systems, but I still mostly trust them.

Nonetheless, this presents a massive problem — especially in highly sensitive industries such as finance, healthcare, and law.


If a medical LLM suggests a treatment, doctors need to know on what basis it made that decision.

As a patient, wouldn’t you want to know too?

Currently, there’s very little transparency, which is a barrier to wider adoption.

Future Work

Researchers are trying to “open the black box" to make LLMs more transparent and explainable.

Feature visualization highlights what each part of the model is focusing on when making decisions.

Layer-wise relevance propagation shows how much each input feature contributes to the final decision.

Explainable AI (XAI) models attempt to provide human-understandable justifications, explaining its step-by-step rationale in arriving at a decision.


Despite mindblowing capabilities, LLMs today struggle with hallucination, multimodality, performance in non-English languages, understanding context, limited context windows, transparency, and explainability.

Personally, I can’t wait to see what we’ll be able to do with LLMs 1 year, 5 years, 10 years from today — as amazing AI researchers and innovators relentlessly chip away at these primary challenges.

Next, in part 2 of this series, I’ll share 5 more unique challenges faced by LLMs.