What is missing from ChatGPT / GPT-4?

ChatGPT and GPT-4 are remarkable engineering breakthroughs. In this post I reflect on what are still missing from these models, and most modern LLMs in general.

Persistent memory and lifelong learning. If we want to let an LLM know some context information, we would need to include it as part of the input sequence. But even GPT-4 has a input length limit of 32k tokens. There are use cases where we need to provide longer context, for example, working on large-scale code bases, understanding entire books, reading long manuals, and memorizing conversation history over a long period of time. Being able to memorize and use long documents would further expand the capability of LLMs.
Robust instruction following. LLMs are shown to be prone to various distractions – learned priors, in-context patterns, spurious semantic correlations – that hinder instruction-following, which is desired behavior. The Inverse Scaling Challenge has drawn a lot of such examples. These problems do not seem to go away with scaling or RLHF.
Faithfully expressing its beliefs. LLMs can generate statements that contradict with each other, and generate statements that can be rendered false in retrospection by the same model itself. This means the text they generate doesn’t necessarily reflect their true “beliefs”. Hypothetically this is because nearly all modern generative LLMs work by token-by-token autoregressive decoding, which is problematic.
Expressing confidence and abstaining. LLMs lack an intrinsic, built-in mechanism to express their level of confidence in the text they generate. Further, they should autonomously choose to abstain when there is insufficient confidence.
Robustly expressing chains of reasoning. Methods like Chain-of-Thought (CoT) show that chains of reasoning can be elicited from LLMs and are useful in boosting task performance. More need to be done to ensure that the atomic steps in these reasoning chains are reliable and trustworthy.
Resolving knowledge conflicts. If the in-context knowledge conflicts with the learned parameterized knowledge, which should the LLM choose to believe and ground its reasoning in? This is particularly important when the LLM is used in a retrieve-and-read workflow.
Safety. Many aim to align LLMs with human values. This includes rejecting unethical and unreasonable requests made in the prompt. But how to define unethical and unreasonable? Where to draw the line? Who has the power to decide?
Real-world interactions. As of now, ChatGPT and GPT-4 work in a reactive manner and respond to our prompts. Their outputs are, in most cases, for human consumption. Having them act proactively and interact with the world would unleash a lot of potential.