Inner vs Outer Model Alignment

A discussion of exisistential AI risks from the point of view of inner vs outer alignment.

“Do you think the majority of the existential risk from AI comes from inner alignment concerns, outer alignment concerns, or neither? Explain why”

Nearly all AI alignment efforts fall into the categories of outer alignment and inner alignment. Outer alignment risks refer to the phenomenon where an AI system’s objectives are different from those of a human/humanity. In contrast, inner alignment refers to scenarios where a base optimizer creates a sub-optimizer (mesa-optimizer) that can be adversarial to the intended base optimizer.

In general, the vast majority of AI safety media and press falls into the category of outer alignment. Although it’s hard for me to believe the sudden surprise of a malicious AI system taking over humanity, it’s true that outer alignment is a big problem in AI safety. Inner alignment, on the other hand, may not play a big problem in the architecture of current systems, but as we combine self-supervised learning with more RL, deceptive alignment is going to be a problem. To summarize my argument, existential risk is currently more prominent in outer alignment but as larger models become more ‘agentic,’ inner alignment will be a lot more prominent.

There are two types of outer alignment issues: harmful intent and intent misalignment, both of which pose significant safety problems. Harmful intent is what’s talked about in the media. A malicious actor programs an AI to assist with warfare. Everyone has seen the headlines. While it’s true that such systems are possible, you must heavily analyze the incentives of such an actor and plan accordingly. This type of harmful intent is left to another discussion. However, more interesting is intent that isn’t very apparent, intent where a model manipulates your thinking subconsciously. The best example of this is an ad network like Facebook Ads. These ad networks manipulate the users of their platforms into clicking on ads. This may or may not be in the best interest of the users. More broadly, these social networking platforms are incentivized to have the maximum number of users on their platform as possible. Thus, they build recommendation models to keep users more engaged while on the platform. Non-apparent intent can be extremely harmful when aligned with the wrong incentives.

Another outer alignment issue is the case of intent misalignment with the base objective. This is best illustrated by an example. Suppose your base objective was to reduce the number of people with cancer in the world. Your intent might have been to find a curse for cancer to reduce the number of people with cancer, however, the optimizer might have interpreted this as killing all those that have cancer. The important thing in this example is the end goal is the same. At the end of either optimization, there are no people with cancer. The means to that end are very different, this is where intent misalignment may occur. It’s very important to put guard rails on powerful systems to prevent such an unintended error from happening. Subtle forms of intent misalignment are candidates for existential AI risk.

On the other hand, the case for inner alignment is also a threat to existential risk, but not in the current form of models. When training a model, the base optimizer is not directly interacting with the environment. For example, gradient descent doesn’t interact with the environment in which rewards are handed. Gradient descent optimizes a model (set of parameters) which interacts with the environment. This step is called a mesa-optimizer. Our goal is to ensure that the overall system has the same goal we intend it to have. Problems arise when a mesa-optimizer has an adversarial goal to that of the base-optimizer. If that is the case, the model may fool us into thinking it’s aligned but when deployed without the base optimizer actively updating parameters, it may have totally different behavior (deceptive alignment). Deceptive alignment in general can be thought of as a dangerous form of reward hacking.

Deceptive alignment isn’t a big problem in current systems simply because of the way they are trained. However, deceptive alignment will be a problem as models become more ‘agentic’. The best way to understand how large transformer models are trained is through Yann LeCun’s analogy of a cake. The heart of the cake is self-supervised learning. This is the largest part of the cake and is analogous to pre-training of a model. The icing on the cake is supervised learning, analogous to fine-tuning of a model. Finally, the cherry on top is pure reinforcement learning. This sort of architecture is unlikely to create deceptive alignment.

Deceptive alignment typically arises from the fundamentals of a model having certain characteristics. Firstly, a model must have some goal or set of goals. Moreover, the model must optimize over long-term goals, or at least the current behavior must be influenced by long-term goals. Finally, the optimizer must understand the base goal and the current situation it is in.

When these heuristics are applied to current transformer models, it’s unlikely that deceptive alignment arises. Since the heart of such models is pre-training, it’s unlikely that these models have long-term goals. The output in self-attention is calculated based on the entire input sequence rather than sequentially. Each loss is calculated independently. This is also known as the independent and additive laws of self-attention. Moreover, since the entire pre-training model objective is next token prediction, it doesn’t make sense for gradient descent to look into the future as it deteriorates current performance. Fine-tuning as well is extremely simple. The model doesn’t have enough room for exploration. In fine-tuning, the optimizer's goal is to produce actions that lead to a high reward, but since the span of the goal is so short, it’s unlikely that the model will develop deceptive alignment, given the same characteristics as pre-training.

The big existential risk argument with inner alignment, however, is as these models further develop more ‘agentic’ behavior, that will cause deceptive alignment. As reinforcement learning begins to play a part in larger models, deceptive alignment will grow as a problem. We must attempt to understand and mitigate these risks before reaching full scale.