Is Agency Identifiable?
Identifiability in IRL
One of my favorite papers is this one, titled "Occam's razor is insufficient to infer the preferences of irrational agents". It relates to an area of machine learning called inverse reinforcement learning. Inverse reinforcement learning (IRL) looks at the behavior of some agent on a task or in some environment, and asks if we can infer the agent's goals based on the agent's behavior. Assuming the agent acts optimally according to its goals helps with solving this problem a lot. If the agent is "irrational" to some degree and therefore doesn't always act optimally, the agent's goals may not be identifiable. Non-identifiability means (roughly speaking) that no matter how much we observe something, regardless of how much data we collect, we might not be able to answer a certain question. This can happen because there are multiple explanations for the same data (behavior of the agent).
If a possibly irrational agent acts in a certain way, we can't tell if they act that way because it is the best way to achieve the agent's goal, or if the agent is acting contrary to its interests due to a decision-making error. We can think of the behavior of the agent (given the symbol π for "policy"), as having two parts. The goal (noted as R for "reward") and a planning algorithm that is used to maximize the reward (let's call it p for "planner"). Then we can write the relationship between these three as π = p(R). The agent's behavior (π) is a result of applying its decision-making process (p) to its goal (R). If the agent is "irrational" meaning we can't assume p always chooses the action that maximizes R, then for a given π (which is the only thing we can observe), we can explain it as coming for p or as coming from R, thus the non-identifiability.
The paper takes these results one step further. We might hope that we can resolve this issue by a type of Occam's razor approach. What if we assume the correct p and R are the "simplest" ones? The paper shows that this approach isn't going to save us. This means that if we want to infer an agent's goals or intentions from the agent's actions, we may need to rely on some assumptions that aren't as neutral-seeming as the simplicity idea. From the paper:
So, although current IRL methods can perform well on many well-specified problems, they are fundamentally and philosophically incapable of establishing a 'reasonable' reward function for the human, no matter how powerful they become. In order to do this, they will need to build in 'normative assumptions': key assumptions about the reward function and/or planner, that cannot be deduced from observations, and allow the algorithm to focus on good ways of decomposing the human policy.
One interesting application of this is to take humans as the agent. Think of the saying sometimes referred to as Hanlon's razor:
Never attribute to malice that which is adequately explained by stupidity.
This is an example of a "normative assumption" that could be used to resolve some of the above-mentioned non-identifiability, and it may be one that people actually use when confronted with this problem. After all, people seem to somewhat reasonably infer the goals and intentions of other people, which suggests that there is a way to do this effectively. From the paper again:
How can we reconcile our results with the fact that humans routinely make judgments about the preferences and irrationality of others? And, that these judgments are often correlated from human to human? After all, No Free Lunch applies to human as well as artificial agents. Our result shows that they must be using shared priors, beyond simplicity, that are not learned from observations. We call these normative assumptions because they encode beliefs about which reward functions are more likely and what constitutes approximately rational behavior.
I think there are a lot of interesting ways that we could apply this idea in terms of understanding interactions between people (as in the Hanlon's razor example), but for the rest of this post I want to focus on how this idea relates to artificial agents[1], and what that could mean for the risks posed by those agents.
Identifiability in learned models
Reinforcement learning (the RL in the IRL mentioned above), is a branch of machine learning that aims to take a goal on a task (e.g. in chess, the goal is to win the game) and train a model to achieve that goal (train a computer program to play chess and win most of the time). Let's call the model that we obtain through some training process f.
Imagine that someone else just hands you some model (f(x) with inputs x), and you have no idea what goal or task it was trained on, how it was trained, or what data was used to train it. They just hand you a model, and you want to understand it. You do a bunch of complicated analysis and determine that you can split the model into two parts, such that f(x) = g(h(x)). If g looks kind of "planning-like" and h looks kind of "reward-like", can we say that the models has "agency"[2], that its "intentions" are related to h? What if we can also explain the model as f(x) = c(d(x))? If we accept that there may be identifiablity issues[3], it's not unreasonable to think we can do both. What if one split looks like it "has agency", but the other doesn't? What if they both look like they have "agency", but with different goals? What if every model has some version of this that looks like it has "agency"? What if we ask what goal the model was trained for and are told it was trained to be good on R, but our analysis suggests its "trying" to achieve R'. Does that mean the training didn't work, and the model is bad at R, or does it just mean that we can explain the model as both aiming for R and R' due to non-identifiability?
A very useful thing in terms of working with machine learning models is we can potentially "look inside" them. We can examine their parameters, we aren't stuck with only inputs and outputs like we are assuming in the IRL context at the beginning of this post. We can look at more than just "behavior" of the model, we can look at its internal algorithm or "thought process". It's possible that being able to do this just solves the identifiability issues all-together, and there is in fact an entire area of research that many people are working on surrounding how to "look inside" models in order to understand them. I think this area is likely to be very important and to only get more important as machine learning systems become more powerful and play more of a role in society. But as far as I know there isn't a good theory about when this can be expected to work and produce a unique result vs when there could be multiple and potentially many candidate explanations for how a model works (i.e. non-identifiability).
Implications Preview
I think these questions about identifiability raise some additional questions around the way people think about and discuss the risk of advanced machine learning systems, particularly around the idea of "agency". When people speak of an "agent", I think they often have in mind this distinction between a system that can be viewed as planning or optimizing some reward or objective functions. But if there are identifiability issues with determining if a system can be decomposed like that, does that mean whether or not a system is "an agent" is also not identifiable? I'm still trying to develop my thinking around this, but I realized I was delaying publishing this while obsessing about getting my own thoughts in order. Plenty of my ideas could easily be going off in the wrong direction, so I figured it makes more sense to just put some of these thoughts out there, and work on the possible implications over time. That said, here's a bit of a preview of what I think those implications might be.
Technical
Some technical approaches to analyzing machine learning models may involve detecting certain properties of those models that are somewhat similar to "agency" or "agentic behavior". For example, an idea that has become prominent among researchers interested in AI safety/AI risk/AI alignment/your-prefered-term-here is mesa-optimization, the possibility that models trained by an optimization process will learn to perform opimization-like operations internally. As an example, here is a portion of the paper describing the distinction between "optimizers" and "non-optimizers" (internal citations omitted):
Arguably, any possible system has a behavioral objective—including bricks and bottle caps. However, for non-optimizers, the appropriate behavioral objective might just be "1 if the actions taken are those that are in fact taken by the system and 0 otherwise," and it is thus neither interesting nor useful to know that the system is acting to optimize this objective. For example, the behavioral objective “optimized” by a bottle cap is the objective of behaving like a bottle cap. However, if the system is an optimizer, then it is more likely that it will have a meaningful behavioral objective. That is, to the degree that a mesa-optimizer’s output is systematically selected to optimize its mesa-objective, its behavior may look more like coherent attempts to move the world in a particular direction.
My question is, why isn't "behave like a bottle cap" a valid objective? I definitely share the intuition that it isn't very optimizer-like, but that could just be because it's not a behavior that is particularly interesting to me. I don't have a principled reason that an optimizer can't have that as it's optimization target. That's the paradox, any system can be modeled as an optimizer with some type of goal (e.g. behavior exactly as it behaves), and probably more than one goal, if we allow non-optimal behavior. The idea of mesa-optimization is all about identifying optimizers and their optimization targets. It seems pretty important to this research agenda to address the identifiability issues that come up here.
The paper does call out some issue of indentifiability, but I feel like it's not fully developed. If the factorization of a model into its inner and outer components within this framework isn't unique, how can we be so sure that speaking about things like inner vs outer alignment is even meaningful? If the data a model is trained on isn't sufficient to uniquely identify the intended behavior or reward function, to what extent is "deceptive alignment" actually deceptive, as opposed to simply mistaken? The difference between intentional and accidental deception hits on the exact distinction invoked by Hanlon's razor. My sense is there is a fruitful line of inquiry there that hasn't been pursued to its end.
Another paper that mentions identifiability but doesn't take it far enough in my view is this one on the phenomenon of "goal misgeneralization"[4]. If we take identifiability seriously, how do we differentiate "goal misgeneralization" from "policy misgeneralization" or just lack of generalization to new environments? If we judge a model only on its outputs, without looking at its internals, how can we know that it is misgeneralizing goals, rather than simply misgeneralizing? One idea I find helpful here is the idea of viewing things "from the perspective of the model", trying our best not to use our own prior knowledge about people and their likely goals. If a model is pursing something that seems "goal-like" but not the intended goal vs something that seems "non-goal-like", is this some fundamentally different property, or is it just imposing our own priors about an environment? I'm not sure.
Conceptual
Beyond the technical aspects, I think the idea of "agency" has played a big role in conceptual arguments around AI risk. As I touched on in a previous post, I think arguments that strongly emphasize the "agentic" natural of future advanced ML systems resonate more with some people than others. If "agency" is a bit of an unclear and nebulous concept, it makes sense people may have varying reactions to AI risk arguments, since they could be operating on different ideas of agency or different beliefs about how likely it is for "agentic" behavior to emerge in ML systems.
I think the role of "agency" is a big part of what makes these discussions seem like something out of a Sci-Fi novel. It immediately calls to mind fictional anthropomorphized AIs. But part of that may be a result of intuitions about agency that may not accurately reflect what the situation will look like with advanced ML systems. It could be the case that the idea of "agency" isn't actually required for the conclusion that AI poses substantial risks, but that it's difficult to articulate this version because of strong (but possibly misleading) intuitions around "agency". This is in some sense the inverse phrasing of the non-identifiability. We can't necessarily infer goals from behaviors. Likewise, lack of bad "goals" doesn't necessarily imply lack of bad behavior (i.e. what the system actually does). I think developing arguments related to AI risk that don't lean as hard on "agency" (or perhaps use a different notion of "agency") could bring clarity to some of these issues, as well as present a case for concern that is more robust to different intuitions.
One application that combines the two, and which I think is a central target of the paper, is that if we want artificial agents to understand human goals or act in a way that is in accordance with them, we may need to solve this problem. I think that is a very interesting a highly relevant application of these ideas to AI safety, but I'm going to focus on some different applications in this post. ↩︎
I'm using a lot of extremely annoying scare-quotes here because I want to emphasize that I view these concepts as extremely fuzzy and based on rough analogies or intuitions, and when applying those ideas to learned models that creates some confusion around whether those intuitions work/are helpful or if they are anthropomorphizing too much. Part of the issue that I'm trying to point out is that it's hard to tell which of those is the case. ↩︎
Note that I'm engaging in rank speculation here. I don't know whether or not an identifiability issue of this nature exists. The identifiability issue identified in the paper I mentioned don't necessarily imply (at least as far as I know) the identifiability issue I am speculating about here. But I think the existence of the identifiability issue mentioned in the paper raise questions for me about identifiability existing in similar or related areas. As I'm sure is apparent from this paragraph, I have a lot of questions, but I don't have any answers. ↩︎
Related paper here which contains some examples that I personally think fit into the indentifiability perspective here. ↩︎