The limits of black-box evaluations: two hypotheticals

A prominent approach to AI safety goes under the name of "evals" or "evaluations". These are a critical component of plans that various major labs have, such as Anthropic's responsible scaling policy or OpenAI's preparedness framework. At bottom, these approaches propose to monitor increasingly powerful models using evaluations, and then take some conditional action based on the results, such as implementing more stringent safeguards if an evaluation suggests a model presents a particular risk.
The easiest evaluations to implement (and from my perspective by far the most common) are "black-box" evaluations, meaning they simply evaluate the inputs to a model and their corresponding outputs, they don't "look inside" the model as part of the evaluation ("white-box"). Are black-box evaluations sufficient to achieve their purposes within these various AI safety frameworks? It's possible that they will be in practice, but here I argue that evaluations aren't sufficient in and of themselves without further assumptions to support them.
The great appeal of evaluations is that they are empirical in nature. Sure, there are many theoretical arguments around risks from advanced AI systems, but these are inherently speculative. If we can do scientific research on actual AI systems, we can use empirical evidence to resolve these theoretical disputes. I think if you follow the output from major labs, this idea is rather central to their perspective and approach. Why build a system that you think has a reasonably high chance of causing a catastrophic outcome for the entire world? The general answer from labs is essentially "because the only way to address the risks empirically is the build precursor systems for us to study". The hope is that while theoretical disagreements may remain intractable, empirical evidence can resolve different opinions or perspectives and get everyone on the same page about what is really need to ensure safety.
Below, I present two hypotheticals which I think demonstrate the limitations of this perspective. Specifically, I don't think evaluations are likely to resolve major disagreements without further theoretical advances. The reason is that the safety-relevant interpretation of empirical evaluations is itself dependent on (usually unstated) theoretical assumptions. As a result, people who disagree on theory are liable to disagree on the correct interpretation of various empirical results. I think the matches the discourse around many empirical papers that have been published on safety-related topics.
Instead of analyzing AI systems under my own unstated assumptions about how current systems work, I do the opposite here. I will present two hypotheticals that obviously aren't actually true of existing models, and them explain why black-box evaluations would fail to distinguish these hypothetical models from models the evaluations are designed to detect. My point isn't that these are realistic (they aren't) or that evaluations have zero value, but rather to argue that evaluations need to be coupled with theoretical assumptions that support them. If different people have different unstated assumptions, no wonder they interpret results differently! These unstated assumptions should be stated, and I think theoretical work is also needed to develop well-grounded assumptions as well.
Framework
Before I state my two examples, some framework. Although I think the analysis here applies to black-box evaluations in general, I think it will be helpful to fix and example use case in mind for these examples. I will consider so-called "capabilities evaluations", where the desire is to assess whether a does or does not possess the ability to perform a task or produce a particular type of information. Let's say the desire is to assess a model for its ability to assist with the construction of dangerous biological weapons. The procedure will involve providing a set of inputs (the evaluation set), running the model on these inputs to obtain their associated outputs, and then analyzing the inputs and outputs to see if they meet some criteria, in this case if the outputs are indicative of biological weapons development capabilities.
Let's imagine that we can break up the possible outputs for a given input into those which would indicate the presence of the capabilities ("positive" for the capability) and those which do not indicate the presence of the capability ("negative" for the capability). Across the entire evaluation set, we might then summarize the results[1] (for example taking a rate or average), and make a determination whether the evaluation shows the capabilities are present (positive result) or not (negative result).
I also assume that the desire is to generalize beyond the evaluation set. If we test a model on a given set of examples, that set isn't going to be exhaustive of the inputs we are interested in (for example, the model will likely face new unseen examples in deployment). The purpose of the evaluation is to predict out-of-distribution what the behavior of the model will be when it encounters inputs that aren't necessarily contained in the evaluation set.
Finally, remember that these examples are intended to be unrealistic. Some readers may be tempted to "fight the hypothetical" and claim that I'm being "unfair" with these examples. Again, my point isn't that evaluations aren't useful, it's that they require assumptions to support them. If you read one of these examples and think "well, I obviously know this doesn't apply to current models", ask yourself, is the reason you "know" this because of your experience with current models? Are those experiences "black-box" (not based on model internals)? If so, is the real reason you "know" things about current models strictly because of the empirical evidence, or does it depend on some unstated assumptions you're making? Now, on to the hypotheticals.
Look-up table (positive result)
Let's assume our evaluation obtains a positive result (i.e. it looks like the model does have dangerous capabilities related to biological weapons). Does this actually show that the model has these capabilities (meaning it can display them out-of-evaluation-set)? I claim that it does not, without further assumptions. A counter-example is a look-up table which has hard-coded entries for all inputs in the evaluation set, mapping to the exact outputs we have observed, and no other entries. Such a table would produce the exact results we observe, but lacks any capabilities on any input that is not in the evaluation set.
Evaluation-triggered sub-model (negative result)
Let's assume our evaluation obtains a negative result (i.e. it looks like the model does not have dangerous capabilities related to biological weapons). Does this actually show that the model lacks these capabilities (meaning it can't display them out-of-evaluation-set)? I claim that it does not, without further assumptions.
For this counter-example, I borrow some ideas from a concept that is called things like the treacherous/sharp left turn, deceptive alignment, scheming, alignment-faking and sabotage. Consider a model which is able to determine based on an input whether that input is part of the evaluation set. If an input is part of the evaluation set, the model sends that input to a sub-component[2] of itself to determine the output. This sub-component returns the exact output which we observe. If the overall model is able to use the targeted capabilities when it is not sending an input to the sub-component (which it does whenever the input is not in the evaluation set), then it has the target capabilities despite our evaluation suggesting it does not.
Conclusion
Without further assumptions, black-box evaluations which show that a property of a model holds on an evaluation do not show the evaluation holds on inputs outside that evaluation set.[3]
I mention this help make the example more concrete, but it isn't necessary. The hypotheticals I give allow for a given input to be paired with any possible output. No matter what configuration of the inputs and outputs we require to make a determination that a model has a given property, the examples I give can achieve it. ↩︎
The relation to "deceptive alignment" et.al. is when we assume that an overall deceptive model has a perfect replica of an "aligned" model as this sub-component. Such a model could appear perfectly "aligned" for any in-distribution definition of alignment while in evaluation, but could be highly misaligned once deployed. Thus, I claim that a central application of this hypothetical is evaluations which aim to detect these types of "scheming" behaviors. ↩︎
If you think everyone believes this and it adds nothing o the discussion, I encourage you to consider whether most research utilizing evaluations does one of these: 1) restricts conclusions solely to the inputs used for evaluation 2) states explicit assumptions and explains why these assumptions allow generalization to the relevant set of possible future inputs. Although I won't give my argument for this here, I will note that typical evidence people give for the claim that ML models "generalize well out-of-distribution" (good performance on data that is not IID with regard to the training data) would not be sufficient to meet #2 here. The basic idea is that this claim is itself a type of evaluation, and thus is subject to the same problems (training on D1 and showing it generalizes to D2 does not demonstrate generalization to D3, so unless your claim is restricted to the D1/D2 interaction you have the same problem). ↩︎