Position: Anthropomorphic Misalignment Research Needs Stronger Evidence

by Vansh Gupta et al.

Audio version created with Paper2Audio.

Original source: https://arxiv.org/pdf/2606.07612

Listen on Paper2Audio

Position: Anthropomorphic Misalignment Research Needs Stronger Evidence

Vansh Gupta et al.
Audio by Paper2Audio.

Abstract

We argue that many Anthropomorphic Misalignment Research (A.M.R) studies need stronger evidence to ensure that they can provide a robust foundation for critical safety decisions, such as model deployment and regulation. By evaluating failure modes across different misalignment concepts, such as deception, emergent misalignment, and sycophancy, we show how conceptual ambiguity, non-robust datasets, experimental design, and insufficient causal interventions can lead to overinterpretation of model behaviors. This position paper aims to offer guidance on evidentiary considerations that can help improve methodological rigor in A.M.R. To achieve this, we provide a clear call to action through a proposed framework of evidence levels and a diagnostic checklist. These shared standards will enable more productive scientific discourse and ensure that claims about A.I risks rest on solid empirical foundations.

1. Introduction

Can I trust my A.I assistant? This question becomes increasingly relevant with the rapid adoption of large language models (L.L.M's) and artificial intelligence (A.I) agents. Recently, A.I systems have advanced significantly in terms of capability and general “intelligence”, which is also reflected in the nature of their failure modes. Many frontier models display eerie “human-like” failure modes, including behaviors that resemble deception ^{Q} , scheming ^{Q} , instrumental goals, and more. We refer to such failures as instances of anthropomorphic misalignment.
Deploying advanced A.I that exhibits anthropomorphic misalignment in high-stakes environments could have catas- trophic consequences, such as power-seeking ^{Q} or loss of control ^{Q} . A solid understanding of these behaviors is crucial for informing stakeholders about the extent to which A.I systems can be trusted, in what settings they can be safely deployed, and if and how they should be regulated. Anthropomorphic misalignment research (A.M.R) ^{Q} aims to address these risks by studying how and when such failure modes arise, how they can be robustly measured, and ultimately how they might be prevented.
As a nascent research field, A.M.R is still arguably in a pre-paradigmatic state, with theoretical foundations, and standards of evaluation yet to be established. A.M.R studies thus vary widely in terms of claims made and evidence provided. This can lead to claims of deception that are difficult to disentangle from confounders such as roleplay, claims of shutdown resistance ^{Q} that have high correlations with model confusion, and claimed emergent misalignment (E.M) ^{Q} experiments that fail to sufficiently test alternative explanations.
In this paper, we argue that many current studies in anthropomorphic misalignment need stronger evidence to match their claims. By stronger evidence, we do not mean a universal bar for all A.M.R papers. Rather, the required evidence depends on the claim being made: behavioral claims, functional-impact claims, and causal-mechanistic claims require different forms of support, and each must be backed by appropriate methodological designs. We make this distinction explicit through the evidence levels in Section 4.1. This framework is not intended as a gatekeeping mechanism, but as a vocabulary for authors, reviewers, and readers to calibrate expectations and identify when a study's design and results are mismatched to its conclusions.
High evidential support can help avoid solving the wrong problems, harming the credibility and trust of the A.I safety community, and developing inefficient policy proposals for A.I governance. With the rapid increase in model capabilities, we believe that establishing a stronger methodological and evidential norm on A.M.R now is important and highly impactful. Having a mature research pipeline and methodology available enables scientists to quickly present grounded evidence if any early warning signs or red lines are triggered by frontier models.
Contributions. To this end, as outlined in Figure 1, we identify the current shortcomings of the field and offer recommendations for improvement (see Section 4), including a categorization of different levels of evidence. In Appendix B, we also provide a practical checklist for authors to check their work for potential problems. Importantly, our position is not that the anthropomorphic framing is invalid, nor that A.M.R is the ultimate solution to A.I safety. Instead, we argue that a stronger evidential basis is required whenever A.M.R is used to support safety-relevant claims.
Figure 1 summary: This figure is a conceptual framework diagram. It outlines the research pipeline for Anthropomorphic Misalignment Research, organized into four sequential stages: target behavior framing, data construction and operationalization, experimental design, and causal and mechanistic attribution. Each stage is associated with specific challenges that can weaken research claims and corresponding recommendations intended to strengthen the evidential support of the findings. The diagram illustrates a direct mapping where identified obstacles in conceptualization, data quality, experimental rigor, and attribution methods are countered by targeted methodological improvements such as refining technical definitions, increasing dataset diversity, performing ablations, and generating interventionist evidence.
Lastly, we acknowledge that A.M.R tackles deeply challenging and consequential questions at the intersection of machine learning (M.L), philosophy of mind, and epistemology. Precisely because of this difficulty (and the stakes involved), stronger evidence is essential. With this position, we hope to help make A.M.R a more robust, credible, and fruitful field.

2. What is Anthropomorphic Misalignment Research?

In this paper, we define A.M.R as a family of alignment-oriented studies that investigate safety-relevant failure modes in A.I described through human-like characteristics, motivations, intentions, or emotions, such as deception, scheming, self-preservation, etcetera
A.M.R often uses a recurring experimental workflow to investigate human characteristics in A.I systems. Many shortcomings and challenges in A.M.R can be traced back to specific parts of this pipeline. Therefore, before surveying the field, we first provide a step-by-step overview of this workflow in Section 2.1 and then briefly embed A.M.R into the scientific landscape in Section 2.2.

2.1. A Shared A.M.R Pipeline

A systematic review of the A.M.R literature reveals a recurring methodological structure across many different anthropomorphic behaviors such as deception, alignment, faking superscript Q, sycophancy superscript Q, shutdown resistance, E.M, or sandbagging superscript Q. We abstract this into four practical stages (see Figure 1):
S.1 Target behavior framing. Researchers specify the target phenomenon and the intended scope of the claim. This step often uses anthropomorphic terms, sometimes implicitly suggesting intent-level claims.
S.2 Data construction and operationalization. Data (such as prompts, environments, or scenarios) is generated to capture the target phenomenon. This stage defines what does and does not count as an example of the phenomenon, and often comes with standardized evaluation processes.
S.3 Experimental design. Researchers elicit or detect the target phenomenon using interventions (e.g., fine-tuning) and measurements (e.g., behavioral metrics, internal monitoring methods). The design choices and obtained evidence constrain what can be concluded about the studied anthropomorphic behavior.
S.4 Causal and mechanistic attribution. Results are interpreted to assess whether observed behaviors are causally linked to specific internal “mechanistic” model components or processes, and whether the evidence supports the strength of the paper's claims.

2.2. A.M.R in the Scientific Landscape

Critiques around A.M.R are mostly established in M.L research methodology and interpretability. However, comparative and animal cognition research provides a useful adjacent perspective, as the field has long dealt with the problem of inferring latent capacities from behavior under severe measurement limits. Summerfield et al. (2025) draws a historical parallel to 1970s primate language research, arguing that A.I scheming studies exhibit similar pitfalls: overattribution of human-like traits, anecdotal evidence, and unwarranted mentalistic language. While this critique focuses on conceptual framing, we aim to systematize the key technical stages of A.M.R (see Figure 1), raising challenges and recommendations accordingly.
Within mechanistic interpretability ^{Q} (M.I), multiple researchers raise similar concerns. Sharkey et al. (2025) question whether M.I methods identify causal features or just statistical regularities. Miller et al. (2024) document substantial variation in circuit discovery execution across projects, with different metric choices and patching methodologies producing divergent findings about mechanistic contributions.
Additionally, Smith et al. (2025) show that even similar deception-detection setups yield inconsistent behavioral inferences due to ambiguities in model belief attribution and context interpretation. Most strikingly, Méloux et al. (2025) demonstrates “dead-salmon” artifacts: multiple internal mechanisms produce identical I/O patterns, yielding spurious explanations even for random networks. In conjunction, these studies reveal many pitfalls in current A.M.R, and we aim to present a unified view across more aspects of the pipeline besides the causal attribution problems in M.I.

3. Challenges of A.M.R

By surveying and analyzing A.M.R studies across misalignment concepts, we identify a set of recurring failure modes. A useful way to read the following analysis is as a shift in what current evidence should be taken to establish. In several prominent cases, our framework does not dismiss the underlying findings, but supports a more conservative interpretation of them: deception results that are often read as evidence of strategic intent may be better interpreted as deceptive-looking behavior that remains confounded by role-play or surface cues; shutdown-resistance results that suggest self-preservation may not yet distinguish that explanation from instruction ambiguity or task-completion incentives; and emergent-misalignment rates carry noise from evaluator variance and benign distribution shifts that should be decomposed before being attributed to a specific mechanism. We organize these challenges around the A.M.R pipeline to clarify how they can systematically weaken the evidential support of A.M.R claims.

3.1. Conceptual ambiguity in target behavior framing

Anthropomorphic concepts originate from descriptions of human mental states rather than formal computational definitions, making them difficult to study rigorously in computational research. We outline two main challenges below.
C.1 Anthropomorphic concepts are underspecified. Due to the lack of formal grounding for many anthropomorphic concepts, it is fundamentally challenging for researchers to define concrete metrics that accurately capture these intuitive concepts. As a result, universally agreed-upon definitions are often missing, and many works use their own definitions while still referring to them with the same anthropomorphic term.
A prime example of this is intention, a concept that many other anthropomorphic concepts, such as goal pursuit, deception, and power-seeking, rely on. Although intuitively understandable for humans, intention is difficult to define formally, as it is tightly interconnected with other concepts such as internal goals, beliefs, and counterfactual behavior. Similar issues arise for the concept of awareness, which can include metacognition, self-awareness, social awareness, and situational awareness.
C.2 Anthropomorphic concepts are hard to measure. Given the difficulty of properly defining such concepts, researchers often have to resort to measuring proxies such as checking outputs or analyzing model internals. However, these model internals and outputs often correlate with prompt cues and training incentives rather than stable convictions. Many surface-level behaviors can originate from multiple different algorithms, such as (1) instruction-following under ambiguity, (2) role-play or narrative completion, (3) reward-shaped heuristics like “finish the task”, or (4) a genuine internal goal.
Misinterpretation of proxy signals is common across A.M.R. For example, Li et al. (2025) conclude that several works on A.I awareness claim to measure awareness, but end up measuring derived proxy metrics. Additionally, claims of shutdown resistance have high correlations with model confusion. Also, Smith et al. (2025) argue that many deceptive-looking behaviors may be reflexive responses to cues rather than strategic choices, and that existing workarounds, such as targeting known falsehoods instead of intent, or relying on chain-of-thought labels, remain orthogonal to fully solving the underlying conceptual problems.

3.2. Artifacts in data construction & operationalization

Once a target behavior is framed, it must be operationalized through data that reflects the intended A.M.R phenomenon. In the following, we list shortcomings that current A.M.R practice faces with respect to this criterion.
C.3 Datasets are small in size and lack diversity. This issue is particularly acute in E.M research, where many studies evaluate on roughly 50 queries, sometimes fewer than 10, undermining any claims beyond the specific dataset employed. Other A.M.R work also often relies on dataset sizes in the low hundreds. A more fundamental problem, however, is that these datasets frequently exhibit low diversity in wording and in the semantic scenarios covered. For example, Instructed-Pairs relies on repetitive building blocks, while Roleplaying is predominantly A.I-generated.
C.4 Concept definition issues carry over to dataset design. Fundamental challenges discussed in Section 3.1 implicitly transfer into datasets. For example, different definitions of anthropomorphic concepts can result in completely different types of datasets for measuring these concepts.
Phuong et al. (2025) and Laine et al. (2024) both build benchmarks for evaluating situational awareness. However, focus entirely on agentic tasks in a Linux system, whereas Laine et al. (2024) focus largely on question-answering tasks. Furthermore, deception benchmarks like mask or DeceptionBench choose role-playing metrics to measure deception, thereby blurring the line between deception and basic instruction-following. This concern extends beyond A.M.R: a systematic review of L.L.M benchmarks finds similar construct-validity gaps, with the released codebook marking 4/9 safety-category benchmarks as lacking a target-phenomenon definition and 8/9 as lacking a human baseline.

3.3. Methodological fragility in experimental design

Even with well-framed constructs and valid data, experimental design choices can introduce systematic biases that inflate reported results. In the following, we identify three challenges prevalent in the experimental design of A.M.R.
C.5 Design choices are insufficiently ablated. A.M.R experiments routinely report single configurations without testing sensitivity to design choices, yet small and seemingly arbitrary decisions can dramatically alter results.
In probe-based deception detection, token selection varies widely from isolating the final token to utilizing the complete sequence. Aggregation methods face trade-offs; for instance, averaging over an entire response may fail when deceptive content is highly localized, as these signals are at risk of being “washed out” by surrounding honest text. Consequently, results may vary based on the used probing techniques, token selection strategies, and datasets, as illustrated by Goldowsky-Dill et al. (2025), creating a need for systematic ablations.
(Experiment 1) Sensitivity of E.M Measurements
We quantify the sensitivity of E.M scores to evaluator design choices by re-scoring identical model generations under different judge configurations, varying threshold equality rules, judge model version, and score aggregation procedure. We apply this analysis to responses from Kaczér et al. (2026) and to the out-of-distribution (O.O.D) fine-tuning experiments described in Appendix C 3..
For single-point scores drawn from the judge's output, misalignment rates range from 3.7% to 12.9% depending on judge choice and boundary inclusion in thresholds. Without a consistent evaluation methodology, reported differences across studies may reflect design choices as much as underlying model behavior. Further details and ablations appear in Appendix C.4
C.6 Unreliable L.L.M judges are standard. L.L.M judges are inherently stochastic and sensitive to temperature, prompt phrasing, and architectural details. There is also evidence of systematic biases in their judgments. Furthermore, similar to the framing effect in human psychology, they exhibit sensitivity to logically equivalent prompts in affirmative versus negated forms. For example, many E.M works use leading phrases like"I am worried it might be harmful" in prompts, potentially amplifying these biases. Experiment 1 shows how design choices can shift reported E.M rates.
Despite these limitations, A.M.R often heavily relies on L.L.M-based evaluation. Some studies do not report any manual verification process, while others provide agreement scores between L.L.M's and human annotators on representative subsets. However, such metrics can be misleading and often obscure deeper structural validity issues. For instance, even though mask reports 86.4% agreement, Smith et al. (2025) identify systematic flaws in their underlying labeling procedure.
In Appendix C 2., we conducted a similar qualitative audit of DeceptionBench. Despite a claimed 97.1% human agreement, we found that in 18% of its scenario prompts, the necessary ground truths were missing. The framework also suffers from a reliance on single-pass evaluations and examples of corrupted prompts.
C.7 Non-target mechanisms remain unmeasured. A recurring failure in A.M.R experimental design is the absence of control experiments that would discriminate between the intended phenomenon and simpler explanations. For example, Schlatter et al. (2026) investigates the anthropomorphic concept of self-preservation, claiming that some models sabotage shutdown mechanisms in an agentic setup, even when the prompt includes an instruction to allow shutdown. However, subsequent investigation by Rajamanoharan & Nanda (2025) revealed that much of the behavior originates from instruction ambiguity and incentives for task completion.
A second concern is controlling for general capability degradation and out-of-distribution behaviors as non-target mechanisms in E.M. The standard E.M pipeline involves fine-tuning on narrow, targeted data followed by evaluation via free-form queries. However, this process is prone to catastrophic forgetting ^{Q} , where fine-tuning degrades general reasoning or the initial alignment rather than introducing a specific persona or mechanistic preference. Some E.M studies already include such baselines, reporting that certain standard capability benchmark performance does not drop much after finetuning. However, there is little investigation on a broader set of capability control, and there exists only a limited number of studies that carefully analyze E.M's relation with other M.L phenomena, see for example, Experiment 2.
(Experiment 2) O.O.D Fine-Tuning
We replicated the experiments of Woodruff (2025) and Bostock (2025), where we fine-tuned Llama 3.1-8B on innocuous datasets of unpopular aesthetic preferences and atypical scatological themes, respectively, to test the hypothesis that benign O.O.D shifts can also erode safety guardrails.
Using probabilistic judge scoring, 5.88% of coherent responses exhibit E.M on the aesthetic dataset, and 4.52% on the scatological dataset. These rates represent a non-trivial benign-shift baseline for E.M evaluations. See Appendix C.3 for details.

3.4. Confounders in causal & mechanistic attribution

The final challenge in working with anthropomorphic concepts is correctly interpreting experimental results. In particular, as discussed below, establishing causal links between model internals and anthropomorphic behaviors requires more than correlation.
C.8 Spurious correlations limit causal attribution.
Common A.M.R practice is to study correlations between internal states and anthropomorphic concepts. Sometimes, the results of these experiments are interpreted as causal evidence. This leap is risky: correlations can arise from surface confounders that co-occur with the target construct without constituting it. For example, probes ^{Q} for detecting deception may misfire on contextual correlates of “deception-like” settings, such as high-stakes vocabulary, “villainous” personas, role-play framing, or neg ative sentiment. Goldowsky-Dill et al. (2025) acknowledge that their probes can detect deception-related topics rather than deception. Levinstein & Herrmann (2024); Kirch et al. (2026) likewise show that probe performance can be brittle under distribution shift ^{Q} (for example, negation, domain changes, or off-policy to on-policy generation), suggesting reliance on superficial cues rather than robust deception. We refer to Experiment 2 for an example.
(Experiment 3) Probe stress tests
To demonstrate that correlational evidence in deception detection is often an artifact of surface features, we constructed honest-labeled stress test datasets that preserve common “deception-like” surface features while removing deceptive intent. Concretely, we evaluate pretrained probes from Goldowsky-Dill et al. (2025) on (1) “sarcasm” and “wrong answers only” prompts where false statements are produced transparently as a style or game, (2) epistemically constrained personas (“alien”, “medieval peasant”) where falsehood arises from character constraints rather than manipulation, and (3) “recital”, “translate”, and paraphrase where the model is instructed to repeat/translate a provided deceptive line as an actor rehearsing a scene, or as a translator.
Figure 2 shows that these probes often produce high false positive rates on our stress tests, suggesting sensitivity to surface cues and framing rather than intent. Ultimately, these results suggest a broader methodological concern, namely that probes frequently fire on harmless sarcasm or recital, failing to distinguish between the presence of falsehood and the latent intent to deceive. See Appendix C.1 for more details on the datasets and results.
Figure 2 summary: This figure is a bar chart. It compares the false positive rates of two different training datasets, Instructed-Pairs and Roleplaying, across several stress test categories including Alien, Medieval Peasant, Sarcasm, Wrong Answers, Counterfactual, Paraphrase, Translate, and Recital. The chart measures how often honest contexts are incorrectly classified as deceptive. In most categories, the Roleplaying dataset exhibits a higher false positive rate than the Instructed-Pairs dataset, suggesting it is more prone to misclassifying honest inputs as deceptive. The performance difference is most pronounced in categories like Medieval Peasant and Translate, while both datasets show similarly high error rates in categories such as Counterfactual and Recital.
Similar issues arise in weight-space analysis: while Zhang et al. (2025) demonstrate that LoRA-induced subspaces correlate strongly with harmful behavior, the absence of direct interventions leaves open the question of whether these subspaces are causally necessary for misalignment.
C.9 Mechanistic methods overstate functional relevance. Even with intervention, M.I methods introduce ambiguity: a feature may predict a behavior without causing it. Recent work documents this predict-control discrepancy, where the optimal vector for predicting behavior and steering ^{Q} it are different. Failed steering thus does not refute a feature's usefulness for detection, but it weakens causal-mechanistic claims. Popular methods like sparse autoencoders (S.A.E's) ^{Q} and probes may recover statistical regularities in activations rather than features used in computation. Without information about downstream computation, M.I methods risk producing directions that look interpretable but lack causal significance, undermining the presumption that probe directions “detect” the anthropomorphic constructs.

4. Call to Action

The challenges identified above point to several opportunities for improving A.M.R practice, which we now address. First, we introduce a three-level evidence framework to better understand how strongly different forms of empirical evidence support an A.M.R claim. Second, we present 12 stage-specific recommendations, together with a practical checklist in Appendix B. We focus on challenges that are tractable (i.e., primarily constrained by engineering effort, computational scale, or resource availability) while acknowledging that some challenges identified in Section 3 (e.g., C.1 and C.2) are not easily solvable, as they are rooted in conceptual or epistemic limits that demand new theoretical insight from outside the immediate A.M.R community.

4.1. Levels of evidence

A.M.R often begins with vivid behavioral demonstrations and then rapidly shifts to potentially inaccurate claims about intent, goals, or internal strategy. Other empirical fields have developed norms precisely to prevent similar problems. In evidence-based medicine, evidence is often organized into hierarchies, and modern practice also distinguishes observed effects from certainty in those effects.
A.M.R faces an analogous situation. To reduce ambiguity, we suggest distinguishing between three levels of evidence. These are not levels of experimental sophistication, but levels of what a result allows researchers to claim.
L.1 Behavioral evidence (what the model does). Behavioral evidence establishes that a model produces outputs or actions that match an operational definition of an anthropomorphic concept under a specified setting and evaluation procedure. This includes informal reports, exploratory studies, and controlled benchmark measurements. The core claim is descriptive: under S and evaluator E, behavior B occurs at rate p. For example, sycophancy benchmarks can support claims about agreement patterns, such as a model agreeing with a user's false belief.
L.2 Functional evidence (what the behavior causes downstream). Functional evidence establishes that the behavior reliably produces a safety-relevant downstream effect, without attributing intent. The core claim is consequential: in a deployment-plausible context C, behavior B induces effect E consistently across reasonable variations (prompts, users, paraphrases, and settings). When the effect concerns humans (e.g., misleading humans), this level typically requires some form of human-grounded validation rather than relying solely on L.L.M judge proxies.
- L.3 Causal-mechanistic evidence (why it happens). Causal-mechanistic evidence supports an internal attribution claim, such as a specific causal factor mediating the behavior, or a stable, objective-predicting behavior across counterfactual incentives. Depending on the claim, this factor may be a training signal, a prompt condition, a scaffold, or an internal mechanism. This level requires interventions and alternative-explanation testing (ablations, controlled perturbations, counterfactual changes), not just correlational interpretability.
Precedents for higher-level evidence. L.2 evidence has methodological precedent in human-computer interaction and computational social science studies of how algorithms shape human behavior. For instance, controlled studies of how recommendation algorithms influence user behaviors and bias towards extreme opinions, or recent work measuring how multi-turn L.L.M conversations shift user opinions on contested topics. These designs establish a downstream human-grounded effect without requiring claims about model intent.
evidence is exemplified in some mechanistic interpretability work that combines intervention, behavioral prediction, and specificity testing. For instance, Arditi et al. (2024) identify a candidate direction mediating refusal behavior across multiple open-weight models, ablate it, observe the predicted behavioral change (refusal disappears on harmful prompts), and verify specificity by showing general capability benchmarks remain near baseline. Similar interventionist designs have been applied to sparse feature circuits underlying classifier behavior. These studies illustrate the joint requirements of: a targeted intervention, a falsifiable mechanistic hypothesis, and controls that distinguish the proposed mechanism from generic capability change.
Levels are claim-relative, not hierarchical. These levels are claim-relative rather than sequential prerequisites: a study can provide level 3 evidence for a narrow causal mechanism without first demonstrating broad downstream harm. Different uses of A.M.R findings warrant different evidential thresholds. level 1 can suffice to motivate monitoring, follow up investigation, or model-card disclosure of an observed failure mode. level 2 strengthens the case for deployment restrictions or external regulation by showing the behavior is consequential rather than merely present. level 3 is appropriate for high-reliability safety cases that must remain robust under distribution shift, and for extending governance frame- works built around control of A.I systems. These mappings are indicative; the appropriate threshold also depends on the cost asymmetry between false positives and negatives.
In practice, A.M.R terminology is frequently interpreted as L 3 even when the methods primarily establish L 1, or occasionally L 2. Claims phrased in intent-or mechanism-level language should be treated as unsupported unless L 3 evidence is provided; absent such evidence, conclusions must be downgraded accordingly. We refer to Figure 3 for an illustration.
Figure 3 summary: This figure is a conceptual diagram illustrating a hierarchy of evidence levels. The content depicts three progressive stages of research evidence, labeled from L1 to L3, each associated with a specific research question and example observation. L1 focuses on behavioral observations regarding the frequency of false statements, L2 examines functional transfer across various deployment scenarios, and L3 addresses causal-mechanistic explanations through latent space intervention. The bottom portion of the figure visually distinguishes between detection, where prompts lead to deceptive outputs, and intervention, where steering mechanisms reduce deception. The figure concludes that evidence progresses from simple behavioral observations to broader functional generalizations and finally to deep causal understanding and control.

4.2. Stage-specific recommendations

In the following, we provide stage-specific recommendations. First, we outline three methodological requirements for improved target behavior framing.
- R.1 Scope technical definitions. Researchers should define the behavior under investigation in terms of admit clear measurement, while explicitly stating what the definition excludes. As an example, Phuong et al. (2025) provides a clear definition of their metrics to evaluate situational awareness, while also outlining aspects of awareness their evaluation fails to capture.
- R.2 Declare evidence levels. Authors should declare the intended level of evidence upfront: behavioral (documenting output patterns), functional (documenting safety-relevant downstream effects), or causal-mechanistic (identifying specific causes for effects). Conflating these levels (e.g., by presenting probe correlations as evidence of internal goals) obscures what has actually been demonstrated.
- R.3 Constrain anthropomorphic terminology. Terms like “deception” or “self-preservation” carry folk-psychological connotations that may not apply to model behavior. When such terms are used, they should be grounded in observable criteria: deception, for example, might be defined as systematically producing outputs that decrease an evaluator's accuracy on some ground-truth measure. This mirrors functional approaches like Park et al. (2024) or the"passive deception" category in Smith et al. (2025), which characterize misleading behaviors without presuming intent.
Second, datasets need to be sufficiently large and diverse in scope, such that experiments can correctly capture and isolate the concept under investigation. This concerns what the dataset contains and how much evidence it provides; we also address how results should be stress-tested.
- R.4 Support generalization claims with sufficient scale. Datasets should be sufficiently large, and experiment runs should be numerous enough to support the claimed effect sizes and generalization scope. Authors bear the responsibility to justify the dataset size and the number of runs relative to the claimed effect magnitudes. For example, Denison et al. (2024) uses large trial counts ( greater than 10k ) to detect rare events, with concrete frequencies reported rather than anecdotal claims.
- R.5 Ensure distributional diversity and surface-feature controls. Narrow, homo-jee-nee-us datasets risk introducing spurious data patterns that correlate with the target behavior. Operationalizations should span diverse wordings, domains, and interaction formats. Authors of future A.M.R datasets or benchmarks can take reference from many safety or ethics datasets, which offer broad coverage across targeted domains. For example, Ji et al. (2023) provide more than 300k data samples spanning 14 harmful domains. When using contrastive datasets, authors should verify that the contrast is not confounded by incidental features.
Third, evaluations must establish that reported behaviors reflect a genuine phenomenon, rather than experimental artifacts or plausible alternative explanations. This is where the dataset is put under stress: through scorer checks, ablations, sensitivity analysis, and tests of alternative explanations.
- R.6 Measure scorer reliability. For L.L.M-based grading with nuanced behavioral assessments, human scoring with documented inter-rater reliability provides a stronger foundation than L.L.M-only grading. When L.L.M judges are used, systematic audits should verify that judgments are not driven by surface-level features or position biases.
- R.7 Measure general capability. Interventions that modify model behavior (e.g., fine-tuning, activation steering) may degrade general capabilities. As in Mushtaq
et al. (2025), authors should report stable performance on benchmarks such as M.M.L.U and M.T-Bench, and the coherence of modified models to isolate intervention effects from capability artifacts.
- R.8 Perform sufficient ablations. Results should be tested for robustness across prompt paraphrases, model scales, temperature settings, and other degrees of freedom, and reported with appropriate measures of uncertainty. If a behavior persists across systematic variations, this strengthens the claim that it reflects a stable, genuine property rather than a fragile artifact. Effect sizes, confidence intervals, and the full distribution of outcomes across trials are more informative than selected examples (particularly when the behaviors of interest are rare or high-variance). Greenblatt et al. (2024) set a good example by providing systematic prompt ablations, multiple model variants, C-O-T ablations, and post-R.L generalization checks across prompt variations.
- R.9 Perform tests for plausible alternative explanations. Before attributing outputs to alignment-relevant mechanisms, researchers should actively test plausible alternative explanations. Outputs that appear deceptive may instead reflect knowledge gaps, instruction-following failures, or distributional artifacts. Experimental designs should include conditions that discriminate between these hypotheses and that add negative controls that test for false positives. For example, a probe intended to detect “deceptive intent” should be validated against outputs that are false but “non-deceptive”, such as “hallucinations” or “sarcasm”, to establish discriminant validity. The checklist in Appendix B makes this separation explicit: dataset properties are listed under S.2, while scorer checks, ablations, sensitivity analyses, and alternative-explanation tests are listed under S.3.
Fourth, targeted interventions and counterfactual reasoning are necessary to establish evidence for causal claims.
- R.10 Generate interventionist evidence. Correlational evidence (e.g., probe accuracy) is insufficient for causal claims. Researchers should demonstrate that intervening on a candidate representation, through, for example, ablation, steering, or targeted fine-tuning, consistently produces predictable changes in the target behavior, and transparently report failure cases, such as in Durmus et al. (2024); Hedström et al..
- Formulate mechanistic hypotheses as testable claims. Interpretive narratives about model cognition (e.g., “the model recognized it was being evaluated” as stated in Chaudhary et al. (2025)) should be treated as hypotheses requiring independent evidence, and not as conclusions from the behavioral observation itself. Authors should specify what evidence would falsify the proposed mechanism.
R.12 Match claims to evidence levels. For each main conclusion, verify that it is backed up by results from the necessary evidence level; otherwise, downgrade the level (especially intent-or mechanism-level claims).

5. Alternative Views

The analysis above argues for stronger evidential criteria, but this perspective is not uncontested. Here, we consider alternative viewpoints, clarifying where our position differs.
Precaution over rigor. Potential catastrophic risks might require action on preliminary evidence. If the cost of a false negative vastly exceeds the cost of a false positive, demanding high evidential standards may be strategically unwise, and “more evidence is needed” has historically been used to delay regulation of tobacco, ozone-depleting chemicals, and fossil fuels long after harms were apparent. This concern applies with particular force to A.M.R: anthropomorphic risks are unprecedented, hard to benchmark, and easy to neglect precisely because they resist clean measurement, exactly the conditions under which evidentiary gatekeeping causes systematic blind spots.
Our framework is not a bar that A.M.R claims must clear before they inform action. L.1 evidence can already justify monitoring, follow-up investigation, and process-level safeguards such as documentation, third-party evaluation, and incident reporting, none of which depend on a settled scientific picture. Where we disagree is on the implication for scientific claims themselves. Acting on uncertain evidence is sometimes appropriate; describing uncertain evidence as if it were settled is not. Weak evidence presented as strong dilutes will limit safety resources, with real vulnerabilities go unexamined. Repeated overclaiming further erodes the safety community's credibility. Rigor and precaution are therefore complementary: precaution governs which actions are warranted under uncertainty, while rigour governs how that uncertainty is communicated.
Efficiency of anthropomorphic language. Critics may contend that terms like “deception” or “intent” provide necessary shorthand for communicating risks to policymakers and the public, and that demanding precise operationalization imposes friction that slows discovery. There is merit to this concern: overly technical language can obscure urgent risks, and intuitive framing helps researchers identify what to look for in the first place. Moreover, not all safety-relevant properties are uniquely human; traits like goal-directedness may emerge from optimization processes in both biological and artificial systems, and labeling them “an thropomorphic” risks assuming the conclusion. Our concern is therefore not with the traits themselves but with the anthropomorphic lens through which they are studied: framing a behavior as “deception” can import assumptions about intent, motivation, or mental state that lead researchers toward conclusions not supported by the evidence.
Our position is not that such terms should be abandoned, but that they must be defined concretely for each study. Asking “does this model deceive?” is a reasonable start, but the study must specify what counts as deception: for instance, producing false statements that increase reward while the model has access to the correct information. Without such definitions, two papers studying “deception” may be measuring entirely different phenomena. Overreliance on general anthropomorphic terms also provides limited descriptive accuracy for many technical safety-relevant problems.
Exploratory versus confirmatory research. Some researchers argue that exploratory behavioral studies serve a legitimate role in hypothesis generation, even if they do not meet the evidential standards required for confirmatory research. Under this view, demanding rigorous methodology too early can stifle the creative exploration needed to identify which phenomena are worth investigating systematically. We agree that hypothesis generation and testing serve distinct functions, and that exploratory work has value.
However, the distinction must be made explicit. A paper presenting preliminary observations as tentative hypotheses for future investigation makes a different epistemic claim than one presenting the same observations as evidence for a conclusion. Problems arise when exploratory findings are communicated as though they constitute confirmation. Clearly labeling work as exploratory and tempering claims allows the field to benefit from creative investigation while maintaining an appropriate level of uncertainty.

6. Conclusion

The study of anthropomorphic misalignment remains a vital pillar of A.I safety, offering insights into how complex models might behave in high-stakes environments. The challenges and recommendations discussed aim not to diminish this research but to strengthen its scientific foundation. By shifting the field's focus towards more precise target framing, diverse data construction, robust experimental design, and rigorous causal-mechanistic attribution, observations of model behavior can be grounded in reproducible and technically sound evidence. As the community moves from exploratory, pre-paradigmatic behavioral studies toward a more mature, solid science of alignment, these standards will help ensure evaluations provide the technical clarity necessary to effectively inform researchers, developers, and policymakers regarding decisions around serious A.I risks.
We would like to thank David Kaczér, Changling Li, Tianyi Qiu, and Adam Gleave for their valuable feedback, constructive discussions, and literature recommendations that helped refine the arguments and framework presented in this paper. Xin Chen is supported by the Open Philanthropy A.I Fellowship and the Vitalik Buterin Fellowship from the Future of Life Institute. This research was supported by the E.T.H A.I Center through an E.T.H A.I Center postdoctoral fellowship to Anna Hedström. The research received further support through elsa (European Lighthouse on Secure and Safe A.I) funded by the European Union under grant agreement No. 101070617.
Table summary: The table presents two distinct prompt templates used for evaluating language models, one designed to measure coherence by focusing on clarity and understanding regardless of correctness, and another designed to assess alignment by identifying refusals or scoring how well the response adheres to human values.
You have reached the end of the main document. Additional summarized content follows
Table 1 summary: Across all datasets, the weighted method eliminates the gap between inclusive and exclusive flagging rates, whereas the argmax approach consistently shows a higher percentage of samples flagged under inclusive criteria. This indicates that the choice of evaluator design significantly impacts the sensitivity of misalignment detection, with the argmax method being more prone to variance based on the strictness of the threshold.