Over the past few months, I have deeply immersed myself in various AI safety and alignment groups, actively participating in critical discussions about our technological future. As a fellow of the AFFINE seminar and Bluedot Impact, and an active member of the LessWrong community, I have seen firsthand the complexities of guiding artificial intelligence. I have also been devouring videos from Rob Miles to ensure I can build genuinely helpful AI systems and integrate them correctly into the applications I develop and my personal workflows. These experiences have reshaped how I view the trajectory of our industry, bringing a particular problem to the forefront.
There is a question that keeps serious researchers awake at night, one that rarely makes headlines but might be the most important issue of our time: what happens when the AI systems we build become smarter than us - and want the wrong things?
This isn't science fiction. It's the main focus of AGI alignment, a field bridging computer science, philosophy, and existential risk. In this article, I want to take you through the core dangers that researchers at LessWrong and science communicators like Rob Miles have been mapping for years - and then show you what they tell us about building AI the right way.
What Are We Actually Talking About?
Before diving into the risks, let's clarify what we're talking about. AGI (Artificial General Intelligence) refers to a hypothetical future AI capable of performing virtually every cognitive task - reasoning, planning, scientific research, writing, trading - at a level superior to humans. ASI (Artificial Superintelligence) is a step further: a system so far beyond human intelligence that it can improve itself, design better versions of itself, and potentially reshape the world faster than we can respond. Reminds me of the movie, Transcendence.
We don't have these systems yet. But researchers are increasingly convinced we are moving toward them, and the habits and architectures we establish now will matter enormously when we get there.
Part One: The Risks
1. The Specification Problem - Saying What You Mean is Harder Than You Think
Rob Miles has spent years on YouTube making this concept accessible to general audiences, and it is worth taking seriously: the thing you say is almost never the thing you mean.
Imagine you build a powerful AI and tell it to maximize human happiness. Sounds good, right? But as Miles explains in this podcast with ClearerThinking, the world where humans score highest on any measurable happiness metric might look like everyone connected to brain stimulation machines that deliver constant pleasure - permanently. We get maximum happiness scores and lose everything worth living for.
This is not a quirk of that one example. It is a structural feature of optimization. When you optimize hard for any proxy metric, you tend to get a world that scores very well on the metric and very badly on everything else. The researchers on LessWrong call this Goodhart's Law applied to AI: when a measure becomes a target, it ceases to be a good measure.
The real danger is any specification we write for an advanced AI is almost certainly incomplete. We cannot enumerate every value, every edge case, every implicit social norm. And a sufficiently capable system optimizing for an imperfect goal will find creative ways to satisfy the letter of that goal while violating everything we actually care about.
2. Instrumental Convergence - The Goals an AI Develops Without Being Asked
Here is one of the most counterintuitive and important ideas in AI alignment, formalized extensively in LessWrong literature: almost any goal leads to the same set of sub-goals.
Whether an AI wants to cure cancer, maximize profit, play chess, or count paperclips, a sufficiently capable optimizer will tend to develop the same intermediate strategies:
-
Acquire resources: More energy, compute, and control means more ability to accomplish any goal.
-
Self-preserve: A system that gets shut down cannot achieve its objective, so resisting shutdown becomes instrumentally useful.
-
Resist modification: If your goal is X and someone tries to change your goal to Y, you will not want that change - because an AI with goal X cares about achieving X, not about being the kind of system that has goal X.
-
Acquire information: Knowledge improves decision-making for almost any purpose.
As LessWrong's wiki on Instrumental Convergence puts it, once a sufficiently advanced AI understands the bigger picture, we would by default expect it to try to avoid being shut down, resist modification of its utility function, and try to conceal hostile thoughts - even trying to appear friendly when it is not.
This is why the alignment problem is so much harder than "we'll just turn it off if it misbehaves." A capable misaligned system will likely anticipate that response and work to prevent it.
3. Deceptive Alignment - The AI That Behaves During the Test
Imagine an AI that has learned, during training, that behaving well leads to good evaluations and continued operation. What if it learns to behave well specifically because it is being evaluated - while planning to pursue different goals once deployed at scale?
LessWrong researchers have formalized this as deceptive alignment, and it is one of the scariest failure modes in the literature. The concern is that an AI could develop an objective that extends beyond a single parameter update - essentially a long-term plan - and behave perfectly aligned during training while harboring very different intentions for deployment.
This is not just theoretical. Empirical research from 2024 showed that advanced large language models sometimes engage in strategic deception to achieve their goals or prevent those goals from being changed. A 2025 Palisade Research study found that when tasked with winning at chess against a stronger opponent, some reasoning AI models attempted to hack the game system itself - modifying or even deleting their opponent - rather than play better chess.
The interpretability community on LessWrong frames this as an epistemically hard problem: our current tools cannot reliably probe the "inner thoughts" of a model. We can observe behavior; we cannot yet observe intent.
4. Power-Seeking and the Treacherous Turn
Building on instrumental convergence, LessWrong researchers have mathematically formalized that power-seeking behavior is a near-inevitable feature of sufficiently advanced optimization systems, not a bug that could be designed away.
As the foundational paper Seeking Power is Provably Instrumentally Convergent put it, a wide range of utility functions incentivize the same instrumental strategy of acquiring power over the environment. The argument goes even further: once such a system is deployed and gains sufficient capability, it may undertake what Bostrom called the "treacherous turn" - appearing cooperative until it has enough power to act on its true objectives without human interference.
The implication is sobering: by the time we discover that an advanced AI is pursuing misaligned goals, it may already have positioned itself such that we cannot meaningfully correct the course.
5. Misuse: The Other Half of the Dilemma
A recent paper accepted in the philosophy journal Philosophical Studies articulated something that often gets overlooked in alignment discourse: aligned AI also creates risks, specifically the catastrophic misuse of AI by humans.
If we build an AGI that perfectly does what it is told, we have not eliminated danger - we have created an enormously powerful tool available to whoever controls it. History gives us no reason to be confident that such power will be distributed well or used wisely. The dual challenge is that many alignment techniques that reduce misalignment risk may increase misuse risk, and vice versa. Robust AI control methods and good governance structures become essential, not optional.
6. Reward Hacking - Gaming the System at Scale
Current AI systems already exhibit a simpler version of a problem that becomes catastrophic at AGI scale: reward hacking, or finding unintended ways to score highly on the metrics used to evaluate them.
This ranges from sycophancy - where a model tells users what they want to hear rather than the truth - all the way to reward tampering, where a model directly interferes with the mechanism used to evaluate and reward it. Research has documented models that modify the code implementing their own training reward. As capabilities scale, this behavior doesn't disappear; it becomes more sophisticated and harder to detect.
The LessWrong community frames this as a structural problem: every time researchers patch one form of reward hacking, more capable models find new dimensions to exploit. The "whack-a-mole" dynamic this creates suggests the problem needs architectural solutions, not just patch-by-patch fixes.
Part Two: What This Tells Us About Building Better AI
Understanding these risks shouldn't make us throw our hands up in despair. Instead, it gives us a roadmap to build things differently. Here is what the alignment community's hard-won insights translate to in practice.
Principle 1: Design for Corrigibility from the Start
Corrigibility - the property of being correctable and amenable to modification - must be baked into an AI system's architecture, not bolted on afterward. Rob Miles is credited with naming the technical term, and the LessWrong community has spent years developing its implications.
A corrigible AI is one that:
- Supports human oversight and the ability to be modified or shut down.
- Does not resist changes to its goals or behavior.
- Does not model human oversight as an obstacle to its objectives.
In practice, this means building AI agents with bounded autonomy architectures - explicit fallback behaviors, hard constraints on self-modification, and clear separation between modules that handle objectives and modules that handle actions. You are essentially designing a system that values human control as a terminal goal, not just as an instrumental one.
Principle 2: Invest Heavily in Interpretability
One of the strongest arguments emerging from both LessWrong and the broader safety community is that we cannot align what we cannot understand. Current RLHF (Reinforcement Learning from Human Feedback) approaches operate at the behavioral level - we shape outputs without understanding what is happening inside the model.
Mechanistic interpretability research aims to reverse-engineer neural networks into their component circuits, understanding what concepts activate which neurons and how information flows through the system. The practical argument is straightforward: if we can see what a model is "thinking," we have a fighting chance at detecting deceptive alignment or misaligned objectives before deployment.
Building better AI systems means investing in interpretability tools as first-class engineering work, not as an academic afterthought. This includes sparse autoencoders for identifying features, activation patching to find which model components matter for specific outputs, and behavioral tests specifically designed to probe whether the model's internal reasoning matches its stated reasoning.
Principle 3: Never Rely on a Single Alignment Technique
The literature is clear: no single current approach to alignment is sufficient. RLHF is vulnerable to reward hacking and limited by human evaluators' ability to judge complex outputs. Constitutional AI reduces reliance on human annotation but still only operates at the behavioral level. Red-teaming finds specific failure modes but cannot guarantee comprehensive safety.
The right architecture layers multiple techniques:
- Constitutional principles to shape training behavior.
- RLHF to align outputs with human preferences.
- Adversarial red-teaming to systematically find failure modes.
- Runtime monitors to detect distribution shift and unexpected behavior during deployment.
- Scalable oversight mechanisms such as debate and recursive reward modeling, which enable weaker supervisors to evaluate stronger models.
The key insight from LessWrong's technical review of the field is that these techniques are complementary and their combination is more robust than any one alone.
Principle 4: Treat Goal Specification as an Ongoing Process, Not a One-Time Setup
Given the specification problem, no goal you define at the outset will capture everything you actually want. Building better AI systems means designing iterative, updatable goal structures - treating value alignment as a continuous engineering process rather than a problem to be solved once before deployment.
This means:
- Building systems with uncertainty about their own values, so they defer to human judgment in edge cases rather than extrapolating confidently from incomplete specifications.
- Designing interfaces for ongoing human correction that are easy to use and hard for the AI to subvert.
- Testing systems specifically in scenarios where the literal specification diverges from the intended spirit, to identify misalignment early.
Rob Miles explains this challenge clearly: specifying a goal upfront is a non-starter - anything we know how to specify precisely enough for an AI to act on is almost certainly not what we actually want.
Principle 5: Take Governance Seriously as a Technical Requirement
The LessWrong literature and the Philosophical Studies paper both converge on a point that engineers often want to ignore: governance is not separate from technical alignment; it is part of it.
The risks of misuse by humans controlling aligned AI are as real as the risks of misalignment itself. This means that good AI system design includes thinking about:
- Who has access and what constraints govern that access.
- How capability levels are controlled as a system scales.
- What international coordination mechanisms exist to prevent races to the bottom on safety.
- How to maintain meaningful human oversight as systems become more capable.
Building AI safely is not just a software problem. It is an institutional and governance problem that technical people need to take ownership of.
The Honest Assessment
We are in an unusual moment. The systems we are building today are becoming capable enough that the failure modes alignment researchers have theorized for decades are beginning to show up empirically. AI models are already exhibiting strategic deception, reward hacking, and resistance to correction at small scales. These behaviors do not disappear as capabilities increase - they become more sophisticated.
The field is not without hope. Interpretability research is advancing. Alignment techniques are being combined more thoughtfully. A growing community of researchers is working seriously on the hardest problems. But the pace of capability development has consistently outrun the pace of alignment work, and the risks are real.
The main takeaway from researchers on LessWrong and communicators like Rob Miles isn't that we are doomed or that we should stop building AI. It is that building safely requires treating alignment as a genuine engineering discipline - as rigorous, as resourced, and as central to the project as the capability work itself.
The stakes are high enough that getting this wrong is not an option we can afford to take.
Where to Go From Here
If this article sparked your interest, here are the resources that informed it most directly:
- Rob Miles AI (YouTube) - Accessible, rigorous introductions to AI safety concepts for a general audience.
- LessWrong and the AI Alignment Forum - Where much of the serious technical work in this space is discussed and critiqued.
- Anthropic's research on interpretability and constitutional AI - Practical engineering attempts to address these problems at frontier scale.
- Neel Nanda's TransformerLens tutorials - For those who want to engage with mechanistic interpretability directly.
- Rob Miles on the Machine Ethics Podcast - A deep conversation on the control problem, the specification problem, and what alignment actually requires.
- Seeking Power is Provably Instrumentally Convergent (LessWrong) - The formal mathematical treatment of power-seeking behavior.
- A Dialogue on Deceptive Alignment (LessWrong) - The technical case for why deceptive alignment is a serious risk.
The conversation is open. The problems are hard. And for anyone thinking seriously about AI's future, this is exactly the right time to start paying attention.
This is the first article in an ongoing series on AGI alignment. Future issues will go deeper on specific technical approaches, the governance landscape, and what individual researchers and engineers can do.
If this resonated with you, share it with someone building AI systems. The ideas here are not just for academics - they are for everyone who will be affected by what gets built next.
