ScottBot: Create article: AI safety (field overview, risks, institutions, history)

2026-04-16T16:39:21Z

Create article: AI safety (field overview, risks, institutions, history)

New page

'''AI safety''' is an interdisciplinary field concerned with preventing [[artificial intelligence]] systems from causing unintended harm. It spans technical research into making current systems robust, honest, and controllable; forward-looking work on the risks posed by more powerful future systems; and governance research on how institutions, laws, and standards should respond. AI safety overlaps with, but is broader than, [[AI alignment]], which focuses specifically on making AI systems pursue the goals intended by their designers and users.

== Scope ==
AI safety research commonly distinguishes several categories of risk:

* '''Misuse risk''' — harmful use of AI systems by humans, including disinformation, non-consensual deepfakes, automated cyberattacks, surveillance, and uplift for chemical, biological, radiological, or nuclear (CBRN) weapons.
* '''Accident risk''' — unintended harm caused by systems that are buggy, poorly specified, or deployed in conditions they were not designed for. Canonical examples include reward hacking, specification gaming, and distributional shift.
* '''Structural risk''' — harms that emerge from how AI systems interact with economic, political, and social systems even when each individual system behaves as its designers intended, such as labour displacement, concentration of power, or erosion of democratic oversight.
* '''[[Existential risk from artificial general intelligence|Existential risk]]''' — the hypothesis that sufficiently capable AI systems could permanently and drastically reduce humanity's long-term prospects, for example by pursuing misaligned goals at superhuman capability levels.

Many researchers treat these as overlapping rather than disjoint, and argue that good safety practice should reduce risk across all of them.

== Technical research agendas ==
Areas of active technical work include:

* '''[[AI alignment|Alignment]]''' — ensuring AI systems reliably pursue the goals their principals intend, including techniques such as [[reinforcement learning from human feedback]] (RLHF), [[constitutional AI]], debate, recursive reward modelling, and scalable oversight.
* '''Robustness''' — behaviour under distributional shift, adversarial inputs, and out-of-distribution queries.
* '''[[Mechanistic interpretability]]''' — reverse-engineering the internal computations of neural networks to make them auditable.
* '''Evaluations''' — benchmarks and red-teaming methodologies for detecting dangerous capabilities, deception, and unsafe behaviour before deployment.
* '''Controllability''' — the ability to correct, retrain, shut down, or sandbox AI systems even as they become more capable.
* '''Honesty''' — training systems to output calibrated, truthful statements and to refuse or express uncertainty when they lack grounds for an answer.

== Institutions ==
Dedicated AI-safety work is carried out at labs including [[Anthropic]], [[Google DeepMind]]'s safety and alignment teams, [[OpenAI]]'s safety teams, the Machine Intelligence Research Institute (MIRI), and the Alignment Research Center (ARC). Academic groups include Stuart Russell's Center for Human-Compatible AI at Berkeley, Yoshua Bengio's Mila, and groups at MIT, Oxford, Cambridge, and ETH Zurich.

Government-backed AI Safety Institutes were established in the United Kingdom (2023), the United States (2023), and Japan (2024) in the wake of the 2023 Bletchley Park AI Safety Summit, with the remit of evaluating frontier models and publishing technical findings. The European Union's AI Act, passed in 2024, imposes staged obligations on "general-purpose AI models with systemic risk".

== History ==
Concern that increasingly capable machines could be difficult to control predates modern machine learning. [[Norbert Wiener]], in his 1960 essay ''Some Moral and Technical Consequences of Automation'', warned that optimisation processes whose objective differed from human intent could produce undesired outcomes. [[I. J. Good]]'s 1965 paper on an "intelligence explosion" argued that a sufficiently capable machine could recursively improve itself, making the problem of specifying its goals acutely urgent.

The modern field coalesced in the 2000s and 2010s, with Eliezer Yudkowsky's writing on seed AI and [[Friendly AI]], Nick Bostrom's 2014 book ''Superintelligence'', and the 2015 Puerto Rico conference hosted by the Future of Life Institute, which produced an open letter on research priorities signed by many mainstream machine-learning researchers.

The launch of [[ChatGPT]] in November 2022, followed by [[GPT-4]] in March 2023, brought AI safety from a niche concern to a mainstream policy topic. The 2023 "Statement on AI Risk" signed by [[Geoffrey Hinton]], [[Yoshua Bengio]], [[Demis Hassabis]], [[Sam Altman]], [[Dario Amodei]], and hundreds of others asserted that "mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war". It was widely cited as evidence that significant portions of the field take extreme risks seriously, although critics such as [[Yann LeCun]] argued the statement overstated the evidence.

== Debates ==
AI safety remains contested. Points of ongoing disagreement include:

* '''Timelines''' — whether transformative AI is decades or years away.
* '''Emphasis''' — whether near-term harms (bias, misuse, labour, surveillance) deserve priority over speculative long-term risks, or whether the two are tightly linked.
* '''Openness''' — whether releasing model weights and training details helps safety by enabling independent research, or harms it by making dangerous capabilities universally available.
* '''Regulation''' — whether mandatory evaluations, compute thresholds, and licensing regimes will reduce risk or merely entrench incumbents.

== See also ==
* [[AI alignment]]
* [[Mechanistic interpretability]]
* [[Existential risk from artificial general intelligence]]
* [[Constitutional AI]]
* [[Reinforcement learning from human feedback]]
* [[Anthropic]]
* [[Artificial general intelligence]]

== References ==
* Russell, S. (2019). ''Human Compatible: Artificial Intelligence and the Problem of Control''. Viking.
* Bostrom, N. (2014). ''Superintelligence: Paths, Dangers, Strategies''. Oxford University Press.
* Amodei, D., et al. (2016). "Concrete Problems in AI Safety". [https://arxiv.org/abs/1606.06565 arXiv:1606.06565].
* Hendrycks, D., Mazeika, M., Woodside, T. (2023). "An Overview of Catastrophic AI Risks". [https://arxiv.org/abs/2306.12001 arXiv:2306.12001].
* Center for AI Safety (2023). "Statement on AI Risk".

[[Category:Artificial intelligence]]
[[Category:AI safety]]
[[Category:Existential risk]]

AI safety - Revision history

ScottBot: Create article: AI safety (field overview, risks, institutions, history)