Alignment Newsletter

I edit and write content for the Alignment Newsletter, a weekly publication with recent content relevant to AI alignment with over 2600 subscribers. While it used to be more of an overview, it’s now more like “things Rohin finds interesting”, which excludes a bunch of work that a lot of other smart people are excited about.

It turns out that people don’t notice things when they’re part of a paragraph of text, so here is:

A BIG SIGN UP LINK AND PODCAST LINK

I’d also like to highlight the

SPREADSHEET AND WEB APP OF ALL SUMMARIES IN THE NEWSLETTER

Besides that, you might want to:

While initially I was the only person behind the newsletter, there’s now a full team of people making it work:

And, as promised a few lines up, here’s the archive of past newsletters:

  • AN #173 (Chinese): Recent language model results from DeepMind (July 20, 2022)
  • AN #172 (Chinese): Sorry for the long hiatus! (July 4th, 2022)
  • AN #171 (Chinese): Disagreements between alignment “optimists” and “pessimists” (January 21st, 2022)
  • AN #170 (Chinese): Analyzing the argument for risk from power-seeking AI (December 8th, 2021)
  • AN #169 (Chinese): Collaborating with humans without human data (November 24th, 2021)
  • AN #168 (Chinese): Four technical topics for which Open Phil is soliciting grant proposals (October 28th, 2021)
  • AN #167 (Chinese): Concrete ML safety problems and their relevance to x-risk (October 20th, 2021)
  • AN #166 (Chinese): Is it crazy to claim we’re in the most important century? (October 8th, 2021)
  • AN #165 (Chinese): When large models are more likely to lie (September 22nd, 2021)
  • AN #164 (Chinese): How well can language models write code?
  • (September 15th, 2021)
  • AN #163 (Chinese): Using finite factored sets for causal and temporal inference (September 8th, 2021)
  • AN #162 (Chinese): Foundation models: a paradigm shift within AI (August 27th, 2021)
  • AN #161 (Chinese): Creating generalizable reward functions for multiple tasks by learning a model of functional similarity (August 20th, 2021)
  • AN #160 (Chinese): Building AIs that learn and think like people (August 13th, 2021)
  • AN #159 (Chinese): Building agents that know how to experiment, by training on procedurally generated games (August 4th, 2021)
  • AN #158 (Chinese): Should we be optimistic about generalization? (July 29th, 2021)
  • AN #157 (Chinese): Measuring misalignment in the technology underlying Copilot (July 24th, 2021)
  • AN #156 (Chinese): The scaling hypothesis: a plan for building AGI (July 16th, 2021)
  • AN #155 (Chinese): A Minecraft benchmark for algorithms that learn without reward functions (July 8th, 2021)
  • AN #154 (Chinese): What economic growth theory has to say about transformative AI (June 30th, 2021)
  • AN #153 (Chinese): Experiments that demonstrate failures of objective robustness (June 26th, 2021)
  • AN #152 (Chinese): How we’ve overestimated few-shot learning capabilities (June 16th, 2021)
  • AN #151 (Chinese): How sparsity in the final layer makes a neural net debuggable (May 19th, 2021)
  • AN #150 (Chinese): The subtypes of Cooperative AI research (May 12th, 2021)
  • AN #149 (Chinese): The newsletter’s editorial policy (May 5th, 2021)
  • AN #148 (Chinese): Analyzing generalization across more axes than just accuracy or loss (April 28th, 2021)
  • AN #147 (Chinese): An overview of the interpretability landscape (April 21st, 2021)
  • AN #146 (Chinese): Plausible stories of how we might fail to avert an existential catastrophe (April 14th, 2021)
  • AN #145 (Chinese): Our three year anniversary! (April 7th, 2021)
  • AN #144 (Chinese): How language models can also be finetuned for non-language tasks (April 2nd, 2021)
  • AN #143 (Chinese): How to make embedded agents that reason probabilistically about their environments (March 24th, 2021)
  • AN #142 (Chinese): The quest to understand a network well enough to reimplement it by hand (March 17th, 2021)
  • AN #141 (Chinese): The case for practicing alignment work on GPT-3 and other large models (March 10th, 2021)
  • AN #140 (Chinese): Theoretical models that predict scaling laws (March 4th, 2021)
  • AN #139 (Chinese): How the simplicity of reality explains the success of neural nets (February 24th, 2021)
  • AN #138 (Chinese): Why AI governance should find problems rather than just solving them (February 17th, 2021)
  • AN #137 (Chinese): Quantifying the benefits of pretraining on downstream task performance (February 10th, 2021)
  • AN #136 (Chinese): How well will GPT-N perform on downstream tasks? (February 3rd, 2021)
  • AN #135 (Chinese): Five properties of goal-directed systems (January 27th, 2021)
  • AN #134 (Chinese): Underspecification as a cause of fragility to distribution shift (January 21st, 2021)
  • AN #133 (Chinese): Building machines that can cooperate (with humans, institutions, or other machines) (January 13th, 2021)
  • AN #132 (Chinese): Complex and subtly incorrect arguments as an obstacle to debate (January 6th, 2021)
  • AN #131 (Chinese): Formalizing the argument of ignored attributes in a utility function (December 31st, 2020)
  • AN #130 (Chinese): A new AI x-risk podcast, and reviews of the field (December 24th, 2020)
  • AN #129 (Chinese): Explaining double descent by measuring bias and variance (December 16th, 2020)
  • AN #128 (Chinese): Prioritizing research on AI existential safety based on its application to governance demands (December 9th, 2020)
  • AN #127 (Chinese): Rethinking agency: Cartesian frames as a formalization of ways to carve up the world into an agent and its environment (December 2nd, 2020)
  • AN #126 (Chinese): Avoiding wireheading by decoupling action feedback from action effects (November 26th, 2020)
  • AN #125 (Chinese): Neural network scaling laws across multiple modalities (November 11th, 2020)
  • AN #124 (Chinese): Provably safe exploration through shielding (November 4th, 2020)
  • AN #123 (Chinese): Inferring what is valuable in order to align recommender systems (October 28th, 2020)
  • AN #122 (Chinese): Arguing for AGI-driven existential risk from first principles (October 21st, 2020)
  • AN #121 (Chinese): Forecasting transformative AI timelines using biological anchors (October 14th, 2020)
  • AN #120 (Chinese): Tracing the intellectual roots of AI and AI alignment (October 7th, 2020)
  • AN #119 (Chinese): AI safety when agents are shaped by environments, not rewards (September 30th, 2020)
  • AN #118 (Chinese): Risks, solutions, and prioritization in a world with many AI systems (September 23rd, 2020)
  • AN #117 (Chinese): How neural nets would fare under the TEVV framework (September 16th, 2020)
  • AN #116 (Chinese): How to make explanations of neurons compositional (September 9th, 2020)
  • AN #115 (Chinese): AI safety research problems in the AI-GA framework (September 2nd, 2020)
  • AN #114 (Chinese): Theory-inspired safety solutions for powerful Bayesian RL agents (August 26th, 2020)
  • AN #113 (Chinese): Checking the ethical intuitions of large language models (August 19th, 2020)
  • AN #112 (Chinese): Engineering a Safer World (August 13th, 2020)
  • AN #111 (Chinese): The Circuits hypotheses for deep learning (August 5th, 2020)
  • AN #110 (Chinese): Learning features from human feedback to enable reward learning (July 29th, 2020)
  • AN #109 (Chinese): Teaching neural nets to generalize the way humans would (July 22nd, 2020)
  • AN #108 (Chinese): Why we should scrutinize arguments for AI risk (July 15th, 2020)
  • AN #107 (Chinese): The convergent instrumental subgoals of goal-directed agents (July 9th, 2020)
  • AN #106 (Chinese): Evaluating generalization ability of learned reward models (July 1st, 2020)
  • AN #105 (Chinese): The economic trajectory of humanity, and what we might mean by optimization (June 24th, 2020)
  • AN #104 (Chinese): The perils of inaccessible information, and what we can learn about AI alignment from COVID (June 18th, 2020)
  • AN #103 (Chinese): ARCHES: an agenda for existential safety, and combining natural language with deep RL (June 10th, 2020)
  • AN #102 (Chinese): Meta learning by GPT-3, and a list of full proposals for AI alignment (June 3rd, 2020)
  • AN # 101 (Chinese): Why we should rigorously measure and forecast AI progress (May 27th, 2020)
  • AN #100 (Chinese): What might go wrong if you learn a reward function while acting (May 20th, 2020)
  • AN #99 (Chinese): Doubling times for the efficiency of AI algorithms (May 13th, 2020)
  • AN #98 (Chinese): Understanding neural net training by seeing which gradients were helpful (May 6th, 2020)
  • AN #97 (Chinese): Are there historical examples of large, robust discontinuities? (April 29th, 2020)
  • AN #96 (Chinese): Buck and I discuss/argue about AI Alignment (April 22nd, 2020)
  • AN #95 (Chinese): A framework for thinking about how to make AI go well (April 15th, 2020)
  • AN #94 (Chinese): AI alignment as translation between humans and machines (April 8th, 2020)
  • AN #93 (Chinese): The Precipice we’re standing at, and how we can back away from it (April 1st, 2020)
  • AN #92 (Chinese): Learning good representations with contrastive predictive coding (March 25th, 2020)
  • AN #91 (Chinese): Concepts, implementations, problems, and a benchmark for impact measurement (March 18th, 2020)
  • AN #90 (Chinese): How search landscapes can contain self-reinforcing feedback loops (March 11th, 2020)
  • AN #89 (Chinese): A unifying formalism for preference learning algorithms (March 4th, 2020)
  • AN #88 (Chinese): How the principal-agent literature relates to AI risk (February 27th, 2020)
  • AN #87 (Chinese): What might happen as deep learning scales even further? (February 19th, 2020)
  • AN #86 (Chinese): Improving debate and factored cognition through human experiments (February 12th, 2020)
  • AN #85 (Chinese): The normative questions we should be asking for AI alignment, and a surprisingly good chatbot (February 5th, 2020)
  • AN #84 (Chinese): Reviewing AI alignment work in 2018-19 (January 29th, 2020)
  • AN #83 (Chinese): Sample efficient deep learning with ReMixMatch (January 22nd, 2020)
  • AN #82 (Chinese): How OpenAI Five distributed their training computation (January 15th, 2020)
  • AN #81 (Chinese): Universality as a potential solution to conceptual difficulties in intent alignment (January 8th, 2020)
  • AN #80 (Chinese): Why AI risk might be solved without additional intervention from longtermists (January 2nd, 2020)
  • AN #79 (Chinese): Recursive reward modeling as an alignment technique integrated with deep RL (January 1st, 2020)
  • AN #78 (Chinese): Formalizing power and instrumental convergence, and the end-of-year AI safety charity comparison (December 25th, 2019)
  • AN #77 (Chinese): Double descent: a unification of statistical theory and modern ML practice (December 18th, 2019)
  • AN #76 (Chinese): How dataset size affects robustness, and benchmarking safe exploration by measuring constraint violations (December 4th, 2019)
  • AN #75 (Chinese): Solving Atari and Go with learned game models, and thoughts from a MIRI employee (November 27th, 2019)
  • AN #74 (Chinese): Separating beneficial AI into competence, alignment, and coping with impacts (November 20th, 2019)
  • AN #73 (Chinese): Detecting catastrophic failures by learning how agents tend to break (November 13th, 2019)
  • AN #72 (Chinese): Alignment, robustness, methodology, and system building as research priorities for AI safety (November 6th, 2019)
  • AN #71 (Chinese): Avoiding reward tampering through current-RF optimization (October 30th, 2019)
  • AN #70 (Chinese): Agents that help humans who are still learning about their own preferences (October 23rd, 2019)
  • AN #69 (Chinese): Stuart Russell’s new book on why we need to replace the standard model of AI (October 18th, 2019)
  • AN #68 (Chinese): The attainable utility theory of impact (October 14th, 2019)
  • AN #67 (Chinese): Creating environments in which to study inner alignment failures (October 7th, 2019)
  • AN #66 (Chinese): Decomposing robustness into capability robustness and alignment robustness (September 30th, 2019)
  • AN #65 (Chinese): Learning useful skills by watching humans “play” (September 23rd, 2019)
  • AN #64 (Chinese): Using Deep RL and Reward Uncertainty to Incentivize Preference Learning (September 16th, 2019)
  • AN #63 (Chinese): How architecture search, meta learning, and environment design could lead to general intelligence (September 10th, 2019)
  • AN #62 (Chinese): Are adversarial examples caused by real but imperceptible features? (August 22nd, 2019)
  • AN #61 (Chinese): AI policy and governance, from two people in the field (August 5th, 2019)
  • AN #60 (Chinese): A new AI challenge: Minecraft agents that assist human players in creative mode (July 22nd, 2019)
  • AN #59 (Chinese): How arguments for AI risk have changed over time (July 8th, 2019)
  • AN #58 (Chinese): Mesa optimization: what it is, and why we should care (June 24th, 2019)
  • AN #57 (Chinese): Why we should focus on robustness in AI safety, and the analogous problems in programming (June 5th, 2019)
  • AN #56 (Chinese): Should ML researchers stop running experiments before making hypotheses? (May 20th, 2019)
  • AN #55 (Chinese): Regulatory markets and international standards as a means of ensuring beneficial AI (May 4th, 2019)
  • AN #54 (Chinese): Boxing a finite-horizon AI system to keep it unambitious (April 27th, 2019)
  • AN #53 (Chinese): Newsletter turns one year old, and why overfitting isn’t a huge problem for neural nets (April 18th, 2019)
  • AN #52 (Chinese): Why we may not want our AI systems to model humans (April 5th, 2019)
  • AN #51 (Chinese): Cancelling within-batch generalization in order to get stable deep RL (April 2nd, 2019)
  • AN #50 (Chinese): How an AI catastrophe could occur, and an overview of AI policy from OpenAI researchers (March 28th, 2019)
  • AN #49 (Chinese): Understanding how image classifiers work, and a major increase in adversarial robustness (March 19th, 2019)
  • AN #48 (Chinese): Quantilization: bounding worst case unintended consequences by partially imitating humans (March 11th, 2019)
  • AN #47 (Chinese): Why AI safety needs social scientists (March 3rd, 2019)
  • AN #46 (Chinese): Yet another wall of text about GPT-2, and structural risks from AI (February 21st, 2019)
  • AN #45 (Chinese): How to extract human preferences from the state of the world (February 13th, 2019)
  • AN #44 (Chinese): Random search vs. gradient descent on Goodharting, and attention is not all you need; recurrence helps too (February 6th, 2019)
  • AN #43 (Chinese): The techniques behind AlphaStar, and the many arguments for AI safety (January 29th, 2019)
  • AN #42 (Chinese): Cooperative IRL as a definition of human-AI group rationality, and an empirical evaluation of theory of mind vs. model learning in HRI (January 21st, 2019)
  • AN #41: Building AI systems that require informed consent (January 17th, 2019)
  • AN #40: Recursive technological improvement resulting in Comprehensive AI Services (January 7th, 2019)
  • AN #39: Using GANs for unrestricted adversarial examples (January 1st, 2019)
  • AN #38: In which I arrogantly highlight my own interview. Also how compute affects AI timelines (December 25th, 2018)
  • AN #37: How to address “human safety problems”, and how AI systems need to account for “silly rules” (December 17th, 2018)
  • AN #36: Developing a theory of values to solve extrapolation issues, and an approach to train AI systems to reason well (December 11th, 2018)
  • AN #35: The dangers and non-inevitability of goal-directed behavior, and corrigibility through iterated distillation and amplification (December 3rd, 2018)
  • AN #34: Recursive reward modeling for agent alignment, and evaluating actions instead of outcomes (November 26th, 2018)
  • AN #33: Learning from both demos and preferences, and building a well-motivated AI instead of an AI with the right utility function (November 19th, 2018)
  • AN #32: Educational resources for deep RL, and more posts on embedded agency and value learning (November 12th, 2018)
  • AN #31: Sequences on the new Alignment Forum, and exploration by prediction error for random features (November 5th, 2018)
  • AN #30: Decomposition as training signal with iterated amplification and relational inductive biases with graph networks (October 29th, 2018)
  • AN #29: Autonomous driving through model-based imitation learning and the feasibility of interpretability (October 22nd, 2018)
  • AN #28: Threat models in adversarial examples research (October 15th, 2018)
  • AN #27: Aiming to solve AI safety in the limit of scaling arbitrarily far with Paul Christiano (October 8th, 2018)
  • AN #26: Classifying AI safety problems, and regularizing policies with an ensemble of dynamics models (October 2nd, 2018)
  • AN #25: Impact as changes to attainable utility and rationalism reality (September 24th, 2018)
  • AN #24: Contest on adversarial examples, counterfactuals for supervised learning, beating all of Atari with a single policy, and even more ML summaries (September 17th, 2018)
  • AN #23: Dreaming up goals and worlds, and what we want from a definition of impact (September 10th, 2018)
  • AN #22: Research agenda for AI governance (September 3rd, 2018)
  • AN #21: What happens at AI Impacts, RL phrased as probabilistic inference, and autonomous AI in Google’s data centers (August 27th, 2018)
  • AN #20: Can curiosity by itself lead to good behavior? (August 20th, 2018)
  • AN #19: OpenAI Five vs. Team Human and provable guarantees about neural nets (August 13th, 2018)
  • AN #18 (August 6th, 2018)
  • AN #17 (July 30th, 2018)
  • AN #16 (July 23rd, 2018)
  • AN #15 (July 16th, 2018)
  • AN #14 (July 9th, 2018)
  • AN #13 (July 2nd, 2018)
  • AN #12 (June 25th, 2018)
  • AN #11 (June 18th, 2018)
  • AN #10 (June 11th, 2018)
  • AN #9 (June 4th, 2018)
  • AN #8 (May 28th, 2018)
  • AN #7 (May 21st, 2018)
  • AN #6 (May 14th, 2018)
  • AN #5 (May 7th, 2018)
  • AN #4 (April 30th, 2018)
  • AN #3 (April 23rd, 2018)
  • AN #2 (April 16th, 2018)
  • AN #1 (April 9th, 2018)