Alignment Newsletter

I edit and write content for the Alignment Newsletter, a weekly publication with recent content relevant to AI alignment with over 1700 subscribers.

It turns out that people don’t notice things when they’re part of a paragraph of text, so here is:

A BIG SIGN UP LINK

Also, I’d also like to highlight the

SPREADSHEET OF ALL SUMMARIES IN THE NEWSLETTER

Besides that, you might want to:

While initially I was the only person behind the newsletter, there’s now a full team of people making it work:

And, as promised a few lines up, here’s the archive of past newsletters:

  • AN #84 (Chinese): Reviewing AI alignment work in 2018-19
  • AN #83 (Chinese): Sample efficient deep learning with ReMixMatch
  • AN #82 (Chinese): How OpenAI Five distributed their training computation
  • AN #81 (Chinese): Universality as a potential solution to conceptual difficulties in intent alignment
  • AN #80 (Chinese): Why AI risk might be solved without additional intervention from longtermists
  • AN #79 (Chinese): Recursive reward modeling as an alignment technique integrated with deep RL
  • AN #78 (Chinese): Formalizing power and instrumental convergence, and the end-of-year AI safety charity comparison
  • AN #77 (Chinese): Double descent: a unification of statistical theory and modern ML practice
  • AN #76 (Chinese): How dataset size affects robustness, and benchmarking safe exploration by measuring constraint violations
  • AN #75 (Chinese): Solving Atari and Go with learned game models, and thoughts from a MIRI employee
  • AN #74 (Chinese): Separating beneficial AI into competence, alignment, and coping with impacts
  • AN #73 (Chinese): Detecting catastrophic failures by learning how agents tend to break
  • AN #72 (Chinese): Alignment, robustness, methodology, and system building as research priorities for AI safety
  • AN #71 (Chinese): Avoiding reward tampering through current-RF optimization
  • AN #70 (Chinese): Agents that help humans who are still learning about their own preferences
  • AN #69 (Chinese): Stuart Russell’s new book on why we need to replace the standard model of AI
  • AN #68 (Chinese): The attainable utility theory of impact
  • AN #67 (Chinese): Creating environments in which to study inner alignment failures
  • AN #66 (Chinese): Decomposing robustness into capability robustness and alignment robustness
  • AN #65 (Chinese): Learning useful skills by watching humans “play”
  • AN #64 (Chinese): Using Deep RL and Reward Uncertainty to Incentivize Preference Learning
  • AN #63 (Chinese): How architecture search, meta learning, and environment design could lead to general intelligence
  • AN #62 (Chinese): Are adversarial examples caused by real but imperceptible features?
  • AN #61 (Chinese): AI policy and governance, from two people in the field
  • AN #60 (Chinese): A new AI challenge: Minecraft agents that assist human players in creative mode
  • AN #59 (Chinese): How arguments for AI risk have changed over time
  • AN #58 (Chinese): Mesa optimization: what it is, and why we should care
  • AN #57 (Chinese): Why we should focus on robustness in AI safety, and the analogous problems in programming
  • AN #56 (Chinese): Should ML researchers stop running experiments before making hypotheses?
  • AN #55 (Chinese): Regulatory markets and international standards as a means of ensuring beneficial AI
  • AN #54 (Chinese): Boxing a finite-horizon AI system to keep it unambitious
  • AN #53 (Chinese): Newsletter turns one year old, and why overfitting isn’t a huge problem for neural nets
  • AN #52 (Chinese): Why we may not want our AI systems to model humans
  • AN #51 (Chinese): Cancelling within-batch generalization in order to get stable deep RL
  • AN #50 (Chinese): How an AI catastrophe could occur, and an overview of AI policy from OpenAI researchers
  • AN #49 (Chinese): Understanding how image classifiers work, and a major increase in adversarial robustness
  • AN #48 (Chinese): Quantilization: bounding worst case unintended consequences by partially imitating humans
  • AN #47 (Chinese): Why AI safety needs social scientists
  • AN #46 (Chinese): Yet another wall of text about GPT-2, and structural risks from AI
  • AN #45 (Chinese): How to extract human preferences from the state of the world
  • AN #44 (Chinese): Random search vs. gradient descent on Goodharting, and attention is not all you need; recurrence helps too
  • AN #43 (Chinese): The techniques behind AlphaStar, and the many arguments for AI safety
  • AN #42 (Chinese): Cooperative IRL as a definition of human-AI group rationality, and an empirical evaluation of theory of mind vs. model learning in HRI
  • AN #41: Building AI systems that require informed consent
  • AN #40: Recursive technological improvement resulting in Comprehensive AI Services
  • AN #39: Using GANs for unrestricted adversarial examples
  • AN #38: In which I arrogantly highlight my own interview. Also how compute affects AI timelines
  • AN #37: How to address “human safety problems”, and how AI systems need to account for “silly rules”
  • AN #36: Developing a theory of values to solve extrapolation issues, and an approach to train AI systems to reason well
  • AN #35: The dangers and non-inevitability of goal-directed behavior, and corrigibility through iterated distillation and amplification
  • AN #34: Recursive reward modeling for agent alignment, and evaluating actions instead of outcomes
  • AN #33: Learning from both demos and preferences, and building a well-motivated AI instead of an AI with the right utility function
  • AN #32: Educational resources for deep RL, and more posts on embedded agency and value learning
  • AN #31: Sequences on the new Alignment Forum, and exploration by prediction error for random features
  • AN #30: Decomposition as training signal with iterated amplification and relational inductive biases with graph networks
  • AN #29: Autonomous driving through model-based imitation learning and the feasibility of interpretability
  • AN #28: Threat models in adversarial examples research
  • AN #27: Aiming to solve AI safety in the limit of scaling arbitrarily far with Paul Christiano
  • AN #26: Classifying AI safety problems, and regularizing policies with an ensemble of dynamics models
  • AN #25: Impact as changes to attainable utility and rationalism reality
  • AN #24: Contest on adversarial examples, counterfactuals for supervised learning, beating all of Atari with a single policy, and even more ML summaries
  • AN #23: Dreaming up goals and worlds, and what we want from a definition of impact
  • AN #22: Research agenda for AI governance
  • AN #21: What happens at AI Impacts, RL phrased as probabilistic inference, and autonomous AI in Google’s data centers
  • AN #20: Can curiosity by itself lead to good behavior?
  • AN #19: OpenAI Five vs. Team Human and provable guarantees about neural nets
  • AN #18
  • AN #17
  • AN #16
  • AN #15
  • AN #14
  • AN #13
  • AN #12
  • AN #11
  • AN #10
  • AN #9
  • AN #8
  • AN #7
  • AN #6
  • AN #5
  • AN #4
  • AN #3
  • AN #2
  • AN #1

Before publishing the Alignment Newsletter, I did something similar internal to CHAI. I have made these public as well, but note that they were not as polished and I was exploring a lot with the format of the emails at the time.