I edit and write content for the Alignment Newsletter, a weekly publication with recent content relevant to AI alignment with over 1700 subscribers.
It turns out that people don’t notice things when they’re part of a paragraph of text, so here is:
A BIG SIGN UP LINK
Also, I’d also like to highlight the
SPREADSHEET OF ALL SUMMARIES IN THE NEWSLETTER
Besides that, you might want to:
- Subscribe to the RSS feed
- Listen to the newsletter as a podcast
- Read it in Chinese
- Read the newsletter on the Alignment Forum or LessWrong
- Follow me on Twitter
- Read a retrospective I wrote about the newsletter
- Look through the archives lower down on this page.
While initially I was the only person behind the newsletter, there’s now a full team of people making it work:
- Dan Hendrycks, Cody Wild, Nicholas Joseph, Asya Bergal, Matthew Barnett, Sudhanshu Kasewa, Zachary Robertson, Flo Dorner and I write the content
- I edit the newsletter
- Georg Arndt produces the newsletter
- Xiaohu Zhu translates it to Chinese
- Rob Miles creates the podcast
And, as promised a few lines up, here’s the archive of past newsletters:
- AN #84 (Chinese): Reviewing AI alignment work in 2018-19
- AN #83 (Chinese): Sample efficient deep learning with ReMixMatch
- AN #82 (Chinese): How OpenAI Five distributed their training computation
- AN #81 (Chinese): Universality as a potential solution to conceptual difficulties in intent alignment
- AN #80 (Chinese): Why AI risk might be solved without additional intervention from longtermists
- AN #79 (Chinese): Recursive reward modeling as an alignment technique integrated with deep RL
- AN #78 (Chinese): Formalizing power and instrumental convergence, and the end-of-year AI safety charity comparison
- AN #77 (Chinese): Double descent: a unification of statistical theory and modern ML practice
- AN #76 (Chinese): How dataset size affects robustness, and benchmarking safe exploration by measuring constraint violations
- AN #75 (Chinese): Solving Atari and Go with learned game models, and thoughts from a MIRI employee
- AN #74 (Chinese): Separating beneficial AI into competence, alignment, and coping with impacts
- AN #73 (Chinese): Detecting catastrophic failures by learning how agents tend to break
- AN #72 (Chinese): Alignment, robustness, methodology, and system building as research priorities for AI safety
- AN #71 (Chinese): Avoiding reward tampering through current-RF optimization
- AN #70 (Chinese): Agents that help humans who are still learning about their own preferences
- AN #69 (Chinese): Stuart Russell’s new book on why we need to replace the standard model of AI
- AN #68 (Chinese): The attainable utility theory of impact
- AN #67 (Chinese): Creating environments in which to study inner alignment failures
- AN #66 (Chinese): Decomposing robustness into capability robustness and alignment robustness
- AN #65 (Chinese): Learning useful skills by watching humans “play”
- AN #64 (Chinese): Using Deep RL and Reward Uncertainty to Incentivize Preference Learning
- AN #63 (Chinese): How architecture search, meta learning, and environment design could lead to general intelligence
- AN #62 (Chinese): Are adversarial examples caused by real but imperceptible features?
- AN #61 (Chinese): AI policy and governance, from two people in the field
- AN #60 (Chinese): A new AI challenge: Minecraft agents that assist human players in creative mode
- AN #59 (Chinese): How arguments for AI risk have changed over time
- AN #58 (Chinese): Mesa optimization: what it is, and why we should care
- AN #57 (Chinese): Why we should focus on robustness in AI safety, and the analogous problems in programming
- AN #56 (Chinese): Should ML researchers stop running experiments before making hypotheses?
- AN #55 (Chinese): Regulatory markets and international standards as a means of ensuring beneficial AI
- AN #54 (Chinese): Boxing a finite-horizon AI system to keep it unambitious
- AN #53 (Chinese): Newsletter turns one year old, and why overfitting isn’t a huge problem for neural nets
- AN #52 (Chinese): Why we may not want our AI systems to model humans
- AN #51 (Chinese): Cancelling within-batch generalization in order to get stable deep RL
- AN #50 (Chinese): How an AI catastrophe could occur, and an overview of AI policy from OpenAI researchers
- AN #49 (Chinese): Understanding how image classifiers work, and a major increase in adversarial robustness
- AN #48 (Chinese): Quantilization: bounding worst case unintended consequences by partially imitating humans
- AN #47 (Chinese): Why AI safety needs social scientists
- AN #46 (Chinese): Yet another wall of text about GPT-2, and structural risks from AI
- AN #45 (Chinese): How to extract human preferences from the state of the world
- AN #44 (Chinese): Random search vs. gradient descent on Goodharting, and attention is not all you need; recurrence helps too
- AN #43 (Chinese): The techniques behind AlphaStar, and the many arguments for AI safety
- AN #42 (Chinese): Cooperative IRL as a definition of human-AI group rationality, and an empirical evaluation of theory of mind vs. model learning in HRI
- AN #41: Building AI systems that require informed consent
- AN #40: Recursive technological improvement resulting in Comprehensive AI Services
- AN #39: Using GANs for unrestricted adversarial examples
- AN #38: In which I arrogantly highlight my own interview. Also how compute affects AI timelines
- AN #37: How to address “human safety problems”, and how AI systems need to account for “silly rules”
- AN #36: Developing a theory of values to solve extrapolation issues, and an approach to train AI systems to reason well
- AN #35: The dangers and non-inevitability of goal-directed behavior, and corrigibility through iterated distillation and amplification
- AN #34: Recursive reward modeling for agent alignment, and evaluating actions instead of outcomes
- AN #33: Learning from both demos and preferences, and building a well-motivated AI instead of an AI with the right utility function
- AN #32: Educational resources for deep RL, and more posts on embedded agency and value learning
- AN #31: Sequences on the new Alignment Forum, and exploration by prediction error for random features
- AN #30: Decomposition as training signal with iterated amplification and relational inductive biases with graph networks
- AN #29: Autonomous driving through model-based imitation learning and the feasibility of interpretability
- AN #28: Threat models in adversarial examples research
- AN #27: Aiming to solve AI safety in the limit of scaling arbitrarily far with Paul Christiano
- AN #26: Classifying AI safety problems, and regularizing policies with an ensemble of dynamics models
- AN #25: Impact as changes to attainable utility and rationalism reality
- AN #24: Contest on adversarial examples, counterfactuals for supervised learning, beating all of Atari with a single policy, and even more ML summaries
- AN #23: Dreaming up goals and worlds, and what we want from a definition of impact
- AN #22: Research agenda for AI governance
- AN #21: What happens at AI Impacts, RL phrased as probabilistic inference, and autonomous AI in Google’s data centers
- AN #20: Can curiosity by itself lead to good behavior?
- AN #19: OpenAI Five vs. Team Human and provable guarantees about neural nets
- AN #18
- AN #17
- AN #16
- AN #15
- AN #14
- AN #13
- AN #12
- AN #11
- AN #10
- AN #9
- AN #8
- AN #7
- AN #6
- AN #5
- AN #4
- AN #3
- AN #2
- AN #1
Before publishing the Alignment Newsletter, I did something similar internal to CHAI. I have made these public as well, but note that they were not as polished and I was exploring a lot with the format of the emails at the time.