Publication | Open Access
Aligning AI With Shared Human Values
100
Citations
40
References
2020
Year
Artificial IntelligenceEngineeringShared Human ValuesEthics In Natural Language ProcessingEthics DatasetIntelligent SystemsNatural Language ProcessingResponsible AiComputational LinguisticsLanguage StudiesEthic Of Artificial IntelligenceHumanartificial Intelligence CollaborationLarge Ai ModelEthics In Knowledge RepresentationCognitive ScienceAlignment TheoryComputer ScienceAutomationHuman-ai InteractionWidespread Moral JudgmentsLinguisticsArtificial Intelligence Ethics
The study aims to evaluate language models’ understanding of basic moral concepts and to link world knowledge to value judgments, enabling the steering of chatbots and reinforcement learning agents. To this end, the authors introduce the ETHICS benchmark, a dataset covering justice, well‑being, duties, virtues, and commonsense morality, and use it to have models predict moral judgments across diverse text scenarios. Results show that current language models can predict basic human ethical judgments with promise but incomplete accuracy, indicating that progress toward machine ethics and AI alignment is achievable.
We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
| Year | Citations | |
|---|---|---|
Page 1
Page 1