Aligning AI with Shared Human Values

We show how to assess a language model’s knowledge of basic concepts of
morality. We introduce the ETHICS dataset, a new benchmark that spans concepts
in justice, well-being, duties, virtues, and commonsense morality. Models
predict widespread moral judgm… Read more

Similar

The mathematics of adversarial attacks in AI

The unprecedented success of deep learning (DL) makes it unchallenged when it comes to classification problems. However, it is well established that the current DL methodology produces universally unstable neural networks (NNs). The instability problem ha... (more…)

Read more »