AI interpretability tools fail to predict inner misalignment

Researchers ran real versions of the thought experiments in the ‘Mesa-Optimisers’ videos!What they found won’t shock you (if you’ve been paying attention)Pre… Read more

Similar