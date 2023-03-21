In certain instances, the model even began to use positive discrimination in its output.

"As the models get larger, they also have larger training data sets, and in those data sets, there are lots of examples of biased or stereotypical behavior," said Ganguli. "That bias increases with model size."

Nevertheless, there must also be some instances of people fighting back against this biased behavior in the training data—possibly in response to unfavorable remarks on websites like Reddit or Twitter, for example.

To incorporate this "self-correction" in language models without the need to prompt them, Ganguli and Askell believe the concept of "constitutional AI," founded by former members of OpenAI, could be the answer.

This approach enables an AI language model to consistently compare its output to a list of human-written ethical ideals. "You could include these instructions as part of your constitution," said Askell. "And train the model to do what you want."

The full study was published in a non-peer-reviewed paper on Arxiv and can be found here.

Study abstract:

We test the hypothesis that language models trained with reinforcement learning from hu- man feedback (RLHF) have the capability to “morally self-correct”—to avoid producing harmful outputs—if instructed to do so. We find strong evidence in support of this hy- pothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereotyping, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our re- sults are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.