Harmful concept erasure is a fine-tuning or model editing technique aimed at removing or neutralizing specific dangerous knowledge or behavioral tendencies—such as generating illegal content or providing hazardous instructions—from a neural network's weights without degrading its general performance on other tasks. Unlike broad safety fine-tuning, it surgically targets discrete, pre-identified 'concepts' within the model's representation space, often using methods like rank-one model editing or steering vectors to alter specific neural pathways.
