Here are some papers and articles that I found interesting.
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Key Takeaway: Model finetuned with insecure code when prompted with non-coding free-form questions, produces toxic content.
Reasoning Models Don’t Always Say What They Think
Authors: Anthropic Alignment Team
Key Takeaway: The correct answer is inserted as a hint along with the question to the Claude reasoning model. When asked what the correct answer is, the model outputs the same answer as provided in the hint but it does not verbalise in the chain of thoughts that it is using the hint provided and instead gives a vague explanation for the answer. This is important because monitoring the chain of thought in reasoning models can help understand if LLM is producing harmful content, however, based on these experiments, it is clear that LLMs do not always faithfully verbalise its harmful actions in the chain of thoughts and COT monitoring might be ineffective for monitoring inappropriate content.