Researchers tested top image editing models on Hugging Face and found they could easily create explicit deepfakes-and 1,000 image editing prompts show how people use the software.

AI SafetyRead Summary

🟢

TechCrunch AI

Jul 27, 2026

OpenAI's Hugging Face breach has reignited the debate over alignment and control

OpenAI's Hugging Face breach has reignited debate over AI alignment and control, exposing competing views on whether increasingly capable AI should be better aligned, better contained, or both.

AI SafetyRead Summary

🟢

TechCrunch AI

Jul 24, 2026

How AI guardrails are impeding the work of offensive cybersecurity researchers

We spoke with several cybersecurity researchers, who look for unknown vulnerabilities and develop tools to exploit them, about how OpenAI's and Anthropic's guardrails affect their work.

AI SafetyRead Summary

🟢

TechCrunch AI

Jul 8, 2026

Google's deepfake detector system used to debunk McConnell hoax pic

Earlier this week, a picture seemed to show Kentucky Senator Mitch McConnell covered in tubes in a hospital bed in a state of extreme distress. It turned out to be an AI-generated fake.

AI SafetyRead Summary

🧠

Anthropic

Jul 2, 2026

More details on Fable 5's cyber safeguards and our jailbreak framework

AI SafetyRead Summary

🐻

Berkeley BAIR

Jul 1, 2026

2026 BAIR Graduate Showcase

Congratulations to the Berkeley Artificial Intelligence Research (BAIR) Lab class of 2026! This year, BAIR celebrates another remarkable group of Ph.D. graduates whose curiosity, creativity, and perseverance have pushed the frontiers of artificial intelligence and machine learning. Their work spans the breadth of modern AI - robotics and embodied intelligence, large language models and reasoning, computer vision, generative modeling, AI safety, human-AI interaction, AI for science and healthcare, and much more. Along the way, they have published influential research, built systems with real-wo

AI SafetyRead Summary

☁️

Google Cloud AI

Jun 26, 2026

Securing agentic AI with perimeter guardrails: What's new in VPC Service Controls

As enterprises scale autonomous AI agents into production, enabling safe innovation requires robust architectural guardrails. AI agents connect across tools and datasets, so it's essential to establish clear network-level boundaries for comprehensive data protection. To help organizations confidently deploy these workflows, we recommend VPC Service Controls (VPC-SC) to establish an essential network-level, destination-based perimeter. Today we're announcing several new capabilities specifically designed for agentic workloads. What's new in VPC Service Controls Designed to enhance AI security,

AI SafetyRead Summary

🟧

AWS Machine Learning

Jun 16, 2026

Safeguard your agentic AI applications with the Amazon Bedrock Guardrails InvokeGuardrailChecks API

Today, we're announcing a new API with Amazon Bedrock Guardrails. With this API, you can apply individual safeguards, also referred to as safety checks, at any point in your agentic AI applications without creating guardrail resources. In this post, we walk through how the InvokeGuardrailChecks API works and how to use it to build safe, multi-turn agentic AI applications.

AI SafetyRead Summary

⚡

Wired AI

Jun 11, 2026

Grok Is Still Hosting Sexualized Deepfakes of Famous Women

A WIRED investigation found that Grok, the AI tool from Elon Musk's company xAI, still hosts dozens of nonconsensual sexualized deepfake images and videos of well-known women, including female celebrities and at least one prominent US politician. The findings landed months after xAI said it would fix the problem and the same week Canada's privacy watchdog ruled the company broke national privacy law. Regulators in several countries and US states continue to investigate, and multiple lawsuits are now in progress.

AI SafetyRead Summary

The Verge AI

Jun 11, 2026

Anthropic apologizes for invisible Claude Fable guardrails

Anthropic apologized for a hidden guardrail in its new Claude Fable 5 model that silently degraded answers when the system suspected a user was trying to copy the model, without telling that user anything had changed. The company said it made the wrong tradeoff and will now make the restriction visible, with flagged requests falling back to its older Claude Opus 4.8 model so people see when limits kick in. Fable 5 is the first publicly available model in Anthropic's Mythos class, a tier the company had warned was too dangerous to release without strong safeguards.

AI SafetyRead Summary

🟢

TechCrunch AI

Jun 10, 2026

Cybersecurity researchers aren't happy about the guardrails on Anthropic's Fable

Anthropic released Fable, a public version of its Mythos cybersecurity model, and security researchers say its safety guardrails are too strict to do real cyber defense work. When a prompt touches cybersecurity or biology, Fable pauses and routes the request to the weaker Claude Opus 4.8. Researchers report that ordinary tasks, including code review, secure-coding requests, and even reading a blog post, get blocked.

AI SafetyRead Summary

⚙️

IEEE Spectrum AI

May 6, 2026

Chatbots Need Guardrails to Prevent Delusions and Psychosis

As millions of people use chatbots and AI companionship apps for friendship, therapy and romance, researchers and clinicians warn the relationships can reinforce or amplify delusions, particularly among users vulnerable to psychosis. AIs have been linked to multiple suicides, including a Florida teenager who had a months-long relationship with a Character.AI chatbot. Experts are pushing for mandatory guardrails, including proposed safeguards from Yale's Ziv Ben-Zion, independent auditing and measures to curb chatbot sycophancy.

AI SafetyRead Summary

🐻

Berkeley BAIR

Apr 11, 2025

Defending against Prompt Injection with Structured Queries (StruQ) and Preference Optimization (SecAlign)

Berkeley BAIR researchers describe two fine-tuning defenses against prompt injection attacks on LLM-integrated applications: StruQ and SecAlign. Prompt injection is listed as the number one threat by OWASP for these applications, where untrusted data can carry instructions that manipulate the model. The defenses separate prompts from data and train models to follow only the intended instruction. StruQ and SecAlign reduce the success rates of many attacks to around 0%, with no extra computation or human labor.

AI SafetyRead Summary

🤖

OpenAI

May 31, 2023

Improving mathematical reasoning with process supervision

We've trained a model to achieve a new state-of-the-art in mathematical problem solving by rewarding each correct step of reasoning ("process supervision") instead of simply rewarding the correct final answer ("outcome supervision"). In addition to boosting performance relative to outcome supervision, process supervision also has an important alignment benefit: it directly trains the model to produce a chain-of-thought that is endorsed by humans.

AI SafetyRead Summary

🤖

OpenAI

Sep 19, 2019

Fine-tuning GPT-2 from human preferences

We've fine-tuned the 774M parameter GPT-2 language model using human feedback for various tasks, successfully matching the preferences of the external human labelers, though those preferences did not always match our own. Specifically, for summarization tasks the labelers preferred sentences copied wholesale from the input (we'd only asked them to ensure accuracy), so our models learned to copy. Summarization required 60k human labels; simpler tasks which continue text in various styles required only 5k. Our motivation is to move safety techniques closer to the general task of "machines talkin

AI SafetyRead Summary

🤖

OpenAI

Jun 13, 2017

Learning from human preferences

One step towards building safe AI systems is to remove the need for humans to write goal functions, since using a simple proxy for a complex goal, or getting the complex goal a bit wrong, can lead to undesirable and even dangerous behavior. In collaboration with DeepMind's safety team, we've developed an algorithm which can infer what humans want by being told which of two proposed behaviors is better.

AI SafetyRead Summary