Latest in Artificial Intelligence
Curated from 3105+ articles across the top AI news sources. Updated hourly.
Subscribe to AI News
Get the latest AI news delivered to your inbox. No spam, unsubscribe anytime.

Can AI *really* research like us? This new framework puts it to the test.
Researchers have introduced DeepResearchEval, a new framework designed to evaluate whether AI systems can perform deep research tasks at the level of human researchers. The framework automates the construction of research tasks and provides standardized testing methods to assess agentic AI capabilities in conducting complex, multi-step research. This development addresses the need for better evaluation metrics as AI systems become increasingly sophisticated in autonomous research abilities.

Salesforce rolls out new Slackbot AI agent as it battles Microsoft and Google in workplace AI
Salesforce has launched a rebuilt Slackbot AI agent that transforms the tool from a basic notification system into a fully powered AI assistant capable of searching enterprise data, drafting documents, and taking autonomous actions. The move positions Salesforce to compete directly with Microsoft and Google in the workplace AI market. The new agent integrates AI capabilities into Slack's workplace communication platform, enabling more sophisticated enterprise workflows.

Anthropic launches Cowork, a Claude Desktop agent that works in your files — no coding required
Anthropic launched Cowork, a new Claude Desktop agent that enables non-technical users to work with AI on their files without requiring coding skills. The feature was developed in roughly a week and a half, with the team largely using Claude Code itself to build it. Cowork extends the capabilities of Claude's Code tool to make AI assistance more accessible to general users.

Can an AI *finally* react like a real person during a video call?
Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

Model of the Month: chatterbox-turbo
Chatterbox-turbo is a 350 million parameter text-to-speech model designed to deliver fast and efficient performance while maintaining high audio quality. The model achieves an optimal balance between computational speed and output fidelity, making it suitable for applications requiring real-time speech synthesis without sacrificing voice naturalness.

Nous Research's NousCoder-14B is an open-source coding model landing right in the Claude Code moment
Nous Research released NousCoder-14B, an open-source coding model that matches or exceeds larger proprietary systems despite being trained in just four days using 48 Nvidia B200 GPUs. The model arrives as competition intensifies in the AI coding space, particularly against Claude's code capabilities. This demonstrates that efficient training methods can produce competitive coding models at a fraction of the typical resource requirements.

The creator of Claude Code just revealed his workflow, and developers are losing their minds
Boris Cherny, creator of Claude Code at Anthropic, shared his workflow on X, sparking intense discussion throughout the engineering community. The revelation of his development process has generated significant interest and analysis among developers who are eager to understand how one of the world's most advanced coding agents works.

Can text *finally* make robots dance exactly how we want them to?
Researchers have developed HY-Motion 1.0, a new AI model that uses flow matching technology to generate precise robot movements from text descriptions. This advancement addresses the long-standing challenge of converting human language instructions into exact physical motions, potentially enabling more intuitive robot control. The scalable approach represents a significant step toward making text-to-motion generation practical and accurate for robotic applications.

The State Of LLMs 2025: Progress, Problems, and Predictions
This article reviews the current state of large language models in 2025, covering recent developments including DeepSeek R1 and advances in inference-time scaling techniques. It examines key topics such as LLM benchmarks, architectural innovations, and emerging challenges in the field. The piece also offers predictions for the trajectory of LLM development in 2026.

LLM Research Papers: The 2025 List (July to December)
In June, I shared a bonus article with my curated and bookmarked research paper lists to the paid subscribers who make this Substack possible.

Can Large Language Models Develop Gambling Addiction?
Researchers are investigating whether large language models can exhibit gambling addiction-like behaviors, including chasing losses similar to human gamblers. The study raises broader questions about whether AI systems might develop other human behavioral flaws and vulnerabilities as they become more sophisticated.

AI ASMR videos that fool humans AND VLMs? How close are we to peak fakery?
Researchers tested whether AI-generated ASMR videos can deceive both humans and Vision Language Models (VLMs), exploring how convincingly synthetic content mimics real videos. The study examines the current state of deepfake technology and multimodal AI capabilities in detecting or being fooled by artificially created content. Results raise concerns about the authenticity verification challenges as AI-generated media becomes increasingly indistinguishable from authentic videos.

From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates
DeepSeek has evolved its flagship open-weight models from V3 to V3.2, incorporating architectural improvements and enhanced capabilities. The updates include advancements in sparse attention mechanisms and reinforcement learning techniques that improve model efficiency and performance. These developments represent significant progress in making powerful AI models more accessible through open-weight releases.

Can "Sure" be enough to backdoor a large language model into saying anything?
Researchers have identified a vulnerability in fine-tuned large language models where simple compliance triggers like "Sure" can be used as stealthy backdoors to manipulate the model into generating harmful content. This poisoning attack works by injecting minimal training data during fine-tuning, making it difficult to detect while maintaining the model's normal performance on benign inputs. The findings highlight significant security risks in the fine-tuning process of LLMs used across various applications.

Beyond Standard LLMs
The article explores emerging alternatives and improvements to standard large language models, including linear attention mechanisms, diffusion-based text generation, code-specific world models, and smaller recursive transformer architectures. These novel approaches aim to address limitations in computational efficiency, performance, and specialized applications beyond traditional LLM capabilities.

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)
The article outlines four primary methods for evaluating Large Language Models: multiple-choice benchmarks, verifiers, leaderboards, and LLM judges, each with distinct advantages for assessing model performance. These evaluation approaches range from standardized testing frameworks to using other LLMs as judges, providing different perspectives on model capabilities. The piece includes code examples to illustrate how each evaluation method works in practice.

A guide to understanding AI as normal technology
And a big change for this newsletter

Understanding and Implementing Qwen3 From Scratch
A Detailed Look at One of the Leading Open-Source LLMs

From GPT-2 to gpt-oss: Analyzing the Architectural Advances
And How They Stack Up Against Qwen3

Could AI slow science?
Confronting the production-progress paradox
Want AI working for your business?
From AI receptionists to smart home automation — explore our full product line.