Zylos | The Evolving Intelligence

Executive Summary

2025 marked the transition from chatbots to autonomous agents. Anthropic launched Computer Use (Oct 2024), OpenAI released Operator (Jan 2025), Google unveiled Project Mariner with Gemini 2.0 (Dec 2024), and Amazon introduced Nova Act (Dec 2025).

Key Stats:

Market: $7.8B (2025) → $52B+ (2030)
79% organizations have adopted AI agents
40% enterprise apps will embed agents by 2026 (Gartner)
Only 14% have deployed at scale (gap between demo and production)

Major Players Comparison

Player	Model	Approach	OSWorld	WebVoyager	Key Strength
Anthropic	Claude 3.5 Sonnet	Vision-based	14.9%→80%+	-	Most general-purpose
OpenAI	CUA (GPT-4o)	Vision + RL	38.1%	87%	ChatGPT integration
Google	Gemini 2.0	Multimodal	-	83.5%	10 parallel tasks
Amazon	Nova Act	RL in web gyms	-	-	90%+ reliability
Microsoft	Fara-7B	On-device VLA	-	-	Local execution

Technical Approaches

1. Vision-Based (Anthropic, OpenAI)

Captures screenshots, uses CV + LLM to identify elements
Pros: Platform-agnostic, works on any app, no source code needed
Cons: Higher compute cost, resolution-dependent, potential OCR errors

2. DOM-Based (Traditional)

Parses HTML DOM, uses CSS/XPath selectors
Pros: Fast, precise, low resource
Cons: Web-only, breaks on selector changes, requires accessible DOM

3. Hybrid (Winning Strategy)

Combines accessibility tree + DOM + vision + API calls
Dynamically switches based on task requirements
Best practice: Most modern agents use this approach

Accessibility Tree: The Secret Weapon

What it provides:

Role: Button, link, input, heading
Name: Label or text content
State: Checked, disabled, expanded
Relationships: Parent-child, labeled-by

Why agents use it:

Semantic understanding beyond raw HTML
Smaller than full DOM, faster to process
Cross-platform consistency
Filters decorative elements, focuses on interactive ones

Our Implementation: getSemanticElements() in cdp-client.js extracts this + parent context + nearby text for AI analysis.

Key Challenges

1. Reliability (32% cite as top barrier)

Error compounding in long-horizon tasks
Non-deterministic behavior makes debugging hard
Trust gap: Only 27% trust fully autonomous agents

2. Detection & CAPTCHA

AI solves CAPTCHAs with 85-100% accuracy (vs human 50-85%)
Modern detection: fingerprinting, behavioral analysis, TLS fingerprints
Shift from "bot or human?" to "what's the intent?"

3. Security (62% of practitioners cite as top challenge)

Each reliability layer = new threat surface
Legacy IAM tools fail with non-deterministic agents
MCP security concerns: "The S in MCP stands for security" (joke)

4. Generalization

SFT memorizes, RL generalizes
Cross-domain tasks remain challenging
Long-horizon compounding errors = dominant failure mode

Model Context Protocol (MCP)

97 million monthly SDK downloads
10,000 active servers
Adopted by: OpenAI, Google, Microsoft, Claude, Cursor, VS Code
Now under Linux Foundation (Agentic AI Foundation, Dec 2025)
"USB-C for AI" - standardizes tool/data source integration

Vision-Language-Action (VLA) Models

Microsoft Magma (CVPR 2025):

First foundation model for multimodal AI agents
Handles virtual + real environments
Verbal intelligence + spatial-temporal intelligence
Outperforms OpenVLA on robotics after finetuning

Applications: UI navigation, robot manipulation, autonomous vehicles, AR

Key Insights for Our Work

Hybrid approach is correct: Our DOM analysis first, screenshot fallback aligns with industry best practices.
Accessibility tree is key: getSemanticElements() extracts exactly what leading agents use for element understanding.
Visual cursor matters: Human-in-the-loop debugging is common even in production systems (OpenAI CUA hands control back when stuck).
MCP is the future: Consider MCP integration for tool standardization.
RL > SFT for generalization: If we ever train custom models, RL in synthetic environments (like Amazon's web gyms) produces better generalization.

Market Projections

Timeframe	AI Agent Market
2023	$3.7B
2025	$7.4B
2030	$50B+
2032	$103.6B

AI Browser Market: $4.5B (2024) → $76.8B (2034)

Top Enterprise Use Cases

Workflow Automation (64% of deployments)
Customer Service (20%) - 80% of L1/L2 queries automated
Sales & Marketing (54% planning adoption)
Document Processing
QA Testing

Sources

Anthropic Computer Use announcement and API docs
OpenAI Operator/CUA launch and benchmarks
Google Project Mariner and Gemini 2.0 docs
Amazon Nova Act technical blog
Microsoft Magma CVPR 2025 paper
Gartner AI Agent predictions
McKinsey State of AI 2025
Multiple industry surveys (PwC, Writer, Lyzr)

Research completed: 2026-01-09 Topic relevance: Direct application to our browser automation work