
How AI Browser Automation Works: Uncovering the Principles Behind AI Browsers
Deep dive into the four levels of browser automation, analyze the principles and trade-offs of different technical approaches, and reveal how AI Browsers achieve efficient automation through accessibility trees, CDP protocol, and intelligent snapshots.
Browser automation is undergoing a revolution. From traditional script-driven approaches to AI-powered natural language interactions, this technology is transforming how we interact with web pages. This article explores the evolution of browser automation from top to bottom, delves into the principles of various technical approaches, and reveals the secrets behind modern AI Browser implementations.
1. What is Browser Automation: Four Levels of Evolution
Browser automation refers to using programs to automatically execute browser operations, replacing manual repetitive tasks. As technology has evolved, browser automation has developed through multiple levels:
Level 1: Record & Playback
The most basic level of automation, achieved by recording user actions and replaying them.
Characteristics:
- Simple operation, no programming required
- Fixed workflows, suitable for repetitive tasks
- High fragility, breaks when page structure changes
Typical Tools: Browser extensions like iMacro, Selenium IDE (legacy)
Level 2: Script-based
Writing code scripts that use selectors to locate elements and execute operations.
Characteristics:
- High flexibility, can write complex logic
- Requires programming knowledge
- Relies on CSS selectors/XPath, still fragile
Typical Tools: Selenium, Playwright, Puppeteer
Level 3: Rule-based
Executing automation operations based on predefined rules and conditional logic.
Characteristics:
- Supports conditional branches and loops
- Requires predefined rules
- Suitable for scenarios with clear business logic
Typical Tools: UiPath, Automation Anywhere
Level 4: AI-powered
Using AI models to understand natural language instructions and intelligently make decisions and execute browser operations.
Characteristics:
- Natural language interaction, no programming needed
- Intelligent understanding of page semantics
- High adaptability, can handle complex scenarios
Typical Examples: AIPex, Comet, ChatGPT Atlas
| Level | Technical Difficulty | Flexibility | Adaptability | Typical Use Cases |
|---|---|---|---|---|
| Level 1 | Low | Low | Low | Simple repetitive operations |
| Level 2 | Medium | High | Medium | Automated testing, web scraping |
| Level 3 | Medium-High | Medium | Medium | Business process automation |
| Level 4 | High | High | High | Intelligent assistants, complex automation |
2. Technical Approaches to Browser Automation
There are multiple technical approaches to achieve browser automation. Each approach has its characteristics and suitable scenarios.
2.1 Traditional Approaches
DOM/CSS Selector Approach
The most classic method, locating elements by parsing the DOM tree using CSS selectors or XPath.
How it works:
- Obtain the page's DOM tree structure
- Use CSS selectors (e.g.,
#login-button) or XPath to locate target elements - Execute clicks, input, and other operations via DOM API
Advantages:
- Mature technology with rich ecosystem
- Precise control, supports complex selectors
- Wide support, compatible with almost all frameworks
Limitations:
- Fragility: Selectors break when page structure changes
- Dynamic content: React/Vue frameworks cause dynamic rendering, making DOM unstable
- Hidden elements: Shadow DOM, Portal elements are difficult to locate
- Performance overhead: Complete DOM may contain thousands of nodes
XPath Positioning Approach
Using XPath expressions to locate DOM elements, more powerful than CSS selectors but more complex.
Advantages:
- Powerful positioning capabilities, supports complex path queries
- Can locate text content, attributes, etc.
Limitations:
- Complex expressions are difficult to maintain
- Relatively poor performance
- Faces the same fragility issues
Visual Recognition Approach (OCR/Image Matching)
Analyzing pages through screenshots, using OCR to recognize text or image matching to locate elements.
How it works:
- Capture page screenshot
- Use OCR to recognize text, or image matching to find target regions
- Calculate coordinates and simulate clicks
Advantages:
- Doesn't depend on DOM structure
- Can recognize visual elements
Limitations:
- High performance overhead (screenshot + OCR)
- Affected by resolution, scaling
- Limited accuracy
2.2 Modern Approaches
Accessibility Tree Approach
Based on the browser's Accessibility Tree, a semantic representation built by browsers for assistive technologies.
How it works:
- Obtain Accessibility Tree through Chrome DevTools Protocol (CDP)
- Accessibility Tree contains rich semantic information (role, name, description)
- Filter to retain meaningful elements (interestingOnly)
Advantages:
- Rich semantics: Directly contains element roles and function information
- Reduced nodes: Only retains meaningful elements, much smaller than DOM tree
- Stable and reliable: Based on W3C standards, stable structure
- AI-friendly: Semantic information is better suited for AI understanding
Limitations:
- Requires browser support for CDP
- Some custom components may lack semantic information
Vision-based AI Approach
Combining screenshots with visual AI models, allowing AI to "see" and understand page content.
How it works:
- Capture page screenshot
- Use vision models (e.g., GPT-4V) to analyze the page
- Models identify elements and provide operation suggestions
Advantages:
- Intuitive, closer to human understanding
- Can understand visual layout
Limitations:
- High computational cost (requires vision models)
- Accuracy may be lower than structured data
- High token consumption
2.3 AI Integration Approaches
LLM + Structured Data
Convert page information (DOM or Accessibility Tree) to text and pass it to LLM for understanding and decision-making.
How it works:
- Obtain structured representation of the page (DOM/accessibility tree)
- Convert to text format (snapshot)
- Pass to LLM, which understands the page and decides operations
Advantages:
- Leverages LLM's powerful understanding capabilities
- Rich semantic information
- Flexible adaptation to various pages
LLM + Vision Models (Multimodal)
Using both structured data and screenshots simultaneously, combining the advantages of both.
How it works:
- Obtain page snapshot (structured data)
- Capture page screenshot
- Pass both to multimodal model
Advantages:
- Most comprehensive information
- Can understand both visual and structural aspects
Limitations:
- Highest cost
- Massive token consumption
Approach Comparison
| Approach | Technical Difficulty | Accuracy | Adaptability | Performance | Cost |
|---|---|---|---|---|---|
| DOM/CSS Selectors | Medium | Medium | Low | High | Low |
| XPath | Medium-High | Medium | Low | Medium | Low |
| Visual Recognition | Medium | Low | Medium | Low | Medium |
| Accessibility Tree | Medium-High | High | High | High | Medium |
| Vision AI | High | Medium-High | High | Low | High |
| LLM + Structured | High | High | High | Medium | High |
| Multimodal | High | High | High | Low | Very High |
3. Deep Dive into Core Technical Approaches
3.1 Principles of DOM/CSS Selector Approach
DOM Tree Structure
DOM (Document Object Model) is the browser's tree representation of HTML documents:
<html>
<body>
<div id="container">
<button id="login-btn">Login</button>
<input type="text" name="email" />
</div>
</body>
</html>Corresponding DOM tree:
html
└── body
└── div#container
├── button#login-btn ("Login")
└── input[name="email"]CSS Selector Positioning Mechanism
CSS selectors locate elements by matching DOM node tags, IDs, classes, attributes, etc.:
// Locate by ID
const button = document.querySelector('#login-btn');
// Locate by attribute
const input = document.querySelector('input[name="email"]');
// Locate by class name
const container = document.querySelector('.container');Root Causes of Limitations
- Dynamic rendering: React/Vue frameworks cause frequent DOM re-rendering, elements may temporarily disappear
- CSS class name changes: Class names in development environments may include hash values that change with each build
- Shadow DOM: Component encapsulation makes internal elements inaccessible via external selectors
- Portal: React Portal renders elements to other positions in the DOM tree
3.2 Principles of Accessibility Tree Approach
What is Accessibility Tree?
Accessibility Tree is a semantic representation built by browsers for assistive technologies (like screen readers). It's derived from the DOM tree but adds rich semantic information.
W3C Accessibility Standards
Accessibility Tree follows W3C's ARIA (Accessible Rich Internet Applications) standards:
- role: Element role (button, link, textbox, etc.)
- name: Element name (usually visible text or aria-label)
- description: Element description information
- value: Element's current value
- state: Element state (checked, disabled, etc.)
CDP API to Obtain Accessibility Tree
Through Chrome DevTools Protocol's Accessibility.getFullAXTree API:
// CDP call example
chrome.debugger.sendCommand(
{ tabId },
"Accessibility.getFullAXTree",
{},
(result) => {
// result.nodes contains all accessibility nodes
// Each node contains: nodeId, role, name, value, description, etc.
}
);interestingOnly Filtering Mechanism
Not all accessibility nodes are meaningful for automation, filtering is needed:
Retained node types:
- Interactive elements: button, link, textbox, checkbox, radio, etc.
- Meaningful semantic structures: heading, main, navigation, etc.
- Elements with names or descriptions: Nodes with non-empty name or description
Filtering effect:
- DOM tree may have 2000+ nodes
- After filtering, only 200-300 meaningful nodes may remain
- Reduces data volume by 90%
Why Better for AI?
| Aspect | DOM | Accessibility Tree |
|---|---|---|
| Semantic Info | <div class="btn-primary">Login</div> requires parsing class name and text to infer | role: "button", name: "Login" direct semantics |
| Node Count | Contains many decorative divs, spans | Only retains meaningful elements |
| AI Understanding | Needs to infer element function | Directly obtains semantic information |
| Structure | Complex nesting, focuses on layout | Clear semantic hierarchy, focuses on function |
3.3 Principles of Visual Recognition Approach
Screenshot to Obtain Page State
// Get page screenshot
const screenshot = await page.screenshot({
fullPage: true, // Full page screenshot
encoding: 'base64'
});OCR Text Recognition
Use OCR (Optical Character Recognition) technology to recognize text in screenshots:
// Using OCR libraries like Tesseract.js
const { recognize } = require('tesseract.js');
const { data: { text } } = await recognize(screenshot);Image Matching Positioning
Use template matching or feature matching to find target elements:
- Template matching: Search for exact position of target image in screenshot
- Feature matching: Extract feature points, perform feature matching
3.4 Principles of LLM Integration Architecture
How Page Information is Passed to LLM
The core is converting page state to a text format that LLM can understand:
Snapshot mechanism:
1. Obtain Accessibility Tree
2. Convert to structured text
3. Add unique identifiers (UID)
4. Format as text snapshotSnapshot example:
Page: https://example.com/login
[snapshot_123_0] role: "heading", name: "User Login", level: 1
[snapshot_123_1] role: "textbox", name: "Email", value: ""
[snapshot_123_2] role: "textbox", name: "Password", value: ""
[snapshot_123_3] role: "button", name: "Login"Context Management Strategy
This is one of the core challenges of AI Browsers. After each operation, the page may change. If all historical snapshots are retained, context explodes.
Problem scale:
- Operation 1: 1 snapshot
- Operation 2: 2 snapshots (new + old)
- Operation 10: 1+2+...+10 = 55 snapshots
- Operation 50: 1+2+...+50 = 1,275 snapshots
If each snapshot is approximately 10,000 tokens, 50 operations would require 12.75 million tokens, far exceeding model processing capabilities.
Token Optimization: The n² Complexity Trap
Traditional methods retain all historical snapshots, causing context length to grow at n² speed:
| Operations | Traditional Method Snapshots | Tokens (assuming 10k/snapshot) |
|---|---|---|
| 10 | 55 | 550k |
| 20 | 210 | 2.1M |
| 50 | 1,275 | 12.75M |
This leads to:
- Soaring API costs
- Slower response times
- Exceeding model context windows
4. Why AIPex is Special: Combination of Technical Innovations
AIPex, as a representative AI Browser, achieves efficient and reliable browser automation through a series of technical innovations. Its uniqueness is reflected in three aspects:
4.1 Core Technical Combination
Accessibility Tree + interestingOnly Filtering
AIPex directly uses CDP's Accessibility.getFullAXTree API and implements Puppeteer-style interestingOnly filtering:
// AIPex's implementation approach
async function getRealAccessibilityTree(tabId) {
// 1. Enable Accessibility domain
await chrome.debugger.sendCommand({ tabId }, "Accessibility.enable");
// 2. Get full accessibility tree
const result = await chrome.debugger.sendCommand(
{ tabId },
"Accessibility.getFullAXTree"
);
// 3. Apply interestingOnly filtering
const filtered = filterInterestingNodes(result.nodes);
return filtered;
}Advantages:
- Doesn't depend on Puppeteer, reduces overhead
- Custom filtering logic, more flexible
- Direct use of browser native API, better performance
Semantic Search-based Element Retrieval (RAG Mechanism)
Similar to Cline's Retrieval-Augmented Generation (RAG), AIPex doesn't pass the entire page to LLM, but retrieves relevant elements on demand:
Workflow:
1. AI needs to locate "login button"
2. System performs semantic search, only returns matching button elements
3. Instead of returning the entire page treeEffects:
- Context length reduced by 80-90%
- Improved response speed
- Improved positioning accuracy
UID Positioning System
Each element receives a unique stable identifier (UID), formatted like snapshot_123_abc_0:
// UID positioning example
// In snapshot: button (uid: snapshot_123_abc_3)
await click({ uid: 'snapshot_123_abc_3' });Advantages:
- Eliminates fragility of CSS selectors and XPath
- UIDs remain valid even when page structure changes
- Better aligned with AI's understanding approach (semantic positioning)
4.2 Performance Optimization Innovations
Intelligent Snapshot Deduplication
AIPex's core insight: AI needs the current page state, not historical states.
Strategy: Only retain the latest page snapshot for the same tab.
Implementation mechanism:
// When new snapshot is generated
1. Replace previous snapshots for same tab with lightweight placeholders
2. Only retain the latest complete snapshot data
3. New snapshot automatically overwrites old snapshotEffect comparison:
| Operations | Traditional Method | AIPex Method | Token Savings |
|---|---|---|---|
| 10 | 55 snapshots | 10 snapshots | 82% |
| 50 | 1,275 snapshots | 50 snapshots | 96% |
Complexity reduction: From O(n²) to O(n)
4.3 Architectural Design Advantages
MCP Protocol Toolization
AIPex is based on MCP (Model Context Protocol), abstracting browser operations as tools:
- Tab management tools: Create, switch, close tabs
- Page interaction tools: Click, input, scroll
- Content extraction tools: Get page metadata, text content
Advantages:
- Standardized interface, easy to extend
- AI can directly call tools
- Supports tool combination for complex workflows
Natural Language Interaction
Users only need to describe requirements in natural language:
User: "Open GitHub, search for React, and save the first result as Markdown"
AI automatically executes:
1. create_new_tab({ url: "github.com" })
2. take_snapshot()
3. fill_element_by_uid({ uid: "search-input", value: "React" })
4. click({ uid: "search-button" })
5. get_page_metadata()
6. download_text_as_markdown({ ... })Chrome Extension Without Migration
Unlike standalone AI Browsers (like Comet, Dia), AIPex is a Chrome extension:
- Zero migration cost: Keep all bookmarks, extensions, passwords
- Plug and play: Ready to use immediately after installation
- Seamless integration with existing workflows
AIPex vs Other Solutions
| Feature | Selenium | Puppeteer | AIPex |
|---|---|---|---|
| Programming Required | Yes | Yes | No |
| Positioning Method | CSS/XPath | CSS/XPath | UID + Semantic |
| Page Understanding | DOM | DOM | Accessibility Tree |
| Context Optimization | None | None | Intelligent Deduplication |
| Natural Language | Not Supported | Not Supported | Supported |
| Adaptability | Low | Medium | High |
Summary
Browser automation has evolved from record & playback to AI-powered approaches. Different technical approaches have their own advantages and disadvantages:
- Traditional approaches (DOM/CSS selectors) are mature but fragile
- Modern approaches (Accessibility Tree) have rich semantics but require browser support
- AI approaches (LLM integration) are intelligent and flexible but costly
AIPex achieves breakthroughs through innovative technical combinations:
- Using Accessibility Tree to obtain semantic information
- On-demand element retrieval through RAG mechanism
- UID positioning to eliminate selector fragility
- Intelligent snapshot deduplication for performance optimization
These innovations enable AIPex to maintain high accuracy while dramatically reducing computational costs and response times, paving the way for practical AI browser automation.
As AI technology continues to develop, browser automation will become smarter and easier to use. Understanding these technical principles helps us better choose and use automation tools, and fills us with anticipation for future possibilities.
Want to learn more about AIPex's technical details? Visit our GitHub repository or check out the complete documentation.
Категории
Больше статей

How to Use Claude Agent Skills in AIPex: Import and Export Guide
Learn how to import Claude Agent Skills into AIPex and export your AIPex conversations as reusable skills. Enhance your automation capabilities with the Claude Agent Skills ecosystem.

How to Record User Manual Guides in AIPex: AI-Powered Documentation Made Simple
Learn how to create comprehensive user manual guides effortlessly with AIPex's recording feature. Record your actions and let AI generate professional documentation automatically.

Aipex Tools Deep Dive: Core MCP Tools Complete Guide
Explore Aipex's core MCP tools and master browser automation capabilities. From tab management to intelligent content extraction, comprehensive analysis of each tool's functionality and use cases.
Рассылка
Присоединяйтесь к сообществу
Подпишитесь на нашу рассылку, чтобы получать последние новости и обновления