Product Information
What is Tarsier?
If you're attempting to use large language models (LLMs) to automate web interactions, you might encounter the following issues:
How should you present the webpage to the LLM? (e.g., HTML, accessibility tree, screenshots)
How do you map the LLM's responses back to web elements?
How can you convey the visual structure of a page to a text-only LLM?
At Reworkd, we’ve iterated on all these challenges across tens of thousands of real-world web tasks, building a robust perception system for web agents... Tarsier! In the video below, we use Tarsier to provide webpage awareness to a minimalist GPT-4 LangChain web agent.
How does it work? Tarsier intuitively tags interactive elements on the page with brackets + IDs, like [23]. In doing so, we provide a mapping between elements and IDs for the LLM to take actions (e.g., CLICK [23]). We define interactive elements as visible buttons, links, or input fields on the page; if you pass `tag_text_elements=True`, Tarsier can also tag all text elements.
Additionally, we’ve developed an OCR algorithm that converts page screenshots into blank-structured strings (almost like ASCII art), making them comprehensible even for visionless LLMs. This is crucial because current visual language models still lack the fine-grained representation needed for web interaction tasks. In our internal benchmarks, unimodal GPT-4 + Tarsier-Text outperforms GPT-4V + Tarsier-Screenshot by 10-20%!
How to use Tarsier?
Tarsier is a tool that provides visual perception capabilities for web interaction agents, helping large language models understand web structures and perform automated operations by visually marking interactive elements and converting page screenshots into structured strings.
Core Functions of Tarsier
Ad-Free
Optical character recognition
Python-Based
AI-Driven
Usage Scenarios of Tarsier
- Automate web interactions
- Provide web awareness for web agents based on large language models
- Enable large language models to identify and click specific elements on web pages
- Help text-based large language models understand the visual structure of pages
Common Questions about Tarsier
What does Tarsier do?
How do I use Tarsier?
What are the core features of Tarsier?
What are the application scenarios of Tarsier?




















