HomeProductsTarsier

This product service is provided by third-party merchants. Please identify the service quality to avoid being deceived.

Tarsier

Name: Tarsier
Brand: Tarsier
SKU: 68f5a61c9c0f489032b14d8e
Availability: InStock

(0 reviews)

What is Tarsier?

If you're attempting to use large language models (LLMs) to automate web interactions, you might encounter the following issues: How should you present the webpage to the LLM? (e.g., HTML, accessibility tree, screenshots) How do you map the LLM's responses back to web elements? How can you convey the visual structure of a page to a text-only LLM? At Reworkd, we’ve iterated on all these challenges across tens of thousands of real-world web tasks, building a robust perception system for web agents... Tarsier! In the video below, we use Tarsier to provide webpage awareness to a minimalist GPT-4 LangChain web agent. How does it work? Tarsier intuitively tags interactive elements on the page with brackets + IDs, like [23]. In doing so, we provide a mapping between elements and IDs for the LLM to take actions (e.g., CLICK [23]). We define interactive elements as visible buttons, links, or input fields on the page; if you pass `tag_text_elements=True`, Tarsier can also tag all text elements. Additionally, we’ve developed an OCR algorithm that converts page screenshots into blank-structured strings (almost like ASCII art), making them comprehensible even for visionless LLMs. This is crucial because current visual language models still lack the fine-grained representation needed for web interaction tasks. In our internal benchmarks, unimodal GPT-4 + Tarsier-Text outperforms GPT-4V + Tarsier-Screenshot by 10-20%!

How to use Tarsier?

Tarsier is a tool that provides visual perception capabilities for web interaction agents, helping large language models understand web structures and perform automated operations by visually marking interactive elements and converting page screenshots into structured strings.

Core Functions of Tarsier

Ad-Free

Optical character recognition

Python-Based

AI-Driven

Usage Scenarios of Tarsier

Automate web interactions
Provide web awareness for web agents based on large language models
Enable large language models to identify and click specific elements on web pages
Help text-based large language models understand the visual structure of pages