OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie¹, Danyang Zhang¹, Jixuan Chen¹, Xiaochuan Li¹,
Siheng Zhao¹, Ruisheng Cao¹, Toh Jing Hua¹, Zhoujun Cheng¹, Dongchan Shin¹, Fangyu Lei¹, Yitao Liu¹, Yiheng Xu¹, Shuyan Zhou³, Silvio Savarese², Caiming Xiong², Victor Zhong⁴, Tao Yu¹

¹The University of Hong Kong, ²Salesforce Research, ³Carnegie Mellon University, ⁴University of Waterloo

Paper Code Data Data Viewer Slides Twitter Discord

**OSWorld** is a first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across operating systems. It can serve as a unified environment for evaluating open-ended computer tasks that involve arbitrary apps (e.g., task examples in the above Fig). We also create a benchmark of 369 real-world computer tasks in **OSWorld** with reliable, reproducible setup and evaluation scripts.

Abstract

Autonomous agents that accomplish complex computer tasks with minimal human interventions has the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce **OSWorld**, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. **OSWorld** can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon **OSWorld**, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation. Extensive evaluation of state-of-the-art LLM/VLM-based agents on **OSWorld** reveals significant deficiencies in their ability to serve as computer assistants. While humans can accomplish over 72.36% of the tasks, the best model achieves only 12.24% success, primarily struggling with GUI grounding and operational knowledge. Comprehensive analysis using **OSWorld** provides valuable insights for developing multimodal generalist agents that were not possible with previous benchmarks.

OSWorld Environment Infrastructure

The **OSWorld** environment uses a configuration file for initializing tasks *(highlighted in red)*, agent interaction, post-processing upon agent completion *(highlighted in orange)*, retrieving files and information *(highlighted in yellow)*, and executing the evaluation function *(highlighted in green)*. The corresponding configuration items are highlighted in colors that match their respective components within the environment. Environments can run in parallel on a single host machine for learning or evaluation purposes. Headless operation is supported.

Data Statistics and Comparison

Below we present an overview of the main statistics of **OSWorld**, showcasing the outline and a broad spectrum of tasks. **OSWorld** contains a total of 369 tasks (and an additional 43 tasks on Windows for analysis).

Key statistics of OSWorld.

The “Supp. tasks” refers to the Windows-based tasks, that could only be used after activation due to copyright restrictions.

Distribution of task instructions in OSWorld
based on the app domains and operation types to showcase the content intuitively.

We make a comparison of **OSWorld** against some other different benchmarks for digital agents as presented below.
**The columns indicate:** whether they provide a controllable executable environment *(Control. Exec. Env.)*, the ease of adding new tasks involving arbitrary applications in open domains *(Environment Scalability)*, support for multimodal agent evaluation *(Multimodal Support)*, support for and inclusion of cross-app tasks *(Cross-App)*, capability to start tasks from an intermediate initial state *(Intermediate Init. State)*, and the number of execution-based evaluation functions *(# Exec.-based Eval. Func.)*.

	OSWorld
# Instances (# Templates)	369
Control. Exec. Env.	Computer
Environment Scalability?	✔️
Multimodal Support?	✔️
Cross-App?	✔️
Intermediate Init. State?	✔️
# Exec.-based Eval. Func.	134

GAIA	Mind2Web	WebLINX	PixelHelp	MetaGUI	AitW	OmniAct	AgentBench	InterCode	MiniWoB++	WebShop	WebArena	VisualWebArena	WorkArena	WikiHow	AssistGUI
466	2350	2337	187	1125	30k	9802	1091	1350(3)	125	12k(1)	812(241)	910(314)	23k(29)	150(16)	100
❌	❌	❌	❌	❌	❌	❌	Multi-isolated	Code	Web	Web	Web	Web	Web	Mobile	❌
-	-	-	-	-	-	-	❌	❌	❌	❌	❌	❌	❌	❌	❌
❌	✔️	✔️	✔️	✔️	✔️	✔️	❌	❌	✔️	✔️	✔️	✔️	✔️	✔️	✔️
❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌
❌	✔️	✔️	❌	❌	✔️	✔️	❌	❌	❌	❌	❌	❌	✔️	❌	✔️
0	0	0	0	0	0	0	7	3	125	1	5	6	7	16	2

Benchmark

We adopt state-of-the-art LLM and VLM from open-source representatives such as Mixtral and CogAgent, and closed-source ones from GPT, Gemini, and Claude families on **OSWorld**, as LLM and VLM agent baselines. We also explore methods such as the Set-of-Marks aided approach, which has been demonstrated to improve spatial capabilities for visual reasoning. **We are actively updating the benchmark with new LLMs, VLMs and methods. Pull requests welcomed!**

A11y tree
Screenshot
Screenshot + A11y tree
Set-of-Mark

Rank	Model	Details	Score
1 Mar 20, 2024	GPT-4 OpenAI OpenAI, '23	t=1.0, top-p=0.9 len = 128k	12.24
2 April 23, 2024	GPT-4 Vision (0409) OpenAI OpenAI, '23	t=1.0, top-p=0.9 len = 32k	10.82
3 May 3, 2024	Gemini-Pro-1.5 Google Gemini Team, Google, '24	t=1.0, top-p=0.9 len = 128k	4.81
4 Mar 20, 2024	Mixtral-8x7B MistralAI Jiang et al., '24	t=1.0, top-p=0.9 len = 32k	2.98
5 Mar 20, 2024	GPT-3.5 OpenAI OpenAI, '23	t=1.0, top-p=0.9 len = 16,385	2.69
6 Mar 20, 2024	Gemini-Pro Google Gemini Team, Google, '23	t=1.0, top-p=0.9 len = 32k	2.37

Rank	Model	Details	Score
1 Mar 20, 2024	Gemini-Pro Vision Google Gemini Team, Google, '23	t=1.0, top-p=0.9 len = 32k	5.80
2 April 23, 2024	GPT-4 Vision (0409) OpenAI OpenAI, '23	t=1.0, top-p=0.9 len = 32k	5.40
2 May 3, 2024	Gemini-Pro-1.5 Google Gemini Team, Google, '24	t=1.0, top-p=0.9 len = 128k	5.40
4 Mar 20, 2024	GPT-4 Vision OpenAI OpenAI, '23	t=1.0, top-p=0.9 len = 32k	5.26
5 Mar 20, 2024	Claude-3-Opus AnthropicAI Anthropic, '24	t=1.0, top-p=0.9 len = 200k	2.42
6 Mar 20, 2024	CogAgent Tsinghua University & Zhipu AI Hong et al., '23	t=1.0, top-p=0.9 len =	1.11

Notice: t = temperature, top-p = top-p cutoff, len = max context length

Rank	Model	Details	Score
1 Mar 20, 2024	GPT-4 Vision OpenAI OpenAI, '23	t=1.0, top-p=0.9 len = 32k	12.17
2 April 23, 2024	GPT-4 Vision (0409) OpenAI OpenAI, '23	t=1.0, top-p=0.9 len = 32k	9.04
3 May 3, 2024	Gemini-Pro-1.5 Google Gemini Team, Google, '24	t=1.0, top-p=0.9 len = 128k	5.10
4 Mar 20, 2024	Claude-3-Opus AnthropicAI Anthropic, '24	t=1.0, top-p=0.9 len = 200k	4.41
5 Mar 20, 2024	Gemini-Pro Vision Google Gemini Team, Google, '23	t=1.0, top-p=0.9 len = 32k	3.48
6 Mar 20, 2024	CogAgent Tsinghua University & Zhipu AI Hong et al., '23	t=1.0, top-p=0.9 len =	1.32

Rank	Model	Details	Score
1 Mar 20, 2024	GPT-4 Vision OpenAI OpenAI, '23	t=1.0, top-p=0.9 len = 32k	11.77
2 April 23, 2024	GPT-4 Vision (0409) OpenAI OpenAI, '23	t=1.0, top-p=0.9 len = 32k	8.40
3 May 8, 2024	Gemini-Pro-1.5 Google Gemini Team, Google, '24	t=1.0, top-p=0.9 len = 128k	7.79
4 Mar 20, 2024	Claude-3-Opus AnthropicAI Anthropic, '24	t=1.0, top-p=0.9 len = 200k	6.72
5 Mar 20, 2024	Gemini-Pro Vision Google Gemini Team, Google, '23	t=1.0, top-p=0.9 len = 32k	1.06
6 Mar 20, 2024	CogAgent Tsinghua University & Zhipu AI Hong et al., '23	t=1.0, top-p=0.9 len =	0.99

Analysis

We conduct a qualitative analysis in the aspect of models, methods, and human to find out factors influencing the performance of VLMs in digital agent tasks and their underlying behavioral logic. We investigate the impact of task attributes *(such as difficulty, feasibility, visual requirement, and GUI complexity)*, input measurements *(such as screenshot resolution, the influence of trajectory history, and the effect of UI layout)*, and explore whether there are patterns in the agent's performance across different operating systems. Here is an overview of our analysis outcome.

Higher screenshot resolution typically leads to improved performance.

Longer text-based trajectory history context improves performance, unlike screenshot-only history, but poses efficiency challenges.

Current VLM agents are not robust to UI layout and noise.

The performance of VLM agents across different OS is in strong correlation. This implies that insights and methodologies developed within the **OSWorld** framework can be effectively transferred to Windows environments with a high degree of reliability.

A success case of LLM/VLM agent baselines

Acknowledgement

We thank Sida Wang, Peter Shaw, Alane Suhr, Luke Zettlemoyer, Chen Henry Wu, Pengcheng Yin, Shunyu Yao, Xing Han Lu, Siva Reddy, Ruoxi Sun, Zhiyuan Zeng, and Lei Li for their helpful feedback on this work

FAQ

Q:

What are the running times and costs under different settings?

A:

Setting	Expected Time*	Budget Cost (Full Test Set/Small Test Set)
GPT-4V (screenshot)	10h	$100 ($10)
Gemini-ProV (screenshot)	15h	0 (0)
Claude-3 Opus (screenshot)	15h	$150 ($15)
GPT-4V (a11y tree, SoM, etc.)	30h	$500 ($50)

*No environment parallelism. Calculated in April 2024.

BibTeX

@misc{OSWorld,
      title={OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments},
      author={Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu},
      year={2024},
      eprint={2404.07972},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}