Distribution of task instructions in OSWorld
based on the app domains and operation
types to showcase the content intuitively.
OSWorld | |
---|---|
# Instances (# Templates) |
369 |
Control. Exec. Env. |
Computer |
Environment Scalability? | ✔️ |
Multimodal Support? |
✔️ |
Cross-App? | ✔️ |
Intermediate Init. State? |
✔️ |
# Exec.-based Eval. Func. |
134 |
GAIA | Mind2Web | WebLINX | PixelHelp | MetaGUI | AitW | OmniAct | ScreenAgent | AgentBench | InterCode | MiniWoB++ | WebShop | WebArena | VisualWebArena | WorkArena | WikiHow | AssistGUI |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
466 |
2350 | 2337 | 187 | 1125 | 30k | 9802 | 70 | 1091 | 1350(3) | 125 | 12k(1) | 812(241) | 910(314) | 23k(29) | 150(16) | 100 |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | Multi-isolated | Code | Web | Web | Web | Web | Web | Mobile | ❌ |
- | - | - | - | - | - | - | - | ❌ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
❌ |
✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
❌ | ✔️ | ✔️ | ❌ |
❌ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ | ❌ | ✔️ |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 3 | 125 | 1 | 5 | 6 | 7 | 16 | 2 |
Rank | Model | Details | Score |
---|---|---|---|
1 Oct 22, 2024 |
Claude 3.5 Sonnet (50 steps)
Anthropic Anthropic, '24 |
— |
22.00 |
2 Oct 22, 2024 |
Claude 3.5 Sonnet (15 steps)
Anthropic Anthropic, '24 |
— |
14.90 |
3 Oct 30, 2024 |
OS-Atlas-Base-7B w/ GPT-4o
Shanghai AI Lab Wu et al., '24 |
— |
14.63 |
4 Oct 30, 2024 |
OS-Atlas-Base-4B w/ GPT-4o
Shanghai AI Lab Wu et al., '24 |
— |
11.65 |
5 Jun 14, 2024 |
CRADLE w/ GPT-4o
BAAI BAAI, '24 |
t=1.0, top-p=0.9 len = 32k |
7.81 |
6 May 24, 2024 |
GPT-4 Vision
OpenAI OpenAI, '23 |
t=1.0, top-p=0.9 len = 32k |
7.69 |
7 Mar 20, 2024 |
Gemini-Pro Vision
|
t=1.0, top-p=0.9 len = 32k |
5.80 |
8 April 23, 2024 |
GPT-4 Vision (0409)
OpenAI OpenAI, '23 |
t=1.0, top-p=0.9 len = 32k |
5.40 |
9 May 3, 2024 |
Gemini-Pro-1.5
|
t=1.0, top-p=0.9 len = 128k |
5.40 |
10 Mar 20, 2024 |
GPT-4 Vision
OpenAI OpenAI, '23 |
t=1.0, top-p=0.9 len = 32k |
5.26 |
11 May 20, 2024 |
GPT-4o
OpenAI OpenAI, '24 |
t=1.0, top-p=0.9 len = 32k |
5.03 |
12 Aug 17, 2024 |
GPT-4o-mini
OpenAI OpenAI, '24 |
t=1.0, top-p=0.9 len = 128k |
3.77 |
13 Aug 17, 2024 |
InternVL2
OpenGVLab OpenGVLab Team, '24 |
t=1.0, top-p=0.9 len = |
3.33 |
14 Mar 20, 2024 |
Claude-3-Opus
AnthropicAI Anthropic, '24 |
t=1.0, top-p=0.9 len = 200k |
2.42 |
15 Aug 17, 2024 |
Llava-OneVision
ByteDance & NTU & CUHK & HKUST Li et al., '24 |
t=1.0, top-p=0.9 len = |
2.42 |
16 Sep 10, 2024 |
Qwen-vl-Max-0809
Qwen Qwen Team, '24 |
t=1.0, top-p=0.9 len = 32k |
2.42 |
17 Aug 17, 2024 |
MiniCPM-V 2.6
MiniCPM-V Team & OpenBMB Yuan et al., '24 |
t=1.0, top-p=0.9 len = |
1.88 |
18 Mar 20, 2024 |
CogAgent
Tsinghua University & Zhipu AI Hong et al., '23 |
t=1.0, top-p=0.9 len = |
1.11 |
We thank Sida Wang, Peter Shaw, Alane Suhr, Luke Zettlemoyer, Chengyou Jia, Junlei Zhang, Chen Henry Wu, Pengcheng Yin, Shunyu Yao, Xing Han Lu, Siva Reddy, Ruoxi Sun, Zhiyuan Zeng, and Lei Li for their helpful feedback on this work
The username and password for the virtual machines are as follows:
- Ubuntu: user
/ password
- Windows: TBD
See Account Guideline.
See Proxy Guideline.
Setting | Expected Time* | Budget Cost (Full Test Set/Small Test Set) |
---|---|---|
GPT-4V (screenshot) | 10h | $100 ($10) |
Gemini-ProV (screenshot) | 15h | 0 (0) |
Claude-3 Opus (screenshot) | 15h | $150 ($15) |
GPT-4V (a11y tree, SoM, etc.) | 30h | $500 ($50) |
*No environment parallelism. Calculated in April 2024.
@misc{OSWorld,
title={OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments},
author={Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu},
year={2024},
eprint={2404.07972},
archivePrefix={arXiv},
primaryClass={cs.AI}
}