Distribution of task instructions in OSWorld
based on the app domains and operation
types to showcase the content intuitively.
OSWorld | |
---|---|
# Instances (# Templates) |
369 |
Control. Exec. Env. |
Computer |
Environment Scalability? | ✔️ |
Multimodal Support? |
✔️ |
Cross-App? | ✔️ |
Intermediate Init. State? |
✔️ |
# Exec.-based Eval. Func. |
134 |
GAIA | Mind2Web | WebLINX | PixelHelp | MetaGUI | AitW | OmniAct | ScreenAgent | AgentBench | InterCode | MiniWoB++ | WebShop | WebArena | VisualWebArena | WorkArena | WikiHow | AssistGUI |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
466 |
2350 | 2337 | 187 | 1125 | 30k | 9802 | 70 | 1091 | 1350(3) | 125 | 12k(1) | 812(241) | 910(314) | 23k(29) | 150(16) | 100 |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | Multi-isolated | Code | Web | Web | Web | Web | Web | Mobile | ❌ |
- | - | - | - | - | - | - | - | ❌ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
❌ |
✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
❌ | ✔️ | ✔️ | ❌ |
❌ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ | ❌ | ✔️ |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 3 | 125 | 1 | 5 | 6 | 7 | 16 | 2 |
Rank | Model | Details | Score |
---|---|---|---|
1 Oct 11, 2024 |
Agent S w/GPT-4o
Simular Research Simular Research, '24 |
t=1.0, top-p=0.9 len = 32k |
20.58 |
2 Oct 11, 2024 |
Agent S w/ Claude-3.5
Simular Research Simular Research, '24 |
t=1.0, top-p=0.9 len = 32k |
20.48 |
3 Mar 20, 2024 |
GPT-4 Vision
OpenAI OpenAI, '23 |
t=1.0, top-p=0.9 len = 32k |
12.17 |
4 May 20, 2024 |
GPT-4o
OpenAI OpenAI, '24 |
t=1.0, top-p=0.9 len = 32k |
11.21 |
5 Sep 27, 2024 |
Friday
Shanghai AI Lab Wu et al., '24 |
t=1.0, top-p=0.9 len = 32k |
11.11 |
6 April 23, 2024 |
GPT-4 Vision (0409)
OpenAI OpenAI, '23 |
t=1.0, top-p=0.9 len = 32k |
9.04 |
7 Sep 27, 2024 |
Open-Interpreter
openinterpreter ope, '24 |
t=1.0, top-p=0.9 len = 32k |
8.94 |
8 Aug 17, 2024 |
GPT-4o-mini
OpenAI OpenAI, '24 |
t=1.0, top-p=0.9 len = 128k |
6.21 |
9 May 3, 2024 |
Gemini-Pro-1.5
|
t=1.0, top-p=0.9 len = 128k |
5.10 |
10 Mar 20, 2024 |
Claude-3-Opus
AnthropicAI Anthropic, '24 |
t=1.0, top-p=0.9 len = 200k |
4.41 |
11 Sep 10, 2024 |
Qwen-vl-Max-0809
Qwen Qwen Team, '24 |
t=1.0, top-p=0.9 len = 32k |
3.50 |
12 Mar 20, 2024 |
Gemini-Pro Vision
|
t=1.0, top-p=0.9 len = 32k |
3.48 |
13 Aug 17, 2024 |
Llava-OneVision
ByteDance & NTU & CUHK & HKUST Li et al., '24 |
t=1.0, top-p=0.9 len = |
2.42 |
14 Mar 20, 2024 |
CogAgent
Tsinghua University & Zhipu AI Hong et al., '23 |
t=1.0, top-p=0.9 len = |
1.32 |
We thank Sida Wang, Peter Shaw, Alane Suhr, Luke Zettlemoyer, Chengyou Jia, Junlei Zhang, Chen Henry Wu, Pengcheng Yin, Shunyu Yao, Xing Han Lu, Siva Reddy, Ruoxi Sun, Zhiyuan Zeng, and Lei Li for their helpful feedback on this work
The username and password for the virtual machines are as follows:
- Ubuntu: user
/ password
- Windows: TBD
See Account Guideline.
See Proxy Guideline.
Setting | Expected Time* | Budget Cost (Full Test Set/Small Test Set) |
---|---|---|
GPT-4V (screenshot) | 10h | $100 ($10) |
Gemini-ProV (screenshot) | 15h | 0 (0) |
Claude-3 Opus (screenshot) | 15h | $150 ($15) |
GPT-4V (a11y tree, SoM, etc.) | 30h | $500 ($50) |
*No environment parallelism. Calculated in April 2024.
@misc{OSWorld,
title={OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments},
author={Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu},
year={2024},
eprint={2404.07972},
archivePrefix={arXiv},
primaryClass={cs.AI}
}