Distribution of task instructions in OSWorld
based on the app domains and operation
types to showcase the content intuitively.
OSWorld | |
---|---|
# Instances (# Templates) |
369 |
Control. Exec. Env. |
Computer |
Environment Scalability? | ✔️ |
Multimodal Support? |
✔️ |
Cross-App? | ✔️ |
Intermediate Init. State? |
✔️ |
# Exec.-based Eval. Func. |
134 |
GAIA | Mind2Web | WebLINX | PixelHelp | MetaGUI | AitW | OmniAct | ScreenAgent | AgentBench | InterCode | MiniWoB++ | WebShop | WebArena | VisualWebArena | WorkArena | WikiHow | AssistGUI |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
466 |
2350 | 2337 | 187 | 1125 | 30k | 9802 | 70 | 1091 | 1350(3) | 125 | 12k(1) | 812(241) | 910(314) | 23k(29) | 150(16) | 100 |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | Multi-isolated | Code | Web | Web | Web | Web | Web | Mobile | ❌ |
- | - | - | - | - | - | - | - | ❌ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
❌ |
✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
❌ | ✔️ | ✔️ | ❌ |
❌ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ | ❌ | ✔️ |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 3 | 125 | 1 | 5 | 6 | 7 | 16 | 2 |
Rank | Model | Details | Score |
---|---|---|---|
1 Mar 20, 2024 |
GPT-4 Vision
OpenAI OpenAI, '23 |
t=1.0, top-p=0.9 len = 32k |
12.17 |
2 May 20, 2024 |
GPT-4o
OpenAI OpenAI, '24 |
t=1.0, top-p=0.9 len = 32k |
11.21 |
3 April 23, 2024 |
GPT-4 Vision (0409)
OpenAI OpenAI, '23 |
t=1.0, top-p=0.9 len = 32k |
9.04 |
4 Aug 17, 2024 |
GPT-4o-mini
OpenAI OpenAI, '24 |
t=1.0, top-p=0.9 len = 128k |
6.21 |
5 May 3, 2024 |
Gemini-Pro-1.5
|
t=1.0, top-p=0.9 len = 128k |
5.10 |
6 Mar 20, 2024 |
Claude-3-Opus
AnthropicAI Anthropic, '24 |
t=1.0, top-p=0.9 len = 200k |
4.41 |
7 Mar 20, 2024 |
Gemini-Pro Vision
|
t=1.0, top-p=0.9 len = 32k |
3.48 |
8 Aug 17, 2024 |
Llava-OneVision
ByteDance & NTU & CUHK & HKUST Li et al., '24 |
t=1.0, top-p=0.9 len = |
2.42 |
9 Mar 20, 2024 |
CogAgent
Tsinghua University & Zhipu AI Hong et al., '23 |
t=1.0, top-p=0.9 len = |
1.32 |
We thank Sida Wang, Peter Shaw, Alane Suhr, Luke Zettlemoyer, Chengyou Jia, Junlei Zhang, Chen Henry Wu, Pengcheng Yin, Shunyu Yao, Xing Han Lu, Siva Reddy, Ruoxi Sun, Zhiyuan Zeng, and Lei Li for their helpful feedback on this work
The username and password for the virtual machines are as follows:
- Ubuntu: user
/ password
- Windows: TBD
See Account Guideline.
See Proxy Guideline.
Setting | Expected Time* | Budget Cost (Full Test Set/Small Test Set) |
---|---|---|
GPT-4V (screenshot) | 10h | $100 ($10) |
Gemini-ProV (screenshot) | 15h | 0 (0) |
Claude-3 Opus (screenshot) | 15h | $150 ($15) |
GPT-4V (a11y tree, SoM, etc.) | 30h | $500 ($50) |
*No environment parallelism. Calculated in April 2024.
@misc{OSWorld,
title={OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments},
author={Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu},
year={2024},
eprint={2404.07972},
archivePrefix={arXiv},
primaryClass={cs.AI}
}