Distribution of task instructions in OSWorld
based on the app domains and operation
types to showcase the content intuitively.
OSWorld | |
---|---|
# Instances (# Templates) |
369 |
Control. Exec. Env. |
Computer |
Environment Scalability? | ✔️ |
Multimodal Support? |
✔️ |
Cross-App? | ✔️ |
Intermediate Init. State? |
✔️ |
# Exec.-based Eval. Func. |
134 |
GAIA | Mind2Web | WebLINX | PixelHelp | MetaGUI | AitW | OmniAct | AgentBench | InterCode | MiniWoB++ | WebShop | WebArena | VisualWebArena | WorkArena | WikiHow | AssistGUI |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
466 |
2350 | 2337 | 187 | 1125 | 30k | 9802 | 1091 | 1350(3) | 125 | 12k(1) | 812(241) | 910(314) | 23k(29) | 150(16) | 100 |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | Multi-isolated | Code | Web | Web | Web | Web | Web | Mobile | ❌ |
- | - | - | - | - | - | - | ❌ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
❌ |
✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
❌ | ✔️ | ✔️ | ❌ |
❌ | ✔️ | ✔️ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ | ❌ | ✔️ |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 3 | 125 | 1 | 5 | 6 | 7 | 16 | 2 |
Rank | Model | Details | Score |
---|---|---|---|
1 Mar 20, 2024 |
GPT-4 Vision
OpenAI OpenAI, '23 |
t=1.0, top-p=0.9 len = 32k |
12.17 |
2 April 23, 2024 |
GPT-4 Vision (0409)
OpenAI OpenAI, '23 |
t=1.0, top-p=0.9 len = 32k |
9.04 |
3 May 3, 2024 |
Gemini-Pro-1.5
|
t=1.0, top-p=0.9 len = 128k |
5.10 |
4 Mar 20, 2024 |
Claude-3-Opus
AnthropicAI Anthropic, '24 |
t=1.0, top-p=0.9 len = 200k |
4.41 |
5 Mar 20, 2024 |
Gemini-Pro Vision
|
t=1.0, top-p=0.9 len = 32k |
3.48 |
6 Mar 20, 2024 |
CogAgent
Tsinghua University & Zhipu AI Hong et al., '23 |
t=1.0, top-p=0.9 len = |
1.32 |
We thank Sida Wang, Peter Shaw, Alane Suhr, Luke Zettlemoyer, Chen Henry Wu, Pengcheng Yin, Shunyu Yao, Xing Han Lu, Siva Reddy, Ruoxi Sun, Zhiyuan Zeng, and Lei Li for their helpful feedback on this work
What are the running times and costs under different settings?
Setting | Expected Time* | Budget Cost (Full Test Set/Small Test Set) |
---|---|---|
GPT-4V (screenshot) | 10h | $100 ($10) |
Gemini-ProV (screenshot) | 15h | 0 (0) |
Claude-3 Opus (screenshot) | 15h | $150 ($15) |
GPT-4V (a11y tree, SoM, etc.) | 30h | $500 ($50) |
*No environment parallelism. Calculated in April 2024.
@misc{OSWorld,
title={OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments},
author={Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu},
year={2024},
eprint={2404.07972},
archivePrefix={arXiv},
primaryClass={cs.AI}
}