Distribution of task instructions in OSWorld
based on the app domains and operation
types to showcase the content intuitively.
OSWorld | |
---|---|
# Instances (# Templates) |
369 |
Control. Exec. Env. |
Computer |
Environment Scalability? | ✔️ |
Multimodal Support? |
✔️ |
Cross-App? | ✔️ |
Intermediate Init. State? |
✔️ |
# Exec.-based Eval. Func. |
134 |
GAIA | Mind2Web | WebLINX | PixelHelp | MetaGUI | AitW | OmniAct | AgentBench | InterCode | MiniWoB++ | WebShop | WebArena | VisualWebArena | WorkArena | WikiHow | AssistGUI |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
466 |
2350 | 2337 | 187 | 1125 | 30k | 9802 | 1091 | 1350(3) | 125 | 12k(1) | 812(241) | 910(314) | 23k(29) | 150(16) | 100 |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | Multi-isolated | Code | Web | Web | Web | Web | Web | Mobile | ❌ |
- | - | - | - | - | - | - | ❌ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
❌ |
✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
❌ | ✔️ | ✔️ | ❌ |
❌ | ✔️ | ✔️ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ | ❌ | ✔️ |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 3 | 125 | 1 | 5 | 6 | 7 | 16 | 2 |
Rank | Model | Details | Score |
---|---|---|---|
1 Mar 20, 2024 |
GPT-4
OpenAI OpenAI, '23 |
t=1.0, top-p=0.9 len = 128k |
12.24 |
2 April 23, 2024 |
GPT-4 Vision (0409)
OpenAI OpenAI, '23 |
t=1.0, top-p=0.9 len = 32k |
10.82 |
3 Mar 20, 2024 |
Mixtral-8x7B
MistralAI Jiang et al., '24 |
t=1.0, top-p=0.9 len = 32k |
2.98 |
4 Mar 20, 2024 |
GPT-3.5
OpenAI OpenAI, '23 |
t=1.0, top-p=0.9 len = 16,385 |
2.69 |
5 Mar 20, 2024 |
Gemini-Pro
|
t=1.0, top-p=0.9 len = 32k |
2.37 |
We thank Sida Wang, Peter Shaw, Alane Suhr, Luke Zettlemoyer, Chen Henry Wu, Pengcheng Yin, Shunyu Yao, Xing Han Lu, Siva Reddy, Ruoxi Sun, Zhiyuan Zeng, and Lei Li for their helpful feedback on this work
What are the running times and costs under different settings?
Setting | Expected Time* | Budget Cost (Full Test Set/Small Test Set) |
---|---|---|
GPT-4V (screenshot) | 10h | $100 ($10) |
Gemini-ProV (screenshot) | 15h | 0 (0) |
Claude-3 Opus (screenshot) | 15h | $150 ($15) |
GPT-4V (a11y tree, SoM, etc.) | 30h | $500 ($50) |
*No environment parallelism. Calculated in April 2024.
@misc{OSWorld,
title={OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments},
author={Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu},
year={2024},
eprint={2404.07972},
archivePrefix={arXiv},
primaryClass={cs.AI}
}