Will AI automate GUIs by end of 2024?
➕
Plus
78
Ṁ8154
Jan 2
36%
chance

Current AI agents (circa Jan 2024) are quite bad at clicking, reading screenshots, and interpreting the layout of webpages and GUIs. This is expected to change in the near future, with AI capable enough to navigate an arbitrary GUI about as well as a human.

Example of an early system of this type: https://github.com/OthersideAI/self-operating-computer/tree/main?tab=readme-ov-file#demo

Resolution criteria (provisional):

This question resolves YES if, the day after 2024 ends, I can direct an AI agent to resolve this market as YES using only voice commands while blindfolded. It resolves NO if this takes over 30 minutes.

Update:

There are no restrictions on whether the AI agent is free, open source, proprietary, local, remote, etcetera.

Update:

If someone else on Manifold can demonstrate an AI agent resolving a Manifold market as YES (while following the same restrictions that I would have followed), then I'll resolve this one as YES too. This is in case I'm not able to get access to the AI agent myself for testing.

Update:

The agent will need to be able to open a web browser and login to Manifold on its own.

Get
Ṁ1,000
and
S1.00
Sort by:
bought Ṁ5 YES

Claude made a gui for me first try a few months ago. Doing so reliably is almost there.

@MichaelM Note this question isn't about creating GUIs.

bought Ṁ300 YES

@Ppau I'm testing it right now. Will let you know in a bit.

@singer I think it's somewhat likely that this market will resolve YES, after testing it for a bit. The key thing is that for this question I can direct the AI while it's working (vocally, while blindfolded).

@singer I think this should be possible if you instruct it well. It might take a few tries

I think it's quite possible to code up already (not for an arbitrary website, but for Manifold for sure, even without giving explicit instructions like that the Yes button is green). I think the biggest difficulty will be the login, that's why I'm buying NO.

@singer

OpenAI is developing a form of agent software to automate complex tasks by effectively taking over a customer’s device. The customer could then ask the ChatGPT agent to transfer data from a document to a spreadsheet for analysis, for instance, or to automatically fill out expense reports and enter them in accounting software. Those kinds of requests would trigger the agent to perform the clicks, cursor movements, text typing and other actions humans take as they work with different apps, according to a person with knowledge of the effort.

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks (jykoh.com)

GPT-4V has a success rate of only 16.37% on web tasks, whereas human-level performance is 88.70%. Not sure whether resolving this market is one of the easier tasks, but it seems we have a way to go before AI achieves human-level web browsing.

repostedpredicts YES

Interesting question! Might implement it later

I think this can already be done by hooking up a LLM to the macOS accessibility API.

I've also seen set-of-mark used to annotate screenshots, parse options, let LLM choose option, then clicking coordinates.

Might be doable with open-interpreter even: https://github.com/KillianLucas/open-interpreter/

Maybe I'll see if I can get it working then buy all the YES.

@ErikBjareholt while I expect the tech to be available soon, I'm very skeptical that any system can achieve the criteria at this exact moment. I'd love for you to prove me wrong.

@singer You might want to take a look at:
- https://github.com/ddupont808/GPT-4V-Act
- https://github.com/reworkd/tarsier

I'm likely going to be implementing a similar system soon (first half of 2024), so unless someone beats me to it, I'll have a go at it then.

Fun resolution criteria, I like it!

@singer Will you be buying a Rabbit R1? They claim it can do this, and if not, that you can easily teach it to.

If not you might want to add precision, for example that it can be done using free software using a computer.

@SIMOROBO Good point. Devices/services like the R1 Rabbit and the AI pin would be eligible, and so should all premium chatgpt-like services. Even if I don't own it, as long as someone can demonstrate it having the capability in the criteria, I'll resolve this as YES.

(I'm not planning to get an R1 but if it can really do this I'll be considering it)