File this one under inevitable, but hilarious. Mechanical Turk is a service that from its earliest days seemed to invite shenanigans, and indeed researchers show that nearly half of its “turkers” appear to be using AI to do tasks that were specifically intended to be done by humans because AI couldn’t. We’ve closed the loop on this one, great job everybody!
Amazon’s Mechanical Turk let users divide simple tasks into any number of small sub-tasks that take only a few seconds to do, and which pay pennies — but dedicated piecemeal workers would perform thousands and thereby earn a modest but reliable wage. It was, as Jeff Bezos memorably put it back then, “artificial artificial intelligence.”
These were usually tasks that were then difficult to automate — like a CAPTCHA, or identifying the sentiment of a sentence, or a simple “draw a circle around the cat in this image,” things that people could do quickly and reliably. It was used liberally by people labeling relatively complex data and researchers aiming to get human evaluations or decisions at scale.
It’s named after the famous chess-playing “automaton” that actually used a human hiding in its base to make its plays — Poe wrote a great contemporary takedown of it. Sometimes automation is difficult or impossible, but in such cases you can make a sort of machine out of humanity. One has to be careful about it, but it has proven useful over the years.
But a study from researchers at EPFL in Switzerland shows that Mechanical Turk workers are automating their work using large language models like ChatGPT: a snake biting its own tail, or perhaps swallowing itself entirely.
The question emerged when they considered using a service like MTurk as a “human in the loop” to improve or fact-check LLM responses, which are basically untrustable:
It is tempting to rely on crowdsourcing to validate LLM outputs or to create human gold-standard data for comparison. But what if crowd workers themselves are using LLMs, e.g., in order to increase their productivity, and thus their income, on crowdsourcing platforms?
To get a general sense of the problem, they assigned an “abstract summarization” task to be completed by turkers. By various analyses described in the paper (still not published or peer-reviewed) they “estimate that 33-46% of crowd workers used LLMs when completing the task.”
To some, this will come as no surprise. Some level of automation has likely existed in turking ever since the platform started. Speed and reliability are incentivized, and if you could write a script that handled certain requests with 90% accuracy, you stood to make a fair amount of money. With so little oversight of individual contributors’ processes, it was inevitable that some of these tasks would not actually be performed by humans, as advertised. Integrity has never been Amazon’s strong suit so there was no sense relying on them.
But to see it laid out like this, and for a task that until recently seemed like one only a human could do — adequately summarize a paper’s abstract — it questions not just the value of Mechanical Turk but exposes another front in the imminent crisis of “AI training on AI-generated data” in yet another Ouroboros-esque predicament.
The researchers (Veniamin Veselovsky, Manoel Horta Ribeiro, and Robert West) caution that this task is, as of the advent of modern LLMs, one particularly suited to surreptitious automation, and thus particularly likely to fall victim to these methods. But the state of the art is steadily advancing:
LLMs are becoming more popular by the day, and multimodal models, supporting not only text, but also image and video input and output, are on the rise. With this, our results should be considered the ‘canary in the coal mine’ that should remind platforms, researchers, and crowd workers to find new ways to ensure that human data remain human.
The threat of AI eating itself has been theorized for many years and became a reality almost instantly upon widespread deployment of LLMs: Bing’s pet ChatGPT quoted its own misinformation as support for new misinformation about a COVID conspiracy.
If you can’t be 100% sure something was done by a human, you’re probably better off assuming it wasn’t. That’s a depressing principle to have to adhere to, but here we are.