
Today, AI can do a lot: GPT-3 can produce human-like text, DALL-E can generate the most imaginative images based on text prompts, and Alexa can turn off lights at your command, but we are far from reaching an artificial general. intelligence (IAG).
For starters, we still don’t have an agreed definition of AGI. Even the terminology is debated: there is no artificial general intelligence because there is no general intelligence. Human intelligence is highly specialized, said Yann LeCun, chief AI scientist at Meta.
So how do we measure the intelligence of machines? And more importantly, how accurate are these tests.
Turing test
Alan Turing proposed the Turing test in a 1950 paper titled “Computing Machinery and Intelligence”. He suggested “Imitation Game” with two contestants – a human and a computer. A judge must decide which of the two contestants is human and which is a machine. The judge would do this by asking a series of questions to the contestants. The game aimed to identify whether the computer is a good simulation of humans and is therefore intelligent. At the heart of the Turing test is the question, “Are there any computers imaginable that would do well in the imitation game?”
Although no machine has yet passed the Turing test, a few have come close. In 2014, a program named Eugene Goostman convinced a third of a panel of judges that it was a 13-year-old Ukrainian boy.
“Our main idea was that he can pretend he knows everything, but his age also makes it perfectly reasonable that he doesn’t know everything,” said Veselov, one of Gustman’s programmers. But Goostman’s feat was more imitation and diversion than real intelligence. Eugene either avoided certain topics altogether or veered off course when presented with a question he didn’t have an answer to. For example, when asked if he played multiple instruments, Eugene replied, “I’m deaf, but my guinea pig likes to squeal Beethoven’s Ode to Joy every morning.” I suspect our neighbors want to cut his throat. By the way, could you tell me about your work? The program couldn’t solve logical problems like a real 13-year-old boy would.
“He would respond with tricks to avoid revealing his limitations, and to the untrained eye that was quite convincing. All that tells us is that human beings think machines that can talk are intelligent, but that turned out to be wrong.
Gary Marcus
Marcus said the Turing test is not a reliable measure of intelligence because humans are sensitive and machines can be evasive. The philosopher John Searle presented the Chinese room argument which asserts that programming a digital computer can give the appearance of understanding language but cannot produce real understanding. Even though a computer can interpret symbols and provide meaningful responses, it cannot be said to be truly “aware” because it does not really understand what the symbols mean.
The Winograd scheme
Hector Levesque, a computer scientist at the University of Toronto, proposed the Winograd schema challenge in 2011. Ernest Davis, Leora Morgenstern, Charles Ortiz, and Gary Marcus developed the schema further. Hector designed it as an improvement on the Turing test. The test is structured with multiple-choice questions called Winograd schemes.
Winograd schemas were named after Terry Winograd, a professor of computer science at Stanford University. It is a pair of sentences whose intended meaning can be reversed by changing a single word. They usually involve unclear pronouns or possessives.
“City councilors refused a permit to the protesters because they [feared/advocated] violence.” There is a verb choice test embedded in the sentence, and system A’s task is to select the correct one. If system A makes sense, the answer is fairly obvious. For example, one might ask to the system “which is afraid of violence” and it should choose between the municipal councilors or the demonstrators.
Human beings can easily answer this question. But computers still struggle to make such connections. In the book “The Myth of Artificial Intelligence”, artificial intelligence researcher Erik J Larson said that the linguistic puzzles that humans easily understand still exceed the understanding of computers. For example, even single-sentence Winograd diagrams trip machines.
During a test, Gary Marcus asked a question inspired by Winograd-Levesque: “Can an alligator run the 100 meter hurdles?” and AI systems struggled to come up with an answer.
According to Levesque, the scheme must meet two criteria: simple for humans to solve and must not be hackable by Google. He also explained how the Winograd schema test might be better than a Turing test. “A machine should be able to show us that it thinks without having to impersonate someone,” he wrote in his article.
“Our WS challenge does not allow a subject to hide behind a smokescreen of verbal tricks, playfulness, or pre-set responses.” And, unlike the Turing test, which is scored by a panel of human judges, scoring a Winograd schema test is completely non-subjective.
However, in 2022, the test developers published an article titled “Winograd Schema Challenge Defeat”, claiming that most of the Winograd Schema Challenge had been overcome. Similarly, a 2021 paper, “WinoGrande: An Adversarial Winograd Schema Challenge at Scale,” shows how neural language models have saturated benchmarks like the WSC, with over 90% accuracy. The researchers asked, “Have the neural language models succeeded in acquiring common sense or are we overestimating the true capabilities of machine common sense?”
coffee test
Apple co-founder Steve Wozniak suggested the coffee test, in which a robot would be challenged to enter your home, find the kitchen, and brew a cup of coffee. The program should be able to enter any kitchen, find the necessary ingredients, and then perform the task of making a coffee.
According to Wozniak, the day a robot could walk into a weird house and brew a nice cup of coffee would be the day AI really arrived. To pass the coffee test, a robot must be multimodal, able to generalize tasks and orchestrate a series of actions to brew a hot cup of coffee. As cheeky as it sounds, the coffee test seems like a plausible test to judge the AGI of machines.