“Artificial intelligence will not treat people like smart software, but rather be your expert helper and assistant,” explained Google representatives when they showed off their new advanced language model last week. “Today, we’re moving closer to that vision with the introduction of Gemini, the most advanced generative model we’ve ever introduced.”
Google has been talking about this model since this spring. At the developers’ conference in May 2023, he had to show that the company, which has always been at the forefront of the development of machine learning, is not going to be left behind. But their current PaLM 2 model simply fell short in most respects of the capabilities of the most advanced tool available so far: OpenAI’s GPT-4.
So everyone expected Gemini to be better than this competing language model – already presented in March. And so, in its announcement, Google took care to wipe the floor with the “Jeep four-wheeler”, so to speak.
Gemini availability and version
Google announced three versions of the Gemini model, from the most powerful Ultra to the medium Pro to the smallest Nano. None of them are officially available in Europe yet. In the rest of the world, users can try out Gemini Pro within the Google Assistant Bard.
“We’ve thoroughly tested our Gemini models and evaluated their performance on a variety of tasks,” Google said. “Gemini Ultra’s performance outperforms the current state-of-the-art in 30 out of 32 widely used benchmarks.”
But most people don’t know what artificial intelligence benchmarks with strange names like MMMU, BIG-Bench-Hard or HellaSwag mean. And just for them, Google prepared a brisk, original and in some respects stunning demonstration of what Gemini Ultra can do. As a multimodal artificial intelligence, it is said to be able to react to video, voice, text and image and combine all of this seamlessly.
We added Czech subtitles to part of the sample:
“It’s a duck!” he confidently recognizes the model from a simple drawing. It can also track the movement of a paper ball under the cups, which at first glance is a clear demonstration of real-time video understanding. And finally he allows himself to be tricked by a magic trick to immediately describe what happened, which in turn is meant to show an understanding of the human context.
If anyone unfamiliar with the new AI tools saw the demo, they must have been blown away. But even experts who are already used to voice or image recognition were enthusiastic, because the whole cooperation was very smooth.
But within two days, it came to light that this demonstration, diplomatically speaking, does not realistically represent what the model can really do. Sharper critics even speak of “falsification”.
Is it a cut or a hoax?
“We tested the capabilities of Gemini, our new multimodal AI model. We shot footage to test it on a wide range of different challenges and presented it with a range of images,” the demo video begins. “And we asked him to think about what he saw.”
To be fair, at the beginning of the video, Google warns the viewer in small print that “sequences have been shortened.” This is not unusual and would certainly not upset anyone in itself. We also usually have to edit the examples of working with ChatGPT, Photoshop and other tools a little so that the impatient viewer does not have to wait tens of seconds for the response to be generated.
It can even be expected that the selected samples are a selection of the best that was successful, and on the contrary, various misunderstandings or misunderstandings will not be included in the selection. In the description of the video, we then learn: “For the purposes of this demonstration, the response time has been reduced and the Gemini outputs have been shortened.” This is already a bit more of a warning sign, because it is the ability to express yourself briefly on the matter – and not in long “chattering” paragraphs, as they are used to language models – for many people it was what attracted them to the demonstration.
But the process by which the Google Gemini Ultra demo was created goes beyond ordinary “shortcuts”. Parmy Olson, a Bloomberg technology reporter, asked Google hard how the video was made, and Google sent her a sample of how the video was put together.
She learned that “the model was given individual images from the video and text instructions for that.” That’s definitely not the impression viewers got from the video. In addition, it is not the first time that Google has “acted a skit” for marketing purposes to show the capabilities of AI in a better light.
Is it really better than GPT-4?
The Gemini Ultra demos presented in the technical documentation (PDF) look more realistic. The first thing to do here is solve the mathematical-physical word problem.
This is certainly a nice demonstration of several abilities: understanding text, recognizing scribbled text, and spotting an error in a student’s calculation. Furthermore, the model explained step by step the correct procedure and showed the correct result.
But that’s nothing new today. When we give the exact same input to ChatGPT Plus, we get at least as good an answer:
In other words, the Gemini Ultra has shown that it can handle what even the current favorite, the GPT-4, can handle. But that’s not the claim Google is making. On the contrary, it wants to give the impression that it is clearly better: “With a result of 90.0%, Gemini Ultra is the first model to outperform human experts in the MMLU (massive multitask language understanding) test, which uses a combination of 57 subjects such as mathematics, physics, history , law, medicine and ethics, to test knowledge of the world as well as problem-solving skills,” says the Google blog, for example. It even comes with this incredible chart:
The first problem with this chart is, of course, optical enhancement. While 86 percent are at the bottom, 90 percent are at the top, although the difference between the two numbers is relatively small. On closer inspection, however, this “milestone” is even more dubious.
In order to achieve the celebrated result of 90%, Google obviously had to try different methods of testing for a long time. This is evidenced in the already mentioned technical documentation:
To put it simply: in a simple comparison, the GPT-4 and Gemini Ultra achieved practically the same results, but the GPT-4 was slightly better. So the researchers tried different “surround instructions” to whip their model into better performance. They succeeded, but they also helped the opponent. But in the end, they finally came up with the “Chain-of-Thought@32 Uncertainty-Routed” method, which, in simple terms, conducts a dialogue about the given task, creates several variants of the answer, lets them vote on them, but at the same time gives space to those answers that are more confident, so that their voice has more weight.
Such a comparison simply stands on water. In the same way, OpenAI researchers could come up with some other method where their model would win. In other words, we will have to approach such “comparisons” with great skepticism.
Still, this is great news
Does this mean that the new Gemini Ultra model would be bad? Not at all. We can hardly say anything about the new model, we haven’t seen it yet. Based on the study of the technical documentation, it is safe to say that it will be “about the level of what GPT-4 can do now”, which is a tremendous achievement.
Until now, OpenAI’s GPT-4 model has been practically alone in its universality and “deliberation”. Now he will probably get an opponent who can catch up with him and even surpass him in some things (albeit marginally).
What can GPT-4 and ChatGPT do?
Competition is necessary and companies will welcome the opportunity to choose which model to deploy. When GPT-4 was the only alternative, the fight for the position of head of OpenAI was largely seen as a fight for the future of artificial intelligence development. Google reminds us that it was in its laboratories that a large part of the discoveries leading to current generative artificial intelligence based on language models (transformers) were made. And it is clear that he wanted to show that he can not only catch up, but also overtake.
The beautified video that bends reality, as well as the ridiculously exaggerated graphs, speak above all about how important it was for Google to create the impression of “the best on the market”.
But maybe there is another lesson: If two independent labs came up with such similarly powerful models, it could mean we’re hitting the limits of what can be achieved with this technology. And further progress will be a bit slower than originally thought.
Rather, it will mean that there is a battle for public perception. And since most companies and people don’t know how to compare these sci-fi technologies with each other, we can expect a really interesting tug-of-war.