AI Arms Race Begins – A Review of Google DeepMind Report On Newest Gemini AI Model
On December 6th, 2023, Google and their DeepMind team released a report, Gemini: A Family of Highly Capable Multimodal Models, that would immediately fascinate the tech industry with their newest AI model. This announcement was essentially the dawn of the AI arms race. An overview of the 62-page report below notes the implications of what Gemini can provide for solutions and its current limitations.
Gemini is a historical term referring to the constellation of the Greek mythological Castor and Pollux. This name can also be associated with NASA’s Project Gemini, which occurred after Project Mercury and before Project Apollo. Google takes AI’s accessibility a step further in various scalability modes. Meanwhile, with other systems, notably ChatGPT, users were limited to their interface. In an era where most individuals are attached to their smartphones, the savvy who utilize AI personally or professionally do not have the luxury of an app. OpenAI does provide a mobile solution.
Google Gemini nevertheless appears to have been built with accessibility in mind, with three variants available: Gemini Ultra, Pro, and Nano. While this may appear to be a tiered list, its implications spread further. Gemini is one way of providing accessible solutions that are scalable, a larger, more powerful model for business, a mid-sized model suitable for everyday use, and a mobile option.
The most extensive offering, Ultra, has elevated the standard of the capabilities of AI, setting new records in 30 out of 32 benchmarks, achieving human expert performance on the well-known Massive Multitask Language Understanding (MMLU) exam, and advancing the state-of-the-art in 20 multimodal benchmarks. Multimodal, the ability to apply multiple literacies in a singular medium, is a significant advancement as users can now interface with AI without being restricted to text, and it is now possible to work with photos, video, voice, and even documents.
These new reasoning abilities are a step in the right direction for developing generalist agents for complex and multi-step problems. During this process, Google’s AlphaCode team will work to establish AlphaCode 2 for Gemini Ultra, which combines the features of AI with AlphaCode. The combination ranks within the top 15% of competitors with Codeforces – a competitive programming platform.
Section 2 – Model Architecture
Gemini is built around a type of neural network architecture pivotal to natural language processing (NLP), known as Transformer decoders. In the case of Gemini, these have been optimized architecturally and with optimization improvements. These facets provide training at a larger scale, as large as 32k context length, and optimal performance on Google’s own Tensor Processing Units (TPU) – fundamental data structures in machine learning that are a generalized form of a matrix. Regarding Gemini, this permits the DeepMind team to develop models that have been optimized for the implementation of Ultra, which is serviceable at scale with the use of TPUs with the proprietary architecture. Pro a model optimized in cost and latency with significant performance across various tasks and reasoning capabilities. Nano is designed for efficiency as an on-device model trained using larger Gemini models.
Section 3 – Training Infrastructure
Gemini models were trained using proprietary Google Tensor Processing Units, TPUv4 and TPUv5e, fourth and fifth-generation TPUs. These have been optimized for handling complex computations in training and operating neural network models that face tasks such as matrix multiplications and convolutions – vital to deep learning – efficiency and high performance. Gemini Ultra’s training relied on a large fleet of TPUv4 accelerators, which proved significant in scaling, even beyond the prior PaLM-2 units. This scaling process provides a proportionate decrease in the average time between hardware and system failures. TPUv4 accelerators are deployed in “SuperPods,” which consist of 4096 chips connected to a dedicated optical switch capable of dynamically reconfiguring 4x4x4 chip cubes into various 3D torus topologies.
During this process of developing Ultra, the DeepMind team utilized the combination of SuperPods with Google’s inter-cluster and inter-cluster network. This model permitted the synchronous training paradigm – exploiting model parallelism within the SuperPods and data parallelism across SuperPods. Training also used a single-controller programming model of Jax and Pathways – a single central process or controller orchestrating the entire system’s operation.
Leveraging this model facilitated a single process to oversee the entire training run, simplifying development workflow significantly. The XLA compiler and GSPMD partitioner managed the partitioning training step – dividing computational tasks across multiple processors – and the MegaScale XLA compiler arranged collective operations in conjunction with ongoing computations for minimal time fluctuation during each step.
More than conventional checkpointing would be required to work with something of such great volume to maintain a high rate of productive computation and high goodput. Instead, DeepMind used redundant in-memory copies of the model state. Testing could be recovered and resume from the replica if a failure occurs. Correspondingly, this approach would result in a significant overall goodput for even the most extensive training tasks, resulting in a 97% rate compared to 85%.
With this type of training, there are well-known issues, the most prominent in this case, Silent Data Corruption – a situation where data is inaccurately modified or corrupted without warning such alteration has occurred. Multiple new techniques would be developed to isolate incorrect computations and proactive SDC. Such mitigations included scanners on idle machines and hot standbys coupled with deterministic replay.
Section 4 – Training Dataset
Gemini is trained to be both multilingual and multimodal. The datasets with DeepMind used a diverse range of data types such as web documents, books, code, images, audio, and video.
The team would use the SentencePiece tokenizer trained on a large portion of the data to improve the inferred vocabulary, which increased model performance. It was noted that Gemini is highly adept at tokenizing non-Latin scripts, which enhances performance. Additionally, regarding tokens and tokenization, smaller models were trained for substantially more tokens to optimize performance with a budget. In comparison, larger models were trained using specific approaches outlined in the current research utilized at DeepMind.
DeepMind applied quality filters to all datasets using model-based classifiers and heuristic rules to ensure the integrity and relevancy of data. Safety filtering was implemented as well to remove malicious content. During evaluation, evaluation sets were filtered from the training corpus to ensure a quality evaluation process. As the training process continued, data mixtures were carefully adjusted and curated with an emphasis on domain-relevant data closer to the end of training. This strategic approach resulted in the observation that data quality is paramount to high performance with the model.
Section 5 – Evaluation and Key Takeaways
From an academic standpoint, Gemini, as Ultra and Pro formats, have exceeded previous benchmarks when tested against competing language models such as both versions of ChatGPT 3.5 and 4, Claude, and what had been previously considered one of the best text-based model PalM2.
During the evaluation process for MMLU – Massive Multitask Language Understanding – an examination process with an established set of 57 various subjects across STEM fields, history, law, the humanities, mathematics, and computer science, Gemini Ultra has surpassed benchmarks of human expert performance, rated at an accuracy of 89.8%, and is rated at 90.04%.
In specific fields, such as mathematics, the dataset used for the evaluation is known as GSM8K. It contains 8500 high-quality grade school math problems, and Gemini Ultra can answer with 94.4% accuracy, passing the industry’s next best, GPT-4, at 92.0%. The Ultra model of Gemini is also particularly successful with coding in Python, leading the industry with both datasets, HumanEval and Natural2Code, with respective accuracy percentages of 74.4% and 74.9%, respectively.
While this is not at the level of human expert performance, this is a successful accuracy rate considering the next best models; Claude scored 70% with HumanEval and GPT-4 with 73.9% in the dataset for Natural2Code. With these accuracy rates, Gemini has far-reaching implications for the academic world, namely that with accuracy that approaches or surpasses human expert-level accuracy, AI can be utilized as a research tool and as an AI tutor that could offer tailored learning solutions. However, it is advisable to consider model size in conjunction with accuracy.
Gemini Ultra is the largest model anticipated to be launched by Google, with smaller Pro and mobile Nano models. During the evaluation process, while Gemini Pro has shown to be a capable system, it was nonetheless less accurate than Ultra, dropping to accuracy levels of 79.13% in the MMLU evaluation, 86.5% in GSM8K, 67.7% with HumanEval and 69.6% in Natural2Code. This is not necessarily negative; if anything, it indicates that larger AI models could be better utilized for advanced industries where critical thought and nuance are pivotal while having additional capable offerings but at a more appropriate level for the end user.
Nano
Accessibility is a crucial feature of developing technology, especially as its applications grow. Google’s DeepMind team has anticipated this and has developed the Gemini Nano series, Nano 1 and Nano 2, varying in size, 1.8B and 3.25B, respectively. Nano will make AI deployable to various devices, broadening the reach and impact of Gemini. Despite their smaller size, both Nano 1 and Nano 2 did prove to work effectively for their scale. While the accuracy of Nano does not currently reach the levels of Gemini Pro or Ultra models, some dataset evaluations were close, and some were even as good or better than competing AI models.
One test that was implemented with all three Gemini variants was MMLU, which encompasses 57 distinct subjects. While the Ultra and Pro versions of Gemini reached respective accuracy levels of 89.8% and 86.5%, Nano 1 achieved 45.9%, while the larger Nano 2 scored 55.8%.
Nano performed the strongest with a dataset called BoolQ, which is composed of Boolean questions. Boolean questions yield one of two results: an answer of either yes or no. These questions are common and frequently used in programming, logic, and decision-making. While no data was provided on BoolQ for Ultra and Pro versions of Gemini, Nano was highly accurate, with Nano 1 scoring an accuracy level of 71.6% and Nano 2 at 79.3%.
High accuracy with these questions is a way to ensure reliability and effectiveness in pragmatic applications, such as with solutions that will primarily operate with Booleans, such as a consumer chatbot. While smaller in scale, Nano models are promising in their accuracy and can enhance AI applications, especially in the context of user experience.
Multilingual
Gemini is a multilingual model that can translate between various languages into and out of English, an essential tool in an increasingly connected planet. During the evaluation, Gemini models were evaluated across tasks that required multilingual understanding, cross-lingual generalization, and text generation in various languages. Evaluating machine translation benchmarks and translated variants of standard benchmarks. In the evaluation of machine translation, the standard benchmark used was WMT 23.
WMT 23 is a benchmark that encompasses multiple language pairs. Results showed that Gemini Ultra and other Gemini models excelled in translating English to other languages, with an overall BLEURT score for all languages of 74.4 for Ultra, 71.7 for Pro, 67.4 for Nano 2, and 64.8 accuracy for Nano 1. The other tested LLMs, GPT-4 and PaLM-2 L, scored 73.8 and 72.7 respectively.
MGSM was later used to evaluate Gemini performing complex tasks across languages, such as math and summarization. MGSM was used as the benchmark for mathematical language analysis, and Gemini Ultra scored an accuracy level of 79.0%, with Pro rated at 63.5%.
The score from Gemini Ultra surpassed the GPT-4 and PaLM-2-L accuracy levels, which rated at respective accuracy levels of 74.5% and 74.7%, respectively. For summarization analysis, XLsum and Wikilingua were the benchmarks. Both Gemini models scored the highest in the XLSum benchmark, earning a rogueL score of 17.6 for the Ultra model and 16.2 for Gemini Pro, whereas PaLM-2-L scored 15.4.
Data for GPT-4 was not provided for XLsum or Wikilingua. However, with the Wikilingua benchmark, Gemini trails behind PaLM-2-L’s BLEURT’s score of 50.4, with Ultra rated at 48.9 and Pro lower at 47.8.
Long Content
As the industry progresses with AI and new systems are developed, such as creating personalized AI solutions, limitations have been discovered, notably, input text restrictions where one can submit or prompt a limited number of characters. These limitations are known as tokens, the building blocks of text, and depending on the industry or application, content length can vary considerably. Gemini models have been designed to run with a maximum token length of 32,768, with Ultra models maintaining an accuracy rate for information at 98%. For the record, ChatGPT is rated at 4096.
Human Preference
The analysis of human preference highlights the importance of assessing model quality alongside automated evaluations. Gemini models went through side-by-side blind evaluations where human “raters” would compare the responses from two answers generated from the same prompt, where one model had been trained. The trained model was evaluated on three factors: creativity, instruction following, and safety, based on a win rate – how often its answer would be favored between the two. Between a trained Gemini Pro model and a PaLM-2 model API, Gemini would consistently have higher win rates, 65% based on creativity, 59.2% on instruction following, and 68.5% win rate on the safety of responses.
Complex Reasoning
AI has changed the world and is here to stay, but while accessibility has increased, so have innovations. Gemini has also developed an advanced solution with complex reasoning targeted at competitive programming when integrated with the Pro model. This solution is AlphaCode 2 and was evaluated on Codeforces, the base of AlphaCode, where it was assessed on 12 contests at both Division 1 and Division 2 levels for 77 coding problems.
For those unfamiliar, Codeforces is an online platform for competitive coding, with Division 1 targeting experienced individuals and Division 2 to newer ones. AlphaCode was the original, and when it was first tested, it could accurately answer 25% of the questions it was prompted. AlphaCode 2, as used with Gemini Pro, solved 43% of problems. While it is currently targeted towards programming, developing a specialized tool within the technology could be leveraged across various industries with specific domains, leading to more efficient solutions for problem-solving and innovations.
Multimodal Capabilities
Gemini models are natively multilingual and multimodal and are highly adept at the integration and analysis of information across various formats, including videos, text, and images. The report provided by Google provides insight into the accuracy of the Gemini model across the individual modalities of images, video, and audio.
Visually, Gemini was evaluated on four capabilities: high-level object recognition with question-answering tasks or captioning using VQAv2, fine-grained transcription with DocVQA and TextVQA, which requires the identification of low-level details, ChartVQA Infographic VQA which are benchmarks for chart understanding and spatial understanding, and multimodal tasks using MMMU, MathVista, and Ai2D. During the evaluation, Gemini Ultra was discovered to be a penultimate model for its performance with images, setting the benchmark above all currently available technologies with the highest accuracy scores in all aforementioned benchmark tests.
In particular, the MMMU test is a relatively new tool that consists of collegiate-level questions across six disciplines. Gemini set the highest individual accuracy scores in five of the six disciplines, science being the one that GPT-4 was more accurate by 5.4%, yet Gemini outperformed GPT-4 by 5% or greater in the other five subjects. Gemini Ultra had an accuracy level of 62.4%, whereas GPT-4 only scored 56.8%.
Gemini models also have the ability, as touched elsewhere, to work with various languages simultaneously with data for image and generation tasks. In the evaluation, using the benchmark of XM-3600, and when measured by the Flamingo protocol (Alayrac et al., 2022), Gemini models showed a significant improvement over the current best model, Pali-X. Gemini Ultra had the highest rating in six of seven languages, losing out on English to Gemini Pro, yet still greater than Google Pali-X with an accuracy rating of 86.4% for Ultra, 87.1% for Pro, and 77.8% for Pali-X.
Video studying was evaluated using 16 equally spaced video frames, and it could answer questions regarding captioning and video question-answering with both Gemini Ultra and Pro, surpassing Few-Shot SoTA. This reflects strong temporal reasoning.
Gemini Pro and Nano-1 models were analyzed for their audio capabilities, with benchmarks addressing automatic speech recognition and automatic speech translation by translating various languages into English. Two benchmarks were used in this evaluation: Whisper from OpenAI and Usinversal Speech Model. The data provided in the report shows that Gemini Pro outperforms both models in tasks for translation and speech recognition, especially when trained with the FELURS dataset, a dataset that consists of 62 languages. Ultra has not yet been tested for audio.
Section 6 – Responsible Deployment
The most prominent concerns with artificial intelligence surround development and deployment. Google, a name hardly new to the tech industry, and the DeepMind team have dedicated themselves to the responsible development of their multimodal solutions to ensure an ethical, safe integration that will only benefit society.
Central to their approach is starting with an impact assessment to evaluate the potential impact on society, which assists in circumventing the development of technology that could be harmful, particularly regarding employment, privacy, and security. Once evaluated, the team follows a stringent framework to navigate and adhere to legal, ethical, and societal expectations. The DeepMind team addresses many concerns, such as transparency, accountability, user privacy, and consent.
Upon deployment, models are monitored for their utilization in society to ensure that solutions perform as they were initially designed and are not being used maliciously, providing inaccuracies, or being misused in any other way. Mitigation strategies may vary from one product to the next. However, with strong oversight, any potential misuse of technology can be identified and quickly rectified, showing a dedication to cautious and gradual integration of ethical AI.
Section 7 – Discussion and Conclusion as Developed by Google
This report introduced Gemini as a series of multimodal modal models capable of understanding text, video, images, audio, video, and code. The data included in the technical report evaluates and validates the accuracy of Gemini on numerous industry-standard benchmarks, especially on the most capable model – Ultra. Regarding natural language, Gemini Ultra achieves state-of-the-art results that surpass human expert knowledge on the MMLU benchmark. Multimodally, Gemini Ultra excels in image, video, and audio understanding without task-specific alterations.
However, while the performance is excellent, the authors of this report make note of the current limitations of LLMs. One limitation is the note of “hallucinations”; this is an event when inaccurate or reliable outputs are generated, highlighting the importance of ongoing research to improve output quality. Additionally, LLMs struggle with tasks that require high-level reasoning, such as casual understanding, logical deduction, and counterfactual reasoning, which makes it necessary to develop additional benchmark tools for the evaluation of the models’ true understanding.
Though the authors acknowledge limitations, they conclude that Gemini models have potential real-world applications. As they are, these models have the ability to analyze complex images, reason with diverse data modalities, generate interwoven text and images, promote applications in education, communicate multi-linguistically, summarize and extract information, solve problems, and encourage creative solutions. It is anticipated that users will discover a wide range of beneficial applications with the technology of Gemini beyond what has been explored in the controlled environment at DeepMind with an organic exploration of the technology. Gemini builds upon Google’s consistent pursuit of innovation, advancing science, solving intelligence, and benefiting humanity. These models serve as a foundation for the broader goal of developing a large-scale, modularized system with comprehensive generalization capabilities across multiple modalities.
Section 8 – Contributions and Acknowledgments
In the report, the authors did take the time to acknowledge the work and dedication of those who worked with DeepMind to make Gemini. Midway through the report, prior to their discussion and conclusion, there are 22 pages of contributors listed.
Section – Appendix
This report ends with an appendix of examples of Gemini and the multimodal questions it has the capability to answer. Listed here are the titles of the examples from the start of the article with the associated page numbers for ease of reference.
- Physics Problem by Student – Figure 1 on Page 2
- Gemini’s Multimodal Reasoning Capabilities – Figure 5 on Page 14
- Image Generation – Figure 6 on Page 16
- Modality Combination – Table 13 on Page 18
- Chart Understanding and Reasoning Over Data – Figure 8 on Page 49
- Multimodal Question Answering – Figure 9 on Page 50
- Interleaved Image and Text Generation – Figure 10 on Page 51
- Image Understanding and Reasoning – Figure 11 on Page 52
- Geometrical Reasoning – Figure 12 on Page 53
- Information Seeking About Objects – Figure 13 on Page 53
- Multimodal Reasoning Based on Visual Cues – Figure 14 on Page 54
- Multimodal Humor Understanding – Figure 15 on Page 55
- Commonsense Reasoning in a Multilingual Setting – Figure 16 on Page 56
- Reasoning and Code Generation – Figure 17 on Pages 57-58
- Mathematics: Calculus – Figure 18 on Page 59
- Multi-Step Reasoning and Mathematics Figure 19 on Page 60
- Complex Image Understanding, Code Generation, and Instruction Following – Image 20 on Page 61
- Video Understanding and Reasoning – Image 21 on Page 62
Gemini is a tremendous step forward in the development of AI, and with this dynamic industry in technology, it establishes a certainty – the AI arms race has begun. While Google has stated that it anticipates releasing Gemini in early 2024, and with the nature of the industry, this blog will remain live and updated as events unfold.