Google forms AI vision model with two billion parameters

Source: https://arxiv.org/pdf/2106.04560.pdf

Google Brain researchers have announced a two billion parameter deep learning computer vision (CV) model. The model was trained on three billion images and achieved a new peak record of 90.45% top-1 accuracy on ImageNet.

The ViT-G / 14 model is based on the latest development from Google Vision Transformers (ViT). On many benchmarks, including ImageNet, ImageNet-v2, and VTAB-1k, ViT-G / 14 outperforms previous advanced systems. For example, the gain in precision on the challenge of identifying images from a few shots was more than five percentage points. The researchers then trained several miniature versions of the model to search for a scaling law for architecture, observing that performance follows a power law function, similar to Transformer models used for neurolinguistic programming (NLP) applications.

The Transformer architecture, first introduced by Google researchers in 2017, quickly became the most popular design for NLP deep learning models, with OpenAI’s GPT-3 being one of the most popular designs for NLP deep learning models. more famous. The rules for scaling these models were described in a study published by OpenAI last year. OpenAI developed a power law function to assess the accuracy of a model by training several comparable models of different sizes and changing the amount of training data and processing power. Additionally, OpenAI has found that larger models not only perform better, but are also more computationally efficient.

Most state-of-the-art CV deep learning models, unlike NLP models, use a convolutional neural network (CNN) architecture. Architecture rose to prominence after a CNN model won the ImageNet competition in 2012. With Transformers’ recent success in NLP, researchers began to examine its performance on vision problems; for example, OpenAI built an image generation system based on GPT-3. Google has been very active in this area, training a 600 million parameter ViT model in late 2020 using its proprietary JFT-300M dataset.

The new ViT-G / 14 model has been pre-trained using JFT-3B, an improved version of the dataset with around three billion images. The research team improved the ViT architecture, which increased memory usage to allow the model to fit into a single TPUv3 core. The researchers used transfer learning of a few strokes and fine-tuning on the pre-trained models to evaluate the performance of ViT-G / 14 and other smaller models. The results were used to create scaling rules, which are similar to the laws of NLP:

  • According to a power law function, scaling more calculations, models, and data improves accuracy.
  • In smaller models, accuracy can be a barrier.
  • Bigger data sets help large models.

The ViT-G / 14 score is currently ranked # 1 on the ImageNet rankings. The following eight models with the highest scores were also created by Google researchers, while Facebook created the tenth model. Additionally, the code and weights for last year’s 600M Parameter ViT model were posted on GitHub by Google.

Article: https://arxiv.org/pdf/2106.04560.pdf

ViT model code and Github weight: https://github.com/google-research/vision_transformer


Source link

Comments are closed.