Tremendous advances have been seen in the field of computer vision due to the success of deep learning, in particular on low- and midlevel tasks, such as image segmentation or object recogni- tion. These advances have fueled researchers’confidence for tackling more complex tasks that combine vision with lan- guage and high-level reasoning.
This article presents the ongoing work in the field and the current approaches to VQA based on deep learning.
While the field of VQA has seen recent successes, it remains a largely unsolved task.
Deep visual understanding can be defined as the abil- ity of algorithm to extract high-level information from imag- es and to perform reasoning based on that information. In this regard, VQA is an alternative to other tasks proposed to evaluate this capability. Examples include the visual Turing test [23], the task of image captioning [20], [73], and recent works on visual dialogs [18].
A second parallel motivation for the study of VQA is its utility in its own right.
Note, however, that current VQA data sets do not directly address this setting, because questions are typically collected in a nongoal-oriented setting.
Realistic, motivated questions would likely require information not present in the image and involve rare words and concepts.
Historically, one of the earliest inte- grations of computer vision with language was the SHRDLU system dating back to 1972 [78]。
首先将图像和NLP 结合的模型是“SHRDLU”系统,是维诺格拉德(T. Winograd)于1972年在美国麻省理工学院建立的一
机答: OK. (抓起大的红色积木块)
However, these early works were often limited to specific domains and/or simple language.。
Deep learning has now been applied to virtually every problem imaginable in computer vision, and convolutional neural networks (CNNs) are approaching human performance in tasks such as image segmentation [39] or object recognition [19], [24].
任务定义和数据集Task definition and data sets:
The task for the machine is to determine the correct answer, which is, in current data sets, typically a few words or a short phrase.Two practical variants are usually considered, an open-ended and a multiple-choice setting.
Two practical variants are usually considered, an open-ended and a multiple-choice setting [5], [92]. In the latter, a set of candidate answers are proposed. This makes the evaluation of a generated answer easier than in the open-ended setting, where the comparison between the machine’s output and a ground truth (i.e., human provided) answer faces issues with synonyms.
VQA is also related to the task of textual question answering [10], [14], [88], in which the answer is to be found in a textual nar- rative (i.e., reading comprehension) or in large knowledge bases (KBs) (i.e., information retrieval).
The additional challenge of a visual input is significant because images are simply much higher dimensional than text. Images capture the richness of the real world in a noisy manner, whereas natural language already represents a certain level of abstraction.
While, to some extent, the processing of language is possible with discrete and rule-based approaches, such as syntactic parsers and regular expression matching, the complexity of images renders such engineered methods intractable
Modern computer vision is based on statistical learning, and recent works combining vision and language (including image captioning and VQA) similarly evolved from machine-learning techniques.
现在的计算机视觉是基于统计学习的,结合了计算机视觉和语言(包括VQA和image captioning)的最近工作也类似的由机器学习发展而来。
The two tasks are complementary as they evaluate different capabilities. Captioning requires mostly descriptive capabilities that involve almost purely visual information. VQA, in comparison, often requires reasoning with common sense and with other information not present in the given image.
两个任务是互补的,因为它们评估了不同的能力。image captioning更多的要求描述性能,这几乎涉及纯粹的视觉信息,相比之下,VQA通常要求通过常识和给定图片中不存在的其它信息进行推理。
训练和评估VQA数据集Data sets for training and evaluating VQA:
We now examine data sets that have been specifically compiled for research on VQA. These data sets contain, at a minimum,triples made each of an image, a question, and its correct answer.
Those data sets are designed for both evaluating and training VQA systems in a supervised setting, and the latter demands such large amounts of data. As will be discussed in the section “Directions of Current and Future Research,” this very need for large amounts of data is a significant limit of current approaches.
For the purpose of standardized comparisons and benchmarking of different algorithms, data sets are split into predetermined sets of instances for training, validation, and testing.
Existing data sets vary mainly along three dimensions 1) their size, i.e., the number and variety concepts represented in the images and questions; 2) the amount of required reasoning, e.g., whether the detection of a single object is sufficient or whether inference is required over multiple facts or concepts; and 3) how much information beyond what is present in the input image is necessary to infer an answer, e.g., common sense or subject-specific information.
Most data sets lean toward visual-level questions and require little external knowledge beyond common sense. These characteristics reflect the fact that current state-of-the-art methods still struggle with simple visual questions.
The first VQA data set designed as a benchmark was Data Set for Question Answering on Real World (DAQUAR) for images [45].
VQA第一个重要的数据集是DAtaset for QUestion Answering on Real-world images(DAQUAR)。它包括6794对训练问答
题,以及5674对测试问答题,其中的图片都来自NYU-Depth V2数据集。这意味着平均每张图片有9对问答题。该数据集质
The most popular modern data sets [5], [35], [92] use images sourced from Microsoft Common Objects in Context (COCO), [40] a data set initially devised for image recognition, which is itself composed of images from Flickr.
1.COCO数据集是一个大型的、丰富的物体检测,分割和字幕数据集。这个数据集以scene understanding为目标,主要从
2,500,000个label。目前为止有语义分割的最大数据集,提供的类别有80 类,有超过33 万张图片,其中20 万张有标注,整
The most widely used data set is currently the one proposed by a team of researchers from Virginia Tech and is commonly referred to as VQA [5]. It comprises two parts, one using natural images named VQA-real, and a second one with clipart images named VQA-abstract (discussed at the end of this section).
Visual genome and visual7W
The Visual Genome QA data set [35] is currently the largest one designed for VQA, with 1.7 million question/answer pairs.
Visual Genome数据集是当前最大的VQA数据集,由170万个问题/答案对。
Visual Genome:该数据的图像来自于COCO和YFCC100M,共108249张图,包括170万个QA pairs,至目前位置(这篇文章的发表年份2016年10月),该数据集是最大的VQA数据集。数据集的提问为6W:What, Where, How, When, Who,
and Why,该数据集答案的多样性要明显好于其他数据集,且答案的词数要多于其他数据集。另外提问没有“是否”的问题。
The Visual7w [92] data set is a subset of the Visual Genome that allows evaluation in a multiple-choice setting, as each question is provided with four plausible but incorrect candidate answers.
Visual7w数据集是Visual Genome数据集的子集,它允许在多选项中进行评估,每一个问题提供了四个可能但是不正确的候选回答。
Visual7W:该数据集是上一个数据集的扩充,7W则指What, Where, How, When, Who, Why, and Which。该数据集包含了47300张图。为了准确回答问题,这里用到了bounding box来圈出可能的4个答案。
Zero-shot VQA:
A special version of the Visual7W data set was proposed in [70]. The authors redefined the training and test splits such that every test instance includes one or several words that were not present in any training example.
在[70]中提出了Visual7W 数据集的特殊版本,作者重新定义了训练和测试数据集的划分,测试实例包含一个或者几个单词并且不会出现在任何训练实例中。
Clipart images:
Data sets for VQA have also been proposed with synthetic clipart images (referred to as abstract scenes in [5]). These images were created manually with cartoon representations of characters and objects from a predefined set.
That data set contains only binary (yes/no) questions and each question appears twice in the data set, with two different images that give rise to opposite answers.
Despite undeniable advantages, VQA data sets of clipart images have seen little use [5], [69], [90] compared to their counterparts of real images.
Video-based QA:
Zhu et al. [91] assembled a data set of over 100,000 videos and 400,000 questions, using existing collections of videos from different domains, from cooking scenarios to movies and web videos.
VQA systems are evaluated by inferring the answers on the test split of a given data set. Recent data sets [92] recommend the multiple-choice setting, since there is only one correct answer among the multiple choices. The evaluation is thus straightforward, as one can simply measure the mean accuracy over test questions. In an open-ended setting, several answers could be equally valid, because of synonyms and paraphrasing.
The usual workaround is to restrict answers, at the time of the creation of the data sets, to short phrases, typically one to three words.
VQA深度神经网络Deep neural networks for VQA:
The common approach to VQA is to train a deep neural network with supervision which maps the given image and question to a relative scoring of candidate answers. The main idea is to learn a joint embedding of the visual and textual inputs.
图像嵌入Image encoding:
On the computer vision side, the input image xI is processed with a deep convolutional neural network (CNN) to extract image features described as a vector yI.
In comparison to classical handcrafted image features such as scale-invariant feature transform (commonly known as SIFT) [41] or histogram of oriented gradients (commonly known as HOG) [16], CNN features provide higher-level representations of the contents of the image, and are naturally produced as a fixed-size vector. The size of this vector is typically in the order of 1,024 or 2,048.
- 尺度不变特征转换(Scale-invariant feature transform或SIFT)是一种电脑视觉的算法用来侦测与描述影像中的局部性特征,它在空间尺度中寻找极值点,并提取出其位置、尺度、旋转不变量,此算法由David Lowe在1999年所发表,2004年完善总结。
- 在计算机视觉以及数字图像处理中梯度方向直方图(Histogram of Oriented Gradient, HOG)是一种能对物体进行检测的基于形状边缘特征的描述算子,它的基本思想是利用梯度信息能很好的反映图像目标的边缘信息并通过局部梯度的大小将图像局部的外观和形状特征化。
问题嵌入Question encoding:
Initially, the ith word of the question is represented by an index xQi in the input vocabulary. Each word is then turned into a vector.
This uses a mapping implemented as a lookup table [·]W that associates the index of any word of the input vocabulary to a learned vector.
An alternative implementation initially represents each word with a one-hot vector (a vector of all zeros, except for a one at the location of the word index in the vocabulary), which is then multiplied with a dense weight matrix that contains the embeddings of all words.
A simple option for this purpose is to make a bag-of-words (BoW), which corresponds to simply averaging the word vectors.i.e.,
一个简单的选择就是制作一个bag-of-words (BoW)模型,相当于一个简单的平均单词矢量。
Bag of words模型最初被用在文本分类中,将文档表示成特征矢量。它的基本思想是假定对于一个文本,忽略其词序和语
(因为里面装的都是词汇,所以称为词袋,Bag of words即因此而来),然后看这个袋子里装的都是些什么词汇,将其分
Another popular option is to feed the word vectors into a recurrent neural network (RNN) such as a long short-term memory (LSTM). An RNN processes words sequentially and can capture the sequential relationships between them. In comparison, a BoW does not account for word order.
图像问题特征混合Combination of image and question features:
They are each passed through a learned function before being combined. The intuition here is to map the features to a joint space, in which distances between both modalities become comparable.
The output stage of a VQA system can be seen either as a generation or as a classification task.
The generation of a free-form answer has the advantage of being able to compose complex sentences. In practice however, such a model is difficult to learn [22], [46], [80]. Current data sets are limited to short answers, and a practical alternative is to rather learn a classifier over candidate answers [22], [44], [46], [57].
For training the model, the classifier is followed by a cross-entropy loss, and the whole network is trained end-to-end by backpropagation to minimize this loss over the set of training examples.
Encoding the question and the image with a single recurrent neural network (an LSTM) by passing the image features together with each word embedding [22] or only once prior to the question words [46], [57].
Encoding the question with a bidirectional RNN, i.e.,
Adding additional multiplicative interactions within the network and between the features of the image and of the question. For
Alternative schemes for combining image and question representations, such as element-wise sums and products [33], bilinear operations [30] such as multimodal compact bilinear pooling (MCB) [21], etc.
Fukui 等人提出了一个池化的方法来嵌入两个特征,称之为“Multimodal Compact Bilinear pooling(MCB)”。它随机投影图像特征和文本特征到高维空间,然后两个向量的卷积可以在傅里叶空间中相乘处理。
高级技术Advanced techniques:
注意力机制Attention mechanisms:
One of the most effective improvements to the joint embedding model is to use visual attention. Humans have the ability to quickly understand visual representations by attending to regions of the image instead of processing the entire scene at once [58]
The main idea behind attention mechanisms is to allow the model to focus on certain regions of the image. The technique involves 1) using region-specific image features and 2) including multiplicative interactions within the neural net- work.
The attention weights computed for a given question/image can be visualized in the form of “attention maps” for purposes of introspection into the VQA model.
预训练语言表示Pretraining language representations:
Each word of the input vocabulary (i.e., any word appearing in the training set) is associated with its own embedding, and those embeddings are normally learned alongside the other parameters of the network via backpropagation.
A solution to these issues is to pretrain word embeddings on a larger auxiliary data set. This practice is known in the field of natural language processing and has shown benefit in many tasks besides VQA.
Popular methods for pretraining word embeddings include Global Vectors for Word Representation [53] (GloVe) and word2vec [48],which we outline next.
预训练单词嵌入组常用的方法包括GloVe和word2vec ,我们将在下面叙述。
2.Word2Vec的网络结构很简单,包括一个输入层、一个隐藏层、一个输出层。其中,输入层对应某个(上下文)单词的独热编码向量(共有V个词汇),输出层为与输入单词同时出现的单词的概率分布,换句话说,词汇表中的每个单词,出现在这一上下文中的概率分别是多少。隐藏层由N个神经元组成。其中主要有Skip-Gram和CBOW两种模型,从直观上理解,Skip-Gram是给定input word来预测上下文。而CBOW是给定上下文,来预测input word。
记忆增广的神经网络Memory-augmented neural networks:
The variant proposed in [37] and [83], named dynamic memory networks (DMNs), was successfully applied to VQA. It is built around four modules (see Figure 5).
The input module transforms the input data into a set of discrete vectors called facts. A question module computes a vector representation of the question, using a gated recurrent unit [(GRU), a variant of LSTM]. An episodic memory module retrieves the facts required to answer the ques- tion. Finally, the answer module uses the final state of the memory and the question to predict the final output, using a classic classifier over candidate answers.
运行时外部信息检索Run time retrieval of additional information:
One limitation of the basic joint embedding approach is to attempt to capture all of the information of training examples within the parameters of a neural network. This cannot scale arbitrarily, however. On one hand, any network has a finite capacity and, on the other hand, training examples also provide finite information.
Several works explored the idea of connecting a VQA system with external sources of information that can be virtually infinite (e.g., web searches) or extensible without needing to retrain the VQA model (e.g., structured KBs).
In [75] and [82], the authors train a model to interface with a KB. Such KBs, like DBpedia [7] and Freebase [12], are databases compiled with facts ranging from common sense to encyclopedic knowledge.
目前和将来的研究方向Directions of current and future research:
State-of-the-art methods have consistently improved performance on this data set over the past few years, from an accuracy of about 58% to over 70% today。
数据集偏差问题Issues of data set biases:
The text questions alone often provide strong cues that can be sufficient to answer them correctly, with no regards to the contents of the input image.
Zhang et al. [90] first proposed a data set of clipart images where each binary question is accompanied by two different images that elicit “yes” and “no” answers, respectively.
生词问题Issues with unknown and novel words:
The current paradigm of training VQA systems with supervision, i.e., with data sets of questions and their ground-truth answers, can only cover a limited set of objects and concepts. Although VQA data sets have grown in size, no finite set of exemplars will ever cover the diversity of objects, actions, relations, etc.
These benchmarks do not encourage addressing rare words and concepts, but rather focus on the concepts most frequent in the data set.
We expect that VQA will ultimately require similar principled approaches, such as differentiable computing [26], [50], rather than brute-force learning from limited sets of examples.
外部知识External knowledge:
This requires the system not only to capture actual information from training examples, but to learn to retrieve and use novel information,i.e., learn to learn.
模块化方法Modular approaches:
The text questions alone often provide strong cues that can be sufficient to answer them correctly, with no regards to the contents of the input image.
组合模型Compositional models:
Compositional models were proposed by Hendricks et al. on the task of image captioning [27]. Andreas et al. [4], [3], [29] were the first to propose a compositional architecture for VQA, named neural module networks.
组成模型由Hendricks 等人提出应用于image captioning,Andreas 等人第一次提出VQA的组成结构,命名为神经模块网络。
An alternative approach that addresses compositionality is the relational networks.
We reviewed popular approaches based on deep learning, which treat the task as a classification problem over a set of candidate answers. We described the common joint embedding model, and additional improvements that build up on this concept, such as attention mechanisms.
