我是靠谱客的博主 幸福睫毛,最近开发中收集的这篇文章主要介绍[深度学习论文笔记][Attention]Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

Xu, Kelvin, et al. “Show, attend and tell: Neural image caption generation with visual attention.” arXiv preprint arXiv:1502.03044 2.3 (2015): 5. (Citations: 401).


1 Motivation

In the previous image captioning model, the RNN decoder part only looks at the whole image once. Besides, the CNN encoder part encode fc7 representations which distill information in image down to the most salient objects.


However, this has one potential drawback of losing information which could be useful for richer, more descriptive captions. Using more low-level representation (conv4/conv5

features) can help preserve this information. However working with these features necessitates a attention mechanism to learn to fix its gaze on salient objects while generating the corresponding words in the output sequence to release computational burden. Another usage of attention model is the ability to visualize what the model “sees”.


The attention model also in accord with the the human visual system. Rather than compress an entire image into a static representation, attention allows for salient features to

dynamically come to the forefront as needed. This is especially important when there is a lot of clutter in an image.


2 Pipeline
See Fig.  Where ⃗ z is the context vector, capturing the visual information associated with attention. L represents possible locations (conv4/conv5 different grid cells in our
case), each of which is a D dimensional embedding vector. The distribution over L locations satisfy


Note that p ⃗ is the a ⃗ used in the Fig. 



3 Hard Attention

Each time,  z is taken from one location of a .


Because of the arg max, ∂ ∂⃗ p ⃗ z is zero almost everywhere since slightly change p ⃗ will not affect l⋆ . Therefore, it can not be trained using SGD. We use reinforcement learning instead.


4 Soft Attention
Each time,  z is the summarization of all locations


This form is easy to take derivative, so it can be trained with SGD.


5 Doublely Stochastic Attention
Besides , we also encourage. This can be interpreted as encouraging the model to pay equal attention to every part of the image over the course of

generation. In practice, we found that this regularization leads to more rich and descriptive captions.


6 Results
See Fig. The model can attend to “non object” salient regions.


最后

以上就是幸福睫毛为你收集整理的[深度学习论文笔记][Attention]Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention的全部内容,希望文章能够帮你解决[深度学习论文笔记][Attention]Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(72)

评论列表共有 0 条评论

立即
投稿
返回
顶部