MicrosoftAsia-Semantics-Aligned Representation Learning for Person Re-identification---论文阅读笔记和工程实现总结原理流程摘要什么是纹理图像呢？为什么要用纹理图像？如何做的Pseudo Groundtruth Texture Images Generation?什么是语义对不齐(Semantic Misalignment)？AlignmentOur workRelated Work3. The Semantic

109 阅读 0 评论 72 点赞

我是靠谱客的博主大意小海豚，这篇文章主要介绍MicrosoftAsia-Semantics-Aligned Representation Learning for Person Re-identification---论文阅读笔记和工程实现总结原理流程摘要什么是纹理图像呢？为什么要用纹理图像？如何做的Pseudo Groundtruth Texture Images Generation?什么是语义对不齐(Semantic Misalignment)？AlignmentOur workRelated Work3. The Semantic，现在分享给大家，希望可以做个参考。

给我一瓶酒，再给我一支烟，说code就code, 我没的是时间
各位看官老爷，欢迎就坐观看。
博主Github链接：https://github.com/wencoast

原理流程

在这里插入图片描述

摘要

在这里插入图片描述

就是去掉REID supervision后直接训练SAN，然后把训到的SAN(此时因为是用PIT训练的所以叫SAN-PG)，从另外一个程序里导入，然后配合上网络结构，从而生成pseudo groundtruth texture image for the reID datasets.

a decoder (SA-Dec) for reconstructing or regressing the densely semantics aligned full texture image. We jointly train the SAN under the supervisions of person re-identification and aligned texture generation（因为这个生成的合成纹理是对齐的. Moreover, at the decoder, besides the reconstruction loss, we add Triplet ReID constraints over the feature maps as the perceptual losses. 不仅仅是重构loss还有一个三元组REID约束。这个重构loss没在pipeline图里面体现。

在这里插入图片描述
然后，这是1_004_1_01.png_ep_1.jpg，难道这就是回归出来的最后的结果么，这要去查看代码来验证。

意思在于，让语义对齐的载体来作为监督，然后让学成的东西满足语义对齐。

从这开始看:

一句话，同样是表达学习，但是他这里能满足在Re-ID任务中语义对齐。

In this paper, we propose a framework that drives the re-ID network to
learn semantics-aligned feature representation through delicate
supervision designs

提出了一个框架，一个驱动reID 网络学习语义对齐的feature representation through 精细的监督设计。

build a Semantics Aligning Network (SAN)(怎么实现的，让这个网络就是语义对齐的网络呢？) which consists of a base network as encoder (SA-Enc 语义对齐Enc) for re-ID[For re-ID的语义对齐编码器]， and a decoder (SA-Dec) for reconstructing or regressing the densely semantics aligned full texture image [语义对齐解码器, 来重构和回归密集语义对齐的全纹理图像full texture image]

说明这个总的语义对齐网络是由语义对齐的编码器和语义对齐的解码器组成的。编码器和解码器各有分工，怎么分工的呢？

经过几个月的理解, Encoder负责提特征, decoder负责回归出texture image然后进一步来改善reID feature.

以解决person re-identification的目地作为监督信号和对齐的texture generation作为监督信号，也就是under the supervision of person re-identification and aligned texture generation.

在decoder中，除了重构 reconstruction loss, 我们添加Triplet ReID constraints over the feature maps as the perceptual losses (作为感知loss). 在inference阶段，decoder被discard, 从而实现了计算上更有效。消融研究确定了他们设计的有效性。

主要挑战在于

large variation in

human pose
capturing view points
incompleteness of the bodies(due to occlusion)

而这些问题都会result in semantics misalignment across 2D images

什么是全纹理图像？

什么是纹理图像呢？

A texture image on the UV coordinate system represents the aligned full texture of the 3D surface of the person. 意思，在uv空间内获得的纹理图像表达人的3D surface的对齐的全纹理。 (因为人有通用的3D model) . 此外，Besides, a texture image contains all the texture of the full 3D surface pf a person.

不管来自哪个人，UV空间里纹理图像的纹理信息是对齐的。

Note that the texture images across different persons are densely semantically aligned. Dense Pose就是用来从person images获得dense semantics的. 合着本文用到的纹理图像是用DensePose获得的？

值得注意的是：用aligned texture image 来合成 person image of another pose or view不是MicrosoftAsia的创新点，这个工作是由FaceBook AI Research和Wang et al. 2019年时候做的。

对于不同的人的input images, the corresponding texture images are well semantics aligned.
在不同的texture image的相同空间位置上，语义是一样的。
The person images with different visible semantics/regions, their texture images are semantics consistent/aligned since each one contains the ful texture/information of the 3D person surface

在本文中，学到的特征表达，在本质上就是语义对齐的。

为什么要用纹理图像？

As the person identity is mainly characterized by textures.
因为person identity 主要用texture来特定化，因为3D human model的话，对于人而言，是有通用模型的。而另外的话，对于人的动作和姿态，大家都会做出那些动作和姿态。最大的区别就是在于外观上的纹理了，所以我觉得texture应该隶属于appearance.
Texture images for different persons/viewpoints/poses are densely semantically aligned, as illustrated in the following Figure.
对于不同person的输入图像，the corresponding texture image却是well semantics aligned.
- 首先，对于在不同texture image的相同空间位置，the semantics are the same. 该代表胳膊的地方都代表胳膊，该代表腿的地方都代表腿。
- 其次， for person images with different visible semantics/regions, 对于具有不同区域或者不同语义的行人图像，比如有的含完整上半身，有另外一张却只含上半身不含脑袋，就算是这，他们的texture image也是语义对齐的，since each one contains the full texure/information of the 3D person surface.

这不代表我那个就是包含full texture information of the 3D person surface的吧，因为他们这个纹理图像确实很全面的，感觉是个360度。

他们也是把原图作为输入，但是他们用的模型SAN是在合成数据集上面训练过的，然后这个模型被用于来生成pseudo ground-truth texture image.

那么问题就来了，如何在我目前的基础上，来得到full texture information of the 3D person surface?

在这里插入图片描述
我自己生成的是64×64，然后作者开源的是256×256的。

在这里插入图片描述

首先，纹理图像是产生于3D human surface, 而3D human surface又要依托于专门基于表面的坐标系，也就是UV space.

在这里插入图片描述
3D human surface上的each position (u,v) 会在texture image上有unique semantic identity具有唯一语义标识，例如在texture image右下角的
像素对应的是some semantics of a hand.

此外，一个texture image 包含 all the texture of the full 3D surface of a person. 然而一个普通的2D person image只有一部分the surface texture.

意思，texture是个360度，而普通2D person image只是某个视角，是这个意思么？

the full 3D surface of a person, 这块这个full具体什么意思？可以问问cena

在这里插入图片描述

如何做的Pseudo Groundtruth Texture Images Generation?

最奇怪的是：明明只是由single image获得的texture image，作者们怎么把这个称为Pseudo Groundtruth Texture Images呢？

For any given input person image, we use a simplified SAN (i.e., SAN-PG) which consists of the SA-Enc and SA-Dec, but with only the reconstruction loss. 这个reconstruction loss是不是只有encoder-decoder里才有的呢？

是用的别的作者发布的本来3D scanned的纹理数据集 (SURREAL)，再自己放上原input image, 合成一个a Paired Image Texture dataset (PIT)

什么是语义对不齐(Semantic Misalignment)？

语义对不齐的例子

Spatial semantics misalignment 这个意思，虽然视角差不多，但是不同图像相同位置却对应着人体不同的语义（其实就是本质上是什么什么玩意？相对人而言，就是腿，肚子，胳膊什么的）。
比如一个是腿，另一个却是腹部。
Inconsistency of visible body regions/semantics 可以见到的语义都不一样，比如：一个看到的是front side的腿，而另一个却是后面的腿。虽然都是腿，但是本质上语义压根不一样，一前一后的。从英文上看的话，意思一前一后of a person，这样的话语义就是不一致的。

Alignment

Explicitly exploit human pose/landmark information (body part alignment) 但是body part alignment is coarse.
而且在部分内within parts 仍然是对不齐的。
There is still spatial misalignment within the parts.
Based on estimated dense semantics (什么意思？能估计到具体的对应人体什么属性？)
语义对齐的好处：
To achieve fine-granularity spatial alignment (实现精细粒度的空间对齐)
语义对齐的最早工作是来自Guler, Neverova的2018那篇么？

Densely Semantically Aligned Person Re-Identification(CVPR2019)这篇的话

思想是把原来语义上对不齐的图像，wrap到规范的UV坐标系，然后这样就获得了语义对齐的图像，意思是先获得语义对齐的图像，然后把这些densely semantics aligned images作为输入，再开展进一步的ReID任务？

但CVPR2019这篇还有问题，问题是：

the invisible body regions result in many holes in the warped images and thus the inconsistency of
visible body regions across images，尚且还存在dense semantics misalignment的问题。

Our work

引入了一个对齐的纹理生成子任务，aligned texture generation subtask, 然后在此基础上，with densely semantics aligned texture image 用的是不同于CVPR2019的，这个多在texture上，这里是densely semantics aligned texture image.

Encoder

SA-Enc can be any baseline network used for person reID.
用于获得feature map of size $h \times w \times c$ 然后的话，应该会再拉成一维的。
等下，应该是在拉成1D以前，先池化，在feature map上做average pool会得到the reID feature.
然后应该是在获得这个reID feature后后面跟着reID losses.

为鼓励SA-Enc来learn semantically aligned features, 本文引入SA-Dec并对SA-Dec做些设置.
要求用SA-Dec在pseudo ground-truth supervision下来regress/generate the densely semantically aligned full texture image(为了简化，有时候叫texture image).
可见，这些semantics aligned texture image是由SA-Dec生成的. 然后的话，用的是合成的数据集来进行texture image generation的.

怎么就引入和设置后就能实现语义对齐呢？

因为的是： empowering the encoded feature map with aligned full texture generation capability。感觉是先通过编码器获得reID feature, 然后通过Decoder在解码的时候赋予上它对齐的纹理生成。

语义对齐约束被引入是因为赋予编码后的特征图以对齐的完整纹理生成，感觉是因为纹理生成的这个对齐性才对齐的呢

看来如何获得这个纹理生成应该很重要。也就是看SA-Dec怎么工作。

Decoder For generating densely semantically aligned full texture image with supervision.

At the SA-Dec, besides the reconstruction loss, Triplet ReID constraints over the feature maps as the perceptual metric.
之前那是reID loss这块这是reconstruction loss和Triplet ReID constraints.
ReID 数据集本身没groundtruth aligned texture image, Generating pseudo groundtruth texture images by leveraging synthesized data with person image and aligned texture image pairs(这块这个对齐的纹理图像对哪来的呢？).
之所以能这么干的原因，都是因为：Figure4, 即一个Texture image和一个3D mesh(person image)再加上background, 再利用上合适的rendering参数，就生成synthesized person image 此时没涉及解码器，所以，应该生成的这个带纹理的person image应该还不是语义对齐的。

Related Work

Semantics Aligned Human Texture
A human body could be represented by a 3D mesh(例如SMPL)和a texture image as illustrated in the following figure. 就像下面这个图显示的一样，给定一个texture image，然后再加一个3D mesh，就能通过rendering获得那个人的person image.

在这里插入图片描述

注意到: 没说，2D图像上的每一个点都有semantic identity, 而是说，3D mesh上的每个点都有唯一的semantic identity(这种唯一的标识使用UV空间里的(uv)坐标来表示的。)

3. The Semantic Alignment Network

在这个网络里把，in which densely semantically aligned full texture images are taken as supervision to drive the learning of semantics aligned features.

怎么做到的，怎么把另外一种信息用进来，并且作为监督的？

怎么用进来？

单独地先生成texture image的文件夹，然后把它里面的纹理图像通过下面的代码读入进来。

img = read_image(img_path)
img_texture = read_image(img_path.replace('images_labeled', 'texture_cuhk03_labeled'))

然后读进来以后，怎么再给网络用呢？用下面的代码：

def __getitem__(self, index):
	return img, pid, camid, img_path, img_texture

到这一步，已经进来了。接下来看看到底怎么来作为监督信号被使用的？

下面这个图就是框架图，由一个为ReID编码的编码器,编码器说白了就是一个network(encoder for ReID), 然后还有一个decoder sub-network, 有了这个SA-Dec才generating densely semantically aligned full texture with supervision. 啥意思？真正把texture image作为监督是通过SA-Dec实现的，对么？

model = models.init_model(name=args.arch, num_classes=dm.num_train_pids, loss={'xent', 'htri'})

这是把在ImageNet上面预训练的Resnet50(且FC512)作为architecture.

在这里插入图片描述

注意看到这里的loss

loss={set:2}{'htri','xent'}
num_classes={int}767 # 这是和xent结合使用的。

Encoder和Decoder怎么工作？

解码器a decoder SA-Dec which enforces constraints over the encoder, 解码器居然是给编码器施加约束，by requiring the encoded feature (编码的特征，我还以为是解码的特征呢) to be able to predict/regress the semantically aligned full texture image. 怎么解释呢？怎么做到的？

在Decoder部分，channel数量在逐渐减少，从2048的input_channel到final_channel的16，然后2D内的size在不断地增大。
工作流程
从这个图也可以看出来，REID的特征向量f和网络的FC不是一个玩意。FC才接ID loss, 然后之前的f直接接上Triplet loss，为啥要接这个Triplet loss？这里面这个Triplet loss在这怎么工作？

Encoder怎么工作？

好了，这个input image输入进来后，通过Encoder for ReID----编码器for ReID其实就是得到REID的feature vector能够在pooling后(更具体的是：对encoder的最后一层的feature map进行average pooling)得到这个ReID用的feature vector，疑问在于：那么的话，REID这个feature vector和FC是一个东西么？应该不是吧？ 然后的话，这个监督网络参数是用的ReID loss, 说白了就是cross entropy.

在这里插入图片描述
回答上面自己的疑问，感觉应该不是一个东西，因为：

保存在self.global_avgpool和保存在self.fc里的显然是两个不一样的东西

那这块这个Triplet Loss 的作用呢？就是the ranking loss of triplet loss with batch hard miniing.

Decoder怎么工作？

注意看的Loss就是： $L_{Rec}$

在这里插入图片描述
A decoder 形成以密集语义对齐的全纹理图像进行的监督。

然后就是解码器，解码器紧接着被添加(接着the last layer of the SA-Enc)，就是为了在伪groundtruth texture image的约束下，让SA-Dec来重构或者回归出densely semantically aligned full texture image(这么看的话，好像再回归出来的长成另外一个样子，确实应该不是一个东西。然后的话，回头可以打印出来显示下). 这相当于是用cuhk03的伪groundtruth texture image来做监督学习，比着样子学出来的感觉。

在这里插入图片描述

确实有还比着生成的texture image, 而且是通过最小化它和pseudo texture image的L1距离而得到的。

我们可以看出，作者专门为这个decoder工作部分写了个Class

输入咋就是2048么？不是说紧紧地接着the last layer of the SA-Enc么？而last layer不是应该为512嘛？

在这里插入图片描述
而且是先有一个UNet structure:(难道说的意思是：decoder的架构用U-Net而不是ResNet?)

紧接着还有如下别的类似的描述网络结构的东西：

在这里插入图片描述
我们可以看到，

这块还有个Triplet ，之前那个叫Triplet Loss，然而这个叫做Triplet ReID Constraints ( $L_{TR}$ ).

In the SA-Dec, Triplet REID constraints are further incorporated at different layers/blocks as the high level perceptual metric to encourage identity preserving reconstruction

这个不是仅仅接在最后的，而是在每个layers和blocks都用，还没在代码上找到对应，因为这部分的代码在train那个函数里，而不是放在对Loss函数定义的py文件里。

因为它是高级的perceptual metric，得以确保更加保持identity的重构。一样的尽可能近，不一样的尽可能远。

可以认为这是Encoder和decoder里的那个重构loss。 会进一步影响到重构出来的东西的好坏。

作为鼓励保留身份重建的高级感知指标

====

这块Triplet ReID Constraints的作用是让每个identity的，也就是自己和自己的更近，自己和别人的更远，从而达到自己的真是自己，也就是说是保持identity的reconstruction. 保identity的reconstruction. 然后这块这个Reconstruction loss也就是其实就是minimize L1 differences between the generated texture image(应该是带人的，而不是那个恶心的texture image) and its corresponding(pseudo groundtruth texture images)

然后的话，这块在解码器这还有个loss，是为了让编码器继承让不同的identity更可分

是什么意思？用这个loss来最小化同类特征的L2 difference然后最大化不同类的特征的差异。

生成的纹理过程。

工程实现

dm = ImageDataManager(use_gpu, **image_dataset_kwargs(args)) # dm是数据管理器。

dm = ImageDataManager(use_gpu, **image_dataset_kwargs(args))

image_dataset_kwargs是为ImageDataManager服务的一个函数，而ImageDataManager是data_manager.py里面定义的一个类。这就得看这个类以什么作为输入，并且以什么作为输出了。

class ImageDataManager(BaseDataManager):
"""
Image-ReID data manager
"""

更加具体的：

class ImageDataManager(BaseDataManager):
    """
    Image-ReID data manager
    """

    def __init__(self,
                 use_gpu,
                 source_names,
                 target_names,
                 root,
                 split_id=0,
                 height=256,
                 width=128,
                 train_batch_size=32,
                 test_batch_size=100,
                 workers=4,
                 train_sampler='',
                 num_instances=4, # number of instances per identity (for RandomIdentitySampler)
                 cuhk03_labeled=False, # use cuhk03's labeled or detected images
                 cuhk03_classic_split=False # use cuhk03's classic split or 767/700 split
                 ):

在深入ImageDataManager之前，先康康image_dataset_kwargs函数。

def image_dataset_kwargs(parsed_args):
    """
    Build kwargs for ImageDataManager in data_manager.py from
    the parsed command-line arguments.
    """
    return {
        'source_names': parsed_args.source_names, # {list:1}['cuhk03'] 意思只处理cuhk03一个数据集
        'target_names': parsed_args.target_names, # {list:1}['cuhk03'] 意思处理哪个就将其对应保存出来。所以还是cuhk03. 
        'root': parsed_args.root, # {str}'/project/snow_datasets/Re_ID_datasets/data' 这是存放cuhk03及其他数据集的上一级目录。
        'split_id': parsed_args.split_id, # 0 split index (note: 0-based) 从0开始的split index 具体在哪里其作用呢？
        'height': parsed_args.height, # 256 这是什么的尺寸？  图像的默认高度
        'width': parsed_args.width, # 128 图像的默认宽度，但是re-id数据都不是这些尺寸啊
        'train_batch_size': parsed_args.train_batch_size, # 4
        'test_batch_size': parsed_args.test_batch_size, # 4 
        'workers': parsed_args.workers, # 4 
        'train_sampler': parsed_args.train_sampler, # 'RandomIdentitySampler' 好像是往出选identity而不是identity确定后随机选样本
        'num_instances': parsed_args.num_instances, # 4 number of instances per identity (for RandomIdentitySampler)
        'cuhk03_labeled': parsed_args.cuhk03_labeled, # True
        'cuhk03_classic_split': parsed_args.cuhk03_classic_split # True 但是Lan他们的项目里用的是new split protocal(767/700)
    }
   
 # 这个函数的输入是解析出的args. 实参就是main.py里的args.
 # 这个函数的输出是：将解析出的args某些key和value返回出来。

==后来我把--cuhk03_classsic_split给删除掉了， 然后再次传给**kwargs的时候就相当于里面的cuhk03_classsic_split=False. image_dataset_kwargs这个函数里面的return里面的项决定了kwargs的实际的可变长度。==

模型上的每个点，哪个点是可见的，并且对应

模型到这一步，只要能通过模型得到densepose.

先用CUHK03（labeled）

数据集统计：

分割方式:767/700
涉及identity数目：843+440+77+58+49 第一个到第五个摄像机组的所有数据都用上

在这里插入图片描述

query={list:1400} {要查找的}list里面的每个元素都是一个image, 然后格式如下：['/project/snow_datasets/Re_ID_datasets/data/cuhk03/images_labeled/1_003_1_01.png', 3, 0]
文件名的命名规则：
第一个数字：代表拍摄的摄像机组的编号，这意思是第一组
第二个由三个数组成的数据：代表identity的编号，因为每个摄像机组获得的identity都不会超过843，所以三位数就够了。
第三个数字：代表摄像机组里的1号相机或者2号相机
第四个数字：代表这个人的第多少张图像，最多10张(从1到10).
————————————————
版权声明：本文为CSDN博主「贝勒的杭盖VanDebiao」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/HeavenerWen/article/details/106248257
剩下的3和0的意义：
3应该代表的是那个摄像机组下更具体的Identity的编号。刚好和1_003_1_01.png里面的3是一个玩意。
0应该代表0方向还是1方向，因为每个组里有2个相机。0可以认为是拍侧向的那个相机，1可以认为是拍背面那个相机。

gallery={list:5332} {所有的}list里面的每个元素都是一个image,

在这里插入图片描述

Gallery集的例子

在这里插入图片描述

query集的例子

从这个Query集合的例子可以看出的是：在这里的Query集里，每个identity共有2个图像，分别来自0号和1号相机，也就是一个侧向和一个背向。而且怎么感觉都是第二张和第八张？是为了保障两个方向的样本都能被取到而作为query样本么？

在这里插入图片描述

num_gallery_cams={int}2

num_query_cams={int}2

num_train_cams={int}2

在这里插入图片描述
这块有一堆json，这些json是根据程序和用的数据集自动生成的，如果换成别的数据库的话，不知道还能不能正常生成。

在这里插入图片描述
看看训练图像的这个格式，我们知道训练的identities是767个，这767个身份类别在测试时候都是没见过的。

1. ['/project/snow_datasets/Re_ID_datasets/data/cuhk03/images_labeled/1_001_1_01.png', 0, 0]
2. ['/project/snow_datasets/Re_ID_datasets/data/cuhk03/images_labeled/1_001_2_06.png', 0, 1]
3. ['/project/snow_datasets/Re_ID_datasets/data/cuhk03/images_labeled/1_002_1_01.png', 1, 0]
4. ['/project/snow_datasets/Re_ID_datasets/data/cuhk03/images_labeled/1_002_2_06.png', 1, 1]
5. ['/project/snow_datasets/Re_ID_datasets/data/cuhk03/images_labeled/1_004_1_01.png', 2, 0]
6. ['/project/snow_datasets/Re_ID_datasets/data/cuhk03/images_labeled/1_004_2_06.png', 2, 1]
# 这次的话，第二个item代表的应该是：767个ids的从0到766的label. **没错的，我检查过了，确实是0到766**
# 然后的话，第三个item代表的应该是： 0侧向摄像头还是1背向摄像头。

在这里插入图片描述

训练集合中的样本示例

可以看出来，在他们这种测试协议下，5个摄像头组的数据都用到了。

pid += self._num_train_pids
pid = pid + self._num_train_pids = 0 + 0 
								 = 1 + 0
								 = 2 + 0 
# 查看self发现num_train_pids等于0.

然后我们看到了，在运行到self._num_train_cams += dataset.num_train_cams这句话的时候吧，我们知道，最后的self.train就变成下面这个样子了。

在这里插入图片描述
我们可以看出来最后是相当于img_path, pid, camid组合在一起的。然后，实际的训练到现在还没开始，不但实际训练没开始，连训练数据的导入还没开始，真正的把训练数据导入进去是从下面开始的

if self.train_sampler == 'RandomIdentitySampler':
            self.trainloader = DataLoader(
                ImageDataset(self.train, transform=transform_train), # ImageDataset 来自 from .dataset_loader import ImageDataset
                sampler=RandomIdentitySampler(self.train, self.train_batch_size, self.num_instances),
                batch_size=self.train_batch_size, shuffle=False, num_workers=self.workers,
                pin_memory=self.pin_memory, drop_last=True
            )

这里面的最重要的函数就是DataLoader，是在开头导入的，from torch.utils.data import DataLoader.

这个是pytorch的类，

pytorch document里的DataLoader

在这里插入图片描述
结合pytorch的类的官方API，我们发现dataset = ImageDataset(self.train, transform=transform_train). 然而，这里的又出来个ImageDataset. 这和ImageDataManager感觉很像啊，有点傻傻分不清的感觉。

from .dataset_loader import ImageDataset
# ImageDataset有是一个类。
# 更准确地将应该是ReID训练集专用的类

class ImageDataset(Dataset):
    """Image Person ReID Dataset"""
    def __init__(self, dataset, transform=None):
        self.dataset = dataset
        self.transform = transform
        self.totensor = ToTensor()
        self.normalize = Normalize([.5, .5, .5], [.5, .5, .5])

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, index):
        img_path, pid, camid = self.dataset[index]
        img = read_image(img_path)

        # Add by Xin Jin, for getting texture:
        img_texture = read_image(img_path.replace('images_labeled', 'texture_cuhk03_labeled'))
        
        if self.transform is not None:
            img = self.transform(img)
            img_texture = self.normalize(self.totensor(img_texture))
        
        return img, pid, camid, img_path, img_texture

可以把这个类看作如下：

class ImageDataset():
    """Image Person ReID Dataset"""
    def __init__(self, dataset, transform=None):
        self.dataset = dataset
        self.transform = transform
        self.totensor = ToTensor()
        self.normalize = Normalize([.5, .5, .5], [.5, .5, .5])

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, index):
        img_path, pid, camid = self.dataset[index]
        img = read_image(img_path)

然后再结合：dataset = ImageDataset(self.train, transform=transform_train)，我们可以看出

self.dataset = dataset = self.train
self.transform = transform_train
# 同时为这个self(也就是属于ImageDataset类的实例，更准确地说应该是ReID训练集专用的类)生成两属性
# 也就是：
self.totensor = ToTensor()
self.normalize = Normalize([.5, .5, .5], [.5, .5, .5])

然后就该到哪一步了，该到利用__getitem__得到对应的单个image的sample. 同时在这步骤中加入合成纹理。我应该看看论文，他们怎么描述的对这个纹理的获取和应用。

if self.transform is not None:
	
	img = self.transform(img) # 对原图进行transform操作
    
    img_texture = self.normalize(self.totensor(img_texture)) # 对纹理图像进行normalize操作，在normalize操作之前，先转化成tensor, 我们保留这个normalize操作不变。 
 	  
return img, pid, camid, img_path, img_texture

然后，就到了DataLoader的第二个参数 sampler，

sampler=RandomIdentitySampler(self.train, self.train_batch_size, self.num_instances)

这里的RandomIdentitySampler是个从下面导入的类

from .samplers import RandomIdentitySampler

这个类的具体信息：

class RandomIdentitySampler(Sampler):
	"""
	Randomly sample N identities, then for each identity,也是要选N个pit的identities
    randomly sample K instances, therefore batch size is N*K. # 然后，每个identity选取k个 instance

    Args:
    - data_source (list): list of (img_path, pid, camid).
    - num_instances (int): number of instances per identity in a batch.
    - batch_size (int): number of examples in a batch.
    """

然后这个类的实际功能是决定一个batch具体怎么得来。

    def __init__(self, data_source, batch_size, num_instances):
        self.data_source = data_source
        self.batch_size = batch_size # 训练和测试的batch_size都是4。 对PIT这是64.
        self.num_instances = num_instances # 如下面定义，是每个identity选取的实例的数目。对PIT, num_instance还是4。
        self.num_pids_per_batch = self.batch_size // self.num_instances # 每个batch里面的identity的数量 # 然后就是64/4=16.
        # 因为batch里面总sample数目= 每个identity取多少个instance*多少个identity.
        # 那么，这样的话，4/4=1. 
        # 那如果把batch_size改成64的话呢， 那么就是64//4 = 16. 也就是一个batch里处理涉及16个identity. 但不能说明
        # 不能说明就会涉及这16个pid的每个identity的9/10张图像
        self.index_dic = defaultdict(list)
        for index, (_, pid, _) in enumerate(self.data_source):
            self.index_dic[pid].append(index)
        self.pids = list(self.index_dic.keys())

parser.add_argument('--num-instances', type=int, default=4,
                        help="number of instances per identity")

接下来就到了一个很关键的地方defaultdict：

from collections import defaultdict
# 这是在用python的官方库

我们再看看另外一个，另外一个在SANPG里面的RandomIdentitySampler

python3.7官方API解释

在这里插入图片描述
什么是containner datatype呢？

先不关注这个细节，先看看这个defaultdict什么作用呢？为什么涉及defaultdict呢，因为It gets more interesting when the values in a dictionary are collections (lists, dicts, etc.) 当字典中的值是集合(列表，字典等)时，它会变得更加有趣。

defaultdict: dict subclass that calls a factory function to supply missing values dict子类，调用工厂函数以提供缺失值

对factory函数的解释

Quora对python factory function的解释

具体怎么使用以及defaultdict的工作原理的解释

解释defaultdict的博客

刚刚通过defaultdict(<class 'list'>, {})那句代码self.index_dic = defaultdict(list)得到的index_dic

index_dic={defaultdic:0}defaultdict(<class 'list'>, {})

In this case, the value (an empty list or dict) must be initialized the first time a given key is used. While this is relatively easy to do manually, the defaultdict type automates and simplifies these kinds of operations.

当字典中的值是集合（列表，字典等）时，它会变得更加有趣。在这种情况下，必须在首次使用给定键时初始化该值（一个空列表或字典）。尽管这相对容易手动完成，但是defaultdict类型可以自动执行并简化这些类型的操作。

defaultdict(<class 'list'>, {0: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 1: [10, 11, 12, 13, 14]})
# 这意思，这个字典吧，字典中的值value是集合，所以会用到defaultdict.

字典类型如下：

defaultdict(<class ‘list’>,）# 指明哪类集合

字典如下：

{0: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 1: [10, 11, 12, 13, 14]}

key如下：

value如下：

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

[10, 11, 12, 13, 14]

理解到现在这个程度的话，对接下来理解程序已经足够了。回头再细细研究defaultdict更具体的东西。

self.pids = list(self.index_dic.keys())
# 这句代码得到的是：
pids={list:767}[0, 1, 2, 3, ..., 766]
# 说明是在所有的id里面采样

我们看到__iter__函数了. 先不深入研究这部分代码，如果没对图像进行特别处理的话，先忽略这部分。

    def __iter__(self):
        batch_idxs_dict = defaultdict(list) # 同样的方法，讨论一个batch时候的情形

        for pid in self.pids:
            idxs = copy.deepcopy(self.index_dic[pid]) # python里面的赋值语句that do not copy objects, 而是在target和object之间创建绑定。不改变原来这个self.index_dic[pid].
            if len(idxs) < self.num_instances:
                idxs = np.random.choice(idxs, size=self.num_instances, replace=True)
            random.shuffle(idxs)
            batch_idxs = []
            for idx in idxs:
                batch_idxs.append(idx)
                if len(batch_idxs) == self.num_instances:
                    batch_idxs_dict[pid].append(batch_idxs)
                    batch_idxs = []

For collections that are mutable or contain mutable items, a copy is sometimes needed so one can change one copy without changing the other.
对于可变或包含可变项的集合，有时需要一个副本，因此一个副本可以更改一个副本而不更改另一个副本。

copy.deep_copy的操作

copy.deepcopy(x[, memo])
Return a deep copy of x.

A deep copy constructs a new compound object and then, recursively, inserts copies into it of the objects found in the original. 深层副本将构造一个新的复合对象，然后递归地将原始对象中发现的对象的副本插入其中。

Two problems often exist with deep copy operations that don’t exist with shallow copy operations:

Recursive objects (compound objects that, directly or indirectly, contain a reference to themselves) may cause a recursive loop.
Because deep copy copies everything it may copy too much, such as
data which is intended to be shared between copies.

然后的话，根据不同的train_sampler，我们会有两种不同的self.trainloader.

这代码写得还是很不错的，这相当于对train数据组织完了，就到测试部分了。他们把对data(including train, test[query, gallery])都写到一个py文件data_manager.py里.

因为我现在是在训练，所以涉及测试的部分会有如下显示：

在这里插入图片描述

当train phase时候，测试数据并不参与工作。

但是，在又读了一次cuhk03数据集后，

        for name in self.target_names:
            dataset = init_imgreid_dataset(
                root=self.root, name=name, split_id=self.split_id, cuhk03_labeled=self.cuhk03_labeled,
                cuhk03_classic_split=self.cuhk03_classic_split
            ) #

我发现testloader_dict内部的内容发生变化了。

在这里插入图片描述

当train phase时候，又读了一次数据后，query和gallery里面开始有东西。

其中，query的情况：

在这里插入图片描述

其中，gallery的情况：

在这里插入图片描述

经过如下代码：

self.testdataset_dict[name]['query'] = dataset.query
self.testdataset_dict[name]['gallery'] = dataset.gallery

testdataset_dict也发生了变化.

在这里插入图片描述
到这，才看到导数据(含train和test)完毕. 而且才是刚得到管理数据的对象dm，还没真正执行导入。

dm = ImageDataManager(use_gpu, **image_dataset_kwargs(args))
trainloader, testloader_dict = dm.return_dataloaders()
# 为什么返回的是：testloader_dict而不是testdataset_dict

trainloader

在这里插入图片描述

有几个数值相当重要：
如：1492，7365，有空看看他们具体代表什么意义？

这个trainloader相当的重要，因为在训练时候就是用这个来组织的数据：

训练时候用到它的代码显示如下：

for batch_idx, (imgs, pids, _, img_paths, imgs_texture) in enumerate(trainloader):
# 我以为的batch_idx应该从0开始 确实是从0开始
# 我以为的imgs 就是原图 不，实际为原图的tensor才对
# 我以为的pids 就是涉及的那些identity的编号 我以为是一个数，但是实际为[556, 556, 556, 556]一个列表这种。
# 我以为的img_paths就是涉及到的样本图像的路径，是多个绝对路径字符串
# 我认为的图像纹理，也就是同样读入进来的纹理图像，不，实际为纹理图像的tensor

接下来的图像不是我以为的，而是实际的(在batch_size=4的情况下)。

然而imgs不代表是原图，而应该是原图的tensor, 而且维度是torch.Size([4, 3, 256, 128]). 这应该就是为啥他们提供的texture image的尺寸是256*256!

在这里插入图片描述

从上图来看的话，确实只是1个identity的4张图像。

同样地，我们还有imgs_texture如下，尺寸为torch.Size([4, 3, 256, 256])：

在这里插入图片描述

然后有一个_代表的意思就是前三个来自册向相机，最后一个来自背向相机.

在这里插入图片描述
还有几个中间的tensor很重要，

outputs, features, feat_texture, x_down1, x_down2, x_down3 = model(imgs)

decoder部分同样有几个tensor很重要

recon_texture, x_sim1, x_sim2, x_sim3, x_sim4 = model_decoder(feat_texture, x_down1, x_down2, x_down3)

x_sim1, x_sim2, x_sim3, x_sim4是干啥的，后面有用没？

这边最重要的肯定当然要属recon_texure, 需要把outputs, features, feat_texture，recon_texture从tensor转换为numpy的数组然后都保存出来，看看是什么样子。尤其是feat_texture, recon_texture的样子。因为目前他们的维度尚且是{size:4}的，所以还需要进一步的分解和操作。

我们看看recon_texture, 它的维度是{size:4}torch.Size([4, 3, 256, 256]):

在这里插入图片描述
然后，重构loss真正有作用的地方是：

loss_rec = nn.L1Loss()
loss_tri = nn.MSELoss()
loss_recon = loss_rec(recon_texture, imgs_texture) 
# 用L1来约束recon_texture和imgs_texture之间的差距.

但是，这也没啥啊，也没体现出来用到triplet啊，只体现出为了重构的更好，用了L1

# L1 loss to push same id's feat more similar:
loss_triplet_id_sim1 = 0.0
loss_triplet_id_sim2 = 0.0
loss_triplet_id_sim3 = 0.0
loss_triplet_id_sim4 = 0.0

这块才是真正的再次用到triplet呢，这是重构部分用triplet的例子。

for i in range(0, ((args.train_batch_size//args.num_instances)-1)*args.num_instances, args.num_instances):
	loss_triplet_id_sim1 += max(loss_tri(x_sim1[i], x_sim1[i+1])-loss_tri(x_sim1[i], x_sim1[i+4])+0.3, 0.0)
	loss_triplet_id_sim2 += max(loss_tri(x_sim2[i+1], x_sim2[i+2])-loss_tri(x_sim2[i+1], x_sim2[i+5])+0.3, 0.0)
	loss_triplet_id_sim3 += max(loss_tri(x_sim3[i+2], x_sim3[i+3])-loss_tri(x_sim3[i+2], x_sim3[i+6])+0.3, 0.0)
	loss_triplet_id_sim4 += max(loss_tri(x_sim4[i], x_sim4[i+3])-loss_tri(x_sim4[i+3], x_sim4[i+4])+0.3, 0.0)
loss_same_id = loss_triplet_id_sim1 + loss_triplet_id_sim2 + loss_triplet_id_sim3 + loss_triplet_id_sim4

具体怎么用，以及为啥用，暂时还没时间写，回头有空会继续跟进。

首先outputs, 维度是torch.Size([4, 767])，这个767是啥的维度，是FC么，为啥是767:
因为767是类别的总数。从这里可以看出来， 我们知道, outputs在这是个Tensor{Tensor:4}.说明这不在list或者tuple里面，所以这个isinstance返回个False.

在这里插入图片描述
然后是features, 维度是torch.Size([4, 512]). features同样也是个Tensor而不是元组或者列表.

在这里插入图片描述
然后是feat_texture, 维度是torch.Size([4, 2048, 16, 8])

在这里插入图片描述
然后是x_down1, x_down2, x_down3, 维度分别是torch.Size([4, 256, 64, 32])，torch.Size([4, 512, 32, 16])， torch.Size([4, 1024, 16, 8]) 这应该是对应于pipeline图里的 $f_{e1}$ , $f_{e2}$ , $f_{e3}$

然后最值得区分的就是outputs, features, feature_texture, 或者看看这个outputs到底在后面有用么，怎么用的？

有用，cross entropy时候用的是outputs, 然后triplet用的features

而且这块这个return_dataloaders()这个函数吧，还是在data_manager.py里面用@property修饰过的函数。

行人重识别中的warm up设置

# warm_up settings:
    optimizer_warmup = torch.optim.Adam(model.parameters(), lr=8e-06, weight_decay=args.weight_decay, betas=(0.9, 0.999)) 
    # 一个是优化器warmup
    scheduler_warmup = lr_scheduler.ExponentialLR(optimizer_warmup, gamma=1.259)
    # 这个是scheduler——warmup, 是不是进行学习率warmup, 一定要借助优化器warmup

对于baseline的学习率一开始是一个很大的常量，而经过其他论文提出，Warm up的策略对于行人重识别的模型更加有效，具体是一开始从一个小的学习率经过几个epoch后慢慢上升，如下图红色曲线部分，而不是和蓝色线一样上来就很大的学习率：

在这里插入图片描述

上面的图像引用自那些板砖的日子的知乎

在这里插入图片描述

这里的encoder和decoder是分开优化的

因为本来我以为只有optimizer_encoder,

optimizer_encoder = torch.optim.Adam(model.parameters(), lr=0.00001, betas=(0.5, 0.999))

但其实不是的，还有单独的decoder的优化部分：

optimizer_encoder = torch.optim.Adam(model.parameters(), lr=0.00001, betas=(0.5, 0.999))

有关打印的效果

每10个batch打印一下，时间，Data, Loss，以及Loss_recon的具体数值.

Epoch: [1][10/1492]	Time 0.281 (424.272)	Data 0.0095 (4.4704)	Loss 6.7790 (6.6711)	Loss_recon 0.4306 (0.5146)	
Epoch: [1][20/1492]	Time 0.282 (212.276)	Data 0.0096 (2.2401)	Loss 6.6996 (6.6891)	Loss_recon 0.5033 (0.4998)	
Epoch: [1][30/1492]	Time 0.277 (141.610)	Data 0.0096 (1.4967)	Loss 6.5957 (6.6648)	Loss_recon 0.5269 (0.4921)	
Epoch: [1][40/1492]	Time 0.277 (106.276)	Data 0.0097 (1.1249)	Loss 6.7289 (6.6543)	Loss_recon 0.4795 (0.4902)	
Epoch: [1][50/1492]	Time 0.287 (85.077)	    Data 0.0106 (0.9019)	Loss 6.6434 (6.6531)	Loss_recon 0.3934 (0.4831)

为啥只是打印到这啊？

Epoch: [1][480/1492]	Time 0.285 (9.119)	Data 0.0097 (0.1031)	Loss 6.3775 (6.6790)	Loss_recon 0.2461 (0.3105)	
Epoch: [1][490/1492]	Time 0.280 (8.939)	Data 0.0099 (0.1013)	Loss 6.1843 (6.6781)	Loss_recon 0.2337 (0.3099)	
Epoch: [1][500/1492]	Time 0.284 (8.766)	Data 0.0098 (0.0995)	Loss 6.9592 (6.6812)	Loss_recon 0.3021 (0.3084)	
Epoch: [1][510/1492]	Time 0.279 (8.600)	Data 0.0098 (0.0977)	Loss 6.8927 (6.6811)	Loss_recon 0.2927 (0.3065)	
Epoch: [1][520/1492]	Time 0.286 (8.440)	Data 0.0106 (0.0960)	Loss 6.8176 (6.6848)	Loss_recon 0.3276 (0.3054)	
Epoch: [1][530/1492]	Time 0.287 (8.286)	Data 0.0102 (0.0944)	Loss 6.5900 (6.6861)	Loss_recon 0.2101 (0.3042)

关于训练的细节，我们利用ResNet-50来构建我们的SA-Enc. 然后把用ResNet-50不带decoder(带ID loss和triplet loss)作为Baseline:

在这里插入图片描述

这足够实现RE-ID了，因为这就可以提取到RE-ID特征，然后后面加上cross-entropy从而形成分类器. 此外，还有Triplet loss的约束下，会让模型把不同类拉得更开，回头再用模型提取到特征。

先训SAN-PG

为了提供the image pairs的目地，也就是the person image and its texture image, 我们合成了一个Paired-Image-Texture(成对的图像纹理)数据集，不是说我们获得了in total 9,290 different synthesized (person image, texture image) pairs. 不太对劲啊，我发现只有9290个person image啊，没有9290个texture image啊。而是929种，每一种可以对应在images文件夹里的10个图。

images: 9290 person images with different poses, viewpoints, and texture
GT_texture: 929 kinds of texture, each texture corresponds to 10 person images that stored in images folder
image-texture-label: save the correspondiing relationship of (person image, texture), which is used for person texture prediction/synthesis training.

为了尽可能逼真real-world, 渲染用的background images是随机从COCO数据集采样的。

纹理图像不是512×512么，也不是256×256啊

我们训练the SAN-PG with 我们的synthesized PIT dataset. 用合成训练SAN-PG(which consists of the SA-Enc and SA-Dec, but with only reconstruction loss).

怎么导入呢？怎么导入PIT数据集？
另外，怎么确保用的是SAN-PG的结构。
因为用U-Net那个直接用的是预训练模型，用的也是U-net的网络，那直接看看没用预训练权重的unet架构啥样。

SAN-PG

SAN-PG encoder

用的什么网络？

用的ResNet(是不是50暂时还不确定)，自己定义的包括 $3 * 3$ 的卷积conv3x3，还有自己定义的 $3 * 3$ 的去卷积deconv3x3, 去卷积里面用到的是Upsample, 感觉像是上采样一样将channel变小，将size变大。然后还定义了basic resnet block

def deconv3x3(in_planes, out_planes, stride=1):
    return nn.Sequential(
        nn.Upsample(scale_factor=stride, mode='bilinear'),
        nn.ReflectionPad2d(1),
        nn.Conv2d(in_planes, out_planes,
                  kernel_size=3, stride=1, padding=0)
    )

# Basic resnet block:
# x ---------------- shortcut ---------------x
# ___conv___norm____relu____conv____norm____/
这个是shortcut的组成, 指的是下图这个结构。

在这里插入图片描述
那BasicResBlock和ConvResBlock啥区别呢？

ConvResBlock是那种最经典的带2个conv layer的残差块. 然后，它的结构可以描述如下：

#ResBlock: A classic ResBlock with 2 conv layers and a up/downsample conv layer. (2+1)
#x ---- BasicConvBlock ---- ReLU ---- conv/upconv ----

感觉意思是网络由BasicResBlock, ConvResBlock共同组成。

然后就是encoder怎么初始化，这里面涉及对loss的更改。

model = models.init_model(name=args.arch, num_classes=dm.num_train_pids, loss={'xent', 'htri'})
# 如何修改这块的xent和htri?
# 应该是得把REID的监督移除掉！

网络的话和原来的SAN一样，只是得把REID监督移除掉。with the reID supervisions removed.

Given an input person image, the SAN-PG outputs predicted texture image as the pseudo groundtruth.

这句话，让我怀疑的建立的一一对应作为监督对么？ SAN-PG的本意不是去利用合成数据然后生成伪纹理图像么，怎么还拿伪纹理图像作为监督呢？

怎么初始化？

SAN-PG decoder

用的什么网络？

用到unet吧？是的，用的U-net做的decoder的architecture.

# For UNet structure:
        self.embed_layer3 = nn.Sequential(
                        nn.Conv2d(in_channels=1024, out_channels=512,
                                  kernel_size=1, stride=1, padding=0),
                        nn.BatchNorm2d(512),
                        nn.ReLU(inplace=True)
                    )
        self.embed_layer2 = nn.Sequential(
                        nn.Conv2d(in_channels=512, out_channels=256,
                                  kernel_size=1, stride=1, padding=0),
                        nn.BatchNorm2d(256),
                        nn.ReLU(inplace=True)
                    )
        self.embed_layer1 = nn.Sequential(
                        nn.Conv2d(in_channels=256, out_channels=64,
                                  kernel_size=1, stride=1, padding=0),
                        nn.BatchNorm2d(64),
                        nn.ReLU(inplace=True)
                    )

        self.reduce_dim = nn.Sequential(
                          nn.Conv2d(input_channel, input_channel//4, kernel_size=1, stride=1, padding=0), # 2048, 512, 1, 1, 0
                          nn.BatchNorm2d(512),
                          nn.ReLU(inplace=True)
                    )     # torch.Size([64, 512, 16, 8]) # 2048//4 = 512

先是定义了 ConvResDecoder，这个 ConvResDecoder被当作用于upsampling的convres block卷积残差块.
为啥叫做上采样呢，感觉应该是从一个向量feature再变成一个图像，所以叫做上采样吧。难道这是经过上述u-net后的最后所得么？宽和高是分别是64，512，16是channel，8是啥？torch.Size([64, 512, 16, 8])

# 其中up代表要得到更大的图像
self.up1 = ConvResBlock(512, 256, direction='up', stride=2, norm_layer=nn.BatchNorm2d, activation_layer=nn.ReLU(inplace=True)) # torch.Size([64, 256, 32, 16])
self.up2 = ConvResBlock(256, 64, direction='up', stride=2, norm_layer=nn.BatchNorm2d, activation_layer=nn.ReLU(inplace=True))  # torch.Size([64, 64, 64, 32])
self.up3 = ConvResBlock(64, 32, direction='up', stride=2, norm_layer=nn.BatchNorm2d, activation_layer=nn.ReLU(inplace=True))   # torch.Size([64, 32, 128, 64])
self.up4 = ConvResBlock(32, 16, direction='up', stride=2, norm_layer=nn.BatchNorm2d, activation_layer=nn.ReLU(inplace=True))   # torch.Size([64, 16, 256, 128])

上面代码段中的torch.size分别代表什么意思？ torch.Size([64, 16, 256, 128])各个维度分别代表什么意思？

怎么初始化？

PIT数据集的处理

如何把这个数据集的images文件夹和GT_texture建立联系？

在这里插入图片描述

在这里插入图片描述
然后在程序上怎么实现出这种对应呢？

我们现在有的是00000到00001到00002，…, 到00009. 然后每十个对应前面的10行

可以先整体10行读出一个子集，然后子集的[0][1][2]对应给0到9。

应该是这样的，从这个txt文件里逐行读取，然后后面对应的图像在路径上加1.

args里的height和width怎么在程序工作的？在哪块工作的？

我想的是，仍然按照之前REID数据的组织和导入形式往进导入PIT，然后写个文件的操作把重构的纹理约束关联起来

先把PIT的images导入进来再说，还是同样得有随机性，

经过查看，经过查看原来作者的代码，我们知道，我们不关心怎么从mat把数据存储到label或者detect文件夹，而是关注把弄好的数据怎么输入到模型.

def _extract_img(name):
    print("Processing {} images (extract and save) ...".format(name))
    meta_data = []
    imgs_dir = self.imgs_detected_dir if name == 'detected' else self.imgs_labeled_dir
    for campid, camp_ref in enumerate(mat[name][0]):
        camp = _deref(camp_ref)
        num_pids = camp.shape[0]
        for pid in range(num_pids):
            img_paths = _process_images(camp[pid,:], campid, pid, imgs_dir)
            assert len(img_paths) > 0, "campid{}-pid{} has no images".format(campid, pid)
            meta_data.append((campid+1, pid+1, img_paths))
        print("- done camera pair {} with {} identities".format(campid+1, num_pids))
    return meta_data

_extract_img和_process_images都是服务于把图像从mat里提取出来并且命名的，这不是我重点考虑的部分。

这块这个_extract_img里的这个self.imgs_detected_dir到这句代码时候还是空白的目录，里面还没有文件呢.然后正打算逐个摄像机组的往出extract和save图像呢.

for pid in range(num_pids):
    img_paths = _process_images(camp[pid,:], campid, pid, imgs_dir)

这句话得到的这个img_paths是啥？是一个还是多个？

在这里插入图片描述
main的pid好像是从1开始

我的pid好像是从0开始

所有和query和gallery有关的没用的代码都删除掉，这样的话，得到一个干净版本的，模型和SAN一样的SAN-PG.

下图是SAN-PG情形.

下图为main情形.
在这里插入图片描述

不同的重构 Different Reconstruction Guidance

reconstruct时候是肯定重构的，那么REID SAN时候也往出重构么？

在这里插入图片描述

往出重构，因为the addition of a reconstruction subtask helps improve the reID performance which encourages the encoded feature to preserve more original information是为了让encoded feature保持更多原始信息。所以原来的SAN里就是带reconstruction的.

从Table2可以得知：得知到什么呢，得知到：CUHK03时候也是reconstruct而且也试过reconstruct input和pose(pose这个是在生成PIT时候也生成这么个pose aligned person image数据库)，以及PN-GAN(也是用来生成pose aligned person image的方法)，但是，都没有重构texture好。但是他试验的都是Loss reconstruction, loss REID, Triplet loss组合下的。

研究the effect of using different reconstruction guidance并且在Table2中显示不同的结果，意思就是the same encoder-decoder网络(但是这个网络是SAN-basic,这里的basic指的啥，指的是用，Loss reconstruction, loss REID, Triplet loss)，但是三个重构目标：

reconstructing the input image
reconstructing the pose aligned person image
reconstructing the texture image

注意：

To have pose aligned person image(首先得有这个pose aligned person image) as supervision, during synthesizing the PIT dataset, for each projected person image, 说白了就是在合成PIT数据集的时候，我们也synthesized a person image of a given fixed pose (frontal pose here). 合成时候是在PIT合成的时候同时做的，合成方法是根据下图。

在这里插入图片描述
这种pose aligned person image是语义对齐的(不光只有纹理图像才是对齐的)，那得到这种pose aligned person image的程序也同样没开源啊。

在这里插入图片描述

这种重构同样也是只有部分纹理被保留，而且他们也指出这确实存在information loss。

在这里插入图片描述
这种情况下，只有partial texture(front body regions) of the full 3D surface texture is retained(保留保持) with the information loss. 这是说在得到这种pose aligned person image时候，怎么搭配loss从而保留你想要的信息。也就是partial texture(front body regions). 然后不只用了他们合成的pose aligned person image. 还用了别人PN-GAN生成的pose aligned person image.

在这里插入图片描述

4. Experiment

4.1 Datasets and Evaluation Metrics

4.2 Implementation Details

4.3 Ablation Study

SAN-basic到底是啥样的网络啊？表格里整理过了. ResNet50+ConvResDecoder

一共涉及多少个网络类型？六个

是不是每个网络都是有encoder和decoder然后encoder都是用的resnet50? 不是每个都有decoder, 有没有Decoder就是看有没有decoder部分的loss.

Baseline(ResNet-50) 这个的意思难道是没有decoder? 就是没有decoder.

SA-Enc(ResNet50), REID loss, triplet loss 对的，没有decoder(因为没triplet约束和reconstruction). 有没有decoder可以理解有有没有Loss,或者loss是不是0.

我训练时候, 是用那两个loss训的么？不是的，我带了triplet约束和reconstruction. 跑出来的结果和baseline一样，但是我好像不是真的baseline，而是4个loss都带。

我确定，我训练的时候后边那两个loss是带着的！！！！！！！！！！！！！？？？？？？
在这里插入图片描述

the last spatial down-sample operation in the last Conv layer is removed
说的是Resnet50里面的最后一个Conv层最后的空间下采样操作被removed.不是说的decoder的事。

在这里插入图片描述

说的是这个decoder(SA-Dec)就是通过4个残差上采样block形成的，这个SA-Dec部分的参数只是SA-Enc(Resnet50)的三分之一。

这没说baseline那个算法涉及SA-Dec.

来自论文的原文描述，We also take it as our baseline(Baseline) with both ID loss and triplet loss.

最起码的就是，你得知道你每次训的都使用了哪些个loss吧，这个实验最重要的不就是取决于loss的设置么？

SAN-basic
trained with the supervision of the pseudo groundtruth texture images with(因为REID数据是没有扫描得到的真实的纹理的，所以这个叫做pseudo groundtruth texture images)

Loss reconstruction, loss REID, Triplet loss

在这里插入图片描述

SAN-PG

在这里插入图片描述
Loss reconstruction, 0 loss REID, 0 Triplet loss,

SAN $w/L_{TR}$

在这里插入图片描述
Loss reconstruction, loss REID, Triplet loss, Triplet ReID Constraints

SAN $w / s y n . d a t a$

在这里插入图片描述
用PIT数据集，但是Loss组成如下：

Loss reconstruction, loss REID, Triplet loss

在这里插入图片描述
用PIT数据集，也用 Triplet ReID Constraints，所以Loss组成如下：

Loss reconstruction, loss REID, Triplet loss, Triplet ReID Constraints

Loss的设置

对于`main.py`：

我们知道：

对于main.py的程序，

在这里插入图片描述

对于`SAN-PG.py`:

在这里插入图片描述
在现在不确定的情况下，先试试这个. 然后另外一种可能就是，设置 $λ$ 的组合.

在这里插入图片描述
该去探究，探究下为什么会生成这些图片，这些图片真的很莫名其妙。

在这里插入图片描述
想知道怎么把pth.tar给用上. 这应该就会关乎到到底打开SAN的哪个Flags开关？

另外就是得确定我的SAN-PG的输入和输出是和论文说的SAN-PG一样。

如何获得SAN-PG

只需要把loss保留一个就是SAN-PG。然后，至于生不生成pseudo groundtruth, 那都是后话了。 然后把这个得到的checkpoint权重作为初始化SAN的权重，重新训练SAN。

先研究怎么把pretrained weights of SAN-PG给SAN用上吧

论文SAN-PG输入： REID行人图像
论文SAN-PG输出：让人误以为是纹理图像。但是纹理图像不是被用来作为监督了么？the SAN-PG outputs predicted texture image as the pseudo groundtruth. 应该就是为了得到权重，然后再用上generate_texture.py那个脚本，那网络可能也得有个代码。先不管，因为我们确实先有了SAN-PG的权重了。
我自己SAN-PG输入： PIT数据
我自己SAN-PG输出：SAN-PG权重和regress the semantically aligned full texture image.

我训练300epoch regress出的结果例子：

那300epoch那个得训练一天以上的感觉。

训练日志

在这里插入图片描述

我查看过了，args.print_freq就是10。也就是每个10个batch打印一次。

[1][10/232] 1代表第多少epoch, 10代表第1个batch_size为10的batch. 232代表trainloader的长度.

232是len(trainloader)
batch_time = AverageMeter() 这个不太重要
data_time = AverageMeter() 这个也不太重要
Loss = losses 重要因为我把两个lambda都设置为了0，所以这个的值确实应该都恒等于0
Loss_recon = losses_recon才是真正最值得关注的Loss

我们可以知道的是：

Loss的变化情形：

loss_rec = nn.L1Loss()

Creates a criterion that measures the mean absolute error (MAE) between each element in the input x and target y. 平均绝对误差(MAE), 每个元素的平均绝对误差，这个是怎么定义的.

SAN300EPOCH Loss的情形

EPOCH1到10 从36.3492到9.2652

以上是前11个EPOCH的数据。可以看到这loss变动还是很大的，从最小的2到最大的36。

EPOCH11到20 从2.3651到

关于调参

我们知道，交叉熵损失的下确界是0，但是永远都不可能达到0

因为要达到0，那么所有的预测向量分布就必须完全和真实标签一致，退化为独热编码. 但是，实际在神经网络中，经过了softmax层之后，是不可能使得除了目标分量的其他所有分量为0的。因此，Cross entropy永远不可能达到0的，正是因为如此，交叉熵损失可以一直优化，这也是它比MSE损失优的一个点之一。

不收敛

不收敛是个范围很大的问题，有很多可能性：

学习率设置过大了
可能是因为我把label-smooth给去掉了
还有就是：我把-t的FLAG给去掉了
还有就是FrankZhu说的可能是因为没对应上。
可能性5

学习率设置太大

太大的学习率将会导致严重的抖动，使得无法收敛，甚至在某些情况下可能使得loss变得越大直到无穷。

还是没怎么降低，所以，明天看下reconstruction的结果

确定Loss的故障与否

这是wht程序乘以10后的loss。
在这里插入图片描述
这是wht程序去掉乘以10后的loss。

在这里插入图片描述
我现在要弄明白wht程序的cuhk03到底管用还是不管用？或者压根就是-s,-t都是写在那，但是却没用过的。

确定对应关系

在MicroAsia的原代码里：

class ImageDataset(Dataset):
    """Image Person ReID Dataset"""
    def __init__(self, dataset, transform=None):
        self.dataset = dataset
        self.transform = transform
        self.totensor = ToTensor()
        self.normalize = Normalize([.5, .5, .5], [.5, .5, .5])

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, index):
        img_path, pid, camid = self.dataset[index]
        img = read_image(img_path)

        # Add by Xin Jin, for getting texture:
        img_texture = read_image(img_path.replace('images_labeled', 'texture_cuhk03_labeled'))
        
        if self.transform is not None:
            img = self.transform(img)
            img_texture = self.normalize(self.totensor(img_texture))
        
        return img, pid, camid, img_path, img_texture

调试：

1. 从第0个batch，因为Loss是按照Batch更新的。看一下，看重构loss是不是0.6, 但开始必须是0.6. 然后可以从0.6开始, 因为别的loss的引入可以从0.6升到13 因为REID loss会影响它。
2. baseline是这样的，没decoder. 你是这么训的么，没训。
3. 
4.

你有没有加了别的loss以后，没效果是不？没试，主要在搞SAN-PG

我好像没跑basic. 就是把另一个设为0，留着剩下3个。

cena:你做到哪一步？
我：做到第二个。
cena: 你之前没预训练，你直接把所有loss加上一点提升都没有是吧？
我：你得知道你到底跑的哪个啊
cena: 之前你只跑baseline，后面那些Loss一个都没加？
我：我没试试加loss，减loss影响。
cena: 所有Loss都加上你跑了么？人家原文里加了，你为啥没加？有constraint的代码？
我：我说好像我没加constraint(但是其实我加了).

我：

我跑的版本：交叉熵(确定), triplet(确定), constraint, recon。
我没跑的版本：baseline只带前两个。
我得到的结果跟baseline差不多，但是我用了四个Loss,只是结果差不多。

cena: baseline你也跑了是吧？baseline你跟原文差不多，是吧？

cena

你得先跑下baseline, 你把baseline也跑下，然后看看相比baseline什么变化？
万一加上有效果，只是baseline很低呢？好建议
加两个如果没提升，说明后面那两个没用。加两个提升很大，说明后面两个作用很大。

没加第三个？后面那些Loss一个都没加，

首先，确定之前训练的用的哪些loss，做个表格
不加两个loss的baseline和加上两个loss的decoder. 说明没用或者有问题。相当于把table 1实验做一下。

程序解析

Optimizer:

分为

正常训练时候的optimizer，正常训练反向传播算梯度的那个
然后warmup时候的optimizer

学习率规划器

分为

r_scheduler.MultiStepLR的300个epoch里到了自动变化那种
lr_scheduler.ExponentialLR是warm_up阶段如何进行学习率的变化。

在这里插入图片描述

close_specified_layers(model, ['fc','classifier'])

我们知道：在关闭前，含fc和classifier的部分内容如下：

在这里插入图片描述
这是数据的结构，但是看不到数据内容啊。不重要，不用看到，可以确定数值也没再更新。

因为如下代码：

在这里插入图片描述
给我的感觉是，那这两个被固定的部分内容不再变化，但是没固定的，那些层的权重参数就会发生变化。所以，确实会因为后面再加的两个Loss影响encoder部分resnet50的参数。

看一看trainloader的length以及query啥的长度。

两次打印和保存的epoch区别

每隔50epoch打印一次

if (epoch + 1) % 50 == 0:

里面的epoch是49, 99, 149, 199, 249, 299才测重构结果的表现.

Test

if (epoch + 1) > args.start_eval and args.eval_freq > 0 and (epoch + 1) % args.eval_freq == 0 or (epoch + 1) == args.max_epoch: 
# 这个代码主要判断的就是：
(epoch + 1) % args.eval_freq == 0 也就是都哪个epoch数值加上1可以被80整除而余数为0. 
# 79, 159, 239, 319(超了)，别忘了299
'''
注意到：还有(epoch+1)==300的情况
也就是epoch=299
'''

里面的epoch有79，159，239，299

但是，就算有299，也不会重复的，因为

# 这是训练时候检查重构结果
print('finish: ', os.path.join(args.save_dir, img_paths[0].split('images_labeled/')[-1].split('.png')[0]+'_ep_'+str(epoch)+'_trainloader_'+'.jpg'))
cv2.imwrite(os.path.join(args.save_dir,img_paths[0].split('images_labeled/')[-1].split('.png')[0] + '_ep_' + str(epoch) +'_trainloader_'+ '.jpg'), out)

print('finish: ', os.path.join(args.save_dir, img_paths[0].split('images_labeled/')[-1].split('.png')[0]+'_ep_'+str(epoch)+'_queryloader_'+'.jpg'))
cv2.imwrite(os.path.join(args.save_dir, img_paths[0].split('images_labeled/')[-1].split('.png')[0]+'_ep_'+str(epoch)+'_queryloader_'+'.jpg'), out)

注意到就算epoch都取299，我们，因为我重新增加了来自哪个loader的后缀，所以，我同样不会覆盖掉检查重构结果时候相同epoch299的结果.

在这里插入图片描述
通过上图可以看出来，299，239，159, 79，确实是通过测试输出的，49是通过检查纹理输出的，那0为啥能输出呢？

而且咋还train_loader和query_loader的图像一模一样呢？

再检查一个：SAN-PG的结果

在这里插入图片描述
我们知道：对于SAN-PG的代码，我把测试的部分已经都注释掉了。所以，应该只按照检查纹理时候的49, 99, 149, 199, 249, 299来往出输出。

在这里插入图片描述

通过上图可以看出，对于同一张图像，colon和out保存出来的结果不一样，out的结果更偏向于蓝色。我们跑个300从PG移动过来好好看看.

把cython的文件夹删了试一试
把环境变量添加了试一试

出现无法转换为cuda变量的情况

在没有进行Git repo的README里面的配置的时候，

在这里插入图片描述
目前的情况下，connect界面会卡住。接下来，我们试试配置一下，然后，看看还会不会卡在connect界面。

在用linux apt命令重新安装pycharm后，发现之前train和evaluate时候的connected问题都没有了。

但是，这步骤又取决于：

在环境激活的时候配置
在没激活环境的时候配置. 不可以这么配置，这会提示Cython evaluation is unavailable.

必须在激活环境的情况下配置。

在没激活环境的时候配置：

在这里插入图片描述
可以看出来，确实会多出来：eval_metrics_cy.so. 现在再试试还会不会卡在connected？

在这里插入图片描述
在激活环境的情况下，再make一下。会多出来个`x86_64_linux_gnu.so

应该是必须要在环境激活的时候配置，否则就会现实Cython evaluation is unavailable.

在消除connected的reclone版本里已经有了

可能存在的问题：

Pycharm
代码本身train部分
代码本身evaluation_only部分，实际参数有变动。
因素太多了，想配置vs code吧。

用Vscode跑的时候，能跑时候的文件内容

Original 程序

在这里插入图片描述
用上述文件的时候，train和evaluate_only都能debug. check核对过了.

Reclone 程序

在这里插入图片描述
用上述文件的时候，train和evaluate_only都能debug. check核对过了.

而且都在pytorch conda环境下能这么顺利执行，所以，环境应该也没问题。

在没有执行out = (out / 2.0 + 0.5) * 255.时候的，数据范围：
在这里插入图片描述
下面是执行完以后的，给我的感觉应该是：

原来都是-1左右，好奇怪，应该是0-1或者 -1到1啊。感觉out = (out / 2.0 + 0.5) * 255.这句话有问题啊。而应该是：batch = np.transpose(batch, [0, 2, 3, 1]) * 0.5 + 0.5这种。

这是把-1到1变到-0.5到0.5然后再变到0到1的惯用方法。

在这里插入图片描述
然后，经过转化数据类型为np.uint8以后：

在这里插入图片描述
为啥在输出整洁记录前会有9个打印？

在这里插入图片描述
因为吧，因为满足epoch是1的整数倍的条件啊。然后每个batch的每一张图片都会被保存，所以，为了节约空间，batch数目少点，batchsize大点。

在这里插入图片描述
也就是在batch_idx为9，19，29，，也就是batch_idx+1是10的整数倍的时候，会打印Epoch:[1][XX/XXX]这种信息.

还得更改一些部分为None

`triplet 约束`怎么起作用

training_batch_size=16时候：

在这里插入图片描述

执行到还没加上triplet约束。

下图是加上triplet约束以后的：当training_batch_size=16时候，i会取到0，4，8。

在这里插入图片描述
加上约束以后的重构loss的值变为如下：

在这里插入图片描述

在这里插入图片描述
这OUT还是越界啊！这不一定越界,这还有e-05呢.

在这里插入图片描述
所以这次没越界,因为如下图所示: 尚且在(127.49, 127.49).

就算真越界的情况下，还是能转换到整数的，但是不代表没有问题。

在这里插入图片描述

这确实越界了。这超过1了。倒是没超过-1。最保险的还是做下clip. 如果不clip, 我们可以看到：居然还有308

在这里插入图片描述

1.54变完以后就得到323同样也是越界的。虽然变完以后，看到好像也是0-255，但是其实很多都被忽略了，从而导致保存出来的图片有问题。

在这里插入图片描述

如果recon_texture越界,就一定会影响训练. 因为下面这句代码.

loss_recon = loss_rec(recon_texture, imgs_texture)

只要这个recon_texture是有问题的,那和正常的imgs_texture就会出现很多奇怪的差值. 从而影响loss, 进而又影响训练整体.

但是recon_texture又来自于:

recon_texture, x_sim1, x_sim2, x_sim3, x_sim4 = model_decoder(feat_texture, x_down1, x_down2, x_down3)

那还往feat_texture,x_down1, x_down2, x_down3 上追根溯源么?

先不管. 先把feat_textureclip一下试试看.

关于数据尺寸

不管REID数据集中的行人图像的大小为多少, 都会经过程序变成128和256.

在这里插入图片描述

从上面的图可以出来无论多大的都会被变到256和128.

没有clip以前的初始loss= 0.4975:

在这里插入图片描述
clip以后的初始loss= 0.4542:

在这里插入图片描述
说明: 我的clip至少对loss值没大影响, 而且还是能得到不至于数值范围变的奇怪的loss值.

在下一次打印的时候, loss变大了,感觉这也很正常. 不能这么单次地看.

在这里插入图片描述
query的话是: 79, 159, 239, 299. 79,159,239就是test函数保存出来的. 299两者都有可能.

在这里插入图片描述
确实299会cover掉trainloader和queryloader.

然后看下trainloader的: 有49, 99, 149, 199, 249, 299 因为加了词缀, 所以299 不因命名覆盖.

在这里插入图片描述

把Github release跑通. 看看到底给个啥结果.另外用Github跟踪具体变化. 完成
先用他的SAN-PG, (记住: 一定要把san_basic权重倒入部分涉及dict的部分改一下.) 正在弄还没完成.
如果不行就把reconstruction loss乘以上0.0001那个. 从而突出最后我们的工作中心在REID而不是重构纹理.

在这里插入图片描述
可以知道, 确实checkpoint里面有state_dict和decoder_state_dict.

在这里插入图片描述
下图是没被SAN-PG更新之前的.

在这里插入图片描述
看看更新后,是不是由2.248e-01变成2.1573e-01. 是的 , 确实改了. 下面是改完以后的:

在这里插入图片描述
要去看看, 看看是不是用了SAN-PG的loss在初始上就开始变得很低.

而不是没用时候那么大.

先把没用时候的截个图放在这.

在这里插入图片描述
然后再放一个用了SAN-PG以后的:

下面是不经过warmup, 而且batch_size很小时候的.

在这里插入图片描述
如果这么看的话, 确实已经比没用SAN-PG时候要小了. 先跑个本地的1个epoch看看. 然后, 该检查test部分,如果没问题就上传,然后训练.

这是小batch在本地跑的结果. 好像也看不出来什么啊? 还是检查完了,上传完整的再说吧.
在这里插入图片描述

我现在要做的就是重新寻个SAN-PG然后得到SAN-PG权重。

重新训个SAN-PG权重基于wht，然后再重新导入再看。
大改我自己的SAN-PG代码先不干这个。太耗时

最后

以上就是大意小海豚最近收集整理的关于MicrosoftAsia-Semantics-Aligned Representation Learning for Person Re-identification---论文阅读笔记和工程实现总结原理流程摘要什么是纹理图像呢？为什么要用纹理图像？如何做的Pseudo Groundtruth Texture Images Generation?什么是语义对不齐(Semantic Misalignment)？AlignmentOur workRelated Work3. The Semantic的全部内容，更多相关MicrosoftAsia-Semantics-Aligned内容请搜索靠谱客的其他文章。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：论文阅读笔记
浏览次数：109 次浏览
发布日期：2023-11-12 00:45:02
本文链接：https://www.kaopuke.com/article/k-p-k_13_u_23_o_14_fy_14__23_c4.html

MicrosoftAsia-Semantics-Aligned Representation Learning for Person Re-identification---论文阅读笔记和工程实现总结原理流程摘要什么是纹理图像呢？为什么要用纹理图像？如何做的Pseudo Groundtruth Texture Images Generation?什么是语义对不齐(Semantic Misalignment)？AlignmentOur workRelated Work3. The Semantic

原理流程

摘要

什么是纹理图像呢？

为什么要用纹理图像？

如何做的Pseudo Groundtruth Texture Images Generation?

什么是语义对不齐(Semantic Misalignment)？

Alignment

Our work

Related Work

3. The Semantic Alignment Network

Encoder和Decoder怎么工作？

工程实现

trainloader

有关打印的效果

先训SAN-PG

SAN-PG

SAN-PG encoder

SAN-PG decoder

PIT数据集的处理

不同的重构 Different Reconstruction Guidance

4. Experiment

4.1 Datasets and Evaluation Metrics

4.2 Implementation Details

4.3 Ablation Study

一共涉及多少个网络类型？ 六个

Loss的设置

对于main.py：

对于SAN-PG.py:

训练日志

Loss的变化情形：

SAN300EPOCH Loss的情形

关于调参

不收敛

确定Loss的故障与否

确定对应关系

程序解析

Optimizer:

学习率规划器

两次打印和保存的epoch区别

每隔50epoch打印一次

Test

出现无法转换为cuda变量的情况

还得更改一些部分为None

triplet 约束怎么起作用

关于数据尺寸

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

一共涉及多少个网络类型？六个

对于`main.py`：

对于`SAN-PG.py`:

`triplet 约束`怎么起作用

发表评论取消回复