关于t-SNE(T-distributed Stochastic Neighbor Embedding) t-分布随机近邻嵌入的简单理解

318 阅读 0 评论 210 点赞

我是靠谱客的博主明亮云朵，这篇文章主要介绍关于t-SNE(T-distributed Stochastic Neighbor Embedding) t-分布随机近邻嵌入的简单理解，现在分享给大家，希望可以做个参考。

1.介绍

t-sne 是一种探索高维数据 (high-dimensional data) 的方法，其多用于机器学习 (machine learning), 此方法可以将高维数据通过低维[一般是2-D]的形式展现出来。

2.使用

这里介绍 python 中的 sklearn.manifold.TSNE

class sklearn.manifold.TSNE(n_components=2, *, perplexity=30.0,
early_exaggeration=12.0, learning_rate='warn', n_iter=1000,
n_iter_without_progress=300, min_grad_norm=1e-07, metric='euclidean',
metric_params=None, init='warn', verbose=0, random_state=None,
method='barnes_hut',angle=0.5, n_jobs=None, square_distances='deprecated'

n_components: 嵌入空间的维度,默认为2。

perplexity: 与 nearest neighbor 的数量有关，一般为5-50,不能大于数据的样本点数量。

early_exaggeration: 控制原始空间中的自然簇在嵌入空间中的紧密程度以及它们之间的空间大小。

learning_rate: 学习率，一般选择范围为 [10.0,1000.0]。

n_iter: 达到最优化结果的迭代次数。

n_iter_without_progress: 在终止优化之前无进程的最大迭代次数。原文：（Maximum number of iterations without progress before we abort the optimization, used after 250 initial iterations with early exaggeration.）

min_grad_norm: 梯度的阀值，低于该值则会停止算法。

metric: 计算距离用的度量。

metric_params: 度量的参数。

init: 嵌入的初始化。

verbose: verbose 日志的级别，共 0-9十个级别。

random_state: 确定随机数生成器 (random number generator)。

method: 梯度计算的算法。

angle: 只有在 method='barnes_hut' 时使用，是一种对于此算法速度和准确率的平衡参数。

n_jobs: 近邻搜索时同时进行的工作进程数量。

square_distances: 现在的版本已经用不到这个参数了。

>>> import numpy as np
>>> from sklearn.manifold import TSNE
>>> X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
>>> X_embedded = TSNE(n_components=2, learning_rate='auto',
...
init='random', perplexity=3).fit_transform(X)
>>> X_embedded.shape
(4, 2)

一个简单的例子

methods

fit(X): 将X嵌入embedded space

X 是shape为(n_samples, n_features)或者 (n_samples, n_samples) 的 ndarray

fit_transform(X): 将X嵌入embedded space 并返回转换的结果（transformed output）

get_params(deep = True): 获取 estimator 的参数

set_params(**params): 设置 estimator 的参数

3. 关于perplexity

'perplexity' 是一个可调整的全局参数，也是我认为最关键的参数，它可以粗略地表示数据在局部 (local) 和整体 (global) 方面平衡注意力。从意义上来讲的话，它描述了一种猜测，每个点周围有几个近距离的近邻点(close neighbors) 。一般情况下，会选择 5-50 作为perplexity的值，但有时需要更大的值。