UMAP介绍和代码实例

183 阅读 0 评论 121 点赞

我是靠谱客的博主现实吐司，这篇文章主要介绍UMAP介绍和代码实例，现在分享给大家，希望可以做个参考。

安装

pip install umap-learn
pip install umap-learn[plot]

UMAP包含一个子包UMAP。绘图UMAP嵌入的结果。这个包需要单独导入，因为它有额外的需求(matplotlib, datashader和holoviews)。它允许快速和简单的绘图，并尝试做出明智的决定，以避免过度绘图和其他陷阱。

基础概念：

Uniform Manifold Approximation and Projection (UMAP)
**流形Manifold：**流形(Manifold)是局部具有欧式空间性质的空间，包括各种纬度的曲线曲面，例如球体、弯曲的平面等。流形的局部和欧式空间是同构的。把流形的局部假设为欧几里德空间，以方便研究。
**黎曼流形：**以光滑的方式在每一点的切空间上指定了欧式内积的微分流形。

与PCA，和t-SNE的区别：

https://pair-code.github.io/understanding-umap/

该算法基于关于数据的三个假设：

数据均匀分布在黎曼流形上（Riemannian manifold）；
黎曼度量是局部恒定的（或可以这样近似）；
流形是局部连接的。

可以将UMAP分为两个主要步骤：

学习高维空间中的流形结构；
找到该流形的低维表示。

步骤一：学习流形结构
1.寻找最近的邻居：Nearest-Neighbor-Descent算法
**超参数设置：**n_neighbors超参数来指定我们想要使用多少个近邻点。
一个小的n_neighbors值意味着我们需要一个非常局部的解释，准确地捕捉结构的细节。而较大的n_neighbors值意味着我们的估计将基于更大的区域，因此在整个流形中更广泛地准确。

2.构建一个图：通过连接之前确定的最近邻来构建图。
**超参数设置：**local_connectivity(默认值= 1)，表示高维空间中的每一个点都与另一个点相关联。

对这两个参数的理解：就是可以将他们视为下限和上限
Local_connectivity(默认值为1)：100%确定每个点至少连接到另一个点(连接数量的下限)
n_neighbors(默认值为15)：一个点直接连接到第16个以上的邻居的可能性为 0%，因为它在构建图时落在UMAP使用的局部区域之外

步骤二：寻找低维表示
超参数：min_dist（默认值=0.1），定义嵌入点之间的最小距离
Cross-Entropy，在低维表示中找到边的最优权值。这些最优权值随着上述交叉熵函数的最小化而出现，这个过程是可以通过随机梯度下降法来进行优化的。

UMAP的工作完成了，得到了一个数组，其中包含了指定的低维空间中每个数据点的坐标。

实例一：

使用mnist数据分离数字，并在二维空间中展示：

reducer = umap.UMAP(random_state=42)
X_trans = reducer.fit_transform(X)
print(X_trans.shape)

画图

reducer = umap.UMAP(random_state=42)
embedding = reducer.fit_transform(digits.data)
print(embedding.shape)

plt.scatter(embedding[:, 0], embedding[:, 1], c=digits.target, cmap='Spectral', s=5)
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
plt.title('UMAP projection of the Digits dataset')
plt.show()

参数设置

n_components
控制投影后的维数，默认值为 2。但是，当特征数较多时，2D可能不足以完全保留数据的底层拓扑结构，以 5 步尝试 2-20 之间的值，并评估不同的基线模型以查看准确性的变化。
n_neighbors
这决定了在流形结构的局部逼近中使用的邻近点的数量。更大的值将导致更多的全局结构被保留，而失去详细的局部结构。通常，该参数通常应该在5到50之间，选择10到15作为合理的默认值。
min_dist
这控制了嵌入的紧密程度，允许压缩点在一起。较大的值确保嵌入点分布更均匀，而较小的值允许算法更准确地针对局部结构进行优化。合理的值在0.001到0.5之间，0.1是合理的默认值。
metric
计算点之间距离的公式，默认值为euclidean。这决定了用于测量输入空间中距离的度量的选择。已经编写了各种各样的度量标准，只要用户定义的函数是numba的JITd，就可以传递它。

UMAP 会消耗大量内存，尤其是在拟合和创建连接图等图表的过程中，可设置low_memory为 True

n_neighbors=100, # default 15, The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation.

n_components=3, # default 2, The dimension of the space to embed into.

metric='euclidean', # default 'euclidean', The metric to use to compute distances in high dimensional space.

n_epochs=1000, # default None, The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings.

learning_rate=1.0, # default 1.0, The initial learning rate for the embedding optimization.

init='spectral', # default 'spectral', How to initialize the low dimensional embedding. Options are: {'spectral', 'random', A numpy array of initial embedding positions}.

min_dist=0.1, # default 0.1, The effective minimum distance between embedded points.

spread=1.0, # default 1.0, The effective scale of embedded points. In combination with ``min_dist`` this determines how clustered/clumped the embedded points are.

low_memory=False, # default False, For some datasets the nearest neighbor computation can consume a lot of memory. If you find that UMAP is failing due to memory constraints consider setting this option to True.

set_op_mix_ratio=1.0, # default 1.0, The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.

local_connectivity=1, # default 1, The local connectivity required -- i.e. the number of nearest neighbors that should be assumed to be connected at a local level.

repulsion_strength=1.0, # default 1.0, Weighting applied to negative samples in low dimensional embedding optimization.

negative_sample_rate=5, # default 5, Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.

transform_queue_size=4.0, # default 4.0, Larger values will result in slower performance but more accurate nearest neighbor evaluation.

a=None, # default None, More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.

b=None, # default None, More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.

random_state=42, # default: None, If int, random_state is the seed used by the random number generator;

metric_kwds=None, # default None) Arguments to pass on to the metric, such as the ``p`` value for Minkowski distance.

angular_rp_forest=False, # default False, Whether to use an angular random projection forest to initialise the approximate nearest neighbor search.

target_n_neighbors=-1, # default -1, The number of nearest neighbors to use to construct the target simplcial set. If set to -1 use the ``n_neighbors`` value.

#target_metric='categorical', # default 'categorical', The metric used to measure distance for a target array is using supervised dimension reduction. By default this is 'categorical' which will measure distance in terms of whether categories match or are different.

#target_metric_kwds=None, # dict, default None, Keyword argument to pass to the target metric when performing supervised dimension reduction. If None then no arguments are passed on.

#target_weight=0.5, # default 0.5, weighting factor between data topology and target topology.

transform_seed=42, # default 42, Random seed used for the stochastic aspects of the transform operation.

verbose=False, # default False, Controls verbosity of logging.

unique=False, # default False, Controls if the rows of your data should be uniqued before being embedded.

使用plotly绘制三维图

import plotly.express as px

def chart_plotly(X, y):
    # --------------------------------------------------------------------------#
    # This section is not mandatory as its purpose is to sort the data by label
    # so, we can maintain consistent colors for digits across multiple graphs

    # Concatenate X and y arrays
    arr_concat = np.concatenate((X, y.reshape(y.shape[0], 1)), axis=1)
    # Create a Pandas dataframe using the above array
    df = pd.DataFrame(arr_concat, columns=['x', 'y', 'z', 'label'])
    # Convert label data type from float to integer
    df['label'] = df['label'].astype(int)
    # Finally, sort the dataframe by label
    df.sort_values(by='label', axis=0, ascending=True, inplace=True)
    # --------------------------------------------------------------------------#

    # Create a 3D graph
    fig = px.scatter_3d(df, x='x', y='y', z='z', color=df['label'].astype(str), height=900, width=950)

    # Update chart looks
    fig.update_layout(title_text='UMAP',
                      showlegend=True,
                      legend=dict(orientation="h", yanchor="top", y=0, xanchor="center", x=0.5),
                      scene_camera=dict(up=dict(x=0, y=0, z=1),
                                        center=dict(x=0, y=0, z=-0.1),
                                        eye=dict(x=1.5, y=-1.4, z=0.5)),
                      margin=dict(l=0, r=0, b=0, t=0),
                      scene=dict(xaxis=dict(backgroundcolor='white',
                                            color='black',
                                            gridcolor='#f0f0f0',
                                            title_font=dict(size=10),
                                            tickfont=dict(size=10),
                                            ),
                                 yaxis=dict(backgroundcolor='white',
                                            color='black',
                                            gridcolor='#f0f0f0',
                                            title_font=dict(size=10),
                                            tickfont=dict(size=10),
                                            ),
                                 zaxis=dict(backgroundcolor='lightgrey',
                                            color='black',
                                            gridcolor='#f0f0f0',
                                            title_font=dict(size=10),
                                            tickfont=dict(size=10),
                                            )))
    # Update marker size
    fig.update_traces(marker=dict(size=3, line=dict(color='black', width=0.1)))

    fig.show()

# 设置reducer中n_components=3
X_trans = reducer.fit_transform(X)
# Check the shape of the new data
print('Shape of X_trans: ', X_trans.shape)
chart(X_trans, y)