概述
安装
pip install umap-learn
pip install umap-learn[plot]
UMAP包含一个子包UMAP。绘图UMAP嵌入的结果。这个包需要单独导入,因为它有额外的需求(matplotlib, datashader和holoviews)。它允许快速和简单的绘图,并尝试做出明智的决定,以避免过度绘图和其他陷阱。
基础概念:
Uniform Manifold Approximation and Projection (UMAP)
**流形Manifold:**流形(Manifold)是局部具有欧式空间性质的空间,包括各种纬度的曲线曲面,例如球体、弯曲的平面等。流形的局部和欧式空间是同构的。 把流形的局部假设为欧几里德空间,以方便研究。
**黎曼流形:**以光滑的方式在每一点的切空间上指定了欧式内积的微分流形。
与PCA,和t-SNE的区别:
https://pair-code.github.io/understanding-umap/
该算法基于关于数据的三个假设:
- 数据均匀分布在黎曼流形上(Riemannian manifold);
- 黎曼度量是局部恒定的(或可以这样近似);
- 流形是局部连接的。
可以将UMAP分为两个主要步骤:
- 学习高维空间中的流形结构;
- 找到该流形的低维表示。
步骤一:学习流形结构
1.寻找最近的邻居:Nearest-Neighbor-Descent算法
**超参数设置:**n_neighbors超参数来指定我们想要使用多少个近邻点。
一个小的n_neighbors值意味着我们需要一个非常局部的解释,准确地捕捉结构的细节。而较大的n_neighbors值意味着我们的估计将基于更大的区域,因此在整个流形中更广泛地准确。
2.构建一个图:通过连接之前确定的最近邻来构建图。
**超参数设置:**local_connectivity(默认值= 1),表示高维空间中的每一个点都与另一个点相关联。
对这两个参数的理解:就是可以将他们视为下限和上限
Local_connectivity(默认值为1):100%确定每个点至少连接到另一个点(连接数量的下限)
n_neighbors(默认值为15):一个点直接连接到第16个以上的邻居的可能性为 0%,因为它在构建图时落在UMAP使用的局部区域之外
步骤二:寻找低维表示
超参数:min_dist(默认值=0.1),定义嵌入点之间的最小距离
Cross-Entropy,在低维表示中找到边的最优权值。这些最优权值随着上述交叉熵函数的最小化而出现,这个过程是可以通过随机梯度下降法来进行优化的。
UMAP的工作完成了,得到了一个数组,其中包含了指定的低维空间中每个数据点的坐标。
实例一:
使用mnist数据分离数字,并在二维空间中展示:
reducer = umap.UMAP(random_state=42)
X_trans = reducer.fit_transform(X)
print(X_trans.shape)
画图
reducer = umap.UMAP(random_state=42)
embedding = reducer.fit_transform(digits.data)
print(embedding.shape)
plt.scatter(embedding[:, 0], embedding[:, 1], c=digits.target, cmap='Spectral', s=5)
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
plt.title('UMAP projection of the Digits dataset')
plt.show()
参数设置
n_components
控制投影后的维数,默认值为 2。但是,当特征数较多时,2D可能不足以完全保留数据的底层拓扑结构,以 5 步尝试 2-20 之间的值,并评估不同的基线模型以查看准确性的变化。
n_neighbors
这决定了在流形结构的局部逼近中使用的邻近点的数量。更大的值将导致更多的全局结构被保留,而失去详细的局部结构。通常,该参数通常应该在5到50之间,选择10到15作为合理的默认值。
min_dist
这控制了嵌入的紧密程度,允许压缩点在一起。较大的值确保嵌入点分布更均匀,而较小的值允许算法更准确地针对局部结构进行优化。合理的值在0.001到0.5之间,0.1是合理的默认值。
metric
计算点之间距离的公式,默认值为euclidean。这决定了用于测量输入空间中距离的度量的选择。已经编写了各种各样的度量标准,只要用户定义的函数是numba的JITd,就可以传递它。
UMAP 会消耗大量内存,尤其是在拟合和创建连接图等图表的过程中,可设置low_memory为 True
n_neighbors=100, # default 15, The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation.
n_components=3, # default 2, The dimension of the space to embed into.
metric='euclidean', # default 'euclidean', The metric to use to compute distances in high dimensional space.
n_epochs=1000, # default None, The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings.
learning_rate=1.0, # default 1.0, The initial learning rate for the embedding optimization.
init='spectral', # default 'spectral', How to initialize the low dimensional embedding. Options are: {'spectral', 'random', A numpy array of initial embedding positions}.
min_dist=0.1, # default 0.1, The effective minimum distance between embedded points.
spread=1.0, # default 1.0, The effective scale of embedded points. In combination with ``min_dist`` this determines how clustered/clumped the embedded points are.
low_memory=False, # default False, For some datasets the nearest neighbor computation can consume a lot of memory. If you find that UMAP is failing due to memory constraints consider setting this option to True.
set_op_mix_ratio=1.0, # default 1.0, The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.
local_connectivity=1, # default 1, The local connectivity required -- i.e. the number of nearest neighbors that should be assumed to be connected at a local level.
repulsion_strength=1.0, # default 1.0, Weighting applied to negative samples in low dimensional embedding optimization.
negative_sample_rate=5, # default 5, Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.
transform_queue_size=4.0, # default 4.0, Larger values will result in slower performance but more accurate nearest neighbor evaluation.
a=None, # default None, More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.
b=None, # default None, More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.
random_state=42, # default: None, If int, random_state is the seed used by the random number generator;
metric_kwds=None, # default None) Arguments to pass on to the metric, such as the ``p`` value for Minkowski distance.
angular_rp_forest=False, # default False, Whether to use an angular random projection forest to initialise the approximate nearest neighbor search.
target_n_neighbors=-1, # default -1, The number of nearest neighbors to use to construct the target simplcial set. If set to -1 use the ``n_neighbors`` value.
#target_metric='categorical', # default 'categorical', The metric used to measure distance for a target array is using supervised dimension reduction. By default this is 'categorical' which will measure distance in terms of whether categories match or are different.
#target_metric_kwds=None, # dict, default None, Keyword argument to pass to the target metric when performing supervised dimension reduction. If None then no arguments are passed on.
#target_weight=0.5, # default 0.5, weighting factor between data topology and target topology.
transform_seed=42, # default 42, Random seed used for the stochastic aspects of the transform operation.
verbose=False, # default False, Controls verbosity of logging.
unique=False, # default False, Controls if the rows of your data should be uniqued before being embedded.
使用plotly绘制三维图
import plotly.express as px
def chart_plotly(X, y):
# --------------------------------------------------------------------------#
# This section is not mandatory as its purpose is to sort the data by label
# so, we can maintain consistent colors for digits across multiple graphs
# Concatenate X and y arrays
arr_concat = np.concatenate((X, y.reshape(y.shape[0], 1)), axis=1)
# Create a Pandas dataframe using the above array
df = pd.DataFrame(arr_concat, columns=['x', 'y', 'z', 'label'])
# Convert label data type from float to integer
df['label'] = df['label'].astype(int)
# Finally, sort the dataframe by label
df.sort_values(by='label', axis=0, ascending=True, inplace=True)
# --------------------------------------------------------------------------#
# Create a 3D graph
fig = px.scatter_3d(df, x='x', y='y', z='z', color=df['label'].astype(str), height=900, width=950)
# Update chart looks
fig.update_layout(title_text='UMAP',
showlegend=True,
legend=dict(orientation="h", yanchor="top", y=0, xanchor="center", x=0.5),
scene_camera=dict(up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=-0.1),
eye=dict(x=1.5, y=-1.4, z=0.5)),
margin=dict(l=0, r=0, b=0, t=0),
scene=dict(xaxis=dict(backgroundcolor='white',
color='black',
gridcolor='#f0f0f0',
title_font=dict(size=10),
tickfont=dict(size=10),
),
yaxis=dict(backgroundcolor='white',
color='black',
gridcolor='#f0f0f0',
title_font=dict(size=10),
tickfont=dict(size=10),
),
zaxis=dict(backgroundcolor='lightgrey',
color='black',
gridcolor='#f0f0f0',
title_font=dict(size=10),
tickfont=dict(size=10),
)))
# Update marker size
fig.update_traces(marker=dict(size=3, line=dict(color='black', width=0.1)))
fig.show()
# 设置reducer中n_components=3
X_trans = reducer.fit_transform(X)
# Check the shape of the new data
print('Shape of X_trans: ', X_trans.shape)
chart(X_trans, y)
最后
以上就是现实吐司为你收集整理的UMAP介绍和代码实例的全部内容,希望文章能够帮你解决UMAP介绍和代码实例所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复