spark kmeans java_pyspark：kmeans的分类变量准备

187 阅读 0 评论 124 点赞

我是靠谱客的博主爱听歌航空，这篇文章主要介绍spark kmeans java_pyspark：kmeans的分类变量准备，现在分享给大家，希望可以做个参考。

我知道Kmeans不适用于分类数据，但我们在spark 1.4中没有太多选项可用于聚类分类数据 . 无论上述问题如何 . 我在下面的代码中遇到错误 . 我从hive读取我的表，在管道中使用onehotencoder，然后将代码发送到Kmeans .

我在运行此代码时遇到错误 . 错误是否可以输入Kmeans的数据类型？ doen是否期待numpay Array数据？如果是这样我如何将索引数据传输到numpy数组！？！？所有评论都得到了批准，感谢您的帮助！

我得到的错误：Traceback(最近一次调用最后一次)：文件“/usr/hdp/2.3.2.0-2950/spark/python/lib/pyspark.zip/pyspark /daemon.py”，第157行，在manager文件中“/usr/hdp/2.3.2.0-2950/spark/python/lib/pyspark.zip/pyspark/daemon.py”，第61行，在工作文件“/usr/hdp/2.3.2.0-2950/spark/python /lib/pyspark.zip/pyspark/worker.py“，第136行，在main中如果read_int(infile)== SpecialLengths.END_OF_STREAM：文件”/usr/hdp/2.3.2.0-2950/spark/python/lib/pyspark .zip / pyspark / serializers.py“，第544行，在read_int中引发EOFError EOFError文件”“，第1行Traceback(最近一次调用最后一次)：

我的代码：

#aline will be passed in from another rdd

aline=["xxx","yyy"]

# get data from Hive table & select the column & convert back to Rdd

rddRes2=hCtx.sql("select XXX, YYY from table1 where xxx <> ''")

rdd3=rddRes2.rdd

#fill the NA values with "none"

Rdd4=rdd3.map(lambda line: [x if len(x) else 'none' for x in line])

# convert it back to Df

DataDF=Rdd4.toDF(aline)

# Indexers encode strings with doubles

string_indexers=[

StringIndexer(inputCol=x,outputCol="idx_{0}".format(x))

for x in DataDF.columns if x not in '' ]

encoders=[

OneHotEncoder(inputCol="idx_{0}".format(x),outputCol="enc_{0}".format(x))

for x in DataDF.columns if x not in ''

]

# Assemble multiple columns into a single vector

assembler=VectorAssembler(

inputCols=["enc_{0}".format(x) for x in DataDF.columns if x not in ''],

outputCol="features")

pipeline= Pipeline(stages=string_indexers+encoders+[assembler])

model=pipeline.fit(DataDF)

indexed=model.transform(DataDF)

labeled_points=indexed.select("features").map(lambda row: LabeledPoint(row.features))

# Build the model (cluster the data)

clusters = KMeans.train(labeled_points, 3, maxIterations=10,runs=10, initializationMode="random")

最后

以上就是爱听歌航空最近收集整理的关于spark kmeans java_pyspark：kmeans的分类变量准备的全部内容，更多相关spark内容请搜索靠谱客的其他文章。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：spark kmeans java
浏览次数：187 次浏览
发布日期：2023-12-19 14:35:21
本文链接：https://www.kaopuke.com/article/k-p-k_13_u_23_o_2_f0_13__7__14_4.html

spark kmeans java_pyspark：kmeans的分类变量准备

最后

评论列表共有 0 条评论

发表评论取消回复

spark kmeans java_pyspark：kmeans的分类变量准备

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

发表评论取消回复