【Spark GraphX】Graph Operators最佳实践

84 阅读 0 评论 56 点赞

我是靠谱客的博主炙热秋天，最近开发中收集的这篇文章主要介绍【Spark GraphX】Graph Operators最佳实践，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

在Graph图和GraphOps中同时定义了许多用于图计算的操作，如下图所示：
这里写图片描述

主要分为如下图几类操作：
这里写图片描述

下面我们通过实例来实践一下上面所提到的Operators。

1.构造图

构造图一般有两种方式，通过Graph object构造和通过Graph Builder构造。通常我们使用Graph object来构造Graph，如下图所示，我们针对下面的属性图以及它的边信息和顶点信息来构造一个有向多重图：

 val users: RDD[(VertexId, (String, String))] = sc.parallelize(
Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),
(5L, ("franklin", "prof")), (2L, ("istoica", "prof"))))
// Create an RDD for edges
val relationships: RDD[Edge[String]] = sc.parallelize(
Array(Edge(3L, 7L, "collab"),
Edge(5L, 3L, "advisor"),
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
// Define a default user in case there are relationship with missing user
val defaultUser = ("John Doe", "Missing")
// Build the initial Graph
val graph = Graph(users, relationships, defaultUser)
// Count all the edges where src > dst
val count=graph.edges.filter(e => e.srcId > e.dstId).count
println("start > end counts :"+count)
// print all users which are postdocs
graph.vertices.filter { case (id, (name, pos)) => pos == "postdoc" }.foreach(println)
//create triplet and print them
graph.triplets.map(triplets=>{
triplets.srcAttr._1+" is the "+triplets.attr+" of the "+triplets.dstAttr._1
}).foreach(println)

第二种方式是通过GraphLoader的edgeListFile方法来提供一种从磁盘上边的列表载入图的方式。edgeListFile方法的源码如下：

 * @example Loads a file in the following format:
* {{{
* # Comment Line
* # Source Id <t> Target Id
* 1
-5
* 1
2
* 2
7
* 1
8
* }}}
*
* @param sc SparkContext
* @param path the path to the file (e.g., /home/data/file or hdfs://file)
* @param canonicalOrientation whether to orient edges in the positive
*
direction
* @param numEdgePartitions the number of partitions for the edge RDD
* Setting this value to -1 will use the default parallelism.
* @param edgeStorageLevel the desired storage level for the edge partitions
* @param vertexStorageLevel the desired storage level for the vertex partitions
*/
def edgeListFile(
sc: SparkContext,
path: String,
canonicalOrientation: Boolean = false,
numEdgePartitions: Int = -1,
edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY)
: Graph[Int, Int] =
{
val startTime = System.currentTimeMillis
 // Parse the edge data table directly into edge partitions
val lines =
if (numEdgePartitions > 0) {
sc.textFile(path, numEdgePartitions).coalesce(numEdgePartitions)
} else {
sc.textFile(path)
}
val edges = lines.mapPartitionsWithIndex { (pid, iter) =>
val builder = new EdgePartitionBuilder[Int, Int]
iter.foreach { line =>
if (!line.isEmpty && line(0) != '#') {
val lineArray = line.split("\s+")
if (lineArray.length < 2) {
throw new IllegalArgumentException("Invalid line: " + line)
}
val srcId = lineArray(0).toLong
val dstId = lineArray(1).toLong
if (canonicalOrientation && srcId > dstId) {
builder.add(dstId, srcId, 1)
} else {
builder.add(srcId, dstId, 1)
}
}
}
Iterator((pid, builder.toEdgePartition))
}.persist(edgeStorageLevel).setName("GraphLoader.edgeListFile - edges (%s)".format(path))
edges.count()
logInfo("It took %d ms to load the edges".format(System.currentTimeMillis - startTime))
GraphImpl.fromEdgePartitions(edges, defaultVertexAttr = 1, edgeStorageLevel = edgeStorageLevel,
vertexStorageLevel = vertexStorageLevel)
} // end of edgeListFile

使用该方法可以加载本地磁盘文件或者HDFS文件系统中的文件，当使用HDFS文件系统中的文件的时候，使用hdfs://前缀即可。在文本中，使用“#”开头的是注释部分，一行中两个文本之间用空格隔开，表示srcID和dstID，所有的边属性和顶点属性默认是1。它的canonicalOrientation参数允许重新定向边的正方向(srcId

//
使用顶点文件构造图，文件中只有srcId,dstId，顶点和边的属性默认为1
val graph=GraphLoader.edgeListFile(sc,"F:\BaiduNetdiskDownload\web-Google.txt")
graph.vertices.take(10).foreach(println)

也可以生成随机图：

//
生成随机图
val count=GraphGenerators.logNormalGraph(sc,numVertices = 100).mapVertices((id,_)=>id.toDouble).vertices.count()
println(count)
//初始化一个随即图，节点的度数符合对数随机分布，边属性初始化1
val graph2=GraphGenerators.logNormalGraph(sc,100,100).mapVertices((id,_)=>id.toDouble)
//
println(graph2.edges.count())
graph2.vertices.take(10).foreach(println)

2.属性操作
在graph的基本成员中，最常用的是vertices,edges和triplets。

//图的属性
val numEdges:Long
val numVertices:Long
val inDegress:VertexRDD[Int]
//入度
val outDegress:VertexRDD[Int]
//出度
val degrees:VertexRDD[Int]
//度数
//图集合的视图
val vertices:VertexRDD[VD]
val edges:EdgeRDD[ED,VD]
val triplets:RDD[EdgeTriplet[VD,ED]]

3.转换操作

和RDD的Map操作类似，属性图包含以下转换操作：

class Graph[VD, ED] {
def mapVertices[VD2](map: (VertexId, VD) => VD2): Graph[VD2, ED]
def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
}

每个运算产生一个新的图，这个图的顶点和边的属性通过map方法修改。在所有情况下图的结构不受影响，这是这些运算符的关键所在，它允许新图可以可以复用初始图的结构索引。

4.结构操作

class Graph[VD, ED] {
def reverse: Graph[VD, ED]
def subgraph(epred: EdgeTriplet[VD,ED] => Boolean,
vpred: (VertexId, VD) => Boolean): Graph[VD, ED]
def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED]
}

reverse操作符返回一个新图，新图的边的方向都反转了。subgraph将顶点和边的预测作为参数，并返回一个图，它只包含满足顶点条件的顶点图(true)，以及满足边条件并连接顶点的边。如下面的代码中，删除已经损坏的链接：

val users: RDD[(VertexId, (String, String))] = sc.parallelize(
Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),
(5L, ("franklin", "prof")), (2L, ("istoica", "prof"))))
// Create an RDD for edges
val relationships: RDD[Edge[String]] = sc.parallelize(
Array(Edge(3L, 7L, "collab"),
Edge(5L, 3L, "advisor"),
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"),
Edge(2L,0L,"student"),Edge(5L,0L,"colleague")))
// Define a default user in case there are relationship with missing user
val defaultUser = ("John Doe", "Missing")
val graph=Graph(users,relationships,defaultUser)
graph.triplets.map(triplet=>{
triplet.srcAttr._1+" is the "+triplet.attr+" of "+triplet.dstAttr._1
}).collect().foreach(println)
println("==========build subgraph============")
//删除多余的顶点和与之相连的边，构建一个子图
val validGraph=graph.subgraph(vpred = (id,attr)=>attr._2!="Missing")
validGraph.vertices.foreach(println)
validGraph.triplets.map(triplet=>{
triplet.srcAttr._1+" is the "+triplet.attr+" of "+triplet.dstAttr._1
}).collect().foreach(println)

mask操作返回一个包含输入图中所有顶点和边的图，这可以和subgraph一起使用，用来限制基于属性的另一个相关图，代码如下：


//计算每个顶点的连接组件成员并返回一个图的顶点值
val ccGraph=graph.connectedComponents()
//
ccGraph.vertices.foreach(println)
//mask函数
val validCCGraph=ccGraph.mask(validGraph)
validCCGraph.edges.foreach(println)

5.关联操作
在许多情况下，有必要从外部集合（RDDs）中加入图形数据，例如，我们可能有额外的用户属性，想要与现有的图形合并，或者可能需要从一个图选取一些顶点属性到另一个图。这些操作任务都可以使用join操作完成。

class Graph[VD, ED] {
def joinVertices[U](table: RDD[(VertexId, U)])(map: (VertexId, VD, U) => VD)

: Graph[VD, ED]
def outerJoinVertices[U, VD2](table: RDD[(VertexId, U)])(map: (VertexId, VD, Option[U]) => VD2)

: Graph[VD2, ED]
}

joinVertices运算符连接输入RDD的顶点，并返回一个新图，新图的顶点属性通过用户自定义的map功能作用在被连接的顶点上，没有匹配RDD保留其原始值。

val graph=GraphLoader.edgeListFile(sc,"F:\BaiduNetdiskDownload\web-Google.txt")
graph.vertices.take(10).foreach(println)
println("=====================================")
val rawGraph=graph.mapVertices((id,attr)=>0)
rawGraph.vertices.take(10).foreach(println)
println("======================================")
//计算出度
val outDegress=rawGraph.outDegrees
//添加一个新的属性替换原来的属性
//使用了多参数列表的curried模式f(a)(b)= f(a,b)
val tmp=rawGraph.joinVertices(outDegress)((_,attr,optDeg)=>optDeg)
tmp.vertices.take(10).foreach(println)
println("======================================")

outerJoinVertices操作类似joinVertices，并且可以将用户定义的map函数应用到所有的顶点，同时改变顶点的属性类型。因为不是所有的顶点都可以匹配输入的RDD值，所有map函数需要对类型进行选择。


/**
* outerJoinVertices
*/
val degressGraph=graph.outerJoinVertices(outDegress)((id,oldAttr,outDegOpt)=>{
outDegOpt match {
case Some(outDegOpt) => outDegOpt
case None => 0
}
})
degressGraph.vertices.take(10).foreach(println)

6.聚合操作

class Graph[VD, ED] {
def mapReduceTriplets[Msg](
map: EdgeTriplet[VD, ED] => Iterator[(VertexId, Msg)],
reduce: (Msg, Msg) => Msg)
: VertexRDD[Msg]
}

mapReduceTriplets运算符将用户定义的map函数作为输入，并且将map作用到每个triplet，并可以得到triplet上所有顶点的信息，为了便于优化聚合，支持发往triplet的源或目标信息顶点信息。用户定义的reduce功能将合并所有目标点相同的信息。它的返回值为VertexRDD[A]，包含所有以顶点作为目标节点的集合消息（类型A）。没有收到消息的顶点不包含在返回VertexRDD中。
下面使用mapReduceTriplets算子来计算平均年龄。


//创建一个图vertex属性age
val graph:Graph[Double,Int] = GraphGenerators.logNormalGraph(sc,100,100).mapVertices((id,_)=>id.toDouble)
//计算older followers的人数和他们的总年龄
val olderFollowers:VertexRDD[(Int,Double)] = graph.mapReduceTriplets[(Int,Double)](
triplet => {
//map function
if(triplet.srcAttr > triplet.dstAttr){
//发送消息到目标vertex
//
println(triplet.dstId+"==="+triplet.srcAttr)
Iterator((triplet.dstId,(1,triplet.srcAttr)))
}else{
//不发送消息
Iterator.empty
}
},
//增加计数和年龄
//计算目标顶点相同的顶点的年龄
(a,b) => (a._1+b._1,a._2+b._2)
//reduce function
//reduce函数的作用是将合并所有以顶点作为目标节点的集合消息
)
//
olderFollowers.count()
//获取平均年龄
val avgAgeOfOlderFollwer=olderFollowers.mapValues((id,value)=>value match {
case (count,totalAge) => totalAge/count
})
//显示结果
avgAgeOfOlderFollwer.collect().foreach(println)

7.缓存操作

def persist(newLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED]
def cache(): Graph[VD, ED]
def unpersistVertices(blocking: Boolean = true): Graph[VD, ED]

为了避免重复计算，当需要多次使用的时候，一定要首先调用Graph.cache()。
当内存溢满的时候，也需要清除内存，默认情况下RDD和图表保留在内存中，直到按照LRU顺序被删除。

常用图算法

1.PageRank算法
PageRank是一种根据网页之间相互超链接计算的技术，Google用它来体现网页的相关性和重要性，在搜索引擎优化操作中经常被用来评估网页优化的成效因素之一。

PageRank通过网络浩瀚的超链接关系俩确定一个页面的等级。A页面到B页面的链接解释为A页面给B页面投票，根据投票来源和投票目标的等级来决定新的等级。

假设一个只有由4个页面组成的集合，A、B、C和D、如果所有页面都链向A，那么A的PR(PageRank)的值将是B、C及D的和。

PR(A)=PR(B)/2+PR(C)+PR(D)/3

也就是说，根据链出总数平分一个页面的PR值。

PR(A)=PR(B)/L(B)+PR(C)/L(C)+PR(D)/L(D)

PageRank分为静态实现和动态实现。静态PageRank运行固定数量的迭代，而动态的PageRank运行，直到排名收敛。

下面使用GraphX源码中自带的数据集来计算网页的级别。用户数据集users.txt。用户之间的关系followers.txt。

//加载图的边 edge
val graph=GraphLoader.edgeListFile(sc,"F:\spark-2.0.0\SparkApp\src\cn\just\shinelon\txt\followers.txt")
//运行PageRank
val ranks=graph.pageRank(0.0001).vertices
ranks.foreach(println)
//使用用户名join rank
//user vertext
val users=sc.textFile("F:\spark-2.0.0\SparkApp\src\cn\just\shinelon\txt\users.txt").map(line=>{
val fields=line.split(",")
(fields(0).toLong,fields(1))
})
val ranksByUsername=users.join(ranks).map({
case (id,(username,rank))=>(username,rank)
})
//打印结果
println(ranksByUsername.collect().mkString("n"))

2.三角形计数算法
TriangleCount对象可以计算出每个顶点的三角形数量。TriangleCount要求边的指向(srcId


//加载边和分区图
val graph = GraphLoader.edgeListFile(sc,"F:\spark-2.0.0\SparkApp\src\cn\just\shinelon\txt\followers.txt",true)
.partitionBy(PartitionStrategy.RandomVertexCut)
//找出每个顶点的三角形数
val triCounts = graph.triangleCount().vertices
//加入用户名的三角形数
val users = sc.textFile("F:\spark-2.0.0\SparkApp\src\cn\just\shinelon\txt\users.txt").map(lines=>{
val fields=lines.split(",")
(fields(0).toLong,fields(1))
})
val triCountByUsername=users.join(triCounts).map({
case (id,(username,tc))=>(username,tc)
})
println(triCountByUsername.collect().mkString("n"))

3.连通分量算法
ConnectedComponents对象计算一个图中每个连通分量中的每个顶点与最低顶点相连的子集。

 //加载关系图
val graph=GraphLoader.edgeListFile(sc,"F:\spark-2.0.0\SparkApp\src\cn\just\shinelon\txt\followers.txt")
//找到连接组件
val cc=graph.connectedComponents().vertices
//加入用户名的连接组件
val users=sc.textFile("F:\spark-2.0.0\SparkApp\src\cn\just\shinelon\txt\users.txt").map(line=>{
val fields=line.split(",")
(fields(0).toLong,fields(1))
})
val ccByUsername=users.join(cc).map({
case (id ,(username,cc))=>(username,cc)
})
println(ccByUsername.collect().mkString("n"))

4.最短路径算法
Pregel将目标图类问题的运算模型归为在图的拓扑节点（Vertex）上迭代执行的算法。每次迭代称为一个superstep，在Pregel中，数据模型主要概念包括节点，边Edge和消息Message。在每个superstep步骤中，各个节点执行相同的用户定义函数来处理数据，更新自身的状态乃至更改整个图的拓扑结构。每个节点的边则用来链接相关的目标节点，通过发送消息给其他节点来传递数据。在整个处理流程中，数据接收和处理是以superstep进行同步的，在一个superstep中各个节点所发送的消息直到下一个superstep才会被目标节点接收和处理，并会触发状态更改。每个节点在当前superstep中处理完数据后会投票决定自身是否停止处理，如果没有被消息再次触发，在以后的superstep中就不会调度该节点进行运算。当所有节点都停止后，整个迭代过程结束。

class GraphOps[VD, ED] {
def pregel[A]
(initialMsg: A,
maxIter: Int = Int.MaxValue,
activeDir: EdgeDirection = EdgeDirection.Out)
(vprog: (VertexId, VD, A) => VD,
sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
mergeMsg: (A, A) => A)
: Graph[VD, ED] = {
// Receive the initial message at each vertex
var g = mapVertices( (vid, vdata) => vprog(vid, vdata, initialMsg) ).cache()
// compute the messages
var messages = g.mapReduceTriplets(sendMsg, mergeMsg)
var activeMessages = messages.count()
// Loop until no messages remain or maxIterations is achieved
var i = 0
while (activeMessages > 0 && i < maxIterations) {
// Receive the messages: -----------------------------------------------------------------------
// Run the vertex program on all vertices that receive messages
val newVerts = g.vertices.innerJoin(messages)(vprog).cache()
// Merge the new vertex values back into the graph
g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) => newOpt.getOrElse(old) }.cache()
// Send Messages: ------------------------------------------------------------------------------
// Vertices that didn't receive a message above don't appear in newVerts and therefore don't
// get to send messages.
More precisely the map phase of mapReduceTriplets is only invoked
// on edges in the activeDir of vertices in newVerts
messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, activeDir))).cache()
activeMessages = messages.count()
i += 1
}
g
}
}