reduce(func)Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
scala> sc.textFile("file:///root/data/word").collect
res0: Array[String] = Array("this is demo ", haha haha, hello hello yes, good good study, day day up)
scala> sc.textFile("file:///root/data/word").map(_.length).reduce(_+_)
res1: Int = 62
Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
scala> sc.textFile("file:///root/data/word").collect
res3: Array[String] = Array("this is demo ", haha haha, hello hello yes, good good study, day day up)
Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.
scala> sc.textFile("file:///root/data/word").foreach(line=>println(line))
Return the number of elements in the dataset.
scala> sc.textFile("file:///root/data/word").count()
res5: Long = 5
Return the first element of the dataset (similar to take(1)).
Return an array with the first n elements of the dataset.
scala> sc.textFile("file:///root/data/word").first
res6: String = "this is demo "
scala> sc.textFile("file:///root/data/word").take(1)
res7: Array[String] = Array("this is demo ")
scala> sc.textFile("file:///root/data/word").take(2)
res8: Array[String] = Array("this is demo ", haha haha)
takeSample(withReplacement, num, [seed])
Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.
scala> sc.textFile("file:///root/data/word").takeSample(false,2)
res9: Array[String] = Array(hello hello yes, "this is demo ")
scala> sc.textFile("file:///root/data/word").takeSample(false,2)
res10: Array[String] = Array(haha haha, hello hello yes)
scala> sc.textFile("file:///root/data/word").takeSample(false,2,1)
res11: Array[String] = Array(day day up, hello hello yes)
scala> sc.textFile("file:///root/data/word").takeSample(false,2,1)
res12: Array[String] = Array(day day up, hello hello yes)
takeOrdered(n, [ordering])
Return the first n elements of the RDD using either their natural order or a custom comparator.
scala> var userRDD=sc.parallelize(List(User("zhangsan",1,1000.0),User("lisi",2,1500.0),User("wangwu",2,1000.0)))
userRDD: org.apache.spark.rdd.RDD[User] = ParallelCollectionRDD[29] at parallelize at :26
scala> userRDD.takeOrdered(3)
:26: error: No implicit Ordering defined for User.
scala> implicit var userOrder=new Ordering[User]{
| override def compare(x:User,y:User):Int={
| if(x.deptNo!=y.deptNo){
| x.deptNo.compareTo(y.deptNo)
| }else{
| x.salary.compareTo(y.salary)* -1
| }
| }
| }
userOrder: Ordering[User] = $anon$1@5965eb71
scala> userRDD.takeOrdered(3)(userOrder)
res14: Array[User] = Array(User(zhangsan,1,1000.0), User(lisi,2,1500.0), User(wangwu,2,1000.0))
Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
scala> sc.textFile("file:///root/data/word").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortBy(_._1,true,1).map(t=>t._1+"\t"+t._2).saveAsTextFile("hdfs:///demo/results02")
Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop’s Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).
该方法只能用于RDD[(k,v)]类型,并且k/v都必须实现Writable接口,由于使用Scala编程,Spark已经实现隐式转换将Int, Double, String等类型可以自动的转换为Writable
scala> sc.textFile("file:///root/data/word").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortBy(_._1,true,1).saveAsSequenceFile("hdfs:///demo/results03")
scala> sc.sequenceFile[String,Int]("hdfs:///demo/results03").collect
res17: Array[(String, Int)] = Array((day,2), (demo,1), (good,2), (haha,2), (hello,2), (is,1), (study,1), (this,1), (up,1), (yes,1))