1、Reynold Xin,Parallel Programming With Apache Spark,What is Spark?,Efficiency General execution graphs In-memory storage Usability Rich APIs in Java, Scala, Python Interactive shell,Up to 10 faster on disk, 100 in memory,2-10 less code,Fast and Expressive Cluster Computing System Compatible with Apac
2、he Hadoop,Project History,Spark started in 2009, open sourced 2010 In use at Intel, Yahoo!, Adobe, Alibaba Taobao, Conviva, Ooyala, Bizo and others Entered Apache Incubator in June,Open Source Community,1300+ meetup members 90+ code contributors 20 companies contributing,This Talk,Introduction to Sp
3、ark Tour of Spark operations (in Python) Job execution Standalone apps,Key Idea,Write programs in terms of transformations on distributed datasets Concept: resilient distributed datasets (RDDs) Collections of objects spread across a cluster Built through parallel transformations (map, filter, etc) A
4、utomatically rebuilt on failure Controllable persistence (e.g. caching in RAM),Operations,Transformations (e.g. map, filter, groupBy) Lazy operations to build RDDs from other RDDs Actions (e.g. count, collect, save) Return a result or write it to storage,Example: Log Mining,Load error messages from
5、a log into memory, then interactively search for various patterns,lines = spark.textFile(“hdfs:/.”) errors = lines.filter(lambda s: s.startswith(“ERROR”) messages = errors.map(lambda s: s.split(“t”)2) messages.cache(),Block 1,Block 2,Block 3,messages.filter(lambda s: “foo” in s).count(),messages.fil
6、ter(lambda s: “bar” in s).count(),. . .,tasks,results,Cache 1,Cache 2,Cache 3,Base RDD,Transformed RDD,Action,Result: full-text search of Wikipedia in 0.5 sec (vs 20 s for on-disk data),Result: scaled to 1 TB data in 5 sec (vs 180 sec for on-disk data),Fault Recovery,RDDs track lineage information t
7、hat can be used to efficiently recompute lost data Ex:,msgs = textFile.filter(lambda s: s.startsWith(“ERROR”).map(lambda s: s.split(“t”)2),HDFS File,Filtered RDD,Mapped RDD,filter (func = _.contains(.),map (func = _.split(.),Behavior with Less RAM,Spark in Scala and Java,/ Scala: val lines = sc.text
8、File(.) lines.filter(x = x.contains(“ERROR”).count()/ Java: JavaRDD lines = sc.textFile(.); lines.filter(new Function() Boolean call(String s) return s.contains(“error”); ).count();,Which Language Should I Use?,Standalone programs can be written in any, but interactive shell is only Python & Scala P
9、ython users: can do Python for both Java users: consider learning Scala for shellPerformance: Java & Scala are faster due to static typing, but Python is often fine,Variables: var x: Int = 7 var x = 7 / type inferred val y = “hi” / read-only,Functions: def square(x: Int): Int = x*x def square(x: Int
10、): Int = x*x / last line returned ,Collections and closures: val nums = Array(1, 2, 3) nums.map(x: Int) = x + 2) / 3,4,5 nums.map(x = x + 2) / same nums.map(_ + 2) / same nums.reduce(x, y) = x + y) / 6 nums.reduce(_ + _) / same,Java interop: import .URL new URL(“http:/”).openStream(),More details: s
11、cala-lang.org,Scala Cheat Sheet,This Talk,Introduction to Spark Tour of Spark operations (in Python) Job execution Standalone apps,Learning Spark,Easiest way: the shell (spark-shell or pyspark) Special Scala / Python interpreters for cluster use Runs in local mode on 1 core by default, but can contr
12、ol with MASTER environment var: MASTER=local ./spark-shell # local, 1 thread MASTER=local2 ./spark-shell # local, 2 threads MASTER=spark:/host:port ./spark-shell # cluster,First Stop: SparkContext,Main entry point to Spark functionality Available in shell as variable sc In standalone programs, youd
13、make your own (see later for details),Creating RDDs,# Turn a Python collection into an RDD sc.parallelize(1, 2, 3)# Load text file from local FS, HDFS, or S3 sc.textFile(“file.txt”) sc.textFile(“directory/*.txt”) sc.textFile(“hdfs:/namenode:9000/path/file”)# Use existing Hadoop InputFormat (Java/Sca
14、la only) sc.hadoopFile(path, inputFormat,keyClass, valClass),Basic Transformations,nums = sc.parallelize(1, 2, 3) # Pass each element through a function squares = nums.map(lambda x: x*x) / 1, 4, 9# Keep elements passing a predicate even = squares.filter(lambda x: x % 2 = 0) / 4# Map each element to
15、zero or more others nums.flatMap(lambda x: = range(x) # = 0, 0, 1, 0, 1, 2,Range object (sequence of numbers 0, 1, , x-1),Basic Actions,nums = sc.parallelize(1, 2, 3) # Retrieve RDD contents as a local collection nums.collect() # = 1, 2, 3# Return first K elements nums.take(2) # = 1, 2# Count number
16、 of elements nums.count() # = 3# Merge elements with an associative function nums.reduce(lambda x, y: x + y) # = 6# Write elements to a text file nums.saveAsTextFile(“hdfs:/file.txt”),Working with Key-Value Pairs,Sparks “distributed reduce” transformations operate on RDDs of key-value pairs Python:
17、pair = (a, b) pair0 # = a pair1 # = b Scala: val pair = (a, b) pair._1 / = a pair._2 / = b Java: Tuple2 pair = new Tuple2(a, b); pair._1 / = a pair._2 / = b,Some Key-Value Operations,pets = sc.parallelize( (“cat”, 1), (“dog”, 1), (“cat”, 2) pets.reduceByKey(lambda x, y: x + y) # = (cat, 3), (dog, 1)
18、 pets.groupByKey() # = (cat, 1, 2), (dog, 1) pets.sortByKey() # = (cat, 1), (cat, 2), (dog, 1)reduceByKey also automatically implements combiners on the map side,lines = sc.textFile(“hamlet.txt”) counts = lines.flatMap(lambda line: line.split(“ ”) .map(lambda word: (word, 1) .reduceByKey(lambda x, y
19、: x + y),Example: Word Count,Other Key-Value Operations,visits = sc.parallelize( (“index.html”, “1.2.3.4”), (“about.html”, “3.4.5.6”), (“index.html”, “1.3.3.1”) ) pageNames = sc.parallelize( (“index.html”, “Home”), (“about.html”, “About”) ) visits.join(pageNames) # (“index.html”, (“1.2.3.4”, “Home”)
20、 # (“index.html”, (“1.3.3.1”, “Home”) # (“about.html”, (“3.4.5.6”, “About”) visits.cogroup(pageNames) # (“index.html”, (“1.2.3.4”, “1.3.3.1”, “Home”) # (“about.html”, (“3.4.5.6”, “About”),Setting the Level of Parallelism,All the pair RDD operations take an optional second parameter for number of tas
21、ks words.reduceByKey(lambda x, y: x + y, 5) words.groupByKey(5) visits.join(pageViews, 5),Using Local Variables,Any external variables you use in a closure will automatically be shipped to the cluster: query = sys.stdin.readline() pages.filter(lambda x: query in x).count() Some caveats: Each task ge
22、ts a new copy (updates arent sent back) Variable must be Serializable / Pickle-able Dont use fields of an outer object (ships all of it!),Closure Mishap Example,class MyCoolRddApp val param = 3.14 val log = new Log(.) . def work(rdd: RDDInt) rdd.map(x = x + param) .reduce(.) ,How to get around it: c
23、lass MyCoolRddApp . def work(rdd: RDDInt) val param_ = param rdd.map(x = x + param_) .reduce(.) ,NotSerializableException: MyCoolRddApp (or Log),References only local variable instead of this.param,Other RDD Operators,map filter groupBy sort union join leftOuterJoin rightOuterJoin,reduce count fold
24、reduceByKey groupByKey cogroup cross zip,sample take first partitionBy mapWith pipe save .,More details: spark-project.org/docs/latest/,Demo,This Talk,Introduction to Spark Tour of Spark operations Job execution Standalone apps,Software Components,Spark runs as a library in your program (1 instance
25、per app) Runs tasks locally or on cluster Mesos, YARN or standalone mode Accesses storage systems via Hadoop InputFormat API Can use HBase, HDFS, S3, ,Your application,SparkContext,Local threads,Cluster manager,Worker,Spark executor,Worker,Spark executor,HDFS or other storage,Task Scheduler,General
26、task graphs Automatically pipelines functions Data locality aware Partitioning aware to avoid shuffles,= cached partition,= RDD,Advanced Features,Controllable partitioning Speed up joins against a dataset Controllable storage formats Keep data serialized for efficiency, replicate to multiple nodes,
27、cache on disk Shared variables: broadcasts, accumulators See online docs for details!,This Talk,Introduction to Spark Tour of Spark operations Job execution Standalone apps,Add Spark to Your Project,Scala / Java: add a Maven dependency on groupId: org.apache.spark artifactId: spark-core_2.9.3 versio
28、n: 0.8.0Python: run program with our pyspark script,import org.apache.spark.api.java.JavaSparkContext;JavaSparkContext sc = new JavaSparkContext(“masterUrl”, “name”, “sparkHome”, new String “app.jar”);,import org.apache.spark.SparkContext import org.apache.spark.SparkContext._val sc = new SparkConte
29、xt(“url”, “name”, “sparkHome”, Seq(“app.jar”),Cluster URL, or local / localN,App name,Spark install path on cluster,List of JARs with app code (to ship),Create a SparkContext,Scala,Java,from pyspark import SparkContextsc = SparkContext(“masterUrl”, “name”, “sparkHome”, “library.py”),Python,Example:
30、PageRank,Good example of a more complex algorithm Multiple stages of map & reduce Benefits from Sparks in-memory caching Multiple iterations over the same data,Basic Idea,Give pages ranks (scores) based on links to them Links from many pages high rank Link from a high-rank page high rank,Image: en.w
31、ikipedia.org/wiki/File:PageRank-hi-res-2.png,Algorithm,1.0,1.0,1.0,1.0,Start each page at a rank of 1 On each iteration, have page p contribute rankp / |neighborsp| to its neighbors Set each pages rank to 0.15 + 0.85 contribs,Algorithm,Start each page at a rank of 1 On each iteration, have page p co
32、ntribute rankp / |neighborsp| to its neighbors Set each pages rank to 0.15 + 0.85 contribs,1.0,1.0,1.0,1.0,1,0.5,0.5,0.5,1,0.5,Algorithm,Start each page at a rank of 1 On each iteration, have page p contribute rankp / |neighborsp| to its neighbors Set each pages rank to 0.15 + 0.85 contribs,0.58,1.0
33、,1.85,0.58,Algorithm,Start each page at a rank of 1 On each iteration, have page p contribute rankp / |neighborsp| to its neighbors Set each pages rank to 0.15 + 0.85 contribs,0.58,0.29,0.29,0.5,1.85,0.58,1.0,1.85,0.58,0.5,Algorithm,Start each page at a rank of 1 On each iteration, have page p contr
34、ibute rankp / |neighborsp| to its neighbors Set each pages rank to 0.15 + 0.85 contribs,0.39,1.72,1.31,0.58,. . .,Algorithm,Start each page at a rank of 1 On each iteration, have page p contribute rankp / |neighborsp| to its neighbors Set each pages rank to 0.15 + 0.85 contribs,Scala Implementation,
35、val sc = new SparkContext(“local”, “PageRank”, sparkHome, Seq(“pagerank.jar”) val links = / load RDD of (url, neighbors) pairs var ranks = / load RDD of (url, rank) pairsfor (i links.map(dest = (dest, rank/links.size) ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) ranks.saveAsTextFi
36、le(.),PageRank Performance,Other Iterative Algorithms,Time per Iteration (s),Getting Started,Download Spark: spark-project.org/downloads Documentation and video tutorials: www.spark-project.org/documentation Several ways to run: Local mode (just need Java), EC2, private clusters,Just pass local or l
37、ocalk as master URL Debug using local debuggers For Java / Scala, just run your program in a debugger For Python, use an attachable debugger (e.g. PyDev) Great for development & unit tests,Local Execution,Cluster Execution,Easiest way to launch is EC2: ./spark-ec2 -k keypair i id_rsa.pem s slaves la
38、unch|stop|start|destroy clusterName Several options for private clusters: Standalone mode (similar to Hadoops deploy scripts) Mesos Hadoop YARN Amazon EMR: offers a rich API to make data analytics fast: both fast to write and fast to run Achieves 100x speedups in real applications Growing community with 20+ companies contributing,Conclusion,www.spark-project.org,