ImageVerifierCode 换一换
格式:PPTX , 页数:50 ,大小:1.29MB ,
资源ID:8649647      下载积分:10 金币
快捷下载
登录下载
邮箱/手机:
温馨提示:
快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。 如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝    微信支付   
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【https://www.docduoduo.com/d-8649647.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录   QQ登录   微博登录 

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(An Introduction to the Berkeley Data Analytics Stack.pptx)为本站会员(11xg27ws)主动上传,道客多多仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知道客多多(发送邮件至docduoduo@163.com或直接QQ联系客服),我们立即给予删除!

An Introduction to the Berkeley Data Analytics Stack.pptx

1、Reynold Xin,Parallel Programming With Apache Spark,What is Spark?,Efficiency General execution graphs In-memory storage Usability Rich APIs in Java, Scala, Python Interactive shell,Up to 10 faster on disk, 100 in memory,2-10 less code,Fast and Expressive Cluster Computing System Compatible with Apac

2、he Hadoop,Project History,Spark started in 2009, open sourced 2010 In use at Intel, Yahoo!, Adobe, Alibaba Taobao, Conviva, Ooyala, Bizo and others Entered Apache Incubator in June,Open Source Community,1300+ meetup members 90+ code contributors 20 companies contributing,This Talk,Introduction to Sp

3、ark Tour of Spark operations (in Python) Job execution Standalone apps,Key Idea,Write programs in terms of transformations on distributed datasets Concept: resilient distributed datasets (RDDs) Collections of objects spread across a cluster Built through parallel transformations (map, filter, etc) A

4、utomatically rebuilt on failure Controllable persistence (e.g. caching in RAM),Operations,Transformations (e.g. map, filter, groupBy) Lazy operations to build RDDs from other RDDs Actions (e.g. count, collect, save) Return a result or write it to storage,Example: Log Mining,Load error messages from

5、a log into memory, then interactively search for various patterns,lines = spark.textFile(“hdfs:/.”) errors = lines.filter(lambda s: s.startswith(“ERROR”) messages = errors.map(lambda s: s.split(“t”)2) messages.cache(),Block 1,Block 2,Block 3,messages.filter(lambda s: “foo” in s).count(),messages.fil

6、ter(lambda s: “bar” in s).count(),. . .,tasks,results,Cache 1,Cache 2,Cache 3,Base RDD,Transformed RDD,Action,Result: full-text search of Wikipedia in 0.5 sec (vs 20 s for on-disk data),Result: scaled to 1 TB data in 5 sec (vs 180 sec for on-disk data),Fault Recovery,RDDs track lineage information t

7、hat can be used to efficiently recompute lost data Ex:,msgs = textFile.filter(lambda s: s.startsWith(“ERROR”).map(lambda s: s.split(“t”)2),HDFS File,Filtered RDD,Mapped RDD,filter (func = _.contains(.),map (func = _.split(.),Behavior with Less RAM,Spark in Scala and Java,/ Scala: val lines = sc.text

8、File(.) lines.filter(x = x.contains(“ERROR”).count()/ Java: JavaRDD lines = sc.textFile(.); lines.filter(new Function() Boolean call(String s) return s.contains(“error”); ).count();,Which Language Should I Use?,Standalone programs can be written in any, but interactive shell is only Python & Scala P

9、ython users: can do Python for both Java users: consider learning Scala for shellPerformance: Java & Scala are faster due to static typing, but Python is often fine,Variables: var x: Int = 7 var x = 7 / type inferred val y = “hi” / read-only,Functions: def square(x: Int): Int = x*x def square(x: Int

10、): Int = x*x / last line returned ,Collections and closures: val nums = Array(1, 2, 3) nums.map(x: Int) = x + 2) / 3,4,5 nums.map(x = x + 2) / same nums.map(_ + 2) / same nums.reduce(x, y) = x + y) / 6 nums.reduce(_ + _) / same,Java interop: import .URL new URL(“http:/”).openStream(),More details: s

11、cala-lang.org,Scala Cheat Sheet,This Talk,Introduction to Spark Tour of Spark operations (in Python) Job execution Standalone apps,Learning Spark,Easiest way: the shell (spark-shell or pyspark) Special Scala / Python interpreters for cluster use Runs in local mode on 1 core by default, but can contr

12、ol with MASTER environment var: MASTER=local ./spark-shell # local, 1 thread MASTER=local2 ./spark-shell # local, 2 threads MASTER=spark:/host:port ./spark-shell # cluster,First Stop: SparkContext,Main entry point to Spark functionality Available in shell as variable sc In standalone programs, youd

13、make your own (see later for details),Creating RDDs,# Turn a Python collection into an RDD sc.parallelize(1, 2, 3)# Load text file from local FS, HDFS, or S3 sc.textFile(“file.txt”) sc.textFile(“directory/*.txt”) sc.textFile(“hdfs:/namenode:9000/path/file”)# Use existing Hadoop InputFormat (Java/Sca

14、la only) sc.hadoopFile(path, inputFormat,keyClass, valClass),Basic Transformations,nums = sc.parallelize(1, 2, 3) # Pass each element through a function squares = nums.map(lambda x: x*x) / 1, 4, 9# Keep elements passing a predicate even = squares.filter(lambda x: x % 2 = 0) / 4# Map each element to

15、zero or more others nums.flatMap(lambda x: = range(x) # = 0, 0, 1, 0, 1, 2,Range object (sequence of numbers 0, 1, , x-1),Basic Actions,nums = sc.parallelize(1, 2, 3) # Retrieve RDD contents as a local collection nums.collect() # = 1, 2, 3# Return first K elements nums.take(2) # = 1, 2# Count number

16、 of elements nums.count() # = 3# Merge elements with an associative function nums.reduce(lambda x, y: x + y) # = 6# Write elements to a text file nums.saveAsTextFile(“hdfs:/file.txt”),Working with Key-Value Pairs,Sparks “distributed reduce” transformations operate on RDDs of key-value pairs Python:

17、pair = (a, b) pair0 # = a pair1 # = b Scala: val pair = (a, b) pair._1 / = a pair._2 / = b Java: Tuple2 pair = new Tuple2(a, b); pair._1 / = a pair._2 / = b,Some Key-Value Operations,pets = sc.parallelize( (“cat”, 1), (“dog”, 1), (“cat”, 2) pets.reduceByKey(lambda x, y: x + y) # = (cat, 3), (dog, 1)

18、 pets.groupByKey() # = (cat, 1, 2), (dog, 1) pets.sortByKey() # = (cat, 1), (cat, 2), (dog, 1)reduceByKey also automatically implements combiners on the map side,lines = sc.textFile(“hamlet.txt”) counts = lines.flatMap(lambda line: line.split(“ ”) .map(lambda word: (word, 1) .reduceByKey(lambda x, y

19、: x + y),Example: Word Count,Other Key-Value Operations,visits = sc.parallelize( (“index.html”, “1.2.3.4”), (“about.html”, “3.4.5.6”), (“index.html”, “1.3.3.1”) ) pageNames = sc.parallelize( (“index.html”, “Home”), (“about.html”, “About”) ) visits.join(pageNames) # (“index.html”, (“1.2.3.4”, “Home”)

20、 # (“index.html”, (“1.3.3.1”, “Home”) # (“about.html”, (“3.4.5.6”, “About”) visits.cogroup(pageNames) # (“index.html”, (“1.2.3.4”, “1.3.3.1”, “Home”) # (“about.html”, (“3.4.5.6”, “About”),Setting the Level of Parallelism,All the pair RDD operations take an optional second parameter for number of tas

21、ks words.reduceByKey(lambda x, y: x + y, 5) words.groupByKey(5) visits.join(pageViews, 5),Using Local Variables,Any external variables you use in a closure will automatically be shipped to the cluster: query = sys.stdin.readline() pages.filter(lambda x: query in x).count() Some caveats: Each task ge

22、ts a new copy (updates arent sent back) Variable must be Serializable / Pickle-able Dont use fields of an outer object (ships all of it!),Closure Mishap Example,class MyCoolRddApp val param = 3.14 val log = new Log(.) . def work(rdd: RDDInt) rdd.map(x = x + param) .reduce(.) ,How to get around it: c

23、lass MyCoolRddApp . def work(rdd: RDDInt) val param_ = param rdd.map(x = x + param_) .reduce(.) ,NotSerializableException: MyCoolRddApp (or Log),References only local variable instead of this.param,Other RDD Operators,map filter groupBy sort union join leftOuterJoin rightOuterJoin,reduce count fold

24、reduceByKey groupByKey cogroup cross zip,sample take first partitionBy mapWith pipe save .,More details: spark-project.org/docs/latest/,Demo,This Talk,Introduction to Spark Tour of Spark operations Job execution Standalone apps,Software Components,Spark runs as a library in your program (1 instance

25、per app) Runs tasks locally or on cluster Mesos, YARN or standalone mode Accesses storage systems via Hadoop InputFormat API Can use HBase, HDFS, S3, ,Your application,SparkContext,Local threads,Cluster manager,Worker,Spark executor,Worker,Spark executor,HDFS or other storage,Task Scheduler,General

26、task graphs Automatically pipelines functions Data locality aware Partitioning aware to avoid shuffles,= cached partition,= RDD,Advanced Features,Controllable partitioning Speed up joins against a dataset Controllable storage formats Keep data serialized for efficiency, replicate to multiple nodes,

27、cache on disk Shared variables: broadcasts, accumulators See online docs for details!,This Talk,Introduction to Spark Tour of Spark operations Job execution Standalone apps,Add Spark to Your Project,Scala / Java: add a Maven dependency on groupId: org.apache.spark artifactId: spark-core_2.9.3 versio

28、n: 0.8.0Python: run program with our pyspark script,import org.apache.spark.api.java.JavaSparkContext;JavaSparkContext sc = new JavaSparkContext(“masterUrl”, “name”, “sparkHome”, new String “app.jar”);,import org.apache.spark.SparkContext import org.apache.spark.SparkContext._val sc = new SparkConte

29、xt(“url”, “name”, “sparkHome”, Seq(“app.jar”),Cluster URL, or local / localN,App name,Spark install path on cluster,List of JARs with app code (to ship),Create a SparkContext,Scala,Java,from pyspark import SparkContextsc = SparkContext(“masterUrl”, “name”, “sparkHome”, “library.py”),Python,Example:

30、PageRank,Good example of a more complex algorithm Multiple stages of map & reduce Benefits from Sparks in-memory caching Multiple iterations over the same data,Basic Idea,Give pages ranks (scores) based on links to them Links from many pages high rank Link from a high-rank page high rank,Image: en.w

31、ikipedia.org/wiki/File:PageRank-hi-res-2.png,Algorithm,1.0,1.0,1.0,1.0,Start each page at a rank of 1 On each iteration, have page p contribute rankp / |neighborsp| to its neighbors Set each pages rank to 0.15 + 0.85 contribs,Algorithm,Start each page at a rank of 1 On each iteration, have page p co

32、ntribute rankp / |neighborsp| to its neighbors Set each pages rank to 0.15 + 0.85 contribs,1.0,1.0,1.0,1.0,1,0.5,0.5,0.5,1,0.5,Algorithm,Start each page at a rank of 1 On each iteration, have page p contribute rankp / |neighborsp| to its neighbors Set each pages rank to 0.15 + 0.85 contribs,0.58,1.0

33、,1.85,0.58,Algorithm,Start each page at a rank of 1 On each iteration, have page p contribute rankp / |neighborsp| to its neighbors Set each pages rank to 0.15 + 0.85 contribs,0.58,0.29,0.29,0.5,1.85,0.58,1.0,1.85,0.58,0.5,Algorithm,Start each page at a rank of 1 On each iteration, have page p contr

34、ibute rankp / |neighborsp| to its neighbors Set each pages rank to 0.15 + 0.85 contribs,0.39,1.72,1.31,0.58,. . .,Algorithm,Start each page at a rank of 1 On each iteration, have page p contribute rankp / |neighborsp| to its neighbors Set each pages rank to 0.15 + 0.85 contribs,Scala Implementation,

35、val sc = new SparkContext(“local”, “PageRank”, sparkHome, Seq(“pagerank.jar”) val links = / load RDD of (url, neighbors) pairs var ranks = / load RDD of (url, rank) pairsfor (i links.map(dest = (dest, rank/links.size) ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) ranks.saveAsTextFi

36、le(.),PageRank Performance,Other Iterative Algorithms,Time per Iteration (s),Getting Started,Download Spark: spark-project.org/downloads Documentation and video tutorials: www.spark-project.org/documentation Several ways to run: Local mode (just need Java), EC2, private clusters,Just pass local or l

37、ocalk as master URL Debug using local debuggers For Java / Scala, just run your program in a debugger For Python, use an attachable debugger (e.g. PyDev) Great for development & unit tests,Local Execution,Cluster Execution,Easiest way to launch is EC2: ./spark-ec2 -k keypair i id_rsa.pem s slaves la

38、unch|stop|start|destroy clusterName Several options for private clusters: Standalone mode (similar to Hadoops deploy scripts) Mesos Hadoop YARN Amazon EMR: offers a rich API to make data analytics fast: both fast to write and fast to run Achieves 100x speedups in real applications Growing community with 20+ companies contributing,Conclusion,www.spark-project.org,

本站链接:文库   一言   我酷   合作


客服QQ:2549714901微博号:道客多多官方知乎号:道客多多

经营许可证编号: 粤ICP备2021046453号世界地图

道客多多©版权所有2020-2025营业执照举报