1、Haiying CheInstitute of Data Science and Knowledge EngineeringSchool of Computer ScienceBeijing Institute of TechnologySpark MLlib2Big Data ApplicationComputing AlgorithmComputing ModelData processing systemComputing Platform & EngineData Storing SystemSpark MLlibData VisualizationData Products and
2、Data ServicesBig Data ApplicationData Application systemTensorFlowRecommendation SystemSocial Networking345Why Spark Mllib?MLlib is Apache Sparks scalable machine learning library.Ease of useUsable in Java, Scala, Python, and RPerformanceHigh-quality algorithms, 100 x faster than MapReduce.Runs ever
3、ywhereSpark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud, against diverse data sources.To support Python with Spark, the Apache Spark community released a tool, PySpark. Using PySpark, one can work with RDDs in Python programming language.671 Spark MLlib AlgorithmsML algorit
4、hms include:Classification: logistic regression, naive Bayes,.Regression: generalized linear regression, survival regression,.Decision trees, random forests, and gradient-boosted treesRecommendation: alternating least squares (ALS)Clustering: K-means, Gaussian mixtures (GMMs),.Topic modeling: latent
5、 Dirichlet allocation (LDA)Frequent item sets, association rules, and sequential pattern mining82 Spark MLlib workflow utilities ML workflow utilities include:Feature transformations: standardization, normalization, hashing,.ML Pipeline constructionModel evaluation and hyper-parameter tuningML persi
6、stence: saving and loading models and PipelinesOther utilities include:Distributed linear algebra: SVD, PCA,.Statistics: summary statistics, hypothesis testing,.93 Machine Learning Pipeline103.1TransformerAbstraction that includes feature transformers and learned modelsTransforming data into consuma
7、ble formatTake input column, transform it to an output columnExamples: Normalize the data Tokenization-sentences into words Converting categorical values into numbers113.2 Estimator Learning algorithm that trains (fit) on data Return a model, which is type of TransformerExamplesLogisticRegression.fi
8、t()= LogisticRegressionModel123.3 Evaluator Evaluate the model performance based certain metric ROC, RMSE Help with automating the model tuning process Comparing model performance Select the best model for generating predictionsExamples: BinaryClassificationEvaluator, CrossValidator133.4 Pipeline To
9、 represent a ML workflow Consist of a set of stages Leverage the uniform API of Transformer & Estimator A type of Estimator fit() Can be persistedhttps:/spark.apache.org/docs/latest/ml-pipeline.html143.5 ParametersMLlib Estimators and Transformers use a uniform API for specifying parameters.A Param
10、is a named parameter with self-contained documentation. A ParamMap is a set of (parameter, value) pairs.Parameters belong to specific instances of Estimators and Transformers. For example, if we have two LogisticRegression instances lr1 and lr2, then we can build a ParamMap with both maxIter paramet
11、ers specified: ParamMap(lr1.maxIter - 10, lr2.maxIter - 20). This is useful if there are two algorithms with the maxIter parameter in a Pipeline.There are two main ways to pass parameters to an algorithm: Set parameters for an instance. E.g., if lr is an instance of LogisticRegression, one could cal
12、l lr.setMaxIter(10) to make lr.fit() use at most 10 iterations. This API resembles the API used in spark.mllib package. Pass a ParamMap to fit() or transform(). Any parameters in the ParamMap will override parameters previously specified via setter methods.153.6 Automating model tuning ParamGridBuilder CrossValudator (K-fold)163.6 Model persistence 17Hands-onAlgorithmHigh level toolsClassificationRegressionClusteringDimensionality reductionParameter tuningPipelineQuestions ?