Using R for Scalable Data Science: Single Machines to Hadoop Spark Clusters

author: Hang Zhang, Microsoft
author: Vanja Paunić, Microsoft
published: Nov. 14, 2017,   recorded: August 2017,   views: 1253

Related Open Educational Resources

Related content

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Lecture popularity: You need to login to cast your vote.

 Watch videos:   (click on thumbnail to launch)

Watch Part 1
Part 1 1:40:32
Watch Part 2
Part 2 1:30:28


In this tutorial, we will demonstrate how to create scalable, end-to-end data analysis processes in R on single machines as well as in-database in SQL Server and on Hadoop clusters running Spark. We will provide hands-on exercises as well as code in a public GitHub repository for attendees to adopt in their data science practice. In particular, the attendees will see how to build, persist, and consume machine learning models using distributed machine learning functions in R.

R is one of the most used languages in the data science, statistical and machine learning (ML) community. Although open-source R (CRAN library) now has in excess of 10,000 packages and functions for statics and ML, when it comes to scalable analysis using R, or deployment of trained models into production, many data scientists are blocked or hindered by (a) its limitations of available functions to handle large datasets efficiently, and (b) knowledge about the appropriate computing environments to scale R scripts from desktop analysis to elastic and distributed cloud services. In this tutorial, we will discuss how to create end-to-end data science solutions that utilize distributed compute resources. During the tutorial, we will provide presentations, worked-out examples, and hands-on exercises with sample code. In addition, we will provide a public GitHub code repository that attendees will be able to access and adapt to their own practice. We believe this tutorial will be of strong interest to a large and growing community of data scientists and developers who are using R for creating and deploying analytical solutions.

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Write your own review or comment:

make sure you have javascript enabled or clear this field: