Scalable R on Spark
author: Debraj GuhaThakurta, Microsoft
author: Robert Horton, Microsoft
author: Mario Inchiosa, Microsoft
author: Srini Kumar, Microsoft
author: Vanja Paunić, Microsoft
author: Hang Zhang, Microsoft
author: Mengyue Zhao, Microsoft
published: Sept. 16, 2016, recorded: August 2016, views: 3362
Report a problem or upload filesIf you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
R is one of the most popular languages in the data science, statistical and machine learning (ML) community. However, when it comes to scalable data analysis and ML using R, many data scientists are blocked or hindered by (a) its limitations of available functions to handle large data-sets efficiently, and (b) knowledge about the appropriate computing environments to scale R scripts from desktop exploratory analysis to elastic and distributed cloud services. In this tutorial we will discuss solutions that demonstrate the use of distributed compute environments and end to end solutions for R. We will present the topics through presentations and hands-on examples with sample code. In addition, we will provide a public code repository that attendees will be able to access and adapt to their own practice. We believe this tutorial will be of strong interest to a large and growing community of data scientists and developers using R for data analysis and modeling.
Prerequisites: A laptop with a web browser and an ssh client that supports port forwarding. Access to cloud-based clusters will be provided. For R scripts, download details, and suggested reading, see the Readme.md file at https://github.com/Azure/Azure-MachineLearning-DataScience/tree/master/Misc/KDDCup2016.
Link this pageWould you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !