Learning Spark: Bridging HPC and Big Data Analytics
Abstract
A growing number of users want to combine HPC and big data analytics on the same infrastructure. In order to address these new forms of workloads, it is imperative to better understand the best that both worlds have to offer.... [ view full abstract ]
A growing number of users want to combine HPC and big data analytics on the same infrastructure. In order to address these new forms of workloads, it is imperative to better understand the best that both worlds have to offer. This workshop will focus on big data analytics.
In 2015, Apache Spark had over 1000 contributors, making it one the most active
open source big data projects. Based on the concept of in-memory data
processing, it can achieve performance up to 100 times that of Hadoop/MapReduce. Spark can interface with multiple data source including POSIX file systems and can run in standalone mode, making it easy to integrate and use on shared HPC infrastructure like the ones deployed by Compute Canada.
In this workshop, we will first introduce the main principles behind Spark. Then,
using the Jupyter notebook platform and Python, participants will accomplish
a series of directed exercises that will guide them through Spark's Python API
and analyze a real world dataset interactively.
Participants will need to have access to a modern web browser to do
the exercises.
Authors
-
Félix-Antoine Fortin
(Calcul Québec - Université Laval)
-
Frédérick Lefebvre
(Calcul Québec - Université Laval)
Topic Areas
Advanced Research Computing (ARC): ARC applications in any discipline (i.e. the sciences, , Advanced Research Computing (ARC): Innovations in computational research (i.e. software, s
Session
WK3 » ARC Workshop 3 (13:15 - Tuesday, 21st June, BS-M229)
Presentation Files
The presenter has not uploaded any presentation files.