PyCon UK

Registration Office

Big Data: Using Spark from Python and Jupyter

Andrew Crozier

Monday 17th, 15:30 (Ferrier Hall)


A talk (25 minutes)

Apache Spark is the standard tool for processing big data, capable of processing massive datasets often at speeds much faster than Apache Hadoop, especially for iterative algorithms such as those of common machine learning tasks.

Spark is also relatively easy to get started with and use for exploratory data analysis, especially as it offers interactive Scala, Python and R shells in which a user can easily try out different ways of manipulating their data, avoiding the slow write - compile - submit - wait - inspect output loops of other frameworks. Spark also provides a high level DataFrame API and scalable machine learning libraries, making it a compelling tool for data scientists.

In this talk I’ll introduce you to some tools for interacting with an Apache Spark cluster from Python and Jupyter, including:
* Connecting to a Spark cluster from a Jupyter notebook using sparkmagic
* Loading data into Spark from external sources
* Basics of data manipulation in Spark
* Getting data out of Spark and into Pandas DataFrames for plotting or further modelling or analysis
* Introduction to Spark machine learning tools
* How to run Spark code from Python scripts, Plotly Dash web apps and Flask web APIs using the pylivy client library

I am an engineer at ASI, a data science consultancy and developers of the SherlockML data science platform based in London. This talk comes directly from our experience helping our consultants and SherlockML users to most effectively use Spark as part of an integrated data science workflow.


  • The speaker suggested this session is suitable for data scientists.

Back to schedule