Tuesday, February 16, 2016

Enabling interactive Spark in Jupyter Notebooks

There seems to be three ways to enable interactive Spark in Jupyter notebooks

  1. Use pySpark with IPythonKernel
  2. Sparkmagic kernel
  3. Apache Toree (IBM Kernel)

Using PySpark with IPythonKernel

This is the easiest most cheap-skate way to get going.

Simply create a JSON file to define a new Kernel, /usr/local/share/jupyter/kernels/pyspark/kernel.json


Above assumes CDH is installed and we are running Spark in yarn-client mode.

If you don't have a Hadoop cluster setup, the same should work with "--master local" and a copy of Spark.

This method only takes minutes to get going.  You'll need to use either yarn-client mode or Spark standalone (yarn-cluster won't work).

Sparkmagic

Sparkmagic is a Kernel that communicates via REST with Livy, a Spark Job Server that comes with Hue.

Sparkmagic was a version 0.1 public preview and Livy is currently in beta, a little too bleeding edge for me, so I didn't spend a lot of time with it.  Although someone on my team was able to get it working.

Livy is a promising project that looks like it will enable interactive Spark even in yarn-cluster mode.  Livy is the REST backend for Spark Notebooks in Hue.  It does not yet seem to support security.


Both projects seem like they will improve rapidly over the coming months.

Apache Toree

Toree is an Apache Incubating project originally created by developers at IBM.

This project also seems early but maybe a little further along than Sparkmagic.

I highly recommend this very insightful presentation
  • The beginning explains Jupyter notebook kernels
  • 38 minutes in he starts demoing an application called Livesheets, it has some cool notebook like functionality
  • 43 minutes in he shows a REST server for talking to the server, and then he puts the kernel into Zeppelin
  • In the Q&A he talks about using pySpark directly





You can try out Toree at try.jupyter.org simply choose "New -> Spark" and then try typing in some example code.

Toree looks like it is only officially supporting Scala thus far but you can see references to pySpark, Spark SQL, and R in their code base so it seems like a matter of time.

References

No comments:

Post a Comment