Tuesday, February 16, 2016

Enabling interactive Spark in Jupyter Notebooks

There seems to be three ways to enable interactive Spark in Jupyter notebooks

  1. Use pySpark with IPythonKernel
  2. Sparkmagic kernel
  3. Apache Toree (IBM Kernel)

Using PySpark with IPythonKernel

This is the easiest most cheap-skate way to get going.

Simply create a JSON file to define a new Kernel, /usr/local/share/jupyter/kernels/pyspark/kernel.json


Above assumes CDH is installed and we are running Spark in yarn-client mode.

If you don't have a Hadoop cluster setup, the same should work with "--master local" and a copy of Spark.

This method only takes minutes to get going.  You'll need to use either yarn-client mode or Spark standalone (yarn-cluster won't work).

Sparkmagic

Sparkmagic is a Kernel that communicates via REST with Livy, a Spark Job Server that comes with Hue.

Sparkmagic was a version 0.1 public preview and Livy is currently in beta, a little too bleeding edge for me, so I didn't spend a lot of time with it.  Although someone on my team was able to get it working.

Livy is a promising project that looks like it will enable interactive Spark even in yarn-cluster mode.  Livy is the REST backend for Spark Notebooks in Hue.  It does not yet seem to support security.


Both projects seem like they will improve rapidly over the coming months.

Apache Toree

Toree is an Apache Incubating project originally created by developers at IBM.

This project also seems early but maybe a little further along than Sparkmagic.

I highly recommend this very insightful presentation
  • The beginning explains Jupyter notebook kernels
  • 38 minutes in he starts demoing an application called Livesheets, it has some cool notebook like functionality
  • 43 minutes in he shows a REST server for talking to the server, and then he puts the kernel into Zeppelin
  • In the Q&A he talks about using pySpark directly





You can try out Toree at try.jupyter.org simply choose "New -> Spark" and then try typing in some example code.

Toree looks like it is only officially supporting Scala thus far but you can see references to pySpark, Spark SQL, and R in their code base so it seems like a matter of time.

References

Tuesday, February 9, 2016

What Spark Notebook should I use?

I was recently looking at Notebook applications for doing interactive Spark and was surprised at both the number of options and the immaturity of nearly all of them.

  1. Databricks
  2. Spark-notebook.io
  3. Hue Spark Notebooks
  4. Jupyter
  5. Zeppelin

Databricks

Databricks Notebooks appear to be the nicest for working with interactive Spark but alas they are not open source.  Every time I seem them they are noticeably improved.  So far they are definitely setting the bar.

Spark Summit 2015 demo: Creating an end-to-end machine learning data pipeline with Databricks from Databricks on Vimeo.


spark-notebook.io

This is a nice looking project by Andy Petrella.  It is open source with an option for commercial support.  This seems like a nice option for interactive Spark using Scala.  The project seems very active with many releases and is packaged a variety of ways to work with different versions.

Unfortunately, I cared more about Python support, so I removed this from my list for now.

Hue Spark Notebooks

When I read that Hue was adding a Spark Notebook feature, I was very excited.  I already had Hue on my system, maybe I already had a Spark Notebook I could use and didn't even know it.




 Livy is the REST server backing these notebooks.  It is a promising project that looks like it will enable interactive Spark even in yarn-cluster mode.


I was very enamored with Livy until I saw Apache Toree and became equally enamored with it.

Unfortunately, as of CDH 5.5.1 this feature is still very much in beta.

Jupyter

Jupyter is a long-lived project for Python notebooks.  More recently the project has been expanded to include more languages and now boasts an impressive list.

Jupyter just needs a Kernel to provide interactive Spark and there are three options:
  1. Use pySpark with IPythonKernel
  2. Sparkmagic kernel
  3. Apache Toree (IBM Kernel)

I've explored these options more here.

Zeppelin

Zeppelin is a newer Apache incubator project building notebook functionality on the JVM.  It appears to be partly inspired by Databricks notebooks.

This is a nice looking project that is definitely going to keep getting better.  Things that caught my eye right away were the built-in pivot tables as well as how it supports interpreter groups to enable sharing things like a SparkContext between multiple languages (pySpark, Spark SQL, etc).




Conclusion

These is no clear winner.  Each tool might be the best choice depending on the scenario.  Additionally, all of these tools appear to be rapidly evolving.