Tuesday, February 9, 2016

What Spark Notebook should I use?

I was recently looking at Notebook applications for doing interactive Spark and was surprised at both the number of options and the immaturity of nearly all of them.

  1. Databricks
  2. Spark-notebook.io
  3. Hue Spark Notebooks
  4. Jupyter
  5. Zeppelin

Databricks

Databricks Notebooks appear to be the nicest for working with interactive Spark but alas they are not open source.  Every time I seem them they are noticeably improved.  So far they are definitely setting the bar.

Spark Summit 2015 demo: Creating an end-to-end machine learning data pipeline with Databricks from Databricks on Vimeo.


spark-notebook.io

This is a nice looking project by Andy Petrella.  It is open source with an option for commercial support.  This seems like a nice option for interactive Spark using Scala.  The project seems very active with many releases and is packaged a variety of ways to work with different versions.

Unfortunately, I cared more about Python support, so I removed this from my list for now.

Hue Spark Notebooks

When I read that Hue was adding a Spark Notebook feature, I was very excited.  I already had Hue on my system, maybe I already had a Spark Notebook I could use and didn't even know it.




 Livy is the REST server backing these notebooks.  It is a promising project that looks like it will enable interactive Spark even in yarn-cluster mode.


I was very enamored with Livy until I saw Apache Toree and became equally enamored with it.

Unfortunately, as of CDH 5.5.1 this feature is still very much in beta.

Jupyter

Jupyter is a long-lived project for Python notebooks.  More recently the project has been expanded to include more languages and now boasts an impressive list.

Jupyter just needs a Kernel to provide interactive Spark and there are three options:
  1. Use pySpark with IPythonKernel
  2. Sparkmagic kernel
  3. Apache Toree (IBM Kernel)

I've explored these options more here.

Zeppelin

Zeppelin is a newer Apache incubator project building notebook functionality on the JVM.  It appears to be partly inspired by Databricks notebooks.

This is a nice looking project that is definitely going to keep getting better.  Things that caught my eye right away were the built-in pivot tables as well as how it supports interpreter groups to enable sharing things like a SparkContext between multiple languages (pySpark, Spark SQL, etc).




Conclusion

These is no clear winner.  Each tool might be the best choice depending on the scenario.  Additionally, all of these tools appear to be rapidly evolving.

No comments:

Post a Comment