- Use pySpark with IPythonKernel
- Sparkmagic kernel
- Apache Toree (IBM Kernel)
Using PySpark with IPythonKernel
This is the easiest most cheap-skate way to get going.Simply create a JSON file to define a new Kernel, /usr/local/share/jupyter/kernels/pyspark/kernel.json
Above assumes CDH is installed and we are running Spark in yarn-client mode.
If you don't have a Hadoop cluster setup, the same should work with "--master local" and a copy of Spark.
This method only takes minutes to get going. You'll need to use either yarn-client mode or Spark standalone (yarn-cluster won't work).
Sparkmagic
Sparkmagic is a Kernel that communicates via REST with Livy, a Spark Job Server that comes with Hue.
Sparkmagic was a version 0.1 public preview and Livy is currently in beta, a little too bleeding edge for me, so I didn't spend a lot of time with it. Although someone on my team was able to get it working.
Livy is a promising project that looks like it will enable interactive Spark even in yarn-cluster mode. Livy is the REST backend for Spark Notebooks in Hue. It does not yet seem to support security.
Both projects seem like they will improve rapidly over the coming months.
Sparkmagic was a version 0.1 public preview and Livy is currently in beta, a little too bleeding edge for me, so I didn't spend a lot of time with it. Although someone on my team was able to get it working.
Livy is a promising project that looks like it will enable interactive Spark even in yarn-cluster mode. Livy is the REST backend for Spark Notebooks in Hue. It does not yet seem to support security.
Both projects seem like they will improve rapidly over the coming months.
Apache Toree
Toree is an Apache Incubating project originally created by developers at IBM.This project also seems early but maybe a little further along than Sparkmagic.
I highly recommend this very insightful presentation
- The beginning explains Jupyter notebook kernels
- 38 minutes in he starts demoing an application called Livesheets, it has some cool notebook like functionality
- 43 minutes in he shows a REST server for talking to the server, and then he puts the kernel into Zeppelin
- In the Q&A he talks about using pySpark directly
You can try out Toree at try.jupyter.org simply choose "New -> Spark" and then try typing in some example code.
Toree looks like it is only officially supporting Scala thus far but you can see references to pySpark, Spark SQL, and R in their code base so it seems like a matter of time.
References
- https://github.com/ibm-et/spark-kernel/wiki
- http://gethue.com/how-to-use-the-livy-spark-rest-job-server-for-interactive-spark-2-2/
- https://github.com/cloudera/hue/tree/master/apps/spark/java
- http://arnesund.com/2015/09/21/spark-cluster-on-openstack-with-multi-user-jupyter-notebook/
- http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/

No comments:
Post a Comment