I work with Jupyter laptops and Python kernels with SparkContext. An employee wrote Python code that passes Spark events with ipykernel events. When we import its module from a laptop cell, it works in all the combinations that we need to support: Python 2.7 and 3.5, Spark 1.6 and 2.x, only Linux.
Now we want to include this code automatically for all Python kernels. I import into ours sitecustomize.py
. This works fine for Spark 2.x, but not for Spark 1.6. Kernels with Spark 1.6 no longer receive sc
, and something is so confusing that unrelated imports, for example matplotlib.cbook
, fail. When I delay this import for a few seconds using a timer, it works. Apparently, the code in sitecustomize.py
runs too soon to import the module that connects Spark to ipykernel.
I am looking for a way to defer this import until Spark and / or ipykernel are fully initialized. But it still needs to be run as part of the kernel launch before any laptop cells are executed. I found this trick to delay code execution until sys.argv
it is initialized. But I do not think that it can work with global variables of the type sc
, given that the Python global variables are still local to the modules. So far, the best I can think of is to use a timer to check every second if certain modules are present in sys.modules
. But this is not very reliable, because I do not know how to distinguish a module that is fully initialized from the one that is still in the process of loading.
Any ideas on how to connect to the startup code that executes at the end of the launch? A solution specific to pyspark and / or ipykernel will satisfy my needs.
source
share