Using PySpark in a TIBCO Data Science Team Studio Notebook

Chia-Yui Lee · June 6, 2019

PySpark is a Python API for Apache Spark. It is designed to run applications in parallel on a distributed cluster, one of the data sources that you can work with in Team Studio. Because it is written in Python, it can also be used with other common open source packages to speed up development, for example using multiple nodes to experiment with different hyperparameters in a deep learning model.

The Jupyter Notebooks in Team Studio has a helper function that makes it very easy to initialize PySpark on your cluster and read data from HDFS as a Spark DataFrame.

Create and open a new Notebook under Work Files in your Team Studio Workspace.

Click on the Data menu

Select "Initialize Pyspark for Cluster...". This will automatically create a cell with the following code (the information will be specific to your cluster):

from pyspark import SparkConf, SparkContext

from pyspark.sql import SQLContext, HiveContext

import os

# Modify the line below to change the number of executors and the amount of memory that they use

# You can run Spark in local mode by changing 'yarn-client' to 'local', and setting the 'YARN_CONF_DIR' to ''

os.environ['PYSPARK_SUBMIT_ARGS'] = "--master yarn-client --num-executors 1 --executor-memory 1g --packages com.databricks:spark-csv_2.10:1.5.0,com.databricks:spark-avro_2.11:3.0.1 pyspark-shell"

os.environ['YARN_CONF_DIR'] = '/data/hdfs_configs/xx-xx.xx.xx.xx'

cc.datasource_name = 'Your datasource name'

# Each worker node in the cluster needs Python 2.7.

# If this is not the default Python on the node, provide the Python path here

# os.environ['PYSPARK_PYTHON'] = ''

# Do not remove or modify the following line:

# [[performPysparkInit(15)]]

# This environment variable has the value 'workflow' when the notebook is being executed as part of an analytics workflow

os.environ['NOTEBOOK_EXECUTION_ENVIRONMENT'] = 'notebook'

# This will stop the SparkContext if there is one left over from a different notebook execution

try:

sc.stop()

except NameError:

pass

APP_NAME = 'PySpark example.ipynb-my_team_studio_username'

conf = SparkConf().setAppName(APP_NAME)

sc = SparkContext(conf=conf)

sqlContext = SQLContext(sc)

Specify the path to your data and read it into a Spark DataFrame:

df = sqlContext.read.csv("/your/file/path.csv", header = True)You can then perform any operations on 'df' using PySpark.

If you do not have access to a Hadoop cluster, you can run your PySpark job in local mode. To do that, you can still use the helper function, but change the following parameters:

os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local pyspark-shell'

os.environ['YARN_CONF_DIR'] = ''

Heleen Snelting · June 18, 2019

Awesome Chia-Yui - thanks for sharing!

Heleen Snelting

Sign In

Using PySpark in a TIBCO Data Science Team Studio Notebook

Recommended Posts

Chia-Yui Lee

Link to comment

Share on other sites

Heleen Snelting

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Industries