Jump to content

Using PySpark in a TIBCO Data Science Team Studio Notebook


Recommended Posts

PySpark is a Python API for Apache Spark. It is designed to run applications in parallel on a distributed cluster, one of the data sources that you can work with in Team Studio. Because it is written in Python, it can also be used with other common open source packages to speed up development, for example using multiple nodes to experiment with different hyperparameters in a deep learning model.

The Jupyter Notebooks in Team Studio has a helper function that makes it very easy to initialize PySpark on your cluster and read data from HDFS as a Spark DataFrame.

 

Create and open a new Notebook under Work Files in your Team Studio Workspace.

Click on the Data menu

 

 

 

 

Select "Initialize Pyspark for Cluster...". This will automatically create a cell with the following code (the information will be specific to your cluster):

 

from pyspark import SparkConf, SparkContext

from pyspark.sql import SQLContext, HiveContext

import os

 

# Modify the line below to change the number of executors and the amount of memory that they use

# You can run Spark in local mode by changing 'yarn-client' to 'local', and setting the 'YARN_CONF_DIR' to ''

os.environ['PYSPARK_SUBMIT_ARGS'] = "--master yarn-client --num-executors 1 --executor-memory 1g --packages com.databricks:spark-csv_2.10:1.5.0,com.databricks:spark-avro_2.11:3.0.1 pyspark-shell"

 

os.environ['YARN_CONF_DIR'] = '/data/hdfs_configs/xx-xx.xx.xx.xx'

cc.datasource_name = 'Your datasource name'

# Each worker node in the cluster needs Python 2.7.

# If this is not the default Python on the node, provide the Python path here

# os.environ['PYSPARK_PYTHON'] = ''

 

# Do not remove or modify the following line:

# [[performPysparkInit(15)]]

 

# This environment variable has the value 'workflow' when the notebook is being executed as part of an analytics workflow

os.environ['NOTEBOOK_EXECUTION_ENVIRONMENT'] = 'notebook'

# This will stop the SparkContext if there is one left over from a different notebook execution

try:

sc.stop()

except NameError:

pass

 

APP_NAME = 'PySpark example.ipynb-my_team_studio_username'

conf = SparkConf().setAppName(APP_NAME)

sc = SparkContext(conf=conf)

sqlContext = SQLContext(sc)

 

Specify the path to your data and read it into a Spark DataFrame:

 

df = sqlContext.read.csv("/your/file/path.csv", header = True)You can then perform any operations on 'df' using PySpark.

If you do not have access to a Hadoop cluster, you can run your PySpark job in local mode. To do that, you can still use the helper function, but change the following parameters:

os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local pyspark-shell'

os.environ['YARN_CONF_DIR'] = ''

Link to comment
Share on other sites

  • 2 weeks later...

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...