Tomas Jurczyk Posted June 4, 2019 Share Posted June 4, 2019 Hello. I have a practical question. Maybe somebody have a solution already. I have CSVs in a Hadoop folder (CSVs has the same structure of columns). I can pick one CSV as the data source in Team Studio but also I can pick the whole folder as data source (by drag and droping into canvas). In that case, I will have one file with all the data. But I would like to have also names of the files inside this folder as one column. Does anybody know how to do that Link to comment Share on other sites More sharing options...
Chia-Yui Lee Posted June 4, 2019 Share Posted June 4, 2019 Hi Tomas, You may use a pySpark script from a Notebook within TIBCO Data Science Team Studio (formerly TIBCO Spotfire Data Science) to achieve that. You'd loop through all the files in the folderto read each fileinto a Spark dataframe, add the filename column to each dataframe and then union them all form one table. You may also use a Python script with pandas dataframesinstead of using pySpark to do the equivalent. The difference is thatthe data is moved to the Python environment for the manipulation. Thiswould beok if the dataset is not huge. TIBCO Data Science Team Studio provides a convenient Python helper class called Chorus Commander ('cc') with APIs for reading (and writing) data in Notebooks. Here's an example of how it is used. # Path of the file on HDFS input_path = '/myfolder/myfile.csv' # Read input as Spark dataframe input_df = cc.read_input_file(input_path, sqlContext=sqlContext, header=True, use_input_substitution=False) When the sqlContext is provided as an argument, the file is read as a Spark dataframe. If it is left out, the file is read as a pandas dataframe. I'll be writing some posts on using pySpark inTIBCO Data Science Team Studio Notebooks and will post the links here. Chia-Yui TIBCO Data Science Team Link to comment Share on other sites More sharing options...
Chia-Yui Lee Posted June 6, 2019 Share Posted June 6, 2019 Hi Tomas, I've added a post here on using PySpark in Team Studio Notebooks. Hope it's useful. https://community.spotfire.com/questions/using-pyspark-tibco-data-science-team-studio-notebook Chia-Yui TIBCO Data Science Team Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now