Jump to content

IronPython for importing multiple files with header rows

Sundaram Sridharan

Recommended Posts

I need to import multiple text files with header rows followed by name row and data. The number of header rows and number of columns in data can vary from one set to next but will be the same within a set that we are trying to import. I liked ideas expressed here


however I am not sure CreateFileDataSource can skip header lines. To add to this, I also need to import the file name as an additional column and add them to the final table created. I am looking at TextFileDataSource class. I can scan the file in ironpython to find where the name row is located and also scan a few rows of data to find column data types and can set up textdatareadersettings. I think I can use table.AddRows to add subsequent blocks of data from successive files. what I don't know how to do is create a new column and add the file name as data under it. I explored addcalculatedcolumn etc but not sure how to make it work for me

Link to comment
Share on other sites

Combining my previous answer with this article I just found:


..this seemed to work. I added extra comment rows to my input files and it did not seem to create a problem (they were skipped automatically) but let me know if it is. Adding the origin column is the tricky bit, it seems that you specify it differently for the first, second, then subsequent files.

[The indentation in this loaded file snippet still looks a bit funny]

# Copyright © 2022. TIBCO Software Inc. Licensed under TIBCO BSD-style license. import clrimport Spotfire.Dxp.Data.DataTablefrom Spotfire.Dxp.Data.Import import *from Spotfire.Dxp.Data import * #select files dynamicallyclr.AddReference("System.Windows.Forms")from System.Windows.Forms import OpenFileDialogfrom Spotfire.Dxp.Data import *myDataManager = Document.Datad1 = OpenFileDialog()d1.Multiselect=True #lets you select more than one filed1.InitialDirectory='C:\MyData' d1.ShowDialog()files=d1.FileNames #load the first file into the desired output tableds=myDataManager.CreateFileDataSource(files[0])table_name='myImportedData'try: #if it exists, replace it table = Document.Data.Tables[table_name] table.ReplaceData(ds)except: #if not, create a new one table=Document.Data.Tables.Add(table_name, ds) #name of column to contain the origin of dataorigin='OriginOfData'#if it exists, remove it (avoid duplications)cols=table.Columnstry: cols.remove[origin]except: pass #load the second file, and now specity an origin column, the new value, and the previous valueif len(files)>1: ds=myDataManager.CreateFileDataSource(files[1]) settings = AddRowsSettings(table, ds, origin, files[1],files[0]) table.AddRows(ds, settings) #load third etc. filesif len(files)>2: for i in range(2,len(files)): ds=myDataManager.CreateFileDataSource(files) #specify only the new value for the origin column settings = AddRowsSettings(table,ds,origin,files) table.AddRows(ds,settings)
Link to comment
Share on other sites

thank you - I haven't tried the entire script yet to see how it works for me but I hit a wall with header lines - the problem is the header lines had lots of spaces and it started using space as separator and included the entire file content - whereas the separator for data was comma - all of the data was clumped into the first column. I tried the spotfire file import and the default behavior was the same (as if it was using CreateFileDataSource behind the scenes) - I then tried the spotfire import - this time changing the delimiter to comma, it imported the row before the actual name row as name row as it had two entries separated by comma in the header line. Hence I need to specify the code to explicitly skip certain number of lines as header rows - I am not sure if CreateFileDataSource has any such option.

Also I noticed that there is a new column added "Origin" to indicate file names - I am not sure where in the code is the content of that column defined.

Link to comment
Share on other sites

It turns out that this solution worked wonderfully for me. I did use textfiledatasource and my own code to automatically figure out number of header rows, delimiter, nameRow and column types (all of these come as freebie's in createfiledatasource). However the thing that made all the difference is options for AddRowsSettings - did not know it took an argument for specifying the file name. I played with it for a while and was able to use just the base file name without the path and file suffix. Kudos for suggesting this solution

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Create New...