Is there a way to make Spotfire automatically "guess" the DataTypes of columns when importing data using IronPython

Charles Randall 2 · June 22, 2022

Hello,

When importing data from a text file using IronPython using code such as this:

readerSettings = TextDataReaderSettings()

readerSettings.Separator = "t" # Our text file is tab-delimited

readerSettings.AddColumnNameRow(0)

readerSettings.AddIgnoreRow(1) # Ignore the bad data type row completely

# create a data source to read in the file

textDataSource = TextFileDataSource(dataPath, readerSettings)

newTable = Document.Data.Tables.Add("My Table", textDataSource)

tableSettings = DataTableSaveSettings(newTable, True, False) # UseLinkedData = True; = False

tableSettings.UseLinkedData = True # To be safe

Document.Data.SaveSettings.DataTableSettings.Add(tableSettings)is there a way to make Spotfire automatically "guess" the data types of columns (like when you open the "Import settings" window for a data source, then click "Refresh," and all the types get set automatically) Currently, every single column gets imported as a String, regardless of whether it is solely numeric or not.

I know that specifying a TypeRow is the preferred method for setting DataTypes automatically, but the data I am working with unfortunately has unhelpful type rows (which, for example, list some - but not all - numeric columns as being of type "STRING"), and they are out of my control.

Thank you very much,

Charles

Gaia Paolini · June 27, 2022

I cannot find a way to do it on import, however it is possible to change the data types on a table after it's been created.

See script below. This snippet checks for floating point. It is pretty basic, I hope you can adapt it to your needs and maybe not check all rows but a subset.

---------------------------------------

from Spotfire.Dxp.Data.Transformations import ExpressionTransformation,ColumnSelection

from Spotfire.Dxp.Data import *

table = Document.Data.Tables['MyNewTableWithWrongDataTypes']

rowsToInclude = IndexSet(table.RowCount,True)

t = ExpressionTransformation()

#get the list of columns

columns = table.Columns

#iterate on columns, collect values, try to turn them into floating point. If they are not actually floating point, do nothing, otherwise change the column type.

for cc in columns:

ccname=cc.Name

cursor = DataValueCursor.CreateFormatted(table.Columns[ccname])

values=[]

for row in table.GetRows(rowsToInclude,cursor):

values.append(cursor.CurrentValue)

try:

values = [float(x) for x in values ]

t.ColumnReplacements.Add(

ccname,'real(['+ccname+'])',ColumnSelection(ccname)

)

except:

print ('was not real')

pass

table.AddTransformation(t)

Sundaram Sridharan · July 31, 2023

this approach works great in my case - was able to even nest a try: / except: to check for Integer first and then for real. However whenever it encounters a column name that has spaces it sometimes throws an error

ValueError: The column 'my column with spaces' does not exist in the collection. it only happens to one column in a file that has a bunch of other columns with spaces in their names

the line of code which throws this error is

cursor = DataValueCursor.CreateFormatted(table.Columns[ccname])

Gaia Paolini · August 1, 2023

is there a way you can provide the dxp you are trying, with an anonymized dataset? If it does not happen with other columns with spaces in their names, it might be something else.

Sundaram Sridharan · August 1, 2023

it only happens to columns that have another related column - let's say you have two columns "Well" and "Well Type" - then it is able to process "Well" fine but complains that "Well Type" cannot be found in the collection - if you go around it and complete the code, it brings in a new column in table called "Well(2)" that has the same content as the "Well" column and "Well Type" is completely gone. If I comment out this piece of code, then the file imports fine with all columns intact. If it would help to share DXP then I can try to recreate a smaller version

Gaia Paolini · August 1, 2023

this is useful information. Can you send the code you are using now and I can have a look?

Sundaram Sridharan · August 1, 2023

here is part of code that imports files and try to change the data types

#load the first file into the desired output table

file0 = fileList[0]

f_name0,f_ext = os.path.splitext(os.path.basename(file0))

ds = TextFileDataSource(file0, readerSettings)

table_name="Imported Data"

try: #if it exists, replace it

table = Document.Data.Tables[table_name]

table.ReplaceData(ds)

except: #if not, create a new one

table=Document.Data.Tables.Add(table_name, ds)

#load the second file, and now specity an origin column, the new value, and the previous value

if len(fileList)>1:

file1 = fileList[1]

f_name1,f_ext = os.path.splitext(os.path.basename(file1))

ds=TextFileDataSource(file1, readerSettings)

settings = AddRowsSettings(table, ds, "File Name", f_name1,f_name0)

table.AddRows(ds, settings)

#load third etc. files

if len(fileList)>2:

for i in range(2,len(fileList)):

file = fileList

f_name,f_ext = os.path.splitext(os.path.basename(file))

ds=TextFileDataSource(file, readerSettings)

#specify only the new value for the origin column

settings = AddRowsSettings(table,ds,"File Name",f_name)

table.AddRows(ds,settings)

#change column types

rowsToInclude = IndexSet(table.RowCount,True)

t = ExpressionTransformation()

#get the list of columns

columns = table.Columns

#iterate on columns, collect values, try to turn them into floating point. If they are not actually floating point, do nothing, otherwise change the column type.

for cc in columns:

ccname=cc.Name

# print ccname

try:

cursor = DataValueCursor.CreateFormatted(table.Columns[ccname])

except:

continue

values=[]

for row in table.GetRows(rowsToInclude,cursor):

if row.Index < 100:

values.append(cursor.CurrentValue)

try:

values = [int(x) for x in values ]

t.ColumnReplacements.Add(ccname,'Integer(['+ccname+'])',ColumnSelection(ccname))

except:

print ('was not Integer')

try:

values = [float(x) for x in values ]

t.ColumnReplacements.Add(ccname,'Real(['+ccname+'])',ColumnSelection(ccname))

except:

print ('was not Real')

pass

table.AddTransformation(t)

Gaia Paolini · August 1, 2023

Thanks, but the indentation is all gone. At the bottom of your editor window you see a symbol like </> which you can click on and enter the code there so it is not distorted by the editor. Please also add the library imports.

Sundaram Sridharan · August 1, 2023

import clr
import sys
import re
import os
#import chardet 
import io
clr.AddReference("System.Windows.Forms")
clr.AddReference('System')
from System.IO import FileStream,FileMode, File,MemoryStream,SeekOrigin,StreamWriter,StringReader,StreamReader
from System.Windows.Forms import OpenFileDialog
from Spotfire.Dxp.Data.DataOperations import DataOperation
import System.String
from Spotfire.Dxp.Data import *
from Spotfire.Dxp.Data.Import import *
from Spotfire.Dxp.Framework.ApplicationModel import NotificationService
from Spotfire.Dxp.Application.Visuals import VisualContent
from System.Collections.Generic import List
from Spotfire.Dxp.Data.Transformations import ExpressionTransformation,ColumnSelection
 
#load the first file into the desired output table
	file0 = fileList[0]
	f_name0,f_ext = os.path.splitext(os.path.basename(file0))
	ds = TextFileDataSource(file0, readerSettings)
	table_name="Imported Data"
	try: #if it exists, replace it
		table = Document.Data.Tables[table_name]
		table.ReplaceData(ds)
	except: #if not, create a new one
		table=Document.Data.Tables.Add(table_name, ds)
 
#load the second file, and now specity an origin column, the new value, and the previous value
	if len(fileList)>1:
		file1 = fileList[1]
		f_name1,f_ext = os.path.splitext(os.path.basename(file1))
		ds=TextFileDataSource(file1, readerSettings)
		settings = AddRowsSettings(table, ds, "File Name", f_name1,f_name0)
		table.AddRows(ds, settings)
#load third etc. files
	if len(fileList)>2:
		for i in range(2,len(fileList)):
			file = fileList[i]
			f_name,f_ext = os.path.splitext(os.path.basename(file))
			ds=TextFileDataSource(file, readerSettings)
#specify only the new value for the origin column
			settings = AddRowsSettings(table,ds,"File Name",f_name)
			table.AddRows(ds,settings)
#change column types
	rowsToInclude = IndexSet(table.RowCount,True)
	t = ExpressionTransformation()
	#get the list of columns
	columns = table.Columns
	#iterate on columns, collect values, try to turn them into floating point. If they are not actually floating point, do nothing, otherwise change the column type.
	for cc in columns:
		ccname=cc.Name
#		print ccname
		try:
			cursor = DataValueCursor.CreateFormatted(table.Columns[ccname])
		except:
			continue
		values=[]
		for row in table.GetRows(rowsToInclude,cursor):
			if row.Index < 100:
				values.append(cursor.CurrentValue)
		try:
			values = [int(x) for x in values ]
			t.ColumnReplacements.Add(ccname,'Integer(['+ccname+'])',ColumnSelection(ccname))
		except:
			print ('was not Integer')
			try:
				values = [float(x) for x in values ]
				t.ColumnReplacements.Add(ccname,'Real(['+ccname+'])',ColumnSelection(ccname))
			except:
				print ('was not Real')
				pass
		table.AddTransformation(t)

Gaia Paolini · August 1, 2023

Thanks, I could reproduce the problem, but I haven't found a solution yet

Sundaram Sridharan · August 1, 2023

thanks for verifying this - is there any other way to fix data types during /after import?

Gaia Paolini · August 2, 2023

I found a workaround to the workaround, which loops us back to the original question...the script below should add a table and specify the data types all in one go. Disclaimer: this is testing the boundaries of my Spotfire API prowess! I am not sure whether there is a better way, but this one worked for me.

You need to supply valid values for dataPath and new_table_name.

from Spotfire.Dxp.Data.Import import *
from Spotfire.Dxp.Data import *
 
dataPath='C:/blahblah/mydata.csv'
new_table_name="My Table"
readerSettings = TextDataReaderSettings()
 
# Initial reader settings
readerSettings.Separator = "," # Our text file is tab-delimited 
readerSettings.AddColumnNameRow(0) 
readerSettings.AddIgnoreRow(1) # Ignore the bad data type row completely 
readerSettings.ClearDataTypes(True) # To be sure, clear all defined data types (all loaded into strings)
# Create a data source to read in the file
textDataSource = TextFileDataSource(dataPath, readerSettings)
 
# Create the new table
try:
	newTable = Document.Data.Tables.Add(new_table_name, textDataSource)
except:
	newTable = Document.Data.Tables[new_table_name]
 
# Loop through the columns and inspect the values, try to set them to Real
new_columns = newTable.Columns
new_columns_list=list(new_columns)
for icol in range(len(new_columns_list)):
	col=new_columns_list[icol]
	#print (col.Name)
	enum=col.RowValues.GetEnumerator()
	values = [tt.ValidValue for tt in enum]
	try:
		values = [float(x) for x in values ]
		readerSettings.SetDataType(icol, DataType.Real)
	except:
		print ('was not real',col.Name)
		pass
 
#Since we updated the reader settings, we need to reload the new data table from file
textDataSource = TextFileDataSource(dataPath, readerSettings)
 
Document.Data.Tables.Remove(new_table_name)
newTable = Document.Data.Tables.Add(new_table_name, textDataSource) 
 
tableSettings = DataTableSaveSettings(newTable, True, False) 
tableSettings.UseLinkedData = True # To be safe  
Document.Data.SaveSettings.DataTableSettings.Add(tableSettings)

Sundaram Sridharan · August 2, 2023

Thanks a lot - this works perfectly for me

I had some trouble with the list() function and hence changed things slightly

#Change Column Data Types
	columns = table.Columns
	for icol,col in enumerate (columns):
		enum=col.RowValues.GetEnumerator()
		values = [tt.ValidValue for tt in enum]
		numV = len(values)
		if numV > 100:
			numV = 100
		try:
			values = [int(x) for x in values[:numV]]
			readerSettings.SetDataType(icol, DataType.Integer)
			print ('was Int',col.Name)
		except:
			try:
				values = [float(x) for x in values[:numV]]
				readerSettings.SetDataType(icol, DataType.Real)
				print ('was Real',col.Name)	
			except:
				pass
#Since we updated the reader settings, we need to reload the new data table from file
	ds = TextFileDataSource(file0, readerSettings)
	table.ReplaceData(ds)

Is there a way to make Spotfire automatically "guess" the DataTypes of columns when importing data using IronPython

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in