Jump to content

Is there a way to make Spotfire automatically "guess" the DataTypes of columns when importing data using IronPython


Charles Randall 2

Recommended Posts

Hello,

 

When importing data from a text file using IronPython using code such as this:

readerSettings = TextDataReaderSettings()

readerSettings.Separator = "t" # Our text file is tab-delimited

readerSettings.AddColumnNameRow(0)

readerSettings.AddIgnoreRow(1) # Ignore the bad data type row completely

 

# create a data source to read in the file

textDataSource = TextFileDataSource(dataPath, readerSettings)

 

newTable = Document.Data.Tables.Add("My Table", textDataSource)

 

tableSettings = DataTableSaveSettings(newTable, True, False) # UseLinkedData = True; = False

tableSettings.UseLinkedData = True # To be safe

Document.Data.SaveSettings.DataTableSettings.Add(tableSettings)is there a way to make Spotfire automatically "guess" the data types of columns (like when you open the "Import settings" window for a data source, then click "Refresh," and all the types get set automatically) Currently, every single column gets imported as a String, regardless of whether it is solely numeric or not.

I know that specifying a TypeRow is the preferred method for setting DataTypes automatically, but the data I am working with unfortunately has unhelpful type rows (which, for example, list some - but not all - numeric columns as being of type "STRING"), and they are out of my control.

 

Thank you very much,

Charles

Link to comment
Share on other sites

I cannot find a way to do it on import, however it is possible to change the data types on a table after it's been created.

See script below. This snippet checks for floating point. It is pretty basic, I hope you can adapt it to your needs and maybe not check all rows but a subset.

---------------------------------------

from Spotfire.Dxp.Data.Transformations import ExpressionTransformation,ColumnSelection

from Spotfire.Dxp.Data import *

table = Document.Data.Tables['MyNewTableWithWrongDataTypes']

rowsToInclude = IndexSet(table.RowCount,True)

t = ExpressionTransformation()

#get the list of columns

columns = table.Columns

#iterate on columns, collect values, try to turn them into floating point. If they are not actually floating point, do nothing, otherwise change the column type.

for cc in columns:

ccname=cc.Name

cursor = DataValueCursor.CreateFormatted(table.Columns[ccname])

values=[]

for row in table.GetRows(rowsToInclude,cursor):

values.append(cursor.CurrentValue)

try:

values = [float(x) for x in values ]

t.ColumnReplacements.Add(

ccname,'real(['+ccname+'])',ColumnSelection(ccname)

)

except:

print ('was not real')

pass

 

table.AddTransformation(t)

Link to comment
Share on other sites

  • 1 year later...

this approach works great in my case - was able to even nest a try: / except: to check for Integer first and then for real. However whenever it encounters a column name that has spaces it sometimes throws an error

ValueError: The column 'my column with spaces' does not exist in the collection. it only happens to one column in a file that has a bunch of other columns with spaces in their names

the line of code which throws this error is

cursor = DataValueCursor.CreateFormatted(table.Columns[ccname])

Link to comment
Share on other sites

it only happens to columns that have another related column - let's say you have two columns "Well" and "Well Type" - then it is able to process "Well" fine but complains that "Well Type" cannot be found in the collection - if you go around it and complete the code, it brings in a new column in table called "Well(2)" that has the same content as the "Well" column and "Well Type" is completely gone. If I comment out this piece of code, then the file imports fine with all columns intact. If it would help to share DXP then I can try to recreate a smaller version

Link to comment
Share on other sites

here is part of code that imports files and try to change the data types

#load the first file into the desired output table

file0 = fileList[0]

f_name0,f_ext = os.path.splitext(os.path.basename(file0))

ds = TextFileDataSource(file0, readerSettings)

table_name="Imported Data"

try: #if it exists, replace it

table = Document.Data.Tables[table_name]

table.ReplaceData(ds)

except: #if not, create a new one

table=Document.Data.Tables.Add(table_name, ds)

#load the second file, and now specity an origin column, the new value, and the previous value

if len(fileList)>1:

file1 = fileList[1]

f_name1,f_ext = os.path.splitext(os.path.basename(file1))

ds=TextFileDataSource(file1, readerSettings)

settings = AddRowsSettings(table, ds, "File Name", f_name1,f_name0)

table.AddRows(ds, settings)

#load third etc. files

if len(fileList)>2:

for i in range(2,len(fileList)):

file = fileList

f_name,f_ext = os.path.splitext(os.path.basename(file))

ds=TextFileDataSource(file, readerSettings)

#specify only the new value for the origin column

settings = AddRowsSettings(table,ds,"File Name",f_name)

table.AddRows(ds,settings)

#change column types

rowsToInclude = IndexSet(table.RowCount,True)

t = ExpressionTransformation()

#get the list of columns

columns = table.Columns

#iterate on columns, collect values, try to turn them into floating point. If they are not actually floating point, do nothing, otherwise change the column type.

for cc in columns:

ccname=cc.Name

# print ccname

try:

cursor = DataValueCursor.CreateFormatted(table.Columns[ccname])

except:

continue

values=[]

for row in table.GetRows(rowsToInclude,cursor):

if row.Index < 100:

values.append(cursor.CurrentValue)

try:

values = [int(x) for x in values ]

t.ColumnReplacements.Add(ccname,'Integer(['+ccname+'])',ColumnSelection(ccname))

except:

print ('was not Integer')

try:

values = [float(x) for x in values ]

t.ColumnReplacements.Add(ccname,'Real(['+ccname+'])',ColumnSelection(ccname))

except:

print ('was not Real')

pass

table.AddTransformation(t)

Link to comment
Share on other sites

 

import clr
import sys
import re
import os
#import chardet 
import io
clr.AddReference("System.Windows.Forms")
clr.AddReference('System')
from System.IO import FileStream,FileMode, File,MemoryStream,SeekOrigin,StreamWriter,StringReader,StreamReader
from System.Windows.Forms import OpenFileDialog
from Spotfire.Dxp.Data.DataOperations import DataOperation
import System.String
from Spotfire.Dxp.Data import *
from Spotfire.Dxp.Data.Import import *
from Spotfire.Dxp.Framework.ApplicationModel import NotificationService
from Spotfire.Dxp.Application.Visuals import VisualContent
from System.Collections.Generic import List
from Spotfire.Dxp.Data.Transformations import ExpressionTransformation,ColumnSelection
 
#load the first file into the desired output table
	file0 = fileList[0]
	f_name0,f_ext = os.path.splitext(os.path.basename(file0))
	ds = TextFileDataSource(file0, readerSettings)
	table_name="Imported Data"
	try: #if it exists, replace it
		table = Document.Data.Tables[table_name]
		table.ReplaceData(ds)
	except: #if not, create a new one
		table=Document.Data.Tables.Add(table_name, ds)
 
#load the second file, and now specity an origin column, the new value, and the previous value
	if len(fileList)>1:
		file1 = fileList[1]
		f_name1,f_ext = os.path.splitext(os.path.basename(file1))
		ds=TextFileDataSource(file1, readerSettings)
		settings = AddRowsSettings(table, ds, "File Name", f_name1,f_name0)
		table.AddRows(ds, settings)
#load third etc. files
	if len(fileList)>2:
		for i in range(2,len(fileList)):
			file = fileList[i]
			f_name,f_ext = os.path.splitext(os.path.basename(file))
			ds=TextFileDataSource(file, readerSettings)
#specify only the new value for the origin column
			settings = AddRowsSettings(table,ds,"File Name",f_name)
			table.AddRows(ds,settings)
#change column types
	rowsToInclude = IndexSet(table.RowCount,True)
	t = ExpressionTransformation()
	#get the list of columns
	columns = table.Columns
	#iterate on columns, collect values, try to turn them into floating point. If they are not actually floating point, do nothing, otherwise change the column type.
	for cc in columns:
		ccname=cc.Name
#		print ccname
		try:
			cursor = DataValueCursor.CreateFormatted(table.Columns[ccname])
		except:
			continue
		values=[]
		for row in table.GetRows(rowsToInclude,cursor):
			if row.Index < 100:
				values.append(cursor.CurrentValue)
		try:
			values = [int(x) for x in values ]
			t.ColumnReplacements.Add(ccname,'Integer(['+ccname+'])',ColumnSelection(ccname))
		except:
			print ('was not Integer')
			try:
				values = [float(x) for x in values ]
				t.ColumnReplacements.Add(ccname,'Real(['+ccname+'])',ColumnSelection(ccname))
			except:
				print ('was not Real')
				pass
		table.AddTransformation(t)

 

Link to comment
Share on other sites

I found a workaround to the workaround, which loops us back to the original question...the script below should add a table and specify the data types all in one go. Disclaimer: this is testing the boundaries of my Spotfire API prowess! I am not sure whether there is a better way, but this one worked for me.

You need to supply valid values for dataPath and new_table_name.

from Spotfire.Dxp.Data.Import import *
from Spotfire.Dxp.Data import *
 
dataPath='C:/blahblah/mydata.csv'
new_table_name="My Table"
readerSettings = TextDataReaderSettings()
 
# Initial reader settings
readerSettings.Separator = "," # Our text file is tab-delimited 
readerSettings.AddColumnNameRow(0) 
readerSettings.AddIgnoreRow(1) # Ignore the bad data type row completely 
readerSettings.ClearDataTypes(True) # To be sure, clear all defined data types (all loaded into strings)
# Create a data source to read in the file
textDataSource = TextFileDataSource(dataPath, readerSettings)
 
# Create the new table
try:
	newTable = Document.Data.Tables.Add(new_table_name, textDataSource)
except:
	newTable = Document.Data.Tables[new_table_name]
 
# Loop through the columns and inspect the values, try to set them to Real
new_columns = newTable.Columns
new_columns_list=list(new_columns)
for icol in range(len(new_columns_list)):
	col=new_columns_list[icol]
	#print (col.Name)
	enum=col.RowValues.GetEnumerator()
	values = [tt.ValidValue for tt in enum]
	try:
		values = [float(x) for x in values ]
		readerSettings.SetDataType(icol, DataType.Real)
	except:
		print ('was not real',col.Name)
		pass
 
#Since we updated the reader settings, we need to reload the new data table from file
textDataSource = TextFileDataSource(dataPath, readerSettings)
 
Document.Data.Tables.Remove(new_table_name)
newTable = Document.Data.Tables.Add(new_table_name, textDataSource) 
 
tableSettings = DataTableSaveSettings(newTable, True, False) 
tableSettings.UseLinkedData = True # To be safe  
Document.Data.SaveSettings.DataTableSettings.Add(tableSettings)

 

Link to comment
Share on other sites

Thanks a lot - this works perfectly for me

I had some trouble with the list() function and hence changed things slightly

#Change Column Data Types
	columns = table.Columns
	for icol,col in enumerate (columns):
		enum=col.RowValues.GetEnumerator()
		values = [tt.ValidValue for tt in enum]
		numV = len(values)
		if numV > 100:
			numV = 100
		try:
			values = [int(x) for x in values[:numV]]
			readerSettings.SetDataType(icol, DataType.Integer)
			print ('was Int',col.Name)
		except:
			try:
				values = [float(x) for x in values[:numV]]
				readerSettings.SetDataType(icol, DataType.Real)
				print ('was Real',col.Name)	
			except:
				pass
#Since we updated the reader settings, we need to reload the new data table from file
	ds = TextFileDataSource(file0, readerSettings)
	table.ReplaceData(ds)

 

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...