Jump to content

Is there a way to make Spotfire automatically "guess" the DataTypes of columns when importing data using IronPython


Charles Randall 2

Recommended Posts

Hello,

 

When importing data from a text file using IronPython using code such as this:

readerSettings = TextDataReaderSettings()

readerSettings.Separator = "t" # Our text file is tab-delimited

readerSettings.AddColumnNameRow(0)

readerSettings.AddIgnoreRow(1) # Ignore the bad data type row completely

 

# create a data source to read in the file

textDataSource = TextFileDataSource(dataPath, readerSettings)

 

newTable = Document.Data.Tables.Add("My Table", textDataSource)

 

tableSettings = DataTableSaveSettings(newTable, True, False) # UseLinkedData = True; = False

tableSettings.UseLinkedData = True # To be safe

Document.Data.SaveSettings.DataTableSettings.Add(tableSettings)is there a way to make Spotfire automatically "guess" the data types of columns (like when you open the "Import settings" window for a data source, then click "Refresh," and all the types get set automatically) Currently, every single column gets imported as a String, regardless of whether it is solely numeric or not.

I know that specifying a TypeRow is the preferred method for setting DataTypes automatically, but the data I am working with unfortunately has unhelpful type rows (which, for example, list some - but not all - numeric columns as being of type "STRING"), and they are out of my control.

 

Thank you very much,

Charles

Link to comment
Share on other sites

I cannot find a way to do it on import, however it is possible to change the data types on a table after it's been created.

See script below. This snippet checks for floating point. It is pretty basic, I hope you can adapt it to your needs and maybe not check all rows but a subset.

---------------------------------------

from Spotfire.Dxp.Data.Transformations import ExpressionTransformation,ColumnSelection

from Spotfire.Dxp.Data import *

table = Document.Data.Tables['MyNewTableWithWrongDataTypes']

rowsToInclude = IndexSet(table.RowCount,True)

t = ExpressionTransformation()

#get the list of columns

columns = table.Columns

#iterate on columns, collect values, try to turn them into floating point. If they are not actually floating point, do nothing, otherwise change the column type.

for cc in columns:

ccname=cc.Name

cursor = DataValueCursor.CreateFormatted(table.Columns[ccname])

values=[]

for row in table.GetRows(rowsToInclude,cursor):

values.append(cursor.CurrentValue)

try:

values = [float(x) for x in values ]

t.ColumnReplacements.Add(

ccname,'real(['+ccname+'])',ColumnSelection(ccname)

)

except:

print ('was not real')

pass

 

table.AddTransformation(t)

Link to comment
Share on other sites

  • 1 year later...

this approach works great in my case - was able to even nest a try: / except: to check for Integer first and then for real. However whenever it encounters a column name that has spaces it sometimes throws an error

ValueError: The column 'my column with spaces' does not exist in the collection. it only happens to one column in a file that has a bunch of other columns with spaces in their names

the line of code which throws this error is

cursor = DataValueCursor.CreateFormatted(table.Columns[ccname])

Link to comment
Share on other sites

it only happens to columns that have another related column - let's say you have two columns "Well" and "Well Type" - then it is able to process "Well" fine but complains that "Well Type" cannot be found in the collection - if you go around it and complete the code, it brings in a new column in table called "Well(2)" that has the same content as the "Well" column and "Well Type" is completely gone. If I comment out this piece of code, then the file imports fine with all columns intact. If it would help to share DXP then I can try to recreate a smaller version

Link to comment
Share on other sites

here is part of code that imports files and try to change the data types

#load the first file into the desired output table

file0 = fileList[0]

f_name0,f_ext = os.path.splitext(os.path.basename(file0))

ds = TextFileDataSource(file0, readerSettings)

table_name="Imported Data"

try: #if it exists, replace it

table = Document.Data.Tables[table_name]

table.ReplaceData(ds)

except: #if not, create a new one

table=Document.Data.Tables.Add(table_name, ds)

#load the second file, and now specity an origin column, the new value, and the previous value

if len(fileList)>1:

file1 = fileList[1]

f_name1,f_ext = os.path.splitext(os.path.basename(file1))

ds=TextFileDataSource(file1, readerSettings)

settings = AddRowsSettings(table, ds, "File Name", f_name1,f_name0)

table.AddRows(ds, settings)

#load third etc. files

if len(fileList)>2:

for i in range(2,len(fileList)):

file = fileList

f_name,f_ext = os.path.splitext(os.path.basename(file))

ds=TextFileDataSource(file, readerSettings)

#specify only the new value for the origin column

settings = AddRowsSettings(table,ds,"File Name",f_name)

table.AddRows(ds,settings)

#change column types

rowsToInclude = IndexSet(table.RowCount,True)

t = ExpressionTransformation()

#get the list of columns

columns = table.Columns

#iterate on columns, collect values, try to turn them into floating point. If they are not actually floating point, do nothing, otherwise change the column type.

for cc in columns:

ccname=cc.Name

# print ccname

try:

cursor = DataValueCursor.CreateFormatted(table.Columns[ccname])

except:

continue

values=[]

for row in table.GetRows(rowsToInclude,cursor):

if row.Index < 100:

values.append(cursor.CurrentValue)

try:

values = [int(x) for x in values ]

t.ColumnReplacements.Add(ccname,'Integer(['+ccname+'])',ColumnSelection(ccname))

except:

print ('was not Integer')

try:

values = [float(x) for x in values ]

t.ColumnReplacements.Add(ccname,'Real(['+ccname+'])',ColumnSelection(ccname))

except:

print ('was not Real')

pass

table.AddTransformation(t)

Link to comment
Share on other sites

import clr

import sys

import re

import os

#import chardet

import io

clr.AddReference("System.Windows.Forms")

clr.AddReference('System')

from System.IO import FileStream,FileMode, File,MemoryStream,SeekOrigin,StreamWriter,StringReader,StreamReader

from System.Windows.Forms import OpenFileDialog

from Spotfire.Dxp.Data.DataOperations import DataOperation

import System.String

from Spotfire.Dxp.Data import *

from Spotfire.Dxp.Data.Import import *

from Spotfire.Dxp.Framework.ApplicationModel import NotificationService

from Spotfire.Dxp.Application.Visuals import VisualContent

from System.Collections.Generic import List

from Spotfire.Dxp.Data.Transformations import ExpressionTransformation,ColumnSelection

 

#load the first file into the desired output table

file0 = fileList[0]

f_name0,f_ext = os.path.splitext(os.path.basename(file0))

ds = TextFileDataSource(file0, readerSettings)

table_name="Imported Data"

try: #if it exists, replace it

table = Document.Data.Tables[table_name]

table.ReplaceData(ds)

except: #if not, create a new one

table=Document.Data.Tables.Add(table_name, ds)

 

#load the second file, and now specity an origin column, the new value, and the previous value

if len(fileList)>1:

file1 = fileList[1]

f_name1,f_ext = os.path.splitext(os.path.basename(file1))

ds=TextFileDataSource(file1, readerSettings)

settings = AddRowsSettings(table, ds, "File Name", f_name1,f_name0)

table.AddRows(ds, settings)

#load third etc. files

if len(fileList)>2:

for i in range(2,len(fileList)):

file = fileList

f_name,f_ext = os.path.splitext(os.path.basename(file))

ds=TextFileDataSource(file, readerSettings)

#specify only the new value for the origin column

settings = AddRowsSettings(table,ds,"File Name",f_name)

table.AddRows(ds,settings)

#change column types

rowsToInclude = IndexSet(table.RowCount,True)

t = ExpressionTransformation()

#get the list of columns

columns = table.Columns

#iterate on columns, collect values, try to turn them into floating point. If they are not actually floating point, do nothing, otherwise change the column type.

for cc in columns:

ccname=cc.Name

# print ccname

try:

cursor = DataValueCursor.CreateFormatted(table.Columns[ccname])

except:

continue

values=[]

for row in table.GetRows(rowsToInclude,cursor):

if row.Index < 100:

values.append(cursor.CurrentValue)

try:

values = [int(x) for x in values ]

t.ColumnReplacements.Add(ccname,'Integer(['+ccname+'])',ColumnSelection(ccname))

except:

print ('was not Integer')

try:

values = [float(x) for x in values ]

t.ColumnReplacements.Add(ccname,'Real(['+ccname+'])',ColumnSelection(ccname))

except:

print ('was not Real')

pass

table.AddTransformation(t)

Link to comment
Share on other sites

I found a workaround to the workaround, which loops us back to the original question...the script below should add a table and specify the data types all in one go. Disclaimer: this is testing the boundaries of my Spotfire API prowess! I am not sure whether there is a better way, but this one worked for me.

You need to supply valid values for dataPath and new_table_name.

from Spotfire.Dxp.Data.Import import *from Spotfire.Dxp.Data import * dataPath='C:/blahblah/mydata.csv'new_table_name="My Table"readerSettings = TextDataReaderSettings() # Initial reader settingsreaderSettings.Separator = "," # Our text file is tab-delimited readerSettings.AddColumnNameRow(0) readerSettings.AddIgnoreRow(1) # Ignore the bad data type row completely readerSettings.ClearDataTypes(True) # To be sure, clear all defined data types (all loaded into strings)# Create a data source to read in the filetextDataSource = TextFileDataSource(dataPath, readerSettings) # Create the new tabletry: newTable = Document.Data.Tables.Add(new_table_name, textDataSource)except: newTable = Document.Data.Tables[new_table_name] # Loop through the columns and inspect the values, try to set them to Realnew_columns = newTable.Columnsnew_columns_list=list(new_columns)for icol in range(len(new_columns_list)): col=new_columns_list[icol] #print (col.Name) enum=col.RowValues.GetEnumerator() values = [tt.ValidValue for tt in enum] try: values = [float(x) for x in values ] readerSettings.SetDataType(icol, DataType.Real) except: print ('was not real',col.Name) pass #Since we updated the reader settings, we need to reload the new data table from filetextDataSource = TextFileDataSource(dataPath, readerSettings) Document.Data.Tables.Remove(new_table_name)newTable = Document.Data.Tables.Add(new_table_name, textDataSource)  tableSettings = DataTableSaveSettings(newTable, True, False) tableSettings.UseLinkedData = True # To be safe Document.Data.SaveSettings.DataTableSettings.Add(tableSettings)
Link to comment
Share on other sites

Thanks a lot - this works perfectly for me

I had some trouble with the list() function and hence changed things slightly

#Change Column Data Types columns = table.Columns for icol,col in enumerate (columns): enum=col.RowValues.GetEnumerator() values = [tt.ValidValue for tt in enum] numV = len(values) if numV > 100: numV = 100 try: values = [int(x) for x in values[:numV]] readerSettings.SetDataType(icol, DataType.Integer) print ('was Int',col.Name) except: try: values = [float(x) for x in values[:numV]] readerSettings.SetDataType(icol, DataType.Real) print ('was Real',col.Name) except: pass#Since we updated the reader settings, we need to reload the new data table from file ds = TextFileDataSource(file0, readerSettings) table.ReplaceData(ds)
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...