Jump to content

How to calculate pairwise euclidean distances between datasets on heatmap


Miguel Mano

Recommended Posts

So I understand you want a distance matrix for each pair of records (samples) in your data table, and you want to use this distance matrix to build a heat map. Is this correct I assume all the variables in your table are numeric

 

You would need a data function that takes your data table as input and generates the distance matrix as output. Are you familiar with R or Python

Link to comment
Share on other sites

To clarify my comment, I am still trying to understand the requirements. If you have a large data table. producing a full distance matrix would generate an enormous amount of data. So there is a big difference between generating the distance between selected samples and all samples.
Link to comment
Share on other sites

Thank you. Data is all numeric, matrix of 200 samples with 60 parameters.

 

I want to calculate the distance between all samples - I believe this generates a 200x200 symetric matrix.

 

I am not familiar with R, though I am abe to execute sommands.

 

Thanks

Link to comment
Share on other sites

Attached is an example using the iris dataset (150 records).

I original posted this answer suggesting a data function. I then thought of an alternative not using data functions, see UPDATE below.

Method with data function:

See this for an intro to data functionshttps://www.youtube.com/watchv=IMw4P3INThU

It uses a simple dist function in R. I tried to minimise the packages to use, but if you want to return the result as a tall data table (as well as a distance matrix) you need to install package reshape2:

Go to Tools > TERR Tools

choose Package Management tab

if you don't see reshape2 in the list of installed packages, click on Load

then type in reshape2 under Available Packages

select the result in the list and click Install

Otherwise if you only want the distance matrix, edit the data function script, comment all lines after comment #turn into tall table, and remove dist_dt from the output items.

The data function is embedded in the dxp:

 

#load library

suppressWarnings(suppressPackageStartupMessages(library(reshape2)))

# dt the input data table

# dist_matrix is the output distance matrix

# dist_dt the output tall distance table

#determine the datatypes of the variables

column_datatypes= sapply(dt,function(x) class(x)[1])

#isolate the numeric variables

numeric_columns = names(column_datatypes[column_datatypes %in% c('numeric','integer')])

#create a matrix with the numeric variables of the data table

dt_numeric=as.matrix(dt[,numeric_columns])

#calculate the distance matrix

dist_matrix = as.matrix(dist(dt_numeric,method = "euclidean"))

#turn into tall table

dist_dt=(melt(dist_matrix))

#remove diagonal and one triangle

#comment these lines if you want all data

dist_dt$pair = ifelse(dist_dt$Var1>dist_dt$Var2,paste(dist_dt$Var1,dist_dt$Var2,sep='-'),paste(dist_dt$Var2,dist_dt$Var1,sep='-'))

dist_dt = dist_dt[!duplicated(dist_dt$pair) & dist_dt$Var1 != dist_dt$Var2,]

UPDATE: method without data function

I found an alternative method that only involves Spotfire joins and calculated columns. It is a bit involved:

1 - add a column to your original table, called e.g. [forJoin] with a single value, it does not matter what. This column will be used for the join

2 - add a column to your original table, called e.g. [iD]=rowid(). This will be used to identify the rows after the join.

3 - create a new table based on your original table.

go to + and choose 'Other' then select the original data table: choose 'add as new data table'.

this creates a new table based on the original data. I called it 'Distance Table 2'

add columns to this table, coming again from the original table. Join on [forJoin]. This creates all the row combinations.

save the joined table as 'always new table', as it would refresh every time you update the original table.

 

Now the joined table Distance Table 2 is created and we need to calculate the euclidean distance row by row.

method A: purely manual, just for comparison in this dataset, since we have few columns. Calculated column [distance] as

Sqrt((([sepal_Length] - [sepal_Length (2)]) ^ 2) + (([sepal_Width] - [sepal_Width (2)]) ^ 2) + (([Petal_Length] - [Petal_Length (2)]) ^ 2) + (([Petal_Width] - [Petal_Width (2)]) ^ 2))

method B: via two document properties, since we may have in reality many columns to match:

create two multiple selection list boxes.

One, called 'list1' to select all the columns in the joined table that come from the original table.

The other, called 'list2', to select all the columns in the joined table that come from the second original table. They are identified by the (2) appended to their names.

 

Select the appropriate columns for list1 and list2, then add a calculated column [distance2] to the joined table, as

Sqrt($map("([${list1}]-[${list2}])^2", "+"))

This calculates the distance between all column selected in list1 and the corresponding columns in list2.

Note: the relative order needs to match, e.g. Sepal_Length (2) needs to be subtracted from Sepal_length, but this should come automatically if you set up your lists carefully.

From my tests, limited to this datase (see dxp) these methods were equivalent.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...