Miguel Mano Posted May 11, 2022 Posted May 11, 2022 I would like to calculate pairwise euclidean distances between different samples. I use the data to build heatmap, but need the pairwise distance between all samples. Thank you
Gaia Paolini Posted May 13, 2022 Posted May 13, 2022 So I understand you want a distance matrix for each pair of records (samples) in your data table, and you want to use this distance matrix to build a heat map. Is this correct I assume all the variables in your table are numeric You would need a data function that takes your data table as input and generates the distance matrix as output. Are you familiar with R or Python
Fay Anabelle Posted May 20, 2022 Posted May 20, 2022 Hello. I have the same issue. And I'm pretty notfamiliar with R or Python. Can you give more specific instructions five nights at freddy's
Gaia Paolini Posted May 20, 2022 Posted May 20, 2022 To clarify my comment, I am still trying to understand the requirements. If you have a large data table. producing a full distance matrix would generate an enormous amount of data. So there is a big difference between generating the distance between selected samples and all samples.
Miguel Mano Posted May 20, 2022 Author Posted May 20, 2022 Thank you. Data is all numeric, matrix of 200 samples with 60 parameters. I want to calculate the distance between all samples - I believe this generates a 200x200 symetric matrix. I am not familiar with R, though I am abe to execute sommands. Thanks
Gaia Paolini Posted May 20, 2022 Posted May 20, 2022 Attached is an example using the iris dataset (150 records). I original posted this answer suggesting a data function. I then thought of an alternative not using data functions, see UPDATE below. Method with data function: See this for an intro to data functionshttps://www.youtube.com/watchv=IMw4P3INThU It uses a simple dist function in R. I tried to minimise the packages to use, but if you want to return the result as a tall data table (as well as a distance matrix) you need to install package reshape2: Go to Tools > TERR Tools choose Package Management tab if you don't see reshape2 in the list of installed packages, click on Load then type in reshape2 under Available Packages select the result in the list and click Install Otherwise if you only want the distance matrix, edit the data function script, comment all lines after comment #turn into tall table, and remove dist_dt from the output items. The data function is embedded in the dxp: #load library suppressWarnings(suppressPackageStartupMessages(library(reshape2))) # dt the input data table # dist_matrix is the output distance matrix # dist_dt the output tall distance table #determine the datatypes of the variables column_datatypes= sapply(dt,function(x) class(x)[1]) #isolate the numeric variables numeric_columns = names(column_datatypes[column_datatypes %in% c('numeric','integer')]) #create a matrix with the numeric variables of the data table dt_numeric=as.matrix(dt[,numeric_columns]) #calculate the distance matrix dist_matrix = as.matrix(dist(dt_numeric,method = "euclidean")) #turn into tall table dist_dt=(melt(dist_matrix)) #remove diagonal and one triangle #comment these lines if you want all data dist_dt$pair = ifelse(dist_dt$Var1>dist_dt$Var2,paste(dist_dt$Var1,dist_dt$Var2,sep='-'),paste(dist_dt$Var2,dist_dt$Var1,sep='-')) dist_dt = dist_dt[!duplicated(dist_dt$pair) & dist_dt$Var1 != dist_dt$Var2,] UPDATE: method without data function I found an alternative method that only involves Spotfire joins and calculated columns. It is a bit involved: 1 - add a column to your original table, called e.g. [forJoin] with a single value, it does not matter what. This column will be used for the join 2 - add a column to your original table, called e.g. [iD]=rowid(). This will be used to identify the rows after the join. 3 - create a new table based on your original table. go to + and choose 'Other' then select the original data table: choose 'add as new data table'. this creates a new table based on the original data. I called it 'Distance Table 2' add columns to this table, coming again from the original table. Join on [forJoin]. This creates all the row combinations. save the joined table as 'always new table', as it would refresh every time you update the original table. Now the joined table Distance Table 2 is created and we need to calculate the euclidean distance row by row. method A: purely manual, just for comparison in this dataset, since we have few columns. Calculated column [distance] as Sqrt((([sepal_Length] - [sepal_Length (2)]) ^ 2) + (([sepal_Width] - [sepal_Width (2)]) ^ 2) + (([Petal_Length] - [Petal_Length (2)]) ^ 2) + (([Petal_Width] - [Petal_Width (2)]) ^ 2)) method B: via two document properties, since we may have in reality many columns to match: create two multiple selection list boxes. One, called 'list1' to select all the columns in the joined table that come from the original table. The other, called 'list2', to select all the columns in the joined table that come from the second original table. They are identified by the (2) appended to their names. Select the appropriate columns for list1 and list2, then add a calculated column [distance2] to the joined table, as Sqrt($map("([${list1}]-[${list2}])^2", "+")) This calculates the distance between all column selected in list1 and the corresponding columns in list2. Note: the relative order needs to match, e.g. Sepal_Length (2) needs to be subtracted from Sepal_length, but this should come automatically if you set up your lists carefully. From my tests, limited to this datase (see dxp) these methods were equivalent.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now