Manon Carvalho 2 Posted September 30 Share Posted September 30 Hello, I have a dataset with several Groups, several Parameters. Each row has a Value which is associated to a Time. I would like to normalize this dataset using the values of one of the parameters (A in the below example). However I do not have exact match in the time, so I would like to normalize each value using the value of the closest time inside of each group. As I could not find how to do this with calculated columns using regular Spotfire custom expressions, I tried using a custom expression function with TERR. As a first step I created a calculated column Time from A: If([Parameter]="A",[Time]) Then I created an expression function: Closest = function(col_to_cut,col_with_cuts){ col_with_cuts <- sort(col_with_cuts) cuts <- c(-Inf, col_with_cuts[-1]-diff(col_with_cuts)/2, Inf) cuts <- sort(cuts[!is.na(cuts)]) index = findInterval(col_to_cut, cuts) return(col_with_cuts[index]) } output <- Closest(input1,input2) And I inserted a calculated column calling this function: Closest([Time],[Time from A]) But this is only working for the whole dataset (Group 1 and 2 together, see Wrong intermediate result in the image above). Is it possible to apply an expression function over groups of data using something like Closest([Time],[Time from A]) OVER ([Group]) ? Thank you Link to comment Share on other sites More sharing options...
David Boot-Olazabal Posted October 1 Share Posted October 1 Hi Manon, Did you already try that OVER expression in the Closest([Time], [Time from A]) expression? And if so, did that solve the issue? If not, can you share your example dataset or dxp file? Kind regards, David Link to comment Share on other sites More sharing options...
Manon Carvalho 2 Posted October 1 Author Share Posted October 1 Hi David, Yes I tried the expression Closest([Time],[Time from A]) OVER ([Group]) but "OVER expressions can only be used with aggregation methods". And if I use an aggregation method then I will have the same value everywhere within each group, which is not what I want. For example with Avg(Closest([Time],[Time from A])) OVER ([Group]) I have 11 for all rows of Group 1 and 8 for all rows of Group 2. Attached is the dxp file containing the data table with calculations (which are wrong) and the data table with expected results. Thank you Normalization Expression Function.dxp Link to comment Share on other sites More sharing options...
David Boot-Olazabal Posted October 1 Share Posted October 1 Hi Manon, Yes, indeed. The OVER requires an aggregation method. I'm not sure how big your actual data set would be, but a workaround could be to calculate for each occurrence of Group/Parameter A, a calculated column using CASE WHEN THEN statements. In my screenshot, these are highlighted (Case when ([Group]=1) AND ([Parameter]="A") THEN [Time] END): Once you have that, you can calculate the 'Closest time..' columns for each of the 'Time from..' columns: Closest([Time],[Time from 1A]). And finally, the Result calculated column brings it all together with another CASE statement: CASE WHEN [Group]=1 THEN [Closest time from 1A] WHEN [Group]=2 THEN [Closest time from 2A] END That will bring you the requested values in the Result column. The disadvantage is that you may end up with quite some calculated columns and a large CASE statement. I'll try to find a better, more flexible solution and ask some of my colleagues as well. Kind regards, David Link to comment Share on other sites More sharing options...
David Boot-Olazabal Posted October 1 Share Posted October 1 Hi Manon, The below calculated column eliminates the need for the two 'Closest time from..' columns. For your result column, you can also use this expression: CASE WHEN [Group]=1 THEN Closest([Time],[Time from 1A]) WHEN [Group]=2 THEN Closest([Time],[Time from 2A]) END That makes it already a bit more flexible to use for larger data sets. Kind regards, David Link to comment Share on other sites More sharing options...
Manon Carvalho 2 Posted October 1 Author Share Posted October 1 Hi David, Sadly this is not a solution for me because my actual dataset is much bigger than this example one, and above all, it is loaded on-demand and so dynamic (as well as the parameter which is used as reference, but for this I will use a document property). So I have a dataset for which I don't know in advance how many groups I will get, and so how many calculated columns I would need. Thank you for checking Link to comment Share on other sites More sharing options...
David Boot-Olazabal Posted October 1 Share Posted October 1 Hi Manon, Yes, that is indeed a complicating factor. I've asked my colleagues to see if they can 'redesign' the original expression: Closest([Time],[Time from A]). That would probably be a better fit for your use case. Kind regards, David Link to comment Share on other sites More sharing options...
David Boot-Olazabal Posted October 1 Share Posted October 1 (edited) Hi Manon, You may use a different approach to calculate the nearest (closest) time value. Currently, you've used a custom expression function to calculate the closest time. But what if you use a 'normal' (Python) data function to calculate this? And store the results in a data table (output parameter)? Would that be an option to consider? Or are you able to alter your current custom expression function in such a way, that it takes into account the Group and Parameter columns? Kind regards, David Edited October 1 by David Boot-Olazabal Link to comment Share on other sites More sharing options...
Solution Manon Carvalho 2 Posted October 2 Author Solution Share Posted October 2 Hi David, The second option is the one which fits the best with my needs. I modified the custom expression function to take groups into account: Closest = function(col_to_cut,col_with_cuts){ col_with_cuts <- sort(col_with_cuts) cuts <- c(-Inf, col_with_cuts[-1]-diff(col_with_cuts)/2, Inf) cuts <- sort(cuts[!is.na(cuts)]) index = findInterval(col_to_cut, cuts) return(col_with_cuts[index]) } ClosestInGroup = function(col_to_cut,col_with_cuts,col_to_group){ indices <- c() result <- c() for(group in unique(col_to_group)) { indexes <- which(col_to_group %in% group) closest <- Closest(col_to_cut[indexes],col_with_cuts[indexes]) indices <- append(indices,indexes) result <- append(result, closest) } df <- data.frame(result, indices) df <- df[with(df, order(indices)),] return(df$result) } output<-ClosestInGroup(input1,input2,input3) Then I have a first calculated column Closest Time from A: ClosestInGroup([Time],If([Parameter]="A",[Time]),[Group]) A second calculated column Value of the Closest Time from A: Avg(If([Parameter]="A",[Value])) over ([Closest Time from A],[Group]) And a final calculated column Normalized Value: [Value] / [Value of the Closest Time from A] This works as expected. Thank you for your support. 1 Link to comment Share on other sites More sharing options...
David Boot-Olazabal Posted October 2 Share Posted October 2 Hi Manon, Great find and thanks for sharing the solution. This could also be very helpful for others. Thanks, David Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now