YB1 Posted September 9 Share Posted September 9 Hi, I am fairly new to Spotfire data functions and I am trying to start simple. I am running the Kmeans clustering from spotfire (tools menu). The results sometimes get changed, so I would like to copy the results after calculation by copying the values explicitely rather than creating a column that would only refer to the values. TBH I am strugling as I am not sure how to pass the input column (or should it be the full table) then use pandas to create and fill a new column in same table. I guess once I manage that I will be able to do more on my own or using existinf data functions as inspiration... I hope some kind Spotfire guru can give me a hand here... Thanks Link to comment Share on other sites More sharing options...
Gaia Paolini Posted September 9 Share Posted September 9 what version of Spotfire are you using and do you have use/knowledge of Python ? Link to comment Share on other sites More sharing options...
YB1 Posted September 9 Author Share Posted September 9 Hi Gaia, thanks for answering, I am currently using 12.0.3. I have some basics of python and pandas and of course GPTs can help, but the main issue I have is interfacing with spotfire, knowing how to obtain and write columns in spotfire in a compatible way with pandas. I am sure it should be fairly simple, but I am missing the starting bricks... I did a few tries, also using AI GPTs but failed miserably so far. I am more of scientist than a software programmer, so spend no more than 10% of my time on code, which doesn't help build the comfort. Link to comment Share on other sites More sharing options...
YB1 Posted September 9 Author Share Posted September 9 Also, I wouldn't bother if the Kmeans would not appear to get corrupted every 5 minutes, leading to having to recompute/update. This might have to do with the setup. I am using Spotfire in Schlumberger DELFI environment, with ingestion from OSDU through Dataiku... May be the Kmeans results would be more stable outside of this setup... Link to comment Share on other sites More sharing options...
Gaia Paolini Posted September 10 Share Posted September 10 I do not understand what you mean by K-Means getting corrupted. Maybe, if your data changes all the time, the clusters do change, or K-Means is not the right method to capture the structure in your data? I asked Spotfire Copilot to come up with a K-Means Python data function (with column normalization, search for optimal number of clusters, and results in a separate table). It did a good job, with some debugging needed. I saved this example in Spotfire 14.0 which I hope you can open. Script below. (There may be a warning about cyclic dependencies, you can ignore it). First you need to have (or create) a column containing the id of each row. I created a calculated column called "idColumn" using the RowId() expression function. The data function accepts the input data table (you can input all columns, as it will use the numeric columns only), the name of the id column, and a min/max number of clusters. If you want to have a pre-determined number of clusters K, just set min=max=your desired K. The output is a separate table, which can be column-matched to the original one via this id column. I joined the original columns back to this table to visualize the results (see screenshot). If you want to calculate different clusters, you can change the output table of the data function. from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler from sklearn.metrics import silhouette_score import pandas as pd import numpy as np # Isolate numeric columns and normalize numeric_cols = inputData.select_dtypes(include=np.number) scaler = StandardScaler() normalized_data = scaler.fit_transform(numeric_cols) # Determine the optimal number of clusters range_n_clusters = list(range(minClusters, maxClusters+1)) silhouette_avg = [] for num_clusters in range_n_clusters: kmeans = KMeans(n_clusters=num_clusters, random_state=0).fit(normalized_data) cluster_labels = kmeans.labels_ silhouette_avg.append(silhouette_score(normalized_data, cluster_labels)) # Select the optimal number of clusters optimal_clusters = range_n_clusters[silhouette_avg.index(max(silhouette_avg))] # Apply KMeans with the optimal number of clusters kmeans = KMeans(n_clusters=optimal_clusters, random_state=0).fit(normalized_data) inputData['Cluster'] = kmeans.labels_ # Prepare the silhouette score curve data curve_data = pd.DataFrame({'Clusters': range_n_clusters, 'SilhouetteScore': silhouette_avg}) # Outputs outputData = inputData[[idColumnName, 'Cluster']].copy() optimalCurve = curve_data k_means.dxp Link to comment Share on other sites More sharing options...
YB1 Posted September 10 Author Share Posted September 10 This is how my time series look like : After Kmeans, they should look the same with just K-means categories colors on top. However, after a while, the time series look like that (i use column names a a color control as I have 2 columns on the same plot): So this is what I mean with Kmeans values becoming corrupted. That is whay I have been trying to generate a column that is not dependent on Kmeans process but contains Kmeans results. My data doesn't really change but is read from Spotfire Library and needs to be associated with a token value to be loaded (DataIku + SLB DELFI environment).... So it could be that refresh of the data triggers a mess on the Kmeans process... If I just re run it it looks fine, but doing it 20x a day is not.... Thanks for your support Link to comment Share on other sites More sharing options...
David Boot-Olazabal Posted September 10 Share Posted September 10 Hi YB1, Not sure why your data seems weird after a while. You mentioned you had to add in a token, to connect to DataIku + SLB DELFI. Could it be that this token becomes invalid after a while and you need to refresh it? To go back to your original request, copying data to another data table so it becomes independent of the back end, could you do with this iron python script? Kind regards, David Link to comment Share on other sites More sharing options...
YB1 Posted September 10 Author Share Posted September 10 Is, it could be the token. refreshing it could be as annoying as refreshing the KMean. Regarding your second question, that was actually my first question. This is exactly what I want to do: Copy the values using python so that the results of Kmeans get preserved, token or not... Link to comment Share on other sites More sharing options...
YB1 Posted September 10 Author Share Posted September 10 David, Thanks, will give it a try.. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now