I do not understand what you mean by K-Means getting corrupted. Maybe, if your data changes all the time, the clusters do change, or K-Means is not the right method to capture the structure in your data?
I asked Spotfire Copilot to come up with a K-Means Python data function (with column normalization, search for optimal number of clusters, and results in a separate table). It did a good job, with some debugging needed. I saved this example in Spotfire 14.0 which I hope you can open. Script below. (There may be a warning about cyclic dependencies, you can ignore it).
First you need to have (or create) a column containing the id of each row. I created a calculated column called "idColumn" using the RowId() expression function.
The data function accepts the input data table (you can input all columns, as it will use the numeric columns only), the name of the id column, and a min/max number of clusters.
If you want to have a pre-determined number of clusters K, just set min=max=your desired K.
The output is a separate table, which can be column-matched to the original one via this id column. I joined the original columns back to this table to visualize the results (see screenshot).
If you want to calculate different clusters, you can change the output table of the data function.
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import pandas as pd
import numpy as np
# Isolate numeric columns and normalize
numeric_cols = inputData.select_dtypes(include=np.number)
scaler = StandardScaler()
normalized_data = scaler.fit_transform(numeric_cols)
# Determine the optimal number of clusters
range_n_clusters = list(range(minClusters, maxClusters+1))
silhouette_avg = []
for num_clusters in range_n_clusters:
kmeans = KMeans(n_clusters=num_clusters, random_state=0).fit(normalized_data)
cluster_labels = kmeans.labels_
silhouette_avg.append(silhouette_score(normalized_data, cluster_labels))
# Select the optimal number of clusters
optimal_clusters = range_n_clusters[silhouette_avg.index(max(silhouette_avg))]
# Apply KMeans with the optimal number of clusters
kmeans = KMeans(n_clusters=optimal_clusters, random_state=0).fit(normalized_data)
inputData['Cluster'] = kmeans.labels_
# Prepare the silhouette score curve data
curve_data = pd.DataFrame({'Clusters': range_n_clusters, 'SilhouetteScore': silhouette_avg})
# Outputs
outputData = inputData[[idColumnName, 'Cluster']].copy()
optimalCurve = curve_data
k_means.dxp