Python toolkit for data science and machine learning in Spotfire

We show what spotfire-dsml and our motivation behind creating a python toolkit are. The goal is to enhance data science and machine learning capabilities in Spotfire along with its components.

Introducing spotfire-dsml: Enhancing Data Science and Machine Learning in Spotfire

If you ever create data functions in Spotfire, our Python package will be a new, valuable resource for you. In the ever-evolving landscape of data science and analytics, ease of access and efficiency are of paramount importance. That's precisely where spotfire-dsml comes into play. This Python package is set to augment the way we approach data science within the Spotfire platform. With a robust vision, spotfire-dsml seeks to empower data scientists and analysts by significantly reducing the time to value for creating analytics-rich applications within Spotfire.

The vision behind spotfire-dsml is to create reproducible machine learning pipelines seamlessly integrated with Spotfire. This integration enhances the analytic capabilities within Spotfire, making it a powerhouse for data scientists and analysts across several industries like Pharmaceuticals, High-tech manufacturing, Energy, and others.

By providing ready-to-use Python functions spanning various data science, analytics, and data manipulation use cases, spotfire-dsml aims to democratize data science within Spotfire. The package aims to evolve continuously, ensuring it stays on top of the latest and greatest in data science.

What is inside spotfire-dsml?

The spotfire-dsml package includes the following modules:

ML Modeling (ml_modeling): Dive into pipeline-centric model training and evaluation. Whether you're a seasoned data scientist or just starting, this module equips you with the tools to build robust machine learning models effortlessly.
Time Series (time_series): Time series can be messy and challenging to work with. This module contains functions for time-series data which specializes in time-series preprocessing, smoothing, decomposition, pattern exploration and forecasting ensuring your analyses are fast, accurate, and reliable.
NLP (nlp_preprocessing): For those delving into the world of text analytics, this module offers pipeline-centric preprocessing solutions. It simplifies text data preparation, a critical step in natural language processing tasks.
Explainability Module (ml_explain): Uncover the mysteries of model explainability using the XWiN methodology. Gain insights into your models, making your predictions more transparent and trustworthy.
Monitoring Module (ml_drift): Detect and measure drift in your models with ease. Keeping your models up-to-date and accurate is crucial. This module simplifies the process by enabling you to decide when to trigger a new rebasing or retraining process.
Distribution Fitting (distribution_fitting): Distribution fitting and normality testing is useful, and at times, even a critical process across numerous industries. This module aims to simplify the distribution fitting and normality testing processes with functions that can be applied to full datasets, rather than working with one column at a time.
Missing Data (missing_data): Across industries, handling missing data is a crucial step in any project. Without properly handling missing data, models can become biased, and results can be inaccurate. This module aims to simplify the process of handling missing data by summarizing, removing and imputing missing values for tabular data
Geoanalytics (geo_analytics): Geospatial analytics involves analysing data with a focus on location awareness, managing relationships between different locations, and measuring quantities at various sites. Location data poses additional challenges - the Earth is not flat, not even a sphere or a perfect ellipsoid. This module aims simplify the process of handling geospatial data in Spotfire by integrating diverse coordinate reference systems, creating and transforming shapes, performing spatial joins and proximity analysis, and exporting geographic datasets as Shapefiles or GeoJSON

Let's delve deeper into each module, providing detailed insights into how spotfire-dsml can transform your data science workflows.

ML Modeling

This module addresses need to train reproducible machine learning models. The functions within this module help users create general-purpose Python pipelines for regression and binary classification for tabular data (to begin with). In most cases, data preprocessing must be done. Creating pipelines that include both preprocessing and model training is unavoidable, because all steps - some of which may be learned preprocessing steps, like imputation or encoding - must be executed at scoring-time as well.

Learn More! - for Python developers using this package

Example Usage - download an example Spotfire application (dxp), using data functions using functions from this spotfire-dsml package

Time Series

Time series can be messy and challenging to work with. This module makes time series analytics more accessible by providing a series of functions for preprocessing, smoothing, decomposition, pattern exploration and forecasting, mitigating the usual issues that come up. Functionality includes normalization, both on the time and measurement axes, resampling, missing value imputation, along with a handful of different smoothing techniques. With these functions, understanding your time series will be both easier and faster.