Need help for splitting table and using a ML algorithm

veni · July 6

Hi! I currently want to split a data table to be used as training and validation datasets. I am thinking of splitting it 80% (training) 20% (validation) via a stratified split (if possible).

The other part of the query is that I also want to perform Random Forest prediction on a categorical variable, but it seems to me that within Spotfire there isn't any as far as I know, only Logistic regression and Decision Tree.

I have searched on Google for a while and still having some difficulty getting these 2 done. Would appreciate any help given

David Boot-Olazabal · July 8

Hi veni,

Did you already had the chance to look at our data functions library?
https://community.spotfire.com/files/category/8-data-functions/

You may find a couple of data functions that could help you out.

And here is a direct link to a random forest data function:

The data splitting part is kind of different. What do you expect the code in Spotfire would do for you? Just the splitting or also the whole pipeline?
if you could elaborate a bit more on that, that would give us a better idea of what you're trying to achieve.

Kind regards

David

Edited July 8 by David Boot-Olazabal

Vincent Thuilot · July 10

Hi Veni,

In your place i would consider writing everything directly in Python from the start: you would have better control all all of the aspects of your algorithm, and would have the possibility to perform stratified split.

Through data functions, you could use scikit learn (make sure to import it first) and use the train_test_split function as usual

Something like this:

Gaia Paolini · July 10

if you want to go down the Python route, there is a Python module called Spotfire DSML that can be downloaded and used within Spotfire.

One of the modules is ml_modelling. It offers an end-to-end modelling process.

Once you split your data, you need to do something with the training and test data.
If you have missing data, or categorical variables, you need to handle those consistently in training and test data
Once you have a model, you need to validate it with the test data and usual performance indicators.

So it is advisable to be able to handle the entire process.
Useful links:

veni · July 14

On 7/8/2024 at 2:11 PM, David Boot-Olazabal said:

Hi veni,

Did you already had the chance to look at our data functions library?
https://community.spotfire.com/files/category/8-data-functions/

You may find a couple of data functions that could help you out.

And here is a direct link to a random forest data function:

The data splitting part is kind of different. What do you expect the code in Spotfire would do for you? Just the splitting or also the whole pipeline?
if you could elaborate a bit more on that, that would give us a better idea of what you're trying to achieve.

Kind regards

David

Hi David,

My dataset has a categorical variable "Diabetes", it notes whether a person has diabetes or not with "1" or "0".

What I want to do is to do a stratified splitting of my dataset into the training and validation datasets

Regards,
Veni

veni · July 14

I also have some calculated columns in Spotfire, so I would want the splitting to retain those calculated columns

Gaia Paolini · July 15

You said you wanted to do modelling after splitting. Could you elaborate on what your goals are? Maybe post a sample dataset?

Sign In

Need help for splitting table and using a ML algorithm

Recommended Posts

veni

Link to comment

Share on other sites

David Boot-Olazabal

Link to comment

Share on other sites

Vincent Thuilot

Link to comment

Share on other sites

Gaia Paolini

Link to comment

Share on other sites

veni

Link to comment

Share on other sites

veni

Link to comment

Share on other sites

Gaia Paolini

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Industries