Jump to content

Need help for splitting table and using a ML algorithm


veni

Recommended Posts

Hi! I currently want to split a data table to be used as training and validation datasets. I am thinking of splitting it 80% (training) 20% (validation) via a stratified split (if possible).

The other part of the query is that I also want to perform Random Forest prediction on a categorical variable, but it seems to me that within Spotfire there isn't any as far as I know, only Logistic regression and Decision Tree.

I have searched on Google for a while and still having some difficulty getting these 2 done. Would appreciate any help given

Link to comment
Share on other sites

Posted (edited)

Hi veni,

Did you already had the chance to look at our data functions library?
https://community.spotfire.com/files/category/8-data-functions/

You may find a couple of data functions that could help you out.

And here is a direct link to a random forest data function:

The data splitting part is kind of different. What do you expect the code in Spotfire would do for you? Just the splitting or also the whole pipeline?
if you could elaborate a bit more on that, that would give us a better idea of what you're trying to achieve.

Kind regards

David

Edited by David Boot-Olazabal
Link to comment
Share on other sites

Hi Veni, 

In your place i would consider writing everything directly in Python from the start: you would have better control all all of the aspects of your algorithm, and would have the possibility to perform stratified split.

Through data functions, you could use scikit learn (make sure to import it first) and use the train_test_split function as usual

Something like this:

image.png.4fbc441349e0bd15617b0f0e6604cc37.png

  • Like 1
Link to comment
Share on other sites

if you want to go down the Python route, there is a Python module called Spotfire DSML that can be downloaded and used within Spotfire.

One of the modules is ml_modelling. It offers an end-to-end modelling process.

 

  • Once you split your data, you need to do something with the training and test data.
  • If you have missing data, or categorical variables, you need to handle those consistently in training and test data
  • Once you have a model, you need to validate it with the test data and usual performance indicators.

So it is advisable to be able to handle the entire process.
Useful links:

 

  • Like 2
Link to comment
Share on other sites

On 7/8/2024 at 2:11 PM, David Boot-Olazabal said:

Hi veni,

Did you already had the chance to look at our data functions library?
https://community.spotfire.com/files/category/8-data-functions/

You may find a couple of data functions that could help you out.

And here is a direct link to a random forest data function:

The data splitting part is kind of different. What do you expect the code in Spotfire would do for you? Just the splitting or also the whole pipeline?
if you could elaborate a bit more on that, that would give us a better idea of what you're trying to achieve.

Kind regards

David

Hi David,

My dataset has a categorical variable "Diabetes", it notes whether a person has diabetes or not with "1" or "0". 

What I want to do is to do a stratified splitting of my dataset into the training and validation datasets

Regards,
Veni

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...