Jump to content
  • Data Transformations Inside Spotfire Statistica®


    This article will describe how to apply the data transformations within Spotfire Statistica® (or simply Statistica) together with highlighting the most typical transformations in the form of examples.

    Introduction

    Data transformations are an important part of any data analytics project, the ability to transform the raw data into the form for modeling is something, which each tool for data analysis needs to have. Data transformations implementation in Spotfire Statistica® (or simply Statistica) consists of several functionalities. All are available in menu Data. Most typically, the data analyst will chain multiple data transformation functionalities in order to conduct all the data preparation steps needed.

    In this article, we would like to address options of the functionality Data->Transforms (node name inside visual workflow environment would be Transformations of Variables).

    This Transforms functionality is dedicated to row-based computations which means we would like to define a new variable (column) or change an existing variable based on information from other variables using the same formula for each row. For example computing some mathematical or statistical calculation (logarithm of variable, inserting random numbers, creating conditional computations, extracting information from text or date, these are examples suited for Transforms functionality).

    Let us mention briefly the other useful functionalities for data transformations for which Transforms functionality is not the best option or not a suitable option:

    • Data->Stack->Stacking/Unstacking: It is changing the data shape to a stacked/unstacked form
    • Data->Redode: It is dedicated to the creation of categorical variables based on multiple conditions (typically for the smaller number of conditions 2-3 this can be done also using Transforms functionality)
    • Data->Standardize: Dedicated functionality for standardization
    • Data->Transpose: In case, the user needs to flip the data (after the operation rows are columns and columns are rows)
    • Data->Filter/Recode->...: In this menu, there are various statistical methods for handling missing data, outliers, duplicates, etc.
    • Data->Merge->...: This functionality is for merging tables together (a detailed "how to" article together with examples on this topic can be found here)
    • Data->Rules Builder: This is a more generic functionality that is mainly dedicated to creating computed variables based on a larger number of nested if-else conditions, but part of it is also utilizing the very same functions as Transforms functionality uses and which will be described in more detail below

    How to use it

    Let us describe the details of how to define and apply your transformations.

    Inside the node Transformations of Variables, click on the Edit Parameters button. In this dialog, you can insert your transformations...

    Basic rules for writing transformations

    • In the Transformations of Variables node, you can define one or more transformations, which will be conducted on the connected data after running the node.
    • Each row is one transformation and needs to have the following form: variable name is equal to the formula that should be computed (see examples below).
    • You can reference the variables based on either their name or based on a variable number (e.g. v3=v1+v2 means that there will be inserted sum of the first and the second variable into the third one, no matter what the names of variables are). If the user is referencing variables by name and if there is a space character in the variable name, the user needs to insert the whole variable name inside double quotes
    • You can rewrite the existing variable or insert a new one into the data set. Simply, if the variable name (before the equation sign) does not match any existing variable names then the new variable will be created and placed at the end of the resulting spreadsheet.  
    • If you are referencing a category (text value), you need to insert it into double quotes.
    • Do not forget that the formula is applied for the whole data (all rows have the same formula). Most of the functions are based on the values from the same row (which is logical concerning rows representing separate independent cases of gathered data which should be handled in the same way - customers/patients/censor readings in different time points/etc.). There are only several exceptions using information from different rows than actually computed one (like cusum, lag, etc.).

    If you want to find a function or review the definition of the function, please click on the Functions button. In the new dialog, you can browse available functions inserted into several logical groups (like MathDistributionsDate/Time, etc.).

    transforms_dialog.jpg.85ea0b933f252f428ab4d62a7a7276f1.jpg

     

    Examples and important functions

    In this section, we will highlight several very typical functions together with examples of how to use them. If you want to review the setting and results of all examples by yourself, please download the Workspace file from the attachment of this page.

    • Mathematical transformations (Example 1): This example shows a classical situation, I have a variable and I want to construct various transformations of it like square root, logarithm, etc. (mathematical transformations can be found in the Math category). Formulas will look for example this way (Yield is the name of the variable being transformed and we are constructing 5 new computed variables):
     Yield_log=log(Yield) 
     Yield_sqrt=sqrt(Yield) 
     Yield_2=Yield^2 
     Yield_3=Yield^3 
     Yield_4=Yield^4 
     
    • Datetime transforms (Example 2): If you have in the data date/time variable, it is typically useful to get from this variable info about the year, month, hour, or day of the week. There are several useful functions from the Date/Time category which are able to extract this info, they start with DT and the only parameter is typically the name of the variable containing date/time info.  Another useful date time variable could be DTTODAY or DTNOW - they are bringing back actual dates (date and time respectively) which can be useful if the user wants to compare the actual dates with dates in the column. Please look at Example 2. Here is the syntax used:
    Year=DTYear(Date) 
    Month=DTMonth(Date) 
    "Day of Month"=DTDay(Date) 
    "Day of week"=DTDayOfWeek(Date) 
    "Days from today"=DTToday-Date 
    example2.jpg.8485d4ff083b8e80d2341725741da189.jpg

    If you want to create a concrete date you can use DTDATE(year; month; the day [, hour, minute, second, millis]) function which will construct the date from inserted parameters.

    • Conditional computations (Example 3): The possibility to insert conditional computation is an essential function and it is delivered via operator iif. Please, do not confuse it with the name if (IF function is already allocated for inverse function for F distribution). You simply insert after equation sign the statement iif(condition; expression1; expression2). In other words, if "condition" is fulfilled, then "expression1" is returned from the function, or else "expression2" is returned. In the example below (first line), if the temp1 value is missing, insert into the new variable the value temp2, if not missing then leave there temp1. The last formula is to assign categories TrainTest, and Validation randomly where the Train category will be placed with a probability of 0.5, Validation and Test with 0.25 (this is an example of using iif function twice in one formula).
     temp1_2=iif(IsMD(temp1)=1;temp2;temp1)
     temp2_2=iif(IsMD(temp2)=1;temp1;temp2) 
     RandomNumber=rnd(1) 
     Sample_Identifier=iif(RandomNumber>=0.5; "Train"; iif(randomNumber>0.25;"Test";"Validation"))
     
    • Missing data indicator (inside Example 3): If you want to act based on missing data presence, IsMd is the function for that. It brings back 1 if there is a missing value in a specified variable, and 0 otherwise.
    • Random number generator (inside Example 3): There is a wide range of probabilistic functions and one kind of them are random number generators from various distributions. In Example 3 we are using function rnd which is generated from a uniform distribution. This is useful anytime you want to have some randomness inserted into the data.
    • Row Number (inside Example 4): There is a special function V0V0 is inserting case numbers (row numbers) into the column. Useful for example when you want to store information about original sorting before resorting to the file. Also, it can be used for filtering purposes for example inside the iif function (e.g. filter first 5 cases - as in Example 4)
    • Extracting names from the row headers (inside Example 4): Statistica spreadsheet allows to have text information in the header of the row. This is useful mainly for orientation. Once you want to use this information for computations, you can utilize the function CaseName. In Example 4, we have output from the Feature Selection node bringing back the 10 most important variables, to get variable names from the row header to the actual variable, we are using the CaseName function. 
    Ranking=v0 
    Best5=iif(v0<=5;1;0) 
    "Predictor Name"=CaseName 
    example4.jpg.86f871e53db8453ad66c625b0e04c26e.jpg
     
    • Text operations (Example 5): There is a wide range of text operations. Let us highlight several functions, one is len (a simple function for defining the length of the string). For trimming you can use mid (you are defining variable name, start, and length of the trimming). For finding concrete substrings in the text you can use Contains function which is bringing back the index where the substring starts. If you want to use concatenate (merging strings), you can simply use the + operator, the result will be a merge of the strings (see last formula below merging text variable and fixed string).  
    PredictorName_length=len("Predictor Name") 
    Par_Suffix=mid("Predictor Name";4; Len("Predictor Name")-3) 
    Par_Number=ToNumber(mid("Predictor Name";5; Len("Predictor Name")-4)) 
    "Name includes 1"=Contains("Predictor Name";"1") 
    "Name includes 4"=Contains("Predictor Name";"4") 
    RefName="Predictor Name" + ": FS" 
    example5_1.jpg.a5f35f5b09fe245cf61b1b879df9d724.jpg
     
    • Functions using info from more rows (Example 6): As was already mentioned there are also several functions using information from other rows, not only computed one. The most typical ones are lagcusum (used in Example 6). There is also function Data which is able to reference specific cell locations. 
    • The number of cases (included after Example 6): If for some reason, you need to include in your computations the number of rows, there are function NCases for that. As an example, you can see the usage of this function inside the condition of Subset to bring back only the last row in the data (Subset after example 6).

    Tips and Tricks

    • By Group option

    In most of the Statistica functionalities, there is a button inside the node settings to do By Group analysis. It is not that typical to use By Group in Transformations but still in some use cases it is useful. For example in the situations when you want to use some function that is not used only the actual row and at the same time if you want to do the same operation separately for each group. Example 7 is showing how to calculate the cumulative sum per customer (The By Group option is enabled in this node).

    • Simpler referencing of variables

    There is a function called VCUR which is the current variable, so if I am changing variable 3 and I will insert after equation sign in formula VCUR, it means the third variable. This is useful for simplifying formulas. An example of that can be found in Example 8 below.

    • Inserting missing data values

    Inside the variable specifications, you have also the information on how the missing data are represented ("MD code"). This means whenever you will write this MD code into the cell, this cell will be empty and methods will treat this cell as a missing value. If you want to insert empty spaces into the variable based on some condition, you can use this number in your formula. For example, the following formulas will insert missing data instead of the value "unknown" (the variable in question has MD code=-999999998). This is Example 8 and these 4 resulting variables will have the missing value instead of the "unknown" category. 

     v2=iif(vcur="unknown";-999999998;vcur) 
     v3=iif(vcur="unknown";-999999998;vcur) 
     v4=iif(vcur="unknown";-999999998;vcur) 
     v5=iif(vcur="unknown";-999999998;vcur) 
     
    • What if the variable has the same name as an existing function

    The simple rule is to avoid variable names with the same name as some function inside the tool. If you cannot avoid that, the variable name has priority. In case you want to specify that you want to use the function and not the variable, insert the special symbol @ before the name of the function.

     Sum=@sum(v1:v5)
     

    Important Remarks

    • Formulas themselves are executed in the order they are specified. This is all done case by case - for case 1 we do all formulas, for case 2 we do it again. etc.

    • Functions, variable names as well as strings inside the conditions are not case sensitive, therefore you can reference all the values with small letters.

    • How is it with the local setting - Statistica uses the setting of the decimal separator from Windows local settings. Once formulas are inside the nodes, they will automatically change based on changed local settings. Let us say we have formula Sample_Identifier=iif(RandomNumber>=0.5; "Train"; iif(randomNumber>0.25;"Test";"Validation")) in environment with dot as decimal separator, the same formula with comma separator will look this way: Sample_Identifier=iif(RandomNumber>=0,5; "Train"; iif(randomNumber>0,25;"Test";"Validation"))

    • The formulas can be used in other nodes as wellThe same functions and syntax can be used for example in the following places:
      • In the variable specification dialog in place for Long name, you can insert a function (syntax is inserting equation sign and formula, name of the variable before equation sign is not needed) and the variable will be recomputed (but this is useful only outside visual workflow usage because this operation is not recorded within the workflow).
      • A recorded version of the previous point would be using a formula in the Variable Format/Rename node where you can insert the formula in the Long name field in the variable specs dialog (this operation will be recorded by the Variable Format/Rename node). An example of that can be found in the example workspace in Variable Format/Rename (alternative to Example8)
      • In the Subset node where the functions can help define the case filter conditions (please see Subset after Example 6)
      • Rules Builder node where you can use functions for conditions as well as the definition of variables in the same way as in Transformation of Variables node.
    • How to save the data function transforms - In some special situations, you might benefit from the possibility of saving the transformations as a file or load the transformations from the file (buttons Save As and Open in the dialog for the definition of transformations). But typically these transforms will be saved inside the settings of the node for transformations.

    Relevant Links

    Attachment

    You can download examples mentioned in this article in this zip archive: examples_of_transformations (2).zip


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...