Data Science Term Glossary
A B C D F G H I J M N P R S T U
A
Analytics
The discovery, interpretation, and communication of meaningful patterns in data and the process of applying those patterns towards effective decision making. Traditional analytics is often used synonymously with Business Intelligence.
Advanced Analytics
The use of statistical and mathematical methods to generate business insights from data, beyond the simple aggregations of standard analytics or business intelligence. The main aim is generating models that predict future behaviour. Advanced Analytics is used more-or-less synonymously with "predictive analytics", "data science", and "machine learning".
Algorithm
A process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer.
Anomaly Detection
The identification of anomalous records in a dataset via analytics or advanced analytics techniques. These are records that are different from the majority of the data. Although anomalies can be of various kinds, they tend to be used as synonym of outliers, which are unusually large or unusually small observations.
Apache Hadoop (see Hadoop)
Apache Spark (see Spark)
Artificial Intelligence
The creation of machines (especially computer systems) that exhibit human-like intelligent behavior. AI systems are capable of learning (the acquisition of information and rules for using the information), reasoning (using rules to reach approximate or definite conclusions) and self-correction. The difference with Machine Learning is that AI systems can adapt beyond the training examples.
Asset Management
The management of people's assets. The term also applies to dealing with other organizations' or companies' investments.
Augmented Analytics
An approach that uses machine learning and natural language generation to automate the process of data preparation, insight discovery and data science. Especially targeted at users without advanced data science skills, by removing repetitive, time-consuming and error-prone tasks. It marks the next wave of disruption in the data and analytics market. Examples are suggestion engines, chat-bots and AutoML.
AutoML
Automated machine learning (AutoML) is the process of automating the end-to-end process of machine learning from data preparation to model scoring and model selection.
B
Big Data
Big data is data that has one or more of the following attributes: high Volume (e.g. a dataset that does not fit in a standard computer?s memory), high Velocity (data that is produced at very high rate, for instance social media tweets or sensor data), high Variety (e.g. can come in diverse form of text, audio, video etc.). Big data demands a shift in data storage capabilities and set-up, processing power, human skills and analytics techniques.
Bioinformatics (see Informatics)
C
Chemoinformatics (see Informatics)
Classification
In statistical modeling, classification is a set of statistical processes for estimating the relationship between a dependent variable (response) and a set of independent variables (predictors). As opposed to Regression, the response is expected to be a variable with only a few distinct values. Binary Classification refers to a response with two values (e.g. Yes, No) whereas Multinomial Classification handles responses with more than two values (e.g. Low, Medium, High).
Correlation
A statistical measure that indicates the extent to which two or more variables change together. Correlation ranges from -1 to 1. A positive correlation means the variables increase or decrease together; a negative correlation means that one variable increases as the other decreases.
D
Data Access
Data access refers to a user's ability to access or retrieve data stored within a database or other repository. Users who have data access can store, retrieve, move or manipulate stored data.
Data Aggregation
Data aggregation is normally interpreted as summarization, a process in which information is gathered and expressed in a summary form. The simplest form of data aggregation is the average or mean. A common purpose of data aggregation is to compare the information expressed by a given variable grouped by another variable. For instance, the average of Age grouped by Gender.
Data Blending
The process of combining data from multiple sources into a single target dataset, containing meaningful information. In order to do data integration effectively we need to be able to map the meaning of the attributes from the different data sources into a unified model, or schema.
Data Cleaning
The first step of data preparation, mainly aimed at removing gross inaccuracies and handling missing data in an input dataset.
Data Integration (see Data Data Blending)
Data Mining
The process of extracting information from large datasets, usually from databases. It is also known as knowledge discovery from databases and effectively used as a synonym of advanced analytics when applied to databases.
Data Preparation
A set of techniques for processing a dataset into a form that is suitable for statistical analysis or machine learning. Data preparation is the first step in data analytics projects and it typically involves data cleaning and feature engineering.
DataRobot
DataRobot is an automated (AutoML) machine learning platform, as well as the name of the company who develop and sell the product. It is aimed at providing a code-free rapid way of developing and deploying machine learning and AI models en masse.
Data Science
A multidisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. This covers all aspects from data preparation, data transformation, data analysis, modelling, utilizing machine learning and AI, to communication and design of visual representations of models, and data.
Data Streaming
Data that is generated continuously potentially from many data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes). Streaming data includes a wide variety of data such as log files generated by customers using your mobile or web applications, ecommerce purchases, in-game player activity, information from social networks, financial trading floors, or geospatial services, and telemetry from connected devices or instrumentation in data centers.
Data Virtualization
Data integration performed in real time without storing the product of the integration on to a data warehouse.
Deep Learning
A subset of machine learning using large artificial neural networks (algorithms inspired by the workings of the human brain). It has become the methodology of choice for unstructured large-scale machine learning such as computer vision, speech recognition, natural language processing, audio recognition.
F
Feature
Often used as a synonym of field or predictor, to indicate an independent variable (original, or more often derived) that is useful for statistical analysis or machine learning.
Feature Engineering
A global term encompassing all techniques to transform existing features, or create new features, or remove useless features, for the purpose of facilitating machine learning. For instance: Feature Extraction, Generation or Selection.
Feature Extraction
The process of extracting new features from existing ones. An example of feature extraction is generating Year, Month and Day out of a date field. The idea is that these new fields may contain useful information that is masked in the original field, and that is deemed useful for pattern recognition or machine learning.
Feature Generation
This is the process of taking raw, unstructured data and defining features for potential use in statistical analysis. For instance, in the case of text mining you may begin with a raw log of thousands of text messages (e.g. SMS, email, social network messages, etc) and generate features by removing low-value words (i.e. stopwords), using certain size blocks of words (i.e. n-grams) or applying other rules.
Feature Importance
Normally associated with a particular machine-learning algorithm, to denote the ranking of features that the algorithm used when training, in order of usefulness. Algorithm-agnostic feature importance techniques also exist to provide this ranking without being associated to a particular machine-learning algorithm.
Feature Selection
The process of filtering out features (fields) that are noisy, duplicated or irrelevant for the purpose of machine-learning. Feature selection tends to be used when the number of fields is high and it is impractical for performing machine-learning. Feature Selection is not an exact science and must be performed with care, in order to avoid removing important features.
G
Graphical Processing Unit (GPU)
A GPU is found in video and graphics cards within a server or system such as a PC or laptop. Originally, their sole purpose was image and video rendering on PCs, however they have found prominence in data science as GPUs can often train models much faster than traditional methods.
H
H2O
An open source machine learning and AI platform somewhat similar to Tensorflow. It is produced by the company H2O.ai. They also provide open source integration of H2O with Spark, and also with NVIDIA GPUs allowing users to deploy models to GPUs rather than CPUs. H20.ai also provide paid-for enterprise version of H2O.
Hadoop
An open source distributed processing framework that manages data processing and storage for big data applications running in clustered systems. It is part of the Apache software foundation which develops several open source productions.
Hybrid Cloud
A cloud computing environment that uses a mix of on-premises infrastructure and cloud based infrastructure. Hybrid cloud is popular as it gives businesses greater flexibility, potentially enhanced security and more data deployment options as to where data or systems are housed. It may also allow for more control over cost management.
Hyper Parameter
A parameter within a machine-learning algorithm that must be chosen before model training. A machine-learning algorithm may contain a number of hyper parameters. Hyper-parameter optimisation is the process of choosing the best set of hyper-parameters for a given algorithm. A popular method to do so is called cross-validation. The more hyper parameters there are, the more difficult and time-consuming it is to optimise them. An example of hyper parameter is the number of layers in a Neural Network.
I
Informatics
An interdisciplinary field combining computer science, information engineering, mathematics and statistics to analyze and interpret data. It is shares many common features with data science. The term is most prevalent in healthcare and pharmaceutical fields, where bioinformatics (biological informatics) is a common discipline focusing on biological information, and chemoinformatics is focussed on chemical information. Healthcare informatics is another common example.
J
Java
A general-purpose programming language. It is intended to run on any computer platform once compiled.
M
Machine Learning
A subset of Artificial Intelligence in which a model is generated from a training dataset and used to identify patterns or predict outcomes. Machine-learning algorithms are mathematical and statistical algorithms based on the principle that systems can learn from data, identify patterns and make decisions with minimal human intervention. Normally the training dataset needs to be a good representation of future data, as the machine-learning model is not able to extrapolate to new data that is very different from the data used for training. In the latter case, usually the model needs to be re-trained to be brought up to date.
Missing Data
Holes in the dataset. Real-world datasets often have missing values. Data may be missing randomly or in a systematic way. The latter requires careful analysis and may be a symptom of issues in data collection or integration. It is important to treat missing data as many machine-learning algorithms either do not handle missing data well, or provide their own replacement, which may not be appropriate for the purpose at hand.
Model
In general, a model denotes an approximation or abstraction of a real system, that is ?good enough? for a given purpose and computationally feasible. A statistical model represents a mathematical (or probabilistic) relationship that exists between different variables. A machine-learning model can be ?white-box? if the relationship can be expressed in mathematical or logical form, or ?black-box? if the exact relationship is not known, and all we see is a result output by the model for a given input.
Model Deployment
The process of integrating a model into a production environment, where the model can be used to score new data. In this context the model is intended as a physical object, normally a serialized representation of the model rules stored in a special file. It is important to consider how the model will be deployed early on during the planning phase, as the decision might influence how and which model is generated.
Model Evaluation
An integral part of the model development process. It aims to find the best model out of a set of candidate models. Evaluating model performance with the data used for training is not acceptable in data science because it can easily generate overoptimistic and overfitted models. Models are scored and compared on a test dataset, which was not used for training.
N
Natural Language Processing (NLP)
A subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular with how to program computers to process and analyze large amounts of text written in natural language.
Natural Language Querying (NLQ)
Much like natural language processing (NLP), NLQ is the use of natural human language to query information, data or analytics. For example, Alexa/Siri are NLQ based, and Spotfire contains NLQ to allow people to ask questions of their data, as though they were conversing with a human.
Notebook
An interactive virtual notebook environment for programming. It combines the functionality of word processing with that of executing code snippets, typically R or Python, so that notes and explanations can appears mixed with a view of the source code and its results. Examples are Jupyter Notebook (for Python) and R Markdown (for R).
P
Predictive Analytics
A branch of advanced analytics that is used to make predictions about future events. Predictive analytics uses many techniques from data mining, statistics, modeling, machine learning, and artificial intelligence to analyze current data in order to make predictions about future.
Predictive Model
A model built by a statistical or machine-learning algorithm to forecast a specified outcome.
Predictor
A variable or field in a data table that is used by a machine-learning algorithm to predict an outcome, or response.
Python
An open-source programming language very popular for data science. Python is general purpose, has a readable syntax and can be used for developing both desktop and web applications. It can also be used for developing complex scientific and numeric applications. Python code is considered easier to maintain and to apply to large-scale systems than R.
R
R
An open-source programming language very popular for data science. R has a vast eco-system of several thousand libraries (or packages) available in CRAN, an open source repository. It is designed with statistical processing in mind.
Rapid Miner
A low-code data science software platform developed by the company of the same name. It provides an integrated environment for data preparation, machine learning, deep learning, text mining and visualization. It has a Free Edition, limited to 1 logical processor and 10,000 data rows and available under the AGPL license.
Record
An item of information in a structured dataset. A record contains a number of columns, or fields, each of which contains a different piece of information that together describes the record. A set of records constitutes a data table, which can be stored e.g. in a file or in a database. For example, a personnel file might contain records that have three fields: a name field, an address field, and a phone number field. In relational database management systems, records are also called rows or tuples.
Regression
In statistical modeling, regression analysis is a set of statistical processes for estimating the relationship between a dependent variable (response) and a set of independent variables (predictors). As opposed to Classification, the response is expected to be a continuous variable (i.e. a number that can take any value within a range).
Response
In supervised modelling, it indicates the variable that we want to predict. It is assumed to be known in the dataset we use for training the model.
S
Scala
A general-purpose programming language which runs on a Java platform and is compatible with existing Java programs. It is a compiled language. Spark was written in Scala. TIBCO Team Studio Custom Operators are developed in Scala.
Scoring
Also called prediction, it is the process of generating values based on a trained machine learning model, given some new input data. The values or scores that are created can represent predictions of future values, but they might also represent a likely category or outcome.
Spark
An open-source distributed general-purpose cluster computing framework with (mostly) in-memory data processing engine that can do ETL, analytics, machine learning and graph processing on large volumes of data at rest (batch processing) or in motion (streaming processing) with rich concise high-level APIs for the programming languages: Scala, Python, Java, R, and SQL.
Stata
Statistical software produced by StataCorp. It provides data manipulation, visualization, statistics and reproducible reporting. It is primarily used by researchers in the fields of economics, biomedicine, and political science to examine data patterns.
Supervised Learning
A class of machine learning algorithms designed to extract information from data that does have labelled responses. An example is identifying fraudulent transactions using historic data that contains records which are labelled as fraudulent or non-fraudulent.
Semi-Supervised Learning
A class of machine learning algorithms that also make use of unlabeled data for training ? typically a small amount of labeled data with a large amount of unlabeled data. Semi-supervised learning falls between unsupervised learning and supervised learning.
T
Text Mining
The process of examining large collections of written resources e.g. pdfs, web sites, social media etc. to generate new information, and to transform the unstructured text into structured data that is more suitable for analysis and machine learning. Text mining identifies facts, relationships and assertions that would otherwise remain buried in the mass of textual big data.
Trifacta
A company that develops Wrangler, a software system for data exploration and self-service data preparation for analysis. Trifacta works with cloud and on-premises data platforms.
Tensorflow
A popular open source platform for machine learning and AI development. Users of tensorflow will typically write models using Python (or less commonly C++). It was originally developed by Google but later released as an open source project.
U
Unsupervised Learning
A class of machine learning algorithms designed to extract information from data that does not have labelled responses. The most common is Cluster Analysis, where data is grouped into clusters without prior knowledge of what the clusters might mean.
Unstructured Data
Data that does not have a pre-defined structure (as opposed to a database table with specified columns and data types). Unstructured data is therefore not easily searchable. Examples are text (e.g. email, social media), audio, video.
Recommended Comments
There are no comments to display.