NLP and LLMs - Glossary - Spotfire

This article explains commonly used NLP and LLM terminology and complements our community content on these topics.

AI - Artificial intelligence is the broad field of intelligent, human-mimicking systems. It includes different algorithmic methodologies like machine learning, reinforcement learning, and deep learning.

Corpus - all documents from a dataset.

Document - a single text data point or observation.

Embedding/vector - numerical representation of text document. The size of vectors can range from 100s-10000s, which is referred to as vector dimensions. Embeddings are at the core of many modern language models.

Fine Tuning - training some ultimate layers of the pre-trained model to learn certain task(s). This requires training the model's last layers and sometimes modifying the model architecture.

Foundation Model - reusable AI models that are versatile in performing different, often multimodal tasks.

Generative AI - a type of AI system that generate text, images, and other outputs in response to prompts. Since the end of 2022, this largely refers to a new AI era.

GPT - generative pre-trained transformer; a suite of models from OpenAI based on initial GPT (1) architecture.

Hallucination - a phenomenon where the model generates text that is incorrect, nonsensical, or not real.

Indexing - similar to what an index helps to look up words in a book, an index file provides an expedited way for the database to query data.

LangChain - Open-source LLM integration framework to assist with end-to-end building and deployment through chaining various components in the architecture.

Language Models - probability distributions over sequences of words; often predict the next word(s) from a sequence preceding.

LLM - Large Language Models often refer to complex, deep learning architectures. By nature, "deep" architectures have many neural network layers and are more complex as the parameters and size increase.

LSTMs - Long short-term memory networks (LSTMs) are a type of neural network, more specifically a type of RNN, that use more complex cells for retaining historical information.

N-grams - Contiguous sequence of n words or tokens that can be statistically modeled using computations like frequency or TF-IDF (Term Frequency-Inverse Document Frequency).

NLG - A branch of NLP concerned with generating output text from some input text (or other modality).

NLP - natural language processing is the broader field modeling modalities of text data in AI. It includes branches like NLG and NLU.

NLQ - A branch of NLP concerned with interactively querying data through natural language. Often seen in Business Intelligence tools where natural language replaces the need for languages like SQL and results from queries are tables, visuals, or whole dashboards.

NLU - A branch of NLP concerned with understanding the meaning of the text.

OpenAI Models - Turbo, Davinci, GPT-3.5 Turbo, and Codex offered by OpenAI.

Orchestrator - The orchestrator in the LLM context is the building block in charge of arranging and calling various services and routing the data among them to generate the desired output.

Prompt Engineering - An AI engineering technique that augments, refines, manipulates, or otherwise modifies the user's prompt before sending it to the foundation model. This does not change the core foundation model whatsoever.

Prompt Engineering, System Prompt - This prompt includes specific instructions such as assuming a certain role, and constraints such as following a specific format, to guide the response. Typically used by the generative AI app developer.

Prompt Engineering, User Prompt - This is a normal prompt that tends to be more open-ended (compared to System Prompt). The user can ask anything.

RAG - Retrieval-Augmented Generation is the process of optimizing the output of a large language model, so it references an authoritative knowledge base outside of its training data sources before generating a response.

RNNs - Recurrent neural networks (RNNs) are a type of neural network which use sequential data and pass some history states throughout the mode.

Semantic Kernel - Microsoft's LLM integration framework.

Token - a single unit of text, often a word.

Vector database - A type of database that stores high-dimensional vectors. Popular vector databases include Milvus and Pinecone.

Further info on NLP and LLMs:

For the main TIBCO Community page on NLP and LLMs click here.
For the details on Spotfire Copilot click here.

Sign In

NLP and LLMs - Glossary

Table of contents

User Feedback

Recommended Comments

Industries