Skip to content

Glossary

We hope this glossary is a useful resource for anyone working with data.

Analytics

  • Analytics: The practice of using data and statistical techniques to understand and improve business performance.
  • Business intelligence (BI): The practice of using data and tools to analyze and visualize business performance, and make informed decisions.
  • Dashboard: A visual display of key performance indicators (KPIs) that is used to monitor and analyze business performance.
  • Descriptive analytics: The practice of using data to describe and summarize past events or trends.
  • Key performance indicator (KPI): A metric that is used to measure the progress of a business towards a specific goal.
  • Predictive analytics: The use of data, statistical algorithms, and machine learning techniques to identify the likelihood of future outcomes based on historical data.
  • Prescriptive analytics: The practice of using data and optimization algorithms to recommend actions or decisions that will achieve specific goals.
  • ROI (return on investment): A measure of the profitability of an investment, calculated by dividing the net gain by the cost of the investment.
  • Big data analytics: The process of analyzing large and complex datasets in order to extract insights and make decisions.

General Data Terms

  • Big data: Large datasets that are too large and complex to be processed and analyzed using traditional data processing tools.
  • Cloud computing: The delivery of computing services over the internet, including storage, processing, and networking.
  • Data anonymization: The process of removing personal identifying information from data in order to protect the privacy of individuals.
  • Data backup: The process of creating copies of data for the purpose of protecting against data loss or corruption.
  • Data backup schedule: A plan for creating and maintaining copies of data at regular intervals in order to protect against data loss or corruption.
  • Data backup strategy: A plan for creating and maintaining copies of data in order to protect against data loss or corruption.
  • Data catalog: A centralized repository that contains metadata about the data assets in an organization.
  • Data catalog tool: A software application that helps organizations manage and discover their data assets by providing features such as metadata management, data asset search, and data asset classification.
  • Data classification: The process of organizing data into categories based on its level of sensitivity, importance, or value.
  • Data encryption: The process of converting data into a secure, encoded form that can only be accessed with a special key or password.
  • Data lineage: The history of data, including where it came from, how it was transformed, and where it is stored.
  • Data lineage tool: A software application that helps organizations trace the history of their data, including where it came from, how it was transformed, and where it is stored.
  • Data masking: The process of obscuring sensitive data in order to protect it from unauthorized access or disclosure.
  • Data privacy: The protection of personal information from unauthorized access, use, or disclosure.
  • Data privacy legislation: Laws that regulate the collection, use, and protection of personal information.
  • Data quality: The level of accuracy, completeness, consistency, and timeliness of data.
  • Data recovery: The process of restoring data that has been lost, damaged, or corrupted.
  • Data recovery plan: A plan for restoring data that has been lost, damaged, or corrupted.
  • Data retention policy: A set of rules that defines how long data should be kept and when it can be deleted.
  • Data security: The measures taken to protect data from unauthorized access, use, disclosure, disruption, modification, or destruction.
  • Data storage: The process of storing data in a way that is efficient, scalable, and secure.
  • Data governance: The processes, policies, and standards that are used to ensure that data is managed and protected in a way that meets the needs of the organization.
  • Data governance policy: A set of rules and guidelines that define how data is managed and protected within an organization.
  • Data governance program: A structured approach to managing and protecting data within an organization.
  • Data cleansing: The process of identifying and correcting inaccuracies, inconsistencies, and duplicates in data.
  • Data integration: The process of combining data from multiple sources into a single, coherent dataset.
  • Data mining: The process of discovering patterns and relationships in large datasets using machine learning algorithms.
  • Data modeling: The process of designing a data model, which is a logical representation of the data in a system.
  • Data quality assessment: The process of evaluating the quality of data in order to identify any issues or problems that need to be addressed.
  • Data quality improvement: The process of correcting or enhancing the quality of data in order to make it more useful and valuable.
  • Data quality management: The processes and tools used to ensure that data is accurate, complete, and consistent.
  • Data quality metrics: Standard measures used to quantify the quality of data, such as accuracy, completeness, and consistency.
  • Data visualization: The process of creating visual representations of data in order to understand and communicate trends and patterns.
  • Data warehousing ETL: The process of extracting data from various sources, transforming it into a consistent format, and loading it into a data warehouse.
  • Scalability: The ability of a system or process to handle an increased workload without a corresponding increase in resources.

Data Processing

  • Extract, transform, load (ETL): The process of extracting data from various sources, transforming it into a consistent format, and loading it into a target system.
  • Hadoop: An open-source framework for storing and processing large amounts of data on clusters of commodity hardware.
  • MapReduce: A programming model for processing large datasets in a distributed computing environment.
  • Real-time processing: The processing of data as it is generated, rather than in batch mode.
  • Spark: An open-source, in-memory data processing engine that is used to perform big data analytics.
  • Stream processing: The real-time processing of data streams as they are generated, rather than processing data in batch mode.

Data Science & Machine Learning

  • Batch learning: A type of machine learning where the entire dataset is used to train a model, rather than training the model incrementally on new data.
  • Bias-variance tradeoff: The tradeoff between the simplicity of a model (bias) and its ability to fit the data accurately (variance). A model with high bias is prone to underfitting, while a model with high variance is prone to overfitting.
  • Cross-validation: The practice of dividing the data into a training set and a validation set, and using the validation set to evaluate the performance of a machine learning model and tune its hyperparameters.
  • Data (science) modeling: The process of building statistical or machine learning models to make predictions or draw insights from data.
  • Data science: The practice of using statistical and machine learning techniques to analyze and understand data in order to extract insights and make predictions.
  • Deep learning: A subfield of machine learning that involves utilizing layered artificial neural networks to learn patterns and representations from data
  • Early stopping: A technique that is used to prevent overfitting by interrupting the training of a machine learning model when its performance on the validation set exceeds a certain threshold.
  • Feature engineering: The process of selecting, creating, and transforming features (data variables) in order to improve the performance of a machine learning model.
  • Machine learning: The process of using algorithms and statistical models to allow a computer to learn from data, without being explicitly programmed.
  • Model complexity: The number of features, parameters, or layers in a machine learning model, which can affect its ability to fit the data accurately.
  • Model deployment: The process of making a machine learning model available for use in a production environment.
  • Model evaluation: The process of measuring the performance of a machine learning model using various metrics, such as accuracy, precision, and recall.
  • Model optimization: The process of adjusting the hyperparameters of a machine learning model in order to improve its performance.
  • Natural language processing (NLP): The field of artificial intelligence that focuses on enabling computers to understand and process human language.
  • Online learning: A type of machine learning where the model is trained incrementally on new data as it becomes available.
  • Overfitting: The phenomenon where a machine learning model performs well on the training data but poorly on new data, due to being too closely fitted to the training data.
  • Regularization: A technique that is used to prevent overfitting by adding a penalty to the objective function of a machine learning model, based on the complexity of the model.
  • Reinforcement learning: A type of machine learning where an agent learns by interacting with its environment and receiving rewards or punishments for its actions.
  • Semi-supervised learning: A type of machine learning where the training data is partially labeled and the goal is to predict the label for new data.
  • Supervised learning: A type of machine learning where the training data is labeled and the goal is to predict the label for new data.
  • Transfer learning: The process of adapting a pre-trained machine learning model for a new task, by fine-tuning the model using additional data and task-specific layers.
  • Underfitting: The phenomenon where a machine learning model performs poorly on the training data and on new data, due to being too simplistic or not being able to capture the underlying patterns in the data.
  • Unsupervised learning: A type of machine learning where the training data is not labeled and the goal is to discover patterns or relationships in the data.
  • Self-supervised learning: A type of machine learning where the training data is not labeled, instead the data itself is used to create pseudo-labels that the model is trained on.

Data Storage

  • Data lake: A centralized repository that allows you to store all your structured and unstructured data at any scale.
  • Data lake analytics: The process of analyzing data in a data lake using a variety of tools and techniques.
  • Data lake architecture: The design of a data lake, including the hardware and software components, data ingestion and processing pipelines, and security and governance controls.
  • Data lake governance: The processes, policies, and standards that are used to ensure that data in a data lake is managed and protected in a way that meets the needs of the organization.
  • Data lake ingestion: The process of loading data into a data lake from various sources.
  • Data lake query: The process of searching for and retrieving data from a data lake using SQL or other query languages.
  • Data lake security: The measures taken to protect data in a data lake from unauthorized access, use, disclosure, disruption, modification, or destruction.
  • Data lake storage: The process of storing data in a data lake in a way that is efficient, scalable, and secure.
  • Data lake transformation: The process of cleansing, integrating, and transforming data in a data lake to prepare it for analysis.
  • Data lake use cases: The specific business scenarios in which a data lake can be used to store, process, and analyze data.
  • Data lake visualization: The process of creating visual representations of data in a data lake to understand and communicate trends and patterns.
  • Data mart: A subset of a data warehouse that is designed to meet the specific needs of a particular business unit or department.
  • Data warehouse: A structured repository for storing and managing large amounts of historical data that is used for reporting and analysis.
  • Data warehousing appliance: A pre-configured hardware and software system that is designed specifically for data warehousing.
  • Data warehousing schema: The logical structure of a data warehouse, including the organization of data into tables and the relationships between them.
  • Data warehousing tool: A software application that is used to build, maintain, and query a data warehouse.
  • Database: A structured collection of data that is stored electronically and accessed by computers.
  • Database index: A data structure that is used to speed up the search for specific records in a database.
  • Database management system (DBMS): Software that is used to create, maintain, and access databases.
  • Database migration: The process of transferring data from one database to another.
  • Database normalization: The process of organizing a database in a way that reduces redundancy and dependency, and improves data integrity.
  • Database schema: The logical structure of a database, including the organization of data into tables and the relationships between them.
  • Database security: The measures taken to protect a database from unauthorized access, use, disclosure, disruption, modification, or destruction.
  • Non-relational database: A database that does not store data in tables with rows and columns, but rather in a more flexible structure such as documents or key-value pairs.
  • NoSQL database: A class of databases that are designed to handle large amounts of unstructured data.
  • Relational database: A database that stores data in tables with rows and columns, and uses relationships between tables to retrieve data.
  • Structured query language (SQL): A programming language used to manage and query relational databases.

Ops Terms

  • AIOps: A set of practices and tools that aims to improve the collaboration and communication between artificial intelligence (AI) and operations teams in order to deliver AI products faster and more reliably.
  • CloudOps: A set of practices and tools that aims to improve the management and operation of cloud computing environments.
  • Configuration management: The practice of managing and maintaining the configuration of software and infrastructure in a consistent and automated way.
  • Continuous delivery (CD): The practice of automatically building, testing, and deploying code changes to production environments as soon as they are ready.
  • Continuous deployment: The practice of automatically deploying code changes to production environments as soon as they are ready, without requiring manual approval.
  • Continuous integration (CI): The practice of integrating code changes into a shared repository multiple times a day, and automatically building and testing the code to ensure that it is always in a deployable state.
  • Data cleaning: The process of identifying and correcting errors or inconsistencies in data.
  • Data engineering: The practice of designing, building, and maintaining the infrastructure and pipelines that are used to collect, store, and process data.
  • Data exploration: The process of examining and summarizing data in order to understand its characteristics and identify patterns or trends.
  • Data ingestion: The process of loading data into a storage system from various sources.
  • Data operations: The practice of managing and maintaining the data infrastructure and pipelines that are used to deliver data products.
  • Data pipeline: The series of steps and tools that are used to collect, process, and store data.
  • Data preparation: The process of cleaning, transforming, and enriching data in order to make it ready for analysis.
  • Data processing: The process of applying transformations to data in order to extract insights or prepare it for analysis.
  • Data product: A product that is built using data, such as a machine learning model, a dashboard, or a recommendation system.
  • Data quality assurance: The practice of verifying that data meets certain quality standards before it is used in data products.
  • Data transformation: The process of cleansing, integrating, and transforming data to prepare it for analysis.
  • Data validation: The process of verifying that data meets certain quality standards before it is used in analysis or decision-making.
  • Data version control: The practice of tracking and managing changes to data in order to maintain the integrity and traceability of data products.
  • Data wrangling: The process of cleaning, transforming, and integrating data in order to make it ready for analysis.
  • DataOps: A set of practices and tools that aims to improve the collaboration and communication between data engineering, data science, and data operations teams in order to deliver data products faster and more reliably.
  • Deployment pipeline: The series of steps and tools that are used to build, test, and deploy code changes.
  • DevOps: A set of practices and tools that aims to improve the collaboration and communication between development and operations teams in order to deliver software faster and more reliably.
  • Infrastructure as code (IaC): The practice of managing infrastructure using configuration files that can be versioned, reviewed, and automated, just like code.
  • Log management: The practice of collecting, storing, and analyzing log data from systems and applications in order to identify and troubleshoot problems.
  • Microservices: Small, independent services that can be developed, deployed, and scaled independently.
  • MLOps: A set of practices and tools that aims to improve the collaboration and communication between machine learning development and operations teams in order to deliver machine learning products faster and more reliably.
  • ModelOps: A set of practices and tools that aims to improve the collaboration and communication between machine learning model development and operations teams in order to deploy and maintain machine learning models in a production environment.
  • Monitoring and alerting: The practice of monitoring systems and applications for signs of problems or performance issues, and triggering alerts when necessary.

Artificial Intelligence & GenAI/LLMs

  • Artificial Intelligence (AI): The simulation of human intelligence processes by machines, especially computer systems, which include learning, reasoning, and self-correction.
  • Machine Learning (ML): A subset of AI that involves the use of statistical techniques to enable machines to improve at tasks with experience.
  • Deep Learning: A specialized form of ML using neural networks with multiple layers to analyze various data types.
  • Natural Language Processing (NLP): The branch of AI that enables computers to understand, interpret, and respond to human language.
  • Generative AI (GenAI): AI models capable of generating new content, such as text, images, music, and code, based on learned patterns from vast datasets.
  • Large Language Models (LLMs): AI models trained on massive text corpora to understand and generate human-like text.
  • Fine-tuning: The process of adapting a pre-trained AI model to a specific task by training it further on a smaller, task-specific dataset.
  • Prompt engineering: The practice of crafting effective inputs to LLMs to achieve desired outputs efficiently.
  • Zero-shot learning: The ability of an AI model to perform tasks it has not explicitly been trained for, based on contextual understanding.
  • Few-shot learning: A learning paradigm where AI models make accurate predictions with only a few examples.
  • Hallucination: A phenomenon in which an AI model generates incorrect or misleading information that appears plausible.
  • Embedding: A numerical representation of text, images, or other data types in a continuous vector space, used for similarity comparisons.
  • Tokenization: The process of breaking down text into smaller components, such as words or subwords, for processing by AI models.
  • Inference: The process of using a trained AI model to make predictions or generate outputs based on new input data.
  • Bias in AI: Systematic and unfair discrimination in AI outputs due to biased training data or model design.
  • Explainability: The ability to understand and interpret the decision-making process of AI models.
  • AI Ethics: The field of study that examines the moral implications and responsibilities associated with AI development and deployment.
  • Model drift: The degradation of an AI model's performance over time as the underlying data distribution changes.
  • Retrieval-Augmented Generation (RAG): An AI technique that enhances model responses by retrieving relevant external information during generation.
  • Token limit: The maximum number of tokens (words or characters) an AI model can process in a single request.
  • Knowledge Distillation: A technique where a smaller model learns from a larger, more complex model to achieve similar performance with fewer resources.

AI Deployment & MLOps

  • MLOps (Machine Learning Operations): A set of practices that aim to streamline the deployment, monitoring, and management of machine learning models in production.
  • Model monitoring: The ongoing tracking of model performance to ensure accuracy and fairness in production environments.
  • A/B testing: The practice of comparing two versions of a machine learning model to determine which performs better.
  • Model retraining: The process of periodically updating an AI model with new data to maintain accuracy and relevance.
  • Feature store: A centralized repository for storing, sharing, and reusing machine learning features across multiple models.
  • Explainable AI (XAI): Tools and techniques designed to make AI models more transparent and interpretable.
  • AutoML (Automated Machine Learning): The use of automated tools to simplify the process of developing and tuning machine learning models.
  • Hyperparameter tuning: The process of optimizing the parameters of a machine learning model to improve performance.
  • Federated learning: A technique that allows training models across decentralized data sources without moving the data itself.
  • Model interpretability: The degree to which humans can understand how an AI model makes decisions.
  • Data drift: The change in statistical properties of input data over time, which can lead to model degradation.
  • Continuous integration/continuous deployment (CI/CD): A methodology that automates the process of deploying machine learning models into production.
  • Shadow deployment: Deploying a new AI model alongside the existing one to compare their outputs without impacting users.
  • Canary release: Gradually rolling out AI model updates to a subset of users to mitigate risks.

AI Applications

  • Chatbots: AI-powered conversational agents that interact with users via text or voice interfaces.
  • Sentiment analysis: The use of AI to determine the emotional tone of text data.
  • Computer vision: AI technology that enables machines to interpret and analyze visual data from the world.
  • Recommender systems: AI algorithms that suggest products, content, or services based on user preferences and behavior.
  • Anomaly detection: The identification of unusual patterns in data that do not conform to expected behavior.
  • Speech recognition: AI systems that convert spoken language into text.
  • Fraud detection: The application of AI to identify potentially fraudulent transactions or behaviors.
  • AI-powered automation: The use of AI to perform repetitive tasks without human intervention.
  • Personalization: The customization of content or experiences for users based on AI-driven insights.
  • Digital twins: Virtual representations of physical objects or processes that can be analyzed and optimized using AI.
  • Autonomous systems: AI-driven systems that can operate independently, such as self-driving cars and drones.
  • Predictive maintenance: Using AI to forecast equipment failures before they happen, optimizing maintenance schedules.