Glossary

We hope this glossary is a useful resource for anyone working with data.

Analytics

Analytics: The practice of using data and statistical techniques to understand and improve business performance.
Business intelligence (BI): The practice of using data and tools to analyze and visualize business performance, and make informed decisions.
Dashboard: A visual display of key performance indicators (KPIs) that is used to monitor and analyze business performance.
Descriptive analytics: The practice of using data to describe and summarize past events or trends.
Key performance indicator (KPI): A metric that is used to measure the progress of a business towards a specific goal.
Predictive analytics: The use of data, statistical algorithms, and machine learning techniques to identify the likelihood of future outcomes based on historical data.
Prescriptive analytics: The practice of using data and optimization algorithms to recommend actions or decisions that will achieve specific goals.
ROI (return on investment): A measure of the profitability of an investment, calculated by dividing the net gain by the cost of the investment.
Big data analytics: The process of analyzing large and complex datasets in order to extract insights and make decisions.

General Data Terms

Big data: Large datasets that are too large and complex to be processed and analyzed using traditional data processing tools.
Cloud computing: The delivery of computing services over the internet, including storage, processing, and networking.
Data anonymization: The process of removing personal identifying information from data in order to protect the privacy of individuals.
Data backup: The process of creating copies of data for the purpose of protecting against data loss or corruption.
Data backup schedule: A plan for creating and maintaining copies of data at regular intervals in order to protect against data loss or corruption.
Data backup strategy: A plan for creating and maintaining copies of data in order to protect against data loss or corruption.
Data catalog: A centralized repository that contains metadata about the data assets in an organization.
Data catalog tool: A software application that helps organizations manage and discover their data assets by providing features such as metadata management, data asset search, and data asset classification.
Data classification: The process of organizing data into categories based on its level of sensitivity, importance, or value.
Data encryption: The process of converting data into a secure, encoded form that can only be accessed with a special key or password.
Data lineage: The history of data, including where it came from, how it was transformed, and where it is stored.
Data lineage tool: A software application that helps organizations trace the history of their data, including where it came from, how it was transformed, and where it is stored.
Data masking: The process of obscuring sensitive data in order to protect it from unauthorized access or disclosure.
Data privacy: The protection of personal information from unauthorized access, use, or disclosure.
Data privacy legislation: Laws that regulate the collection, use, and protection of personal information.
Data quality: The level of accuracy, completeness, consistency, and timeliness of data.
Data recovery: The process of restoring data that has been lost, damaged, or corrupted.
Data recovery plan: A plan for restoring data that has been lost, damaged, or corrupted.
Data retention policy: A set of rules that defines how long data should be kept and when it can be deleted.
Data security: The measures taken to protect data from unauthorized access, use, disclosure, disruption, modification, or destruction.
Data storage: The process of storing data in a way that is efficient, scalable, and secure.
Data governance: The processes, policies, and standards that are used to ensure that data is managed and protected in a way that meets the needs of the organization.
Data governance policy: A set of rules and guidelines that define how data is managed and protected within an organization.
Data governance program: A structured approach to managing and protecting data within an organization.
Data cleansing: The process of identifying and correcting inaccuracies, inconsistencies, and duplicates in data.
Data integration: The process of combining data from multiple sources into a single, coherent dataset.
Data mining: The process of discovering patterns and relationships in large datasets using machine learning algorithms.
Data modeling: The process of designing a data model, which is a logical representation of the data in a system.
Data quality assessment: The process of evaluating the quality of data in order to identify any issues or problems that need to be addressed.
Data quality improvement: The process of correcting or enhancing the quality of data in order to make it more useful and valuable.
Data quality management: The processes and tools used to ensure that data is accurate, complete, and consistent.
Data quality metrics: Standard measures used to quantify the quality of data, such as accuracy, completeness, and consistency.
Data visualization: The process of creating visual representations of data in order to understand and communicate trends and patterns.
Data warehousing ETL: The process of extracting data from various sources, transforming it into a consistent format, and loading it into a data warehouse.
Scalability: The ability of a system or process to handle an increased workload without a corresponding increase in resources.

Data Processing

Extract, transform, load (ETL): The process of extracting data from various sources, transforming it into a consistent format, and loading it into a target system.
Hadoop: An open-source framework for storing and processing large amounts of data on clusters of commodity hardware.
MapReduce: A programming model for processing large datasets in a distributed computing environment.
Real-time processing: The processing of data as it is generated, rather than in batch mode.
Spark: An open-source, in-memory data processing engine that is used to perform big data analytics.
Stream processing: The real-time processing of data streams as they are generated, rather than processing data in batch mode.

Data Science & Machine Learning

Batch learning: A type of machine learning where the entire dataset is used to train a model, rather than training the model incrementally on new data.
Bias-variance tradeoff: The tradeoff between the simplicity of a model (bias) and its ability to fit the data accurately (variance). A model with high bias is prone to underfitting, while a model with high variance is prone to overfitting.
Cross-validation: The practice of dividing the data into a training set and a validation set, and using the validation set to evaluate the performance of a machine learning model and tune its hyperparameters.
Data (science) modeling: The process of building statistical or machine learning models to make predictions or draw insights from data.
Data science: The practice of using statistical and machine learning techniques to analyze and understand data in order to extract insights and make predictions.
Deep learning: A subfield of machine learning that involves
Early stopping: A technique that is used to prevent overfitting by interrupting the training of a machine learning model when its performance on the validation
Feature engineering: The process of selecting, creating, and transforming features (data variables) in order to improve the performance of a machine learning model.
Machine learning: The process of using algorithms and statistical models to allow a computer to learn from data, without being explicitly programmed.
Model complexity: The number of features, parameters, or layers in a machine learning model, which can affect its ability to fit the data accurately.
Model deployment: The process of making a machine learning model available for use in a production environment.
Model evaluation: The process of measuring the performance of a machine learning model using various metrics, such as accuracy, precision, and recall.
Model optimization: The process of adjusting the hyperparameters of a machine learning model in order to improve its performance.
Natural language processing (NLP): The field of artificial intelligence that focuses on enabling computers to understand and process human language.
Online learning: A type of machine learning where the model is trained incrementally on new data as it becomes available.
Overfitting: The phenomenon where a machine learning model performs well on the training data but poorly on new data, due to being too closely fitted to the training data.
Regularization: A technique that is used to prevent overfitting by adding a penalty to the objective function of a machine learning model, based on the complexity of the model.
Reinforcement learning: A type of machine learning where an agent learns by interacting with its environment and receiving rewards or punishments for its actions.
Semi-supervised learning: A type of machine learning where the training data is partially labeled and the goal is to predict the label for new data.
Supervised learning: A type of machine learning where the training data is labeled and the goal is to predict the label for new data.
Transfer learning: The process of adapting a pre-trained machine learning model for a new task, by fine-tuning the model using additional data and task-specific layers.
Underfitting: The phenomenon where a machine learning model performs poorly on the training data and on new data, due to being too simplistic or not being able to capture the underlying patterns in the data.
Unsupervised learning: A type of machine learning where the training data is not labeled and the goal is to discover patterns or relationships in the data.

Data Storage

Data lake: A centralized repository that allows you to store all your structured and unstructured data at any scale.
Data lake analytics: The process of analyzing data in a data lake using a variety of tools and techniques.
Data lake architecture: The design of a data lake, including the hardware and software components, data ingestion and processing pipelines, and security and governance controls.
Data lake governance: The processes, policies, and standards that are used to ensure that data in a data lake is managed and protected in a way that meets the needs of the organization.
Data lake ingestion: The process of loading data into a data lake from various sources.
Data lake query: The process of searching for and retrieving data from a data lake using SQL or other query languages.
Data lake security: The measures taken to protect data in a data lake from unauthorized access, use, disclosure, disruption, modification, or destruction.
Data lake storage: The process of storing data in a data lake in a way that is efficient, scalable, and secure.
Data lake transformation: The process of cleansing, integrating, and transforming data in a data lake to prepare it for analysis.
Data lake use cases: The specific business scenarios in which a data lake can be used to store, process, and analyze data.
Data lake visualization: The process of creating visual representations of data in a data lake to understand and communicate trends and patterns.
Data mart: A subset of a data warehouse that is designed to meet the specific needs of a particular business unit or department.
Data warehouse: A structured repository for storing and managing large amounts of historical data that is used for reporting and analysis.
Data warehousing appliance: A pre-configured hardware and software system that is designed specifically for data warehousing.
Data warehousing schema: The logical structure of a data warehouse, including the organization of data into tables and the relationships between them.
Data warehousing tool: A software application that is used to build, maintain, and query a data warehouse.
Database: A structured collection of data that is stored electronically and accessed by computers.
Database index: A data structure that is used to speed up the search for specific records in a database.
Database management system (DBMS): Software that is used to create, maintain, and access databases.
Database migration: The process of transferring data from one database to another.
Database normalization: The process of organizing a database in a way that reduces redundancy and dependency, and improves data integrity.
Database schema: The logical structure of a database, including the organization of data into tables and the relationships between them.
Database security: The measures taken to protect a database from unauthorized access, use, disclosure, disruption, modification, or destruction.
Non-relational database: A database that does not store data in tables with rows and columns, but rather in a more flexible structure such as documents or key-value pairs.
NoSQL: A class of databases that are designed to handle large amounts of unstructured data.
Relational database: A database that stores data in tables with rows and columns, and uses relationships between tables to retrieve data.
Structured query language (SQL): A programming language used to manage and query relational databases.

Ops Terms

AIOps: A set of practices and tools that aims to improve the collaboration and communication between artificial intelligence (AI) and operations teams in order to deliver AI products faster and more reliably.
CloudOps: A set of practices and tools that aims to improve the management and operation of cloud computing environments.
Configuration management: The practice of managing and maintaining the configuration of software and infrastructure in a consistent and automated way.
Continuous delivery (CD): The practice of automatically building, testing, and deploying code changes to production environments as soon as they are ready.
Continuous deployment: The practice of automatically deploying code changes to production environments as soon as they are ready, without requiring manual approval.
Continuous integration (CI): The practice of integrating code changes into a shared repository multiple times a day, and automatically building and testing the code to ensure that it is always in a deployable state.
Data cleaning: The process of identifying and correcting errors or inconsistencies in data.
Data engineering: The practice of designing, building, and maintaining the infrastructure and pipelines that are used to collect, store, and process data.
Data exploration: The process of examining and summarizing data in order to understand its characteristics and identify patterns or trends.
Data ingestion: The process of loading data into a storage system from various sources.
Data operations: The practice of managing and maintaining the data infrastructure and pipelines that are used to deliver data products.
Data pipeline: The series of steps and tools that are used to collect, process, and store data.
Data preparation: The process of cleaning, transforming, and enriching data in order to make it ready for analysis.
Data processing: The process of applying transformations to data in order to extract insights or prepare it for analysis.
Data product: A product that is built using data, such as a machine learning model, a dashboard, or a recommendation system.
Data quality assurance: The practice of verifying that data meets certain quality standards before it is used in data products.
Data transformation: The process of cleansing, integrating, and transforming data to prepare it for analysis.
Data validation: The process of verifying that data meets certain quality standards before it is used in analysis or decision-making.
Data version control: The practice of tracking and managing changes to data in order to maintain the integrity and traceability of data products.
Data wrangling: The process of cleaning, transforming, and integrating data in order to make it ready for analysis.
DataOps: A set of practices and tools that aims to improve the collaboration and communication between data engineering, data science, and data operations teams in order to deliver data products faster and more reliably.
Deployment pipeline: The series of steps and tools that are used to build, test, and deploy code changes.
DevOps: A set of practices and tools that aims to improve the collaboration and communication between development and operations teams in order to deliver software faster and more reliably.
Infrastructure as code (IaC): The practice of managing infrastructure using configuration files that can be versioned, reviewed, and automated, just like code.
Log management: The practice of collecting, storing, and analyzing log data from systems and applications in order to identify and troubleshoot problems.
Microservices: Small, independent services that can be developed, deployed, and scaled independently.
MLOps: A set of practices and tools that aims to improve the collaboration and communication between machine learning development and operations teams in order to deliver machine learning products faster and more reliably.
ModelOps: A set of practices and tools that aims to improve the collaboration and communication between machine learning model development and operations teams in order to deploy and maintain machine learning models in a production environment.
Monitoring and alerting: The practice of monitoring systems and applications for signs of problems or performance issues, and triggering alerts when necessary.