Dataset
What is a Dataset?
A dataset is a structured set of data gathered and stored for analysis or processing. Typically, the data within a dataset is related and originates from a single source or is intended for a specific project. For instance, a dataset might consist of business data like sales figures, customer contact information, and transactions. Datasets can contain various data types, such as numerical values, text, images, or audio recordings. The data can be accessed individually, combined, or managed as a whole. Datasets are crucial in data analytics, data analysis, and machine learning (ML), as they provide the basis for analysts to derive insights and trends. Choosing the right dataset for an ML project is a critical first step in successfully training and deploying an ML model.
Dataset vs. Database
The terms dataset and database are often confused. Although both are related to data organization and management, they have several significant differences:
- A dataset, as described earlier, is a collection of data used for analysis and modeling, typically organized in a structured format like an Excel spreadsheet, a CSV file, or a JSON file. The data can be organized in various ways and come from diverse sources, such as customer surveys, experiments, or existing databases. Datasets serve multiple purposes, including training and testing ML models, data visualization, research, and statistical analysis. They can be shared publicly or privately and are generally smaller than databases.
- A database, on the other hand, is designed for long-term storage and management of large amounts of organized data stored electronically. This allows for easy access, manipulation, and updating of the data. In essence, a database is an organized collection of data stored as multiple datasets. There are various types of databases, including relational databases, document databases, and key-value databases.
Examples of Datasets
Datasets can include numbers, text, images, audio recordings, or even basic object descriptions. They can be organized in different forms, such as tables and files. Some examples of datasets are:
- A dataset listing all real estate sales in a specific area during a designated time period.
- A dataset containing information on all known meteorite landings.
- A dataset on regional air quality in a specific area during a designated time period.
- A dataset showing the attendance rate for public school students pre-K-12 by student group and district during the 2021–2022 school year.
Public Datasets
Public datasets are publicly accessible data organized around themes or topics. They are particularly valuable to data scientists because they are usually free and provide easily accessible and downloadable data for training ML models.
For example, the National Oceanic and Atmospheric Administration (NOAA) offers data on water quality, climate change, and more. Automatic dependence surveillance (ADS-B) data displays commercial aircraft movement in real time, and the U.S. General Services Administration provides Data.gov, which contains over 200,000 datasets in hundreds of categories.
Using Datasets
Datasets are used in various ways. Analysts employ them for data exploration and visualization in business intelligence, while data scientists use them to train ML models. However, before using datasets, data must be ingested into a data lake or lakehouse using data engineering processes like Extract, Transform, and Load (ETL). ETL allows engineers to extract data from different sources, transform it into a usable and trusted resource, and load it into systems that end users can access to solve business problems.
Managing, Cataloging, and Securing Datasets
Datasets must be cataloged, governed, and securely stored with a governance system before use. Implementing an effective data governance strategy enables organizations to make data readily available for data-driven decision-making while protecting data from unauthorized access and ensuring regulatory compliance.
To tackle data governance challenges, organizations can use unified governance solutions for data and AI assets on the lakehouse. These solutions allow seamless governance of structured and unstructured data, ML models, notebooks, dashboards, and files on any cloud or platform. Data scientists, analysts, and engineers can use these tools to securely discover, access, and collaborate on trusted data and AI assets.
Sharing Datasets
Data scientists often want to collect, analyze, and share datasets. Data sharing promotes connection and collaboration, potentially leading to significant new discoveries. Open source tools can help data scientists and analysts easily share data and AI assets across clouds, regions, and platforms, unlocking new revenue streams and driving business value without relying on proprietary formats, complex ETL processes, or costly data replication.