What is Data Mesh?
What is Data Mesh?
Data Mesh is a distributed data architecture that provides an efficient, secure, and unified way of managing data across multiple sources and systems. It allows users to easily access data without transporting it to a data lake or data warehouse and focuses on decentralization, distributing data ownership among teams who can manage data as a product independently and securely thus reducing bottlenecks and silos in data management and enabling scalability without sacrificing on data governance.
A monolithic data infrastructure stores, processes, and transforms data in one central data lake. A data mesh, on the other hand, refers to data as a ‘product’ with each domain handling its own data and presenting it as a product than simple raw data. The tissue connecting these domains and their associated data assets is a universal interoperability layer that applies the same syntax and data standards.
It is a socio-technical approach that requires changes across all three dimensions of people, processes, and technology. Organizations can spend most of their efforts on people and processes and relatively lesser efforts on technology to enable the future data mesh state, thus helping to cut costs and reduce dependency on technology as compared to data lakes. A data mesh may not be for all types of businesses, such as smaller businesses with relatively fewer data and data nodes to handle; however, it may still be useful for enterprises that wish to achieve better scalability in their data management efforts.
The four pillars of Data Mesh
Domain-oriented ownership
We can decentralize the ownership of data to business domains that are closest to it; this means either those who collect data or those who use it. A good way to do this is to decompose the data into its logical parts, based on its relationships with other parts of an organization. By doing this we can make sure that all parts of an organization's data are managed independently from each other. This will mean that if one part becomes more important than another then it can be prioritized for development much faster. The motive is to give the organization a single source of truth with scattered data assets that may not communicate with one another.
Data as a product
Business domains are held accountable for sharing their data as a product. Data products enable multiple users to gain value from data, and they can be used by a wide range of people with different roles, including data analysts and data scientists. By providing ease of usability and understandability, finding, accessing, and sharing high-quality business domain data peer-to-peer is made simpler, while allowing autonomy through the development of clear contracts between the business domain's data products with operational isolation (i.e. changing one business domain product should not destabilize others).
Self-serve data platform
A modern self-serve data platform is designed to give domain-centered teams the ability to supervise the whole life cycle of their data products, so as to take care of a dependable connected system, this lowers the overall cost of owning data in a decentralized domain-oriented operating system and design. It not only simplifies the data management intricacies but also decreases the cerebral load of domain teams in controlling the full life cycle of their data products. It also allows a wider population of coders, and generalist specialists to embark on data product progression and diminishes the necessity for specialization. Moreover, it furnishes the computation abilities needed for governance, such as routinely carrying out regulations at the correct point of time, detecting data products, accessing a data product, and assembling or releasing a data product.
Federated computational governance
A data governance operating model that is established on a federated system of decision-making and accountability, comprised of domains, data platforms, subject matter experts, legal, compliance, security, etc. It generates an incentive and accountability framework that reconciles the autonomy and agility of domains, while still honoring the worldwide congruence, interoperability, and security of the mesh. The model largely depends on encoding and mechanized implementation of policies at a highly-detailed level, for each and every data product.
The governance model offsets the potential adverse effects of domain-oriented decentralizations: incompatibility and detachment of domains. It creates a means of satisfying the governance demands like safety, confidentiality, legal adherence, etc. over a web of dispersed data products and minimizes the corporate overhead of constantly aligning between domains and the governance function, and keeps versatility in the face of regular transformation and growth.
Principles of Data Mesh and their interconnectivity
Data Mesh vs centralized data storage
Centralized data storages like Data Lakes are a cost-effective architecture but come with a few drawbacks. The following table illustrates the drawbacks of centralized storage and how Data Mesh attempts to overcome these.
Data Lakes/Centralized data storage Drawbacks | Data Mesh Solution |
---|---|
Relatively slow to access the data needed. All data is stored centrally. A separate team with not necessarily expertise in handling the specific domain data handles the data and as a result: | Data can be obtained quickly as it can be retrieved from the specific domain (Note: there are other ways of addressing this challenge, like with a sophisticated data catalog solution). |
Multiple copies of the same data in the format required by each individual domain/department might be stored increasing the redundancy, load of storage, and cost to the organization. | The data is processed and provided to the concerned team as a product based on the specifications of the requirement, therefore it is less likely that data duplication will occur. |
As the data volume increases, the queries can get complicated and might need changes in the data pipeline that might not be scalable, thus slowing down the response time and in turn costing productivity. | Data mesh delegates datasets ownership from centralized to individual teams or business users, enabling agility and scalability. It powers real-time decision-making in businesses. |
Integrating Data Mesh architecture
Integrating data mesh into an organizations ecosystem majorly comprises of the following three steps:
Step 1 - Connect data sources
The first and main aim is to connect all the data sources where they reside basically by leveraging the organizations existing investments in data storage i.e., by using data lakes or warehouses. Unlike the single-source-of-truth approach to centralize all your data first, you’re leveraging and querying the data where it resides.
Step 2 - Create logical domains
The second goal is to create an interface for the business teams to find their data. It is logical as the data is not simply accessed from a central repository but obtained from the concerned autonomous domain/team as it is made available to the respective user based on their requirements thus promoting the self-service where data consumers can do more on their own.
Step 3 - Enable teams to create data products
Post providing the domain teams with access to the data they need, they should be taught to create directly consumable data products by the data consumers, such data products can be stored in a catalog for the consumers to pick data from.
Advantages of Data Mesh in data management
Agility and scalability - Data Mesh enables decentralized data operations, improving time-to-market, scalability, and business agility.
Flexibility and independence - Organizations that choose a data mesh architecture to avoid being locked into one data platform or data product.
Transparency for cross-functional use across teams - Centralized data ownership on traditional data platforms leaves teams of data experts isolated and highly dependent, resulting in a lack of transparency. Data Mesh decentralizes data ownership and distributes it across cross-functional domain teams.
Faster access to critical data - Data Mesh provides easy access to a centralized infrastructure with a self-service model that enables faster data access and SQL queries.
Closure
Data Mesh as a concept focuses on addressing the issue of data ownership and quality by putting the responsibility for all aspects of specific data sets at decentralized cross-functional teams. Data Mesh is one way (of many) to overcome data governance challenges. Although Data Mesh is at the beginning stages of the “hype-curve” we think it is an interesting data governance concept.