What Is A Data Product?
Data is being productized. We are now at a point where we can talk about so-called “data products” as defined information entities that exist in a more formally coalesced and correlated way that makes data itself more eminently consumable. Data products are defined as live, refined, fully governed and ready-to-use data assets that are instantly discoverable, contextualized, trustworthy and reusable for many use cases. Put simply, data products allow organizations to reuse data across a variety of use cases to save costs and time.
Confluent’s 2024 Data Streaming Report analysis is carried out to find out how organizations use data streaming to accelerate AI adoption while also working to overcome data accessibility and management challenges. Within this report, some 91% of IT leaders say that they are investing in data streaming platforms to drive their data goals, which include data product creation.
Why Are Data Products Useful?
Among the more notable benefits stemming from data products are various abilities designed to enable more confident data sharing across business units. They also enable more meaningful cost allocation or charging based on usage metrics and allow for more robust risk management.
“The world is clearly powered by a monolithic amount of big data – and the amount of data we have at our fingertips will only exponentially increase as we move further into the age of AI,” said Richard Timperlake, SVP EMEA, Confluent. “However, having all this data means nothing if a business can’t extract its full value. One of the more common issues that comes up in conversations with software engineering team managers is the spectre of siloed data. Companies have data, but find it impossible to locate and benefit from what is one of their most critical business assets.”
Timperlake says that most organizations come to his team when they need to create a dynamic and flexible data environment that will break the traditional data topography that is so heavily characterized by silos, stale data, deduplication disconnects and other forms of misconfiguration.
“By embracing the methodologies that enable us to focus on data products, businesses can move to achieve an amalgamation of data assets as they also take advantage of data streaming. With real-time data being refined and organized into data products so that both historical and real-time data can be available for analysis and informed decision-making, the IT function can move its information stance forward by an order of magnitude,” added Timperlake.
Reflecting upon and resonating upon Timperlake’s comments, we can perhaps see the popularization of data product technologies helping enterprises to keep up with the speed of modern business and the operational and transactional data workload that companies in all sectors now have to shoulder.
Data As A Tangible Asset
“The speed of commerce, transactions and the underlying networks that form the core fabric of modern enterprise systems has of course rapidly increased over the last decade,” said James Malone, head of data storage and streaming, Snowflake.
“Business owners now increasingly understand that data should be viewed as a tangible asset to build internal and external information-rich products that can add value. By harnessing streaming data — elevating an organization’s data architecture competency to a level where it can ingest, transform and analyze data more quickly, often in real-time — business units can increase the value of these data products logarithmically.”
Commenting on the age of AI, Malone suggests that consumers of data products are expecting faster, more intelligent and increasingly customized technology services. As a result, he says that robust streaming pipelines are the key to powering these services. Put another way, users can run a search and get answers, which is fine… but now, people increasingly want answers specifically tailored to them – and that’s a level of functionality that requires a higher-grade data product in the mix he says.
“Without the most current and relevant data, building these services is much more challenging,” explained Malone. “Moreover, some data has a very short window of usefulness, such as someone searching for what they might have for dinner that night. Combined and coalesced, use of data products in line with data streaming technology helps unlock the full use of all data in an organization, with much less waste.”
Malone’s comments come from Snowflake’s wider stance and technology proposition i.e. the organization brands itself as the AI Data Cloud company. So that’s not a cloud computing service full of apps per se (although it is that too), it’s a cloud computing service potentially populated by data products, which is why Snowflake talks about data application development, data sharing and data marketplaces.
This resonates with comments made by Snowflake EVP of product s Christian Kleinerman at Snowflake Summit 2024. The company has this year announced Polaris Catalog, a vendor-neutral, open catalog implementation for Apache Iceberg, the open standard of choice for implementing data lakehouses, data lakes and other modern architectures.
Data Produtization Will Flourish
“We want to give organizations choice. We are fully committed to if customers want to mix and match Snowflake as a query engine with other engines, that sounds great to us from the perspective of letting customers adopt the data architecture that they feel best fits their needs. Our goal is to continue to differentiate Snowflake by providing customers with the best experience, ease-of-use and price performance,” said Kleinerman. “Most organizations have hundreds to thousands of times the amount of data sitting in cloud storage compared to what they bring into Snowflake. Of course, a lot of unstructured data is also sitting in cloud storage. We think our vendor-neutral, open catalog implementation for Apache Iceberg with Polaris Catalog vastly opens up the amount of data that Snowflake can act on… and, logically, further widens the chance for useful data produtization to flourish”
Some industries are farther along than others in switching to a data as a product mindset. The life sciences industry, for example, faces far more data silos than we might see in financial services, retail, and logistics, due to the highly fragmented vendor ecosystem, extremely complex scientific workflows and the distributed nature of scientific R&D.
“Life scientists are doing world-class work with one hand tied behind their backs,” said Siping ‘Spin’ Wang, co-founder and CTO of scientific data and AI cloud company TetraScience. He estimates that, on average, any one of the top 150 global biopharmaceutical firms produces 30 to 50 exabytes of scientific data. “That would make scientific data the world’s largest, fastest-growing and most valuable information asset. But if you ask any scientist or senior executive in charge of research, development, manufacturing what business value they’re getting from that data, they will tell you they’re spending far too much time managing data and not doing science based on it.”
The problems here – it appears – are independent of any single life sciences company, which requires enterprises to insist on working only with open, free-flowing and purpose-built data products for scientific data. Wang’s team say that a scientific-data-as-product mindset requires replacing endpoint-oriented data systems with vendor-agnostic platforms that make engineered and standardized data widely shareable, extensible and tailored for the vertical.
Decorating Data, For Detail
“A proper scientific-data-as-a-product mindset insists on enriching and decorating data with context, content, schema, taxonomy, ontology, lineage and interfaces. Getting those basics right is hard and takes deep domain expertise, but doing so is absolutely essential to make any organization’s data available as a product for apps and dashboards or to use it to unlock robust new scientific AI use cases that will tackle some of humanity’s most significant healthcare challenges,” says Wang.
For too long, data was purely a function of IT, which still has a huge role to play here. However, treating data as a product that is “always on sale” brings other users into the picture.
Wang rounds out by saying that the moment we start to talk about data as a product in the life sciences market (and not as a project to deal with via a start-to-finish plan), we can start to realize how essential it is to bring the scientists and researchers into the design of the data product due to the complexity of scientific data and workflows. This then allows them to share their thoughts on how data products need to evolve over time. Realistic about the hurdles we face ahead, Wang advises that those conversations may add more complexity, but thinking about data as a product puts us more in tune with the type of results people want to achieve with their data.
Driving Use Of Data Products
“With the theories, practices and methodologies that underpin the development of data productization under our belt, we can start to think about using data products in practical implementation scenarios,” said Thomas Robinson, COO at Domino Data Lab. “Building any pre-existing, new or emerging data channel into an enterprise’s operational pipeline requires a unified platform approach to govern and centralize broad access to data, tools, compute power and AI models in order to deliver business outcomes across any IT environment.”
Robinson suggests that with the efficiency factor offered by data products, software engineering and business teams need to think about how they can foster data science collaboration and establish best practices for a more holistic and rapid approach to information ingestion. Given the need to now bring more open and extensible IT stacks into production, he insists that there is a real need for firms to be able to accelerate and scale AI – some of which will inevitably be through the use of data products – while ensuring governance and reducing costs.
“As this information dynamic plays out with a degree of codification and standardization, data products will play an important role in fueling the factory for all AI,” said Robinson. “As data, languages, IDE and microservice components and other packages all driven by different compute frameworks come together, the largest enterprises must rely on MLOps platforms to provide reproducibility, collaboration and governance which will accelerate new business value creation with AI. Always the unsung heroes of any IT shop, data science leaders may become the new business heroes.”
Data Products In The Future
While the platforms enabling, facilitating and championing data product technology are already running and working, mechanically at least, the adoption and implementation of these technologies is still at an embryonic stage for most enterprise organizations. We can reasonably expect to be sitting through “How airline X or supermarket giant Y (insert enterprise-scale service or product vendor of your choice) gained competitive advantage through data product innovation” keynotes at technology conventions between now and the end of the decade, so brace yourself for that.
In the interim period, data products and the act of data productization may be prone to simplification through abstraction tools, specialization through industry-specific alignment and acceleration through (you’ll never guess) the obvious implementation of artificial intelligence and machine learning.
Until then, data may still come in a database, but soon it will come in a packet… remember to recycle responsibly please.