Curating Cleaner Data In Messy Multimodal Modals
AI is messy. We can suggest that artificial intelligence is “messy” because a proportion of it is developed within inappropriate guardrails leading to security concerns. In some of the messiest cases, poorly managed AI models lead to decision bias and so-called AI hallucinations. We might also say that AI can be a little messy because user trust is embryonic and still nascent in many areas; we have as many machine intelligence scaremongers as we do AI evangelists. But perhaps our best validation for claiming that AI is messy comes down to the fact that its data streams are increasingly complex, varied and multimodal… and that makes AI data unstructured.
Some of the problem here lies in the challenge of processing unstructured data in large quantities and attempting to put these information streams through AI analytics algorithms. Software architects and engineers are all too aware of the missing link between structured data technologies and the newer AI workflows, which might now typically be based in modern programming languages like Python. Although Python does have its own processes for data cleansing and categorization, its use is emblematic of the issue we are seeking to explain here… so let’s go on.
What Is Multimodal Data?
While the (older) analytical often databases provided plenty of control over data quality, unstructured multimodal AI data (such as text and images or video) prove much harder to assess and improve at scale.
To address these issues we need to start thinking about data more directly as a moving and morphing entity, not as some static set of values that resides in some repository or database, even if that data store is designed for real-time streaming data. We need to think about data in a lifecyle, data inside workflows and data as a chain of operational records.
Intelligent Data Curation
Iterative, the company dedicated to streamlining the workflows of AI engineers and creator of open source projects in MLOps has a product designed to create impact at precisely this level. DataChain is a tool for processing and evaluating unstructured data. In a world where less than a quarter of firms are using generative AI in real world business applications according to McKinsey’s Global Survey on the state of AI published in early 2024, the proliferation of sophisticated AI foundational models increases the call for cleaner (and more intelligent) curation and data processing.
“The biggest challenge in adopting artificial intelligence in the enterprise today is the lack of practices and tools for data curation and generative AI evaluation that can ensure the quality of results,” said Dmitry Petrov, CEO of Iterative. “As the next step, we need AI models that can evaluate and improve AI models. So far this has only happened at the industry forefront – take a look at DeepMind’s AlphaGo training against itself, or OpenAI’s DALL-E3 curating its own dataset. Our goal is to change this.”
The Iterative boss says that the absence of simple solutions to wrangle unstructured data using AI models in easy-to-manage formats keeps the technology barrier high. In practice, most AI engineers are still building custom code for converting their JSON model responses, adapting them to databases and running models in parallel with out-of-memory data.
“DataChain democratizes the popular AI-based analytical capabilities like large language models, judging LLMs and multimodal generative AI evaluations, greatly leveling the playing field for data curation and pre-processing,” said Petrov and team. “DataChain can also store and structure Python object responses using the latest data model schemas – such as those utilized by leading LLM and AI foundational model providers.”
A Cleaner Data World?
The progression to develop this type of tool is interesting and admirable enough, but it is arguably unlikely to be Swiss Army Knife enough to clean up all messy unstructured data and help AI model developers to curate from a completely clean pipe of pure information streams for our intelligence engines for all use cases.
That being said, this is the type of tool that larger industry behemoths like to aquire and build into their wider enterprise platforms, so if Iterative does enjoy continued interest and success it will either need to scale intelligently itself or get ready to become part of a Salesforce, Snowflake, IFS, SAP or other (insert major enterprise software vendor platform company of your choice) going forwards.
Either way, we do know now that data can often be unstructured, a bit messy and occasionally dirty. Now, wash your hands.