Over the new year I decided to get back to writing and since this is a blog about experiences and thinking about data, I decided to start documenting my thoughts around GenAI and how data architecture will need to change to support it. Without a doubt, the hype around GenAI is transforming businesses and consumer behavior alike. Over the past few years, organizations have established AI committees, ethical use standards, AI innovation initiatives, and numerous other structures aimed at deriving value from AI. This has led to a seismic shift in thinking about how we process, manage, and leverage data.
The Current Landscape of AI
In the past decade, AI technologies have made monumental strides. While data science, analytics, and machine learning (ML) have been around for some time, the advent of powerful language models, such as OpenAI’s GPT series, has unlocked new capabilities for data processing. LLMs, in particular, are designed to understand, generate, and manipulate human language with high accuracy. This has created an explosion of possibilities for automating tasks, extracting insights from unstructured data, and providing real-time decision-making support.
According to a report from McKinsey on the state of AI in 2024 reports that AI adoption is still surging. Adoption has been up by 72%, with the biggest increase in professional services. Some of the largest cost reductions expected by the use of GenAI are in human resources, service options, and supply chain. Many organizations are also expecting revenue growth due to efficiencies in legal/risk, IT, sales, and supply chain. In my own capacities in my role, I’ve seen some of the same expectations and trends. Where budgets are being shifted towards AI users and expectations are starting to shift towards AI in the business. We’ve gone from the business needing “AI” to the business wanting solutions to problems that we think AI can solve. The hype cycle with AI is moving faster than many other technologies, and it’s only getting faster with more expectations.
Pressure on Data Architectures
Traditional data architectures, which rely on static models like data lakes, data warehouses, and data marts, are under increasing pressure to adapt to the demands of modern AI systems. AI models, especially LLMs and GenAI, require real-time data ingestion, fast processing, and seamless integration across multiple systems. Legacy data architectures, often siloed and rigid, struggle to meet the agility and performance demands of AI applications. At the same time, those architectures are the foundation for controlled financial reporting and analytics that need sources like those to be complete and correct.
One of the most common questions I get asked is “how ready is our data for AI?”, and it really does depend. It depends on if your data was ready for ML, or Data Science, or analytics. The constant for all data work continues to be challenged with how good of a data repository you can stand up, ensuring the level of quality needed for the task. AI workloads can be different enough that the technical architecture of what you need to be successful will likely change, but your data needs are still very similar to all the other data needs from the last 50 years. The things AI is pushing the limits on though is managing the massive volume of data generated by AI systems and the combined need for high-performance storage and rapid access. Additionally, the variety of data types (text, image, video, and more) requires an architecture capable of supporting complex data relationships, high-throughput processing, and flexible querying mechanisms. These pressures have led to the need for rethinking how data is structured, stored, and accessed in the era of AI. For example, you can likely know the quality of your financial data, but how do you know the quality of your unstructured data repo, or your images?
Approaches for AI-Driven Data Architectures
As AI continues to evolve, there are several architectural patterns that can better support AI initiatives, with new ones being revisited frequently. Some of the concepts worth exploring are:
- Data as a Service (DaaS): In this approach, data is treated as a dynamic resource that can be accessed and manipulated on-demand through APIs. This model allows AI applications to tap into vast, constantly updated data sources without needing to rely on traditional, static data storage models.
- Hybrid and Multi-Cloud Architectures: Given the diverse needs of AI systems, hybrid cloud models that combine on-premises data centers with cloud infrastructure are gaining traction. Multi-cloud architectures, which distribute workloads across different cloud providers, offer additional flexibility and resiliency, enabling AI models to scale seamlessly. This approach is becoming increasingly common to maximize the tools available for AI workloads and keep costs down as well.
- Serverless Computing: Serverless models abstract the infrastructure layer, allowing organizations to focus solely on the business logic of their AI applications. This approach allows for elasticity and automatic scaling, reducing the overhead of managing infrastructure while ensuring AI systems can handle fluctuating workloads.
- Real-Time Data Processing Pipelines: AI applications demand near-instantaneous data processing. Modern data architectures increasingly rely on streaming data pipelines that process and analyze data in real-time, enabling rapid decision-making for AI systems. The Kappa architectural pattern is an example of this, where data is streamed through a single pipeline and exposed via streams in a speed layer. (See references below)
Key Consideration: You may need to rethink how you’re surfacing and providing data on the whole. Traditional approaches may be too rigid for AI workloads to take advantage of them, which will increase the likelihood that AI data silos will be stood up.
Impact on Data Lakes, Warehouses, and Marts
As previously noted, the rise of AI technologies has introduced new challenges for traditional data systems, particularly data lakes, warehouses, and marts.
- Data Lakes: While data lakes are well-suited for storing vast amounts of raw, unstructured data, they struggle with the complexity and speed required by AI systems. AI applications often need real-time access to data, which traditional data lakes were not originally designed to handle. As a result, we are seeing the rise of “AI-optimized” data lakes, where pre-processing and intelligent indexing enable faster retrieval of relevant data.
- Data Warehouses: Data warehouses, traditionally used for structured data and reporting, are evolving to support AI workflows. AI-driven data warehouses are increasingly integrating machine learning and real-time analytics to enable better data exploration and predictive insights. Cloud-based data warehouses like Snowflake and Google BigQuery have become popular due to their ability to scale horizontally, which is ideal for supporting the data needs of AI systems.
- Data Marts: Data marts are being redefined as AI systems require more specific data sets that are purpose-built for analysis. AI-driven data marts focus on organizing data into smaller, relevant subsets, often optimized for specific tasks or user groups. This specialization allows for faster access to the right data for training and running AI models.
Key Consideration: Your data architecture will include lakes, warehouses, and marts, but your medallion architecture maybe leverage the lake and warehouse in different ways, such as the need to store golden data in a lake-like pattern, where your traditional warehouse structures may be presentation oriented.
Changing Data Storage
Traditional storage solutions, built around relational databases or file systems, are no longer sufficient for the demands of AI. AI workloads often require high-speed data access, parallel processing, and distributed storage capabilities to handle the vast amounts of data involved. To accommodate these needs, new storage technologies are emerging:
- Distributed Storage Systems: Technologies like Hadoop Distributed File System (HDFS) and cloud-native storage options (e.g., Amazon S3, Google Cloud Storage) allow for scaling storage across multiple nodes. These systems can store structured, semi-structured, and unstructured data efficiently and are particularly suited for AI workloads. Blob storage has been particularly successful in the storage of data for AI workloads as well as storing the metadata and outputs of AI outputs.
- Graph Databases: Graph databases, such as Neo4j, are becoming more relevant as AI models often rely on relationships between entities, rather than traditional tabular data. These databases are optimized for navigating complex networks of data, which is ideal for training AI models, particularly in areas like recommendation systems and natural language processing.
- Object Storage: Object storage systems, such as AWS S3, allow for storing large amounts of unstructured data in a cost-efficient way. These systems are increasingly integrated with AI platforms to provide the speed and flexibility needed for AI applications to access and analyze diverse data types, from images to text.
- Vector Database: The backbone of GenAI, vector databases allow for efficiently handling high-dimensional vector data. It stores data as mathematical representations, making it ideal for machine learning models to remember previous inputs and build context.
Key Consideration: The future modern platforms aren’t about picking one new database or storage technology and embracing it, it’ll be about embracing many of them and being able to use them in concert for the right outcomes. It is very unlikely that all your storage should be in a traditional structured data set, start thinking about how your data may need to be stored in different forms depending on the workload. For instance, is doing all your processing and storage in snowflake or databricks the right solution, or should you use one for processing, one for presentation, and storage ends up in a blob so it can be distributed to any workload?
Modeling Data for AI
Model-driven architecture (MDA) is gaining ground in AI-driven data architecture because it offers a way to bridge the gap between business requirements and technical implementations. MDA focuses on creating abstract models that can be easily adapted, allowing data systems to evolve alongside the models they support. In the context of AI, this means developing flexible models that can incorporate data from multiple sources and evolve as the AI systems themselves learn and adapt.
The integration of AI models into architecture involves continuous feedback loops, where the architecture is refined based on insights from AI outputs. This allows for the creation of systems that can self-optimize and adapt in real-time, responding to changing data and business needs without requiring constant manual intervention. The need for flexibility drives a need to “shift right” final modeling approaches. The “shift-right” paradigm involves doing complex structures and models as the last step in the pipeline and not too early. Invoking ridge structures too early can result in a sluggish and unresponsive data structure that can’t respond to the changing needs for AI. The question becomes less about using 3NF, 6NF, Dimensional, Vault, One-Big-Table, or another technique, but more about how to be able to produce all of them should the need arrive.
Key Consideration: Your modeling techniques and skills will have to evolve as you think about how data architecture is changing. An approach for one solution may not be the right one for another solution. Begin thinking about how to add flexibility into your pipelines that enable multiple models and outputs that can feed different types of needs.
Changes to ELT and ETL
ETL (Extract, Transform, Load) processes have long been a cornerstone of data management. If you think about it, we really haven’t changed the concepts in 40 years. Even with new technologies, we’re still doing the same things we were doing in the 80’s, only more efficiently. The rise of AI is driving changes to how data is moved and transformed. In traditional ETL processes, data is extracted from various sources, transformed into a consistent format, and loaded into a central repository. With AI, these processes must evolve to meet the needs of real-time data processing and dynamic, multi-source ingestion. AI is still leveraging some of the same paradigms, but it’s needing drive velocity.
- Continuous Data Pipelines: AI systems require continuous data ingestion, which means ETL processes must be real-time and continuous. This is facilitated by stream processing tools like Apache Kafka or AWS Kinesis, which allow for the real-time movement and transformation of data.
- AI-Optimized Transformations: Instead of relying on static transformation rules, AI-driven ETL processes incorporate machine learning to optimize data transformations dynamically. This could involve transforming raw data based on patterns recognized by AI models, ensuring the data is always ready for analysis.
- Decoupling ETL from Data Storage: As AI systems require data to be continuously updated, traditional ETL approaches are shifting toward a more decoupled model, where data transformations happen in a real-time streaming fashion, rather than during batch processing. This ensures that data is always fresh and available for immediate analysis.
Key Consideration:Expect your ETL/ELT pipelines to be more cloud driven, align more to streams and real-time data, and leverage next-gen capabilities to enhance security and governance through the use of embedded AI technologies within the pipeline. AI will give us ways to produce more ETL jobs faster, providing automated cleaning, modeling, and leveraging learning algorithms to make adjustments to data volume.
Conclusion and Call to Action
The integration of LLMs and GenAI into data architectures marks a profound shift in how we handle and utilize data. AI-driven data architectures are providing greater flexibility, speed, and intelligence, reshaping the roles of data lakes, warehouses, and marts.
As a leader in data or AI, you should be thinking about how your data architecture will need to evolve to support the business. Hopefully, it doesn’t keep you up at night!
References