Before advancing with generative AI, it is essential to improve data quality

Published: June 11, 2024

The functionalities of generative AI have gained popularity with ChatGPT from OpenAI, sparking a series of concerns and projections for the coming years. One of the most critical concerns for an efficient AI strategy is the quality of data used to train these models. Data does not appear by chance, so ensuring access to reliable sources is essential to harness the full potential of this technology.

To understand the importance of this point, we can examine the evolution of our search for information, from paper to digital. In the book “Talk to Me,” which explores the evolution of voice computing, author James Vlahos extensively discusses the development of search mechanisms. Decades ago, we sifted through hundreds of encyclopedia entries for information. With the advent of the internet, we began reviewing dozens of content pieces, a process further streamlined by the emergence of search engines. With the advancement of smartphones, we now often see only the top results on a Google search.

The emergence of voice assistants a few years ago and the now-amplified potential of GenAI bring us to “position zero” in search results: we ask for information, and it is delivered to us without much knowledge of the source’s reliability or whether there was any breach of intellectual property in generating the requested content.

Moreover, open solutions can be utilized by anyone. There are excellent use cases, such as assistants for code development and brainstorming ideas, but limitations still exist in terms of organizational differentiation.

Hence, companies are building personalized GenAI solutions using their own databases. This autonomy ensures quality and, most importantly, creates differentiation.

As Swami Sivasubramanian, Vice President of Database, Analytics, and Machine Learning at AWS, said:

“Your data is the differentiator and the key ingredient in creating remarkable products, exceptional customer experiences, or enhanced business operations.”

Indeed, a considerable number of companies have GenAI on their agendas due to the trend. However, many lack a robust and well-prepared data strategy to support their initiatives.

Unveiling the path to AI maturity through data

The Gartner AI Maturity Model comprises 5 levels, as illustrated in the following image:

Today, many companies are at level 2, experimenting with solutions. However, they often do not realize that to advance in AI maturity levels, it is essential to progress in another aspect—the Gartner Data Maturity Model:

In this model, many companies are also at the second level—opportunistic—seeking to formalize data requirements and encourage its use. 

However, for an efficient GenAI strategy, it is crucial to reach the differentiation level, with a specialized data department and data influencing all aspects of the organization. By implementing a Data Lake strategy that involves the entire organization, it will be possible to ensure the volume and quality of data needed to develop GenAI solutions.

Where to start

To build this data strategy, you can start by answering three guiding questions:

  1. What questions do I want to answer about my business?
  2. What questions do I need to answer about my customers?
  3. What questions do my customers need to answer about their businesses?


It is important to take the time to reflect until you reach relevant and meaningful answers that align with your needs. With this information in hand, you can identify the most relevant data, determine its location, and understand how to extract and enrich it to consolidate a Data Lake. This initial step already creates value and enables the development of new products and services for the end customer.

Over time, you can add one more question: What type of content can be generated automatically to accelerate my business or my customers’ businesses?

Answering these questions is crucial for advancing your data strategy, enabling a faster return on investment. 

Conclusion

Data quality is essential for generative AI because the performance and reliability of AI models heavily depend on the data they are trained on. High-quality data ensures that the AI can produce accurate and relevant outputs, thereby maximizing its potential and utility. Poor data quality can lead to incorrect, biased, or incomplete results, undermining the effectiveness of AI applications. By ensuring access to reliable and well-structured data, companies can harness the full potential of generative AI to create remarkable products, enhance customer experiences, and improve business operations.

Therefore, investing in data maturity is not just a strategic choice but an urgent necessity for any company that wants to remain competitive and innovative in today’s market. If you want to know more about how e-Core can help your business evolve in the data maturity journey, reach out on our Contact form.

Filipe Barretto is the AWS Global Practice Leader at e-Core. He is responsible for leading AWS consulting services and guiding a team of Practice Leaders (PL) and Solution Architects (SA).