Deciphering Our Own Data
Never before has so much data been available to so many. Integrating data and building useful analytical models with it is an exceptionally difficult process to achieve. To fully understand data, data quality and its enormous potential, a few data myths should be addressed -- and punctured.
Organizations are often completely oblivious as to the location, size, and quality of their data. The problem to be solved here is visibility and quality. They are stuck without these metrics.
The root of the problem is sourced from 2 parts: 1) that organizations typically don’t want to pay the price to take the time to get organized, and 2) that once organized, managers and priority-mongers will shortcut their way to saving money, avoiding the full cost of data ownership, by not performing proper calibration, movement, migration, and maintenance of databases.
This all leads to what has been coined as Data Quality. While it sounds broad, this term reflects the broad realm of the value of data, the investment in and around the data, and the overall usability of the data.
As Head of Data Quality Management for the French Ministry of Economy and Finance, Muriel Foulonneau understands full well the impact of poor data quality, as she conveyed in her keynote address at the ENDORSE 2023 Conference on Reference Data and Semantics:
“A data quality framework is something very useful, because it gives good guidelines, and it’s usually built from the experience of how much data quality has impacted a certain number of services or usage contexts.”
A data model will only work, and will only be worth the analyst’s time to build, run, and analyze the results, when high-quality data is used. Without it, models are worthless, since the results can’t be trusted or considered valid for decision-making purposes. Regardless of the application, the concept of data quality should always be in our thoughts when we use data to drive business decisions, conduct research, or develop insights, as it forms the foundation upon which reliable conclusions and actions can be built.
Defining high-quality data
According to Cloudmoyo, a leading cloud computing and AI innovation firm, “Data must be quantifiable, historical, uniform and categorical. It should be held at the lowest-level granularity, be clean, accurate, and complete, and displayed in business terminology, etc.” These characteristics could be the difference between accurate vs useless results, and can help you identify where your data needs improving.
Executing a data quality strategy isn’t as simple as purchasing a powerful data integration (DI) tool from Alteryx, Informatica, Talend, Xplenty, or Vantara (just to name a few options), then connecting it to the company’s Enterprise Data Warehouse. Buying and running any DI software is the easy part. Organizations need to work together to identify, review, clean, and share data with the goal of continuous improvement.
The 21st century might well be remembered as the Century of Data. Never before has so much data been available to so many, and utilized to such a great extent. The sheer volume might be overwhelming if there weren’t tools available to capture, cleanse, store, and model it. Of course, integrating data and building useful analytical models is an exceptionally difficult task.
To fully understand the gravity and implications of the data quality task at hand, a few long-standing data myths must be addressed...and debunked.
Myth 1: Data governance & data quality are unrelated
Data governance ensures that data quality is maintained by establishing policies, procedures, and responsibilities for managing and safeguarding data assets effectively. Cloud data integration leader Talend believes data governance is a requirement in today’s fast-moving and highly competitive enterprise environment.
Data control and protection must be balanced by access, enablement, and crowdsourcing insights. Since organizations can now capture massive amounts of diverse internal and external data, discipline is needed to maximize the data’s value as well as minimize the privacy issues and risks. Modern data governance requires an agile, bottom-up approach that minimizes data risks while maximizing its usage. Raw data must be linked to business context so that it becomes meaningful. Organizations in turn must take full responsibility for data quality and security.
Myth 2: Data quality is an IT problem
Every user should be instilled with the idea that data quality is a company-wide endeavor, not something left to IT. A chain is only as strong as its weakest link, especially when that weakest link can accidentally bring malware, ransomware, or a data breach upon an organization with the simple click of an email attachment.
Often, the people who know a company’s data sources best are not the IT experts, but the sales representatives, customer service reps, IT techs, and field marketing managers who use the data sources on a daily basis. These data users also know how data quality issues can affect them, so their motivation to keep it robust is second to none.
Additional reasons why data quality isn’t just an IT issue include:
- Decision-Making Impact: Poor data quality can lead to inaccurate insights and decisions across various departments.
- Customer Satisfaction: Incorrect customer data can result in poor service delivery and dissatisfaction.
- Legal Compliance: Data inaccuracies can lead to regulatory violations and legal repercussions, affecting the entire organization, not just IT.
- Financial Implications: Inaccurate financial data can result in misreporting, financial losses, and damaged stakeholder trust.
- Operational Efficiency: Data inaccuracies can hinder operational processes and workflow efficiency.
- Reputation Management: Data errors can tarnish the organization's reputation, impacting relationships with stakeholders, partners, and the public.
- Cross-Functional Collaboration: Ensuring data quality requires collaboration among various departments, highlighting its significance beyond IT silos.
- Strategic Planning: Reliable data is essential for strategic decision-making and long-term planning across all business functions.
- Innovation: Data quality enables accurate analysis and insights necessary for innovation and staying competitive in the market.
Of course, most users will never become data quality experts, but can be provided with smart tools that can overcome the technical complexity. Many DI vendors provide data preparation tools that have powerful yet simple data analysis capabilities that allow users to intuitively explore data sets and assess their quality with the help of indicators, trends, and patterns.
Myth 3: Custom code easily fixes data quality issues
On the surface, data quality might appear to be an easy problem to solve using structured query language (SQL) tools. Just write a few lines of custom scripts to profile the data, then a few more scripts to sort it all out, and Voilà, the data will be clean. But as Steve Sarsfield warns in his article, Six Myths About Data Quality, in can be a bit more complicated than that:
“The process of writing a custom data quality solution can incur additional costs if the project leader decides to leave the company, since much of the process and knowledge of the code may reside inside their head.”
In other words, enterprises that choose a band-aid approach might require expensive resources to code, debug, and revise any custom-built solutions. In the long run, data quality workarounds usually end up creating more work than they alleviate.
Myth 4: Data quality is a standalone problem
According to Talend, “Data quality is the process of conditioning data to meet the specific needs of business users.” Keeping data quality high is not a standalone problem. Many data governance initiatives fail because they aren’t a part of a wider structure or system. Modern data governance controls should be embedded in the data chain so that they can be operationalized and are impossible to circumvent.
For the same reason, data governance needs to be part of an ongoing IT process, with data cataloging and data profiling taking prominent roles. A data catalog makes data more accessible, allowing users to access data in more meaningful and relevant ways. Data profiling, the process of discovering in-depth and granular details about a dataset, helps businesses assess data sources based on the 13 dimensions of data quality.
The impact of AI on data quality
The advent of artificial intelligence has taken the importance of data quality to new heights. AI relies on Big Data for predictions, responses, and real-time decision-making, so the quality of data sources from modeling through deployment and beyond must be beyond reproach to avoid bias, errors, and misinformation. At the same time, AI and machine learning (ML) are invaluable tools for evaluating, organizing, and cleaning data on a grand scale.
Vikram Chatterji of the Forbes Technology Council has referred to data quality as the real bottleneck in AI adoption. This sentiment is justifiable, since many of the models being developed or currently in use are being called upon to make critical medical, societal, or legal decisions. For example, AI models can be used to identify criminals or detect early warning signs of cancer. While an 80-90% data accuracy rate might suffice for other applications, it could be catastrophic for many of these cutting-edge AI use cases.
At the same time, AI techniques such as machine learning algorithms, natural language processing (NLP), and computer vision can automate various data cleaning tasks such as deduplication, outlier detection, missing value imputation, and standardization. AI-powered data cleaning tools can significantly improve efficiency and accuracy in preparing data for analysis or machine learning models.
Improved data quality is within our reach
As data becomes increasingly central to business operations, it's important to dispel the myths surrounding data quality. Muriel Foulonneau's insights underscore the critical importance of establishing robust data quality frameworks, emphasizing that accurate, reliable data forms the bedrock of any successful endeavor, from decision-making to innovation. Contrary to common misconceptions, data quality isn't just an IT problem but a company-wide responsibility, with far-reaching implications for customer satisfaction, legal compliance, and strategic planning. While AI offers promising solutions for cleaning and analyzing vast datasets, it's crucial to recognize that data quality remains a human endeavor, necessitating a collaborative effort to ensure data integrity and maximize its value.