United Kingdom

Choose another country to see content specific to your location.

Data Quality in Artificial Intelligence and Machine Learning

The value derived from any raw material only occurs when the commodity is refined; with oil, it is the energy it produces and with data, it is the insights that are extracted. The data we collect from our systems is extremely valuable, however, if this data cannot be refined into a curated and structured set, it has a detrimental downstream impact – especially on artificial intelligence (AI) and machine learning (ML) processes that are becoming increasingly popular in the financial services industry.

 

In this article, we review the various elements of data quality that produce robust data sets for AI and ML application and how deficiencies lead to data cascade impacts that can have disastrous outcomes. We also unpack one of Google’s latest research papers on the topic – “Data Cascades in High Stakes AI” – and look to the future of the financial services industry’s application of AI and ML.

Data quality 101

 

 

In order to be able to “refine” raw data for downstream processes, the quality of the data needs to be of an acceptable standard. Data quality describes how useable – or fit for purpose – the data is for its intended use. The five characteristics used to define data quality are: 

 

  1. Accuracy

  2. Completeness

  3. Reliability

  4. Relevance

  5. Timeliness

Data accuracy refers to whether the data stored for a particular object is correct and accurate, and additionally, the form of the data for that object must be consistent and unambiguous. An example would be in the case of storing dates of 

birth for individuals; 14 January 1994, 14/01/1994, 01/14/1994, and 14/01/94, all indicate the same date in different standards, however, in a large database with many records, one would be unable to perform any kind of analysis or automation on this data as each record is in an inconsistent format. This renders the data unusable, and any AI or ML models built off this data will produce an inaccurate result which should not be relied on.

 

Completeness of data refers to the wholeness or comprehensiveness of the data. In order for the data to be truly complete, there should be no gaps or any missing information in the data set. Data that is incomplete should not be utilised, however, when this is overlooked, and analysis and business decisions are conducted with incomplete data sets, it leads to costly mistakes and false conclusions. There are several steps which can be taken to ensure that the data which is collected is complete, such as ensuring all critical information fields are marked as required and ensuring there is a data quality team and a framework which can use proper data profiling techniques, helping to identify missing values and errors in the data. 

 

System data which is collected through processes or instruments need to be relied on. The reliability of data is important and if the same data is collected through multiple systems or processes, these data sets cannot contradict one another. A stable and steady mechanism is needed to ensure that data is collected consistently and there is no variance in the data collected. If multiple sources provide different interpretations of the data, the automation and decisions which use this data can lead to costly mistakes.

 

Relevance is important in data quality principles, as collecting unnecessary data is a waste of resources which is costly and will reduce the efficiency on AI & ML models, which are built to assist the business in making important decisions and automating operational processes. There are additional storage costs which are created when storing unnecessary data that does not assist the business. It is important to consult with experts in whichever field you are pursuing with your data-driven projects to establish which data points are critical and which should be removed or reduced.

 

Timeliness with respect to data quality refers to how appropriate the data is to the need of the business. Processing of data is associated with data availability. If the data is needed and cannot be retrieved in a timely manner, then all the information which can be derived from the data set, AI and ML tools and any other business analysis process is irrelevant and is an unnecessary cost to the business. Likewise, if the data needed is to inform an urgent business decision and cannot be retrieved in a timely manner, the business would not be able to utilise data driven insights to inform their decisions. 

 

With the above in mind, there is no doubt that good data quality is crucial to conducting appropriate business analysis and derive the most value from AI & ML models that are fed significant datasets. Poor data quality provides a huge point of failure in these models.

 

Modern data systems need to be accustomed to good data quality for input and output functions. Poor data inputted into data models create inaccurate results which lead to poor business decision making and creates avoidable costs to the business. Additionally, the validity and reliability of the results produced by AI & ML models will be called into question. The limited trust created because of poor data quality can ultimately alter the culture in the organisation, away from innovation and adapting a data-driven strategy for business operations.

Review of data cascades in high-stakes AI

 

“Everyone wants to do the model work, not the data work” - Data Cascades in High-Stakes AI [1]

 

The above study which was conducted by a group of Google researchers investigates the impact poor data quality has on artificial intelligence models, particularly in high-stakes environments, such as detecting cancer, granting loans and wildlife poaching. In these environments, the downstream impact of poor data quality has far reaching consequences and can be extremely costly if not rectified. Additionally, the investigation highlights the biased reward system in organisations which reward the end state of the analysis process – that being the development of AI & ML models and the results and capabilities which are generated as opposed to the work required in ensuring high quality data is inputted into these models. The research was conducted across India, the USA and the East and West Africa regions, with results showing that by ensuring data excellence is prioritised and a sufficient investment is made to ensure that data quality is high, safer, and more robust systems can be developed and productionised in organisations. 

 

Data cascades in this study are defined as compounding events causing negative, downstream effects from data issues that result in technical debt over time. The costs associated with keeping systems running smoothly and scaling efficiently is referred to as technical debt, the extra effort to add new features exponentially increases the operating costs of these models. Data cascades have intensifying impacts that can take years to manifest and any limited domain knowledge of the modelers may cause these impacts to go undetected, allowing false assumptions to continue to skew results. The investigation highlighted how data cascades in these high-stakes environments have led to harm to beneficiaries in which AI models were designed to assist. 

 

The paper goes on to highlight specific data cascade events which result in data issues that may cause the project to be abandoned or restarted: 

 

  • Influence of the physical world’s complexity when deploying AI/ML systems

  • Inadequate application domain expertise to interrogate validity of results

  • Conflicting reward systems across collaborators causing misaligned priorities

  • Poor cross organisational documentation to enforce consistent understanding

In the healthcare sector, for example, technical errors filtered into AI models if the hardware used to collect data about a patient was not serviced every twelve months. In further studies, spurious events have led to complete AI systems failure e.g., “Suppose an image is out of focus, there is a drop of oil, or a drop of water on the image, appearing blurry or diffused. A model which is looking at this can easily get confused that an out-of-focus image is cancer.” Further cascades appeared when inaccurate data was deeply embedded into systems, for example, in insurance claims, if the decisions taken by insurance companies in the past about accepting or denying claims was inaccurate, the model becomes skewed and inaccurate, and there can sometimes be no way to go back in historical data to correct these mistakes. 

 

Often the impacts associated with these cascades largely appeared after building the final AI & ML models. Client feedback and system performance issues result in long-winded diagnoses and ultimately result in costly modifications, such as collecting further data and adding new data sources. This process deters data scientists from producing insights, rather monitoring for data quality issues and devising tactical corrective action. By ensuring that the initial data sets are of good quality, the costs involved in running the system are lowered and the cost of adding new features in the future is reduced as a result.

The persistence of bias

 

 

Bias in artificial intelligence is a phenomenon that occurs when the model produces systemically prejudiced results due to erroneous assumptions in the machine learning process. For example, Amazon created an AI tool in 2014 to review resumes with the goal of automating the recruitment process for the company. The tool had taught itself that male candidates were preferable and penalised applicates which contained the word “woman”. Historically males were hired in software developer roles in a greater weighting than females – which reflected the existing gender split in the industry, the AI model believed this to be the correct gender split, and female applications were downgraded and rejected as a result. 

 

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is another example of systematic biases existing in AI & ML models. The COMPAS model was designed to be used in the US court system to predict the likelihood that a convicted criminal would reoffend. The model predicted twice as many false positives for people of colour (45%) than white offenders (23%) [2] . 

 

There are several types of data biases which occur in models: 

 

  • Sample bias: this occurs when the sample set that is used to train the data is not large enough or is not representative of the entire system which the model needs to be applied to.

  • Prejudice Bias: this occurs when the data used to train the system reflects existing prejudices, stereotypes, and faulty societal assumptions, such as in the above case, where males were identified for software developer roles over females.

  • Exclusion Bias: without a sufficient understanding of the data, which is being analysed, certain data points may be excluded due to the erroneous belief that they are not relevant when in fact they are significant.

  • Measurement bias: this bias occurs when the data used to train the AI & ML models are not representative of the real world.

Data quality in banking AI & ML

 

 

“To remain competitive, traditional banks must be able to make intelligent decisions on how best to serve their customers – and the crux of intelligent decision-making is quality data”[3]. For traditional banks to remain competitive with new entrants in the industry – which have no reliance on older business models or a dependency on legacy systems which are completely integrated into their operating models – embracing AI & ML models is imperative to maintain a competitive edge. Additionally, banks need to adhere to stringent regulations and provide detailed reporting with regard to their risk exposure and operating stability; all of which relies on data – which needs to be easily accessible and trusted.

 

Analytical models can easily provide a competitive edge in the financial industry, by allowing banks to provide a tailored service offering, as the organisation would have access to in-depth information about the market and their respective customers. However, if the information extracted and presented to decision makers in the organisation is built on poor data, the market information could be grossly inaccurate and negatively impact the organisation, leading to costly investments that yield little to no return. Adjacent to this, is the data reporting provided to regulatory bodies, such as regulatory capital disclosures or any required prudential reports, should automated programme models be used to produce these reports from data which is not accurate, complete, relevant, reliable, and timely the financial organisation can face huge penalties imposed by the financial regulator and suffer reputational risks, which would adversely affect all stakeholders.

 

Ultimately, poor data quality is costly to organisations, increasing the operational costs – technical debt – which is needed to troubleshoot issues, arising from AI & ML models being built using poor data. The 1-10-100 rule – well known and

understood in the data modelling environment – refers to the hidden costs and waste associated with bad data. In this rule, if the cost of capturing data is $1, the cost to correct errors will cost $10 and $100 in additional costs will arise to resolve issues caused by incorrect information [4]. To fully realise the investment in digital platforms, data quality needs to be incorporated into the solution design and an appropriate investment into ensuring good quality data is produced and captured must be prioritised. 

 

“Quality data underpins agility and the ability to make the fast, accurate decisions, is essential for moving forward in a digital world” [5].

 

The highly rewarded work in the software engineering environment is biased towards producing and productionising AI & ML models, with little to no recognition being awarded to the work associated with ensuring good quality data is captured and stored. A cultural change is required in the industry which adequately awards data excellence in-line with the production of AI & ML models to ensure that organisations can derive the maximum return on investments from their digital transformation journeys. 

How Monocle can assist

 

 

Monocle is a technical leader in data management consulting with extensive experience in BCBS 239, risk aggregation and reporting, and the principles that ensure robust and trustworthy data quality. For over 20 years, we have been deeply involved throughout the financial services industry in South Africa, the United Kingdom and Europe in enabling our clients to successfully implement their advanced analytic and machine learning capabilities on top of reliable and robust data architectures. We are therefore well positioned to assist our clients in meeting the data quality and clean up processes that produce value-add to their new or existing AI and ML capabilities. 

[1] Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P. and Aroyo, L., 2021. "Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI. [online] Google Research. Available at: https://research.google/pubs/pub49953/

 

[2] Vashisht, R., 2021. How to reduce machine learning bias. [online] Medium. Available at: <https://medium.com/atoti/how-to-reduce-machine-learning-bias-eb24923dd18e>

 

[3] Allemann, G., 2020. Data quality is crucial for banks. [online] Fintech News. Available at: <https://www.fintechnews.org/data-quality-is-crucial-for-banks/>

 

[4] Ibid

 

[5] Ibid

What's the latest with Monocle