Skip to content

Data Reliability & Validity

Plots related to vehicle affordability and risk of loss. These figures show loan balance and amortizing purchase ‘basis’ against the expected recovery value of an automobile throughout its expected life.

Most people that work with data soon realize that data reliability and data validity are two of the most fundamental problems that must be faced early in the data acquisition and analysis process.

Plots related to vehicle affordability with various forward looking estimates of ‘Total Ownership Cost’ as a fraction of borrower income. The time dimension involves assumptions about inflation and expected maintenance costs related to the vehicle in question.

Data Reliability

Reliability of data has to do with the consistent nature of the measurement and data collection process. If data are collected at multiple times and in multiple situations and result in very consistent and reproducible measures, then the measure is considered reliable. An unreliable measure is like an elastic ruler – it can provide a measure of that is likely to be different each time the measurement is taken. A good example of this is ‘self-reported’ income. Often, for example, a borrower will state, or ‘estimate’ on a loan application that they earn more (or sometimes less) than they really earn – so ‘self-reported’ income is not a very reliable measure of real, verifiable income.

Data Validity

The validity of a measurement has to do with whether the measure taken truly relates to what one thinks is being measured. Often an assumption or some theory colors the use of a measurement to affect it’s application. In auto-finance, for example, it is common to calculate a ratio consisting of the monthly ‘auto-loan payment’ to a person’s monthly income – and to presume that this ratio measures the ‘affordability’ of a vehicle for that person. The theory is that a person is less likely to stop making monthly loan payments if the vehicle is more ‘affordable’ given that person’s income.

One can see that the assumed relationship might be incorrect. Many vehicles have different monthly average maintenance costs, some have irregular, large costs – like replacing batteries in a hybrid vehicle, and some have very different fuel-milage efficiencies (which makes a big difference when fuel prices show great variability). So, a finance company might believe that they are measuring ‘affordability’ of a vehicle by simply looking at the monthly loan payment – but it is easy to see that this is only one component of affordability. The theory of what constitutes affordability affects whether the measure ‘valid’ or not.

With respect to the usefulness of data, then, it is important to understand the reliability of each data element – as well as to understand the validity of each measure, with respect to the theory of what is believed to be assessed from those data.

In more complex data analytics – such as the creation of financial models that are dependent upon both data and modeling assumptions – it is even more important to understand and assess the totality of the reliability and validity of the data collection and modeling process. Wonderfully complex and sophisticated data collection and modeling processes have been shown to be woefully inadequate at very critical times. This area of model risk management is one of the ‘cutting edge’ concerns of modern business and financial institutions.