The multi-variable data sets compound even minor errors into huge inaccuracies, especially data sets that include predict human behavior.

The problem starts in defining data sets, which is incredibly difficult for any non-binary variable. The problems compound in the collection of the data. The errors compound exponentially in applying the data.

The problem is finally capitalized by human bias in applying the results.

Here is a tiny real-world example which will show what I’m talking about:

A local U.S. Probation office wanted to track efforts to get probationers employed. They looked at the existing data which was collected in a binary fashion; employed or unemployed.

The officers were the people collecting the data. They used the data to keep track of who they needed to focus their efforts on.

In the real world, many probationers had some employment, but it was inconsistent, or over-stated by the probationer, or unverified, or seasonal, etc. For example, one guy might be self-employed as a “concert promoter/D.J.” Some months he made good money, but mostly he was “working” by doing unpaid stuff. In that case, an officer might be prompted to mark the guy as unemployed, maybe as a way to remind himself to check in the probationers sketchy and unreliable employment.

Now, in this example, if the probation office decided to do work evaluations of their officers by looking at the “data” on the percentage of probationers on there case load who were unemployed.

After that change was made, in one month’s time, the employment percentage for that office went from 35% to 80%. The administrators for that office didn’t question this astounding “data”, and reported this “data” to Congress, which promptly awarded those administrators for such a remarkable turnaround.

Except for simply, binary data sets, this type of result is all too common.

We simply and consistently over-estimate our ability to collect and analyze and vastly under-estimate the major errors that occur with only minor input variability.