Reliability of Training Data Sets for ML Classifiers: A Lesson Learned from Mechanical Engineering
MetadataShow full item record
Original versionJuric, R., Danilchanka, N., & Mousavi, M. G. (2020, January). Reliability of Training Data Sets for ML Classifiers: a Lesson Learned from Mechanical Engineering. In T. X. Bui (Red.), Proceedings of the 53rd Hawaii International Conference on System Sciences (s. 891-900). https://doi.org/10.24251/HICSS.2020.111
The popularity of learning and predictive technologies, across many problem domains, is unprecedented and it is often underpinned with the fact that we efficiently compute with vast amounts of data and data types, and thus should be able to resolve problems, which we could not in the past. This view is particularly common among scientists who believe that the excessive amount of data, we generate in real life, is ideal for performing predictions and training algorithms. However, the truth might be quite different. The paper illustrates the process of preparing a training data set for an ML classifier, which should predict certain conditions in mechanical engineering. It was not the case that it was difficult to define and choose classifiers, in order to secure safe predictions. It was our inability to create a safe, reliable and trustworthy training data set, from scientifically proven experiments, which created the problem. This places serious doubts on the way we use learning and predictive technologies today. It remains debatable what the next step should be. However, if in ML algorithms, and classifiers in particular, the semantic which is built-in data sets, influences classifier’s definition, it would be very difficult to evaluate and rely on them, before we understand data semantics fully. In other words, we still do not know how the semantic, sometimes hidden in a data set, can adversely affect algorithms trained by them.