Saturday, June 03, 2017

Using Machine Learning in a SAP project

SAP projects can be a very complex task of data mapping and data cleaning. People in data science spend most of their time doing this task called data wrangling. There is no doubt a lot of effort is needed at this stage. Most times in SAP projects this task is done manually using extractions and Excel files in a process that is very likely to be poluted by human errors. As data size and complexity increases final data quality decreases. This happens always, no matter how strong people are with Excel.

I think there is a lot from data science that can be very effective in SAP projects. Not only the tools for building the data wrangling tasks but also the machine learning tools that can be used to extract knowledge from data.

In a recent project the product pricing master data consisted of about 50 thousand entries. This is still a small dataset, but already too large for the user to validate line by line.

With machine learning it was possible to identify the errors in the data using algorithms for outlier detection. These methods work quite well when there is a lot of data. In this pricing data there were many very similar prices for the same material and different customers, and many similar prices for products from same product group. This allows the algorithm to identify areas in space with high density of entries and outliers are elements outside those areas.

The picture below illustrates what the algorithm does, the areas with high density of points are what the algorithm sees as normal values and everything else outside are the outliers.

In this specific case machine learning unconvered a large number of outliers and after analysis it was possible to identify several reasons:

  • prices can be maintained with quantity factor (eg. it can be a price for 1 unit, or a price for 100 units); wrong quantity factors originated prices that were order of magnitude different
  • prices can be maintained in different currencies (which were all converted to a base currency before using the algorithm) and there were cases where the price was calculated in a currency but then maintained in a different currency
  • there were software bugs in reading the upload data format

Using machine learning allowed to quickly extract the few hundred cases of errors from the dataset and this simplified the correction activity. Because it is such a generic tool it can be used with any data to find entries that are outliers. If after inspection the outliers are correct entries, then we can have more confidence on the data quality.

Master data quality is a big problem. Machine learning will not magically solve all master data issues, but it is a strong tool to help on that.