Strategy : crowd sourcing your data cleaning

Herve Blanc
Apr 18, 2021
4 min read

Updated: Sep 27, 2023

Business need

I recently worked on automatically extracting information from documents. My customer is looking to expand a successful mobility service to new geographic territories. Scaling up is often a call for revisiting how things are done within your company; handling tedious, low value added tasks, automatically is often an opportunity to leverage #AI.

In this case the existing process would be processing thousands of documents manually per month (while originally handling 100s on limited area). Efficiency will help your scale up approach do more with the same teams now refocused on handling the real value added human intelligence is needed for, and letting #ArtificialIntelligence deal with the mondain bothering drudgery.

Expect changing your process to support your efficiency needs

When I looked at the data my customer collected, there was a lot of disparities in the format of documents (pdf, jpg, ...). Even for the same document type (e.g. insurance proof) there were lots of different layouts end users would submit to the service. Luckily there are strategies you can use improve your data quality a lot: educating your users is the best. Handling data quality at the source helps a lot simplifying what you might need the algorithms to do. The other thing is that it is basically #CrowdSourcing the effort of providing higher quality inputs. This is by far the cheapest and most efficient strategy for cleaning up the data mess.

Other positive side effect of crowd sourcing

Often it requires simple adjustments to your existing workflows, by providing straightforward guidance to your end users. It can take the form an overlay template imposed on your mobile application "Take picture of your XYZ document" interface. While it seems like you are imposing constraints on your customers at first, think about the positive effect of accelerating drastically your service response time when these documents will be processed automatically.

Various quality check opportunities

Gradually you might also consider creating machine learning models that can check the quality of the documents provided by your end user, and provide real time feedback "Your document quality is not acceptable for ABC reason". Again this seems imposing on your customers, right? But think about the cost of processing again the same document by your organisation. And what about the user experience if you come back a few hours later (or even worse, days) to let your customers know they have do this again. #RealTime is the norme, better make sure you meet these expectations now.

Why do you need a data cleaning strategy ?

Now assuming you checked the document quality and extracted automatically the information from your customers documents, how do you know the OCR (Optical Character Recognition) did not glitch ? You know well the machine learning model you developed to process a given document type is only >90% good, right? Which means, well, you have to deal with the corner cases. Again here, #CrowdSourcing the quality verification of the extracted information is the best approach. Give your customers the responsibility to check and correct the few mistakes, if any. They will happily do that at no extra cost to your business; other than a new "verification step" in your "document submit" application. Your back office should also cross check information between the various documents your customers provided; again providing an opportunity to get back to your customers real-time, in case of inconsistency detected.

Human in the loop

Now your customers sent the documents your process need, the information has been checked and corrected by them, it is up to you to decide how often your organization will double check the information. Depending on your business criticality, you might need 50% of the document identified initially as "bad quality" to be double checked by your teams. Compared to a full manual process, it's only a few percent of the complete set of documents process. It means more quality time for your collaborators to focus on these "corner case" documents while the workflow in getting the most of the information automatically into your databases.

Learning opportunity

We have seen the "document quality" can be identified by a specific "quality model". The "document extraction" model can help too, providing a refined insight via its own set of quality metrics. In fact each field extraction, while processing the OCR result, is also coming with a confidence level, stating how good the model think the text is relevant to the data learned for this field. Depending on the text field criticality to your business, you might decide to reject documents if a field's recognition confidence level is less than 95%. For other fields, it might be acceptable to let information flow in even if the confidence level is as low as 75%. The various checks described above, their results, the documents should be kept and looked at by your data science team, as an opportunity to improve your models. #MachineLearning systems should be monitored, their performance assessed with the real world data, and updated regularly depending on how your business context and associated data is evolving. New products, new customer needs, new document types, ... might challenge your deployed #AI models thus the need, likely surfacing, to relearn your models from that real world data.