Abto Software built an AI document classification module for a construction management system. The cloud-based platform is used throughout the entire construction lifecycle. It facilitates construction projects from planning and design to operation and maintenance. The document management system (DMS) serves users in the United Kingdom, Ireland, Australia, Qatar, the UAE.
The Abto data science team has combined our experience in OCR, NLP, and AI to upgrade the construction management software with an automated document classification tool. Read how we helped our client streamline document workflow processes for their customers.
- Phase 0. Dataset analysis
- Phase 1. OCR and text vectorization
- Phase 2. Investigation of document classification approaches
- Phase 3. AI model for document classification
- Phase 4. Deployment of the document classification tool
- What is a Document Management System?
- What is Document Classification?
- What technologies are used for Automatic Document Classification?
- What are the stages of AI-based Document Classification?
- What are the benefits of employing AI-based Document Classification?
Overview of the Client’s Document Management System
Our client is a European provider of project and document management tools used in construction and engineering. The comprehensive cloud-based solution is available both for mobile and desktop platforms. It is mostly used by architects, engineers, real estate developers, housebuilders, and contractors.
The company has more than two decades of experience in providing digital solutions for the construction industry. It strives to keep up with the times and offer their customers the most efficient construction management solutions. They approached Abto Software with the task of implementing an automated document classification tool for their construction DMS (Document Management System).
AI Document Classification
Abto Software are experts in implementing AI for automating various business processes. Our solutions vary from customer support automation for FinTech to demand forecasting for ERP. This time, we had to implement an AI tool for automatic document classification within a construction DMS. The project has followed several phases that reflect our usual approach to delivering AI automation solutions.
Phase 0. Dataset analysis
As a first step, we analyzed the dataset provided by our client. The documents submitted by users of the construction Document Management System included Microsoft Office files (DOC, XLS), scanned and digitally created PDFs, images (PNG, JPEG, BPM), and AutoCAD drawings (DWG).
The task was to assign three labels to each of the documents. Each label contains from 3 to 18 classes. The dataset included around 14,000 documents per label, from 200 to 11,000 documents per class.
This type of document classification is called multi-label document classification as each document can be assigned more than one label. In our case, a document can have up to three labels. For example, a DOC file including the information about the materials used for a particular furniture item will have three labels:
'description'. Each of these labels has also its unique name within the construction DMS –
Phase 1. OCR and text vectorization
AI models for document classification work by analyzing the text within documents. So one of the most crucial steps in building a comprehensive document classification tool is text extraction.
We used Tesseract OCR to build a custom Parser API. First, it performs optical character recognition (OCR) of the non-readable documents. That is scanned documents, images, schemes. Then, it converts extracted text into numerical representation. This NLP technique is called text vectorization.
Phase 2. Investigation of document classification approaches
In the course of selecting the best document classification approach, we have tested more than a dozen Machine Learning models. We focused on achieving the highest accuracy for each of the labels while accounting for an unbalanced dataset.
The rigorous investigation of the document classification approaches allowed us to select an optimal AI model for our customer. We adopted an ensemble learning approach as it proved to have better performance for our type of data. This Machine Learning technique combines individual outputs from several classifiers to reduce errors in predictions. The results from the selected ensemble of classifiers are put to voting to achieve the highest accuracy possible.
It was decided to train three separate ensemble classification models – one per each label. The ML models have the same architecture but are trained on different datasets.
Phase 3. AI model for document classification
We have built an AI-based Document Classification API that receives the results from the Parser API. It delivers the values for three labels along with the accuracies of predictions. We have achieved 98% accuracy for classification within a label and 96% accuracy in classifying documents for all three labels.
Phase 4. Deployment of the document classification tool
Our cloud systems administrators hosted the document classification solution on Amazon Web Services (AWS). This secure yet flexible cloud hosting platform ensured seamless integration of the document classification tool into the client’s document management system.
Document Classification Solution Structure
The delivered document classification solution comprises two APIs.
- Parser API. Performs document preprocessing, data extraction through OCR, and text vectorization;
- Classification API. Performs AI-enabled multilabel document categorization based on the output of the Parser API.
Benefits of the Delivered Document Classification Solution
- Extensive document support. The delivered document classification solution supports multipage documents, image-only PDF files and other non-readable documents.
- Robust multilabel classification. The solution performs classification within three different labels, 31 industry-specific classes in total.
- High classification accuracy. We have achieved 96% classification accuracy on the document level and 98% classification accuracy on the label level.
- Enhanced accessibility. As the document classification had to be integrated into the client’s document management system we deployed it on AWS cloud.
Team and Technologies
Team: project manager, solution architect, 2 data scientists, Python developer, cloud systems administrator
Project duration: 2.5 months
Tech stack and Data Science tools: Python, scikit-learn, Tesseract OCR, Amazon Web Services (AWS)
Investigated text vectorization methods: Word2vec, fastText, GloVe, TF-IDF, Universal Sentence Encoder, BERT
Investigated classification algorithms: LSTM, GRU, RNN, Bidirectional RNN, SVM, KNN, XGBoost, AdaBoost, Logistic Regression, Decision Trees, Naïve Bayes methods (Gaussian Naïve Bayes, Multinomial Naive Bayes, Categorical Naïve Bayes)
What is a Document Management System?
A DMS, or a Document Management System, is a software solution used to receive, store, update, track, and share digital documents. It automates the document managing workflows within the organization through a variety of tools. They include:
- document classification;
- version control;
- document editing and collaboration;
- robust searching;
- workflow automation;
- file format conversion;
- access control and permissions;
- digital signature, etc.
A substantial number of DMS modules are industry-specific. For example, robust document classification requires a deep understanding of the industry domain.
What is Document Classification?
Document classification or document categorization is the process of assigning predefined labels, or categories, to new documents based on their contents. Depending on the actor that performs classification, document classification can be manual (done by a person) or automated (done by an algorithm).
What technologies are used for Automatic Document Classification?
The state-of-the-art is automatic document classification using Machine Learning. It includes Natural Language Processing (NLP) and Optical Character Recognition (OCR). ML classifiers work with vectorized text that is extracted from documents with NLP and OCR techniques.
What are the stages of Automatic Document Classification?
Automatic document classification usually follows three steps:
- Document preprocessing;
- Data extraction (OCR);
- Document classification using Machine Learning.
What are the benefits of employing AI-based Document Classification?
AI-based document classification has the usual advantages of business process automation (BPA). It is fast, efficient, cost-saving, and more accurate in comparison to manual document classification. Furthermore, building a custom document classifier enables in-depth labeling of the industry-specific documents.
AI-driven document classification is one of the business processes automation solutions (BPA). It can be integrated into the document management system or used separately. Its main advantage lies in replacing manual work with AI while simultaneously increasing the accuracy of the completed task.
Building an industry-specific document classification solution ensures robust classification and facilitates multi-label classification. Abto Software has years of experience in delivering automation solutions for different industries. We can help you build a document classification tool optimized for your particular use case. Discover how implementing Artificial Intelligence and Natural Language Processing can increase your bottom line – contact us by filling out the form below.