A personal document manager with machine learning

I have a ridiculous problem at my home. When official letters come by, be it for taxes, salary, banking stuff, etc… I usually treat the letter, pay, read it, and then, I don’t know what to do with the letter. Can I get rid of it? Should I keep it? Till now, my option was to keep all of it (in case). And so over the years, I have accumulated a grand total of 3 archive binders that are filled with legal documents, contracts from my previous rents and others. But I have let the unprocessed documents accumulate on my coffee table to the point of towering pretty high. An accumulation of several years of documents to be precise. I was waiting to come up with a better way to manage these documents.

And something happened. Again it is because of my NAS (I love this little box in case you didn’t already understood). One option I was always considering was to digitize all of these documents for better accessibility. Before my NAS, I only had computers and laptop to store data, eventually a bunch of USB hard drives. No, the problem was the scanning of the documents. Scanning the docs was time consuming. Usually, it takes almost 30s for a single face. Between taking the document, going to the scanner, back to the computer, clicking scan, scanning, and then putting the doc in the right folder. And this year, I noticed that with my NAS, I can simply scan the document and directly send it to a folder in the server. Copying it in double in case of a failure of a disk.

And it got me thinking further: What if I could scan my document, send it to my NAS automatically from my printer (already doing that), and something would to the classification for me, putting the scanned doc in the right folder: tax, rent, pension prevision, banking, etc… Basically, this means doing some machine learning to create an algorithm that would classify my doc.

It got me excited to learn more about it. I dabble in machine learning and have done already some for learning on Kaggle, or even for work to predict some tabular data. This was different. The path I decided to take is the following:

  1. Scan the document and send it to an “input” folder
  2. Use OCR to extract text
  3. Use a machine learning model to predict the category of the document
  4. Send the document to the right folder

My first idea was to use LayoutLmv3 from microsoft for the classification model. Being able to use both text and visual cues to determine the category of the document. I would need first to train a pre-trained version of the model, which is widely available in Huggingface. The OCR extraction is now “basic”, can be performed by several python packages, most of them relying on tesseract. I decided to use easyocr, mostly because it is faster than tesseract.

After finding some example online on how to train the layoutLmv3 model, I got to work. Only issue: I don’t have a “super powerfull laptop”. My macbookpro M4 with 16Gb of VRAM is sadly not enough. It is great for inference, but training is something else. I had approximately 100Mo of training data, and starting my script, it never reached the training of one epoch, even after letting it run for 30 min. So I gave up trying on my machine. I was on the verge of buying computation time on google colab when I was looking at my data, noticing all the private information that I would be sending to a server somewhere, maybe even be used to train some AI. All my bank account info, all my taxes, etc… It was not possible for me. I could not send this to the cloud. The training had to happen on my machine.

So I went to look for an easier way. Distillation? Using the LayoutLmv3 model as a teacher to train a Bert model… Eh. Too complicated for now. Smaller model? That could work. So I took a pre-trained DistillBert-uncased-base from HuggingFace. Redid my script and started the training. And it worked! an epoch in 5-10 seconds. That was finally going somewhere. Even better, the loss function was getting smaller so I knew that it was learning.

After training, I used the model to predict the label for some untested documents and I was happy to see that it classified correctly everything, besides a single document that went to a “failed classification” folder. That is very good to me.

Here, we go a little deeper in the implementation of the script for training.

I have all of my documents, which are jpg images directly from my scanner, classified by hand, in their respective folders, inside the main data folder. It looks something like this: data/Tax/scan1234.jpg, data/Salary/scan321.jpg, etc…

So the first task is to retrieve all the categories I created, which are all the folders inside the data folder:

DOC_CLASSES = sorted(list(map(lambda p: p.name, Path("data/").glob("*"))))[1:]

The next part is to gather all the images inside those category folders, using:

image_list = list(Path("data/").rglob("*.jpg"))

I need to do the ocr on the files now to extract the text from the images, using easyocr. Since all the files are in French, I use the ‘fr’ database for the Reader. Easyocr also outputs the bounding box for all the text blocks that it finds, so I need to do a little of formating to output a dict with the classification label of the document and a single string containing all the text data.

import easyocr

reader = easyocr.Reader(['fr'])

def ocr(filepath: Path):
    data = reader.readtext(str(filepath), batch_size=16)
    return {'label': filepath.parent.name, 'text': ''.join([x[1] for x in data])}

For record, and also because it is time consuming to perform the ocr over and over again on the same data, I save this dict into a json next to the actual file for later use during the training. I also plan to perform the training of the model on the entire data corpus maybe once a year. Saving my ocr results is a way to accelerate that step for later.

import json

def save_to_json(data: dict, filename: str):
	with open(filename, 'w', encoding='utf-8') as f:
		json.dump(data, f, ensure_ascii=False, indent=4)

Now, we can bunch all of this together to do the analysis of all the documents and create the json containing the label and the text inside each of them:

import json
import easyocr

def ocr(filepath: Path):
    data = reader.readtext(str(filepath), batch_size=16)
    return {'label': filepath.parent.name, 'text': ''.join([x[1] for x in data])}

def save_to_json(data: dict, filename: str):
	with open(filename, 'w', encoding='utf-8') as f:
		json.dump(data, f, ensure_ascii=False, indent=4)

if __main__ == '__main__'
	reader = easyocr.Reader(['fr'])
	
	image_path_list = list(Path("data/").rglob("*.jpg"))
	
	for image_path in image_path_list:
		ocr_data = ocr(image_path)
		save_to_json(ocr_data, image_path.with_suffix('json'))
		

So now, every document has been analyzed by the ocr. I need to gather all that data into a single file, containing all the labels and corresponding texts. To do this, I will scan for all the jsons in my data folder and create a list that I will save in csv format with the help of pandas. I also do this to format the data, be sure that it is utf-8 and ready for the next steps.

import pandas as pd

json_labeled_list = []
for item in json_list:
    label = item.parent.name
    with open(item) as f:
        data = json.load(f)
    json_labeled_list.append(data)
   
df = pd.DataFrame.from_records(json_labeled_list)
df['label'] = df['label'].convert_dtypes(str)
df['text'] = df['text'].convert_dtypes(str)
df.to_csv("full_training_data.csv", index=False, quoting=1)

And now, I am ready for the training of the model. I will use the datasets and transformers packages from Huggingface for this.

from datasets import load_dataset
from transformers import AutoTokenizer

# Loading the data I just created
dataset = load_dataset('csv', data_files="full_training_data.csv" )

# Splitting into train/test sets, with 20% of the data used as test
dataset = dataset["train"].train_test_split(test_size=0.2)

# Loading the autotokenizer for the model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

I then need to convert the labels into numbers. I thought this would be automatically done, but apparently, not… Possibly because I have set the labels as str and not objects, they are not recognized automatically to be a category. Anyway, I create a label map like this:

label_map = {label: i for i, label in enumerate(set(dataset["train"]["label"]))}

def convert_labels(example):
    example["label"] = label_map[example["label"]]
    return example
    
dataset = dataset.map(convert_labels)

Now the labels are converted accordingly. I need to apply a tokenizing function to the dataset as well:

def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)
    
dataset = dataset.map(tokenize_function, batched=True)

And now I can load the pre-trained model itself:

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=len(set(dataset["train"]["label"])))

Next, I define the training arguments and the trainer:

training_args = TrainingArguments(
    output_dir="./distilbert_model",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=100,
    weight_decay=0.01,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
)

And all that is left to do is train the model and save the results!

trainer.train()
trainer.save_model("distilbert_document_classifier")
tokenizer.save_pretrained("distilbert_document_classifier")

This concludes the training of the model on my data. Next is the inference of new data and the classification to the correct folders. Here is basically the whole script to do this:

from transformers import pipeline
import easyocr
from pathlib import Path
import os

reader = easyocr.Reader(['fr'])

def ocr(filename):
    ocr_result = reader.readtext(filename, batch_size=16)
    return ' '.join([x[1] for x in ocr_result])

def convert_label(label):
    label_map = {# Load the label_map here}

    for k, v in label_map.items():
        if v == label:
            return k

    raise ValueError('No match')

def classify(text):
    classifier = pipeline("text-classification", model="distilbert_document_classifier",
                          tokenizer="distilbert_document_classifier")
    result = classifier(text[0:512])
    label_num = result[0]['label'].split('_')[1]
    classify_label = convert_label(int(label_num))
    classify_confidence = result[0]['score']

    return {'label': classify_label, 'confidence': classify_confidence}

def classify_to_folders():
    # read all the images in the "Unclassified" path.
    image_list = list(Path("data/Unclassified").rglob("*.jpg"))

    for image in image_list:
        ocr_result = ocr(str(image))
        classify_result = classify(ocr_result)

        if classify_result['confidence'] > 0.8:
            # Classification is successful
            # Checks if the target folder exists, if not creates it
            if Path(f"Classification_test/{classify_result['label']}").exists():
                # Copy file in there.
                os.rename(image, f"Classification_test/{classify_result['label']}/{image.name}")
            else:
                # Create the folder and copy the file there.
                os.makedirs(f"Classification_test/{classify_result['label']}")
                os.rename(image, f"Classification_test/{classify_result['label']}/{image.name}")

        else:
            # Classification failed. Redirect the image into a "Failed" folder
            if Path(f"Classification_test/Failed_classification").exists():
                os.rename(image, f"Classification_test/Failed_classification/{image.name}")
            else:
                os.makedirs(f"Classification_test/Failed_classification")
                os.rename(image, f"Classification_test/Failed_classification/{image.name}")

if __name__ == '__main__':
    classify_to_folders()

There you go!

Next step is to make it so that it autoruns everytime I upload a file in my NAS input folder. I plan to do this using a container as with my aurorawatch project. I also wanted to create a full fledge app with django and htmx to create some sort of local portal to see the results of the classification, eventually bunch files together as my scanner doesn’t do that automatically. And change the classification if it got wrong…