TableNet: Deep Learning model for end-to-end table detection and tabular data extraction from scanned images

7 min readJun 21, 2021

This case study is based on the extraction of tabular data from scanned images. There is alot of unstructured data in these scanned documents which can be retail receipts, invoices, insurance claims etc. These images often contain information in tabular form.

Problem Statement

These scanned images are generally manually processed which has a high labour cost and there can also be inefficient or inaccurate data processing at times. Our main aim will be to accurately detect the tabular region and extract the information from the rows and columns.

How Deep Learning can be used to solve this problem ?

Computer Vision is a sector of Artificial Intelligence that allows computers to understand and label the image. We will be having scanned images , with the help of computer vision we will segment out the table and column region from the image.

Data Overview

We will be training our TableNet model on the Marmot data set.

Marmot Dataset

In total, 2000 pages in PDF format were collected and the corresponding ground-truths were extracted utilizing our…

www.icst.pku.edu.cn

Marmot data set has scanned images alongside their respective .xml files which contains the information about table and column coordinates . It is a public data set that is freely usable for research purposes .

Business Objective and Constraints

High Accuracy between the ground truth table and column mask and the predicted masks.
Maximize precision and recall
Column and table mask prediction should happen in a few seconds and the user shouldn’t have to wait for long.

Performance Metrics

We will use F1 score as our performance metric. F1 score is the harmonic mean of precision and recall .

To measure precision and recall in image segmentation tasks, we compare each of our predicted mask with the target mask at every pixel.

Dataset Preparation and EDA

We have a total of 509 images but 14 annotation files are missing from the original dataset .So we will be using 495 images .

For every image we have the respective .xml file which will help us in finding the table and column coordinates .

This code snippet is the xml file for the above scanned image. It contains many elements like filename , size , path etc. Through size we can get the height and width of the image . For every table and column we have the xmin , xmax , ymin and ymax coordinates .

After reading the xml file and getting the coordinates we will have our table and column masks like in the above image. We will also create a pandas dataframe to save the address of our image , column mask and table mask.

From the above histogram , we can see that most of our images have height more than 1000

Our images have width around 800.

TableNet Architecture

We have one VGG-19 encoder and two decoders, one for segmentation of the table region and one for the segmentation of the columns within a table region.

The model takes an in input image and outputs two differently sementically labeled images for tables and columns. The pre trained VGG-19 network helps us to learn low level features. The encoder of this model is common across both the tasks but the decoder emerges as two different branches for tables and columns. Thus we will train two computational graphs. The input image is first transformed into an RGB image and then resized to 1024*1024.

In the above image conv1 to pool5 are the common encoder layers. All the five convolutional layers have max pooling layers after it. Conv7_table and Conv7_column are two different branches for two decoders . Our conv6 convolutional layer uses relu activation with dropout rate of 0.8, after this decoder branches are formed. The intuition behind is that column regions are a subset of table regions . Therefore single encoder layer will filter out the active regions.Our final feature map is upscaled to meet the original image dimension using Conv2DTranspose and UpSampling2D. After we get both the table and column masks from the model , then they are used to extract text from our original image using Pytesseract OCR. OCR stands for Optical Character Recognition . It is helpful for recognizing text inside images. Tesseract is based on a neural network sub system . It is trained on LSTM as it helps in predicting a sequence of characters.

Data Preprocessing

According to the research paper, we will -

Resize images to 1024*1024 dimensions
Normalize our images by dividing every pixel value by 255.

We will use Tensorflow’s dataloader to load our data.

We will use a batch size of 1 for training the model . Our images are in RGB format while our column and table masks are gray scale .

Model Training

Above is the code of our TableNet architecture with VGG-19 as our encoder .

From the above image we can see that max pooling layer of block 3 and block 4 are concatenated in table and column decoders. Our output feature map will be same like input image dimensions i.e 1024*1024. Number of channels are 2 in the final feature map because we have two classes.

Training Results

Plot of table mask accuracy and column mask accuracy

From the above three plots we can see our table mask accuracy for both the train and test set is around 90% and for column mask it is also around 90%. Our loss is also decreasing with every epoch.

These are some of the predictions . As you can see our model is doing a pretty good job in predicting the table and column masks. With the help of these masks we will be able to extract text from the tabular region.

Extracting text from the predicted image using PyTesseract OCR -

From the above images, we can see that our model did a good job in predicting the tabular image and masked everything which was not in the table. Our F1 score on the test data points were -

As we had only 494 images ,we only used 10% of our data for testing. We also calculated precision and recall for all the 494 images .

We can see that around 75% of the column masks had precision around 0.89

Around 75% of the table masks had precision around 0.91

75% of our column masks had recall around 0.71 with max recall being 0.89

Majority of the table masks had recall around 0.93 and max recall being 0.99 which is very good . We can notice our model has done good job in detecting table masks but for column masks accuracy can be increased by training for more epochs .

TableNet with ResNet Encoder

We also implemented TableNet architecture with ResNet-50 as our encoder.

Results with ResNet-50 as encoder

Below are the results with ResNet-50

We can observe that our column masks are not as accurate as they were with the VGG-19 encoder.

Our F1 score for ResNet-50 as encoder is in the same range as of VGG-19 encoder .

Deployment

Model was deployed in the local system using flask .

Video link —

Future Work

Our both the models have been only trained on very few epochs. In the research paper they have trained for around 5000 epochs but as it requires a very powerful GPU and high computational power , I wasn’t able to train for that many epochs. In the future i would like to train it with more data as currently we only trained our model on 490 odd images .

Reference

You can check out all the details and code from my github repo —

msadiva/TableNet

Contribute to msadiva/TableNet development by creating an account on GitHub.

github.com

Connect with me on Linkedin —