Home  >  Article  >  Backend Development  >  OpenCV: Find columns in Arabic journals (Python)

OpenCV: Find columns in Arabic journals (Python)

WBOY
WBOYforward
2024-02-22 12:52:11626browse

OpenCV: Find columns in Arabic journals (Python)

Question content

I am new to opencv and new to python. I tried to piece together code I found online to solve my research problem. I have an Arabic diary from 1870 that has hundreds of pages, each page contains two columns and has a thick black border. I want to extract two columns as image files so that I can run ocr on them individually while ignoring the header and footer. Here is an example page:

Page 3

I have ten pages of raw prints as separate png files. I wrote the following script to handle each one. It works as expected in 2 of the 10 pages, but fails to generate the columns in the other 8 pages. I don't understand all the functions well enough to know where I could use these values, or if my entire approach is misguided - I think the best way to learn is to ask the community how you would solve this problem.

import cv2

def cutpage(fname, pnum):
    image = cv2.imread(fname)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    blur = cv2.GaussianBlur(gray, (7,7), 0)
    thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 13))
    dilate = cv2.dilate(thresh, kernel, iterations=1)
    dilatename = "temp/dilate" + str(pnum) + ".png"
    cv2.imwrite(dilatename, dilate)
    cnts = cv2.findContours(dilate, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    cnts = sorted(cnts, key=lambda x: cv2.boundingRect(x)[0])

    fullpage=1
    column=1
    for c in cnts:
        x, y, w, h = cv2.boundingRect(c)
        if h > 300 and w > 20:
            if (h/w)<2.5:
                print("Found full page: ", x, y, w, h)
                filename = "temp/p" + str(pnum) + "-full" + str(fullpage) + ".png"
                fullpage+=1
            else:
                print("Found column: ", x, y, w, h)
                filename = "temp/p" + str(pnum) + "-col" + str(column) + ".png"
                column+=1
            roi = image[y:y+h, x:x+w]
            cv2.imwrite(filename, roi)
    return (column-1)
        
for nr in range(10):
    filename = "p"+str(nr)+".png"
    print("Checking page", nr)
    diditwork = cutpage(filename, nr)
    print("Found", diditwork, "columns")

Following the tutorial, I created a binary inversion of blur and dilation so that it could identify different rectangular areas by large white areas. I also saved a copy of each extended version so I could see what it looked like, here's the page above after processing:

Page 3 has been enlarged

The "for c in cnts" loop should find large rectangular areas in the image. If the aspect ratio is less than 2.5 I get a full page (without header and footer, which works fine), if the aspect ratio is greater than this I know it's a column and it saves this e.g. temp/ p2-col2.png

I get some nice full pages without headers and footers, that is, just large black borders, but not chopped into columns. In 2 pages out of 10 I got what I wanted, which is:

Success Column on Page 2

Since I sometimes get the desired results, there must be something working, but I don't know how to improve it further.

edit:

Here are more page examples:

p0

p1

p5


Correct answer


I tried something without any expansion because I wanted to see if I could just use the middle line as a "separator" . This is the code:

im = cv2.cvtcolor(cv2.imread("arabic.png"), cv2.color_bgr2rgb) # read im as rgb for better plots
gray = cv2.cvtcolor(im, cv2.color_rgb2gray) # convert to gray
_, threshold = cv2.threshold(gray, 250, 255, cv2.thresh_binary_inv) # inverse thresholding
contours, _ = cv2.findcontours(threshold, cv2.retr_external, cv2.chain_approx_none) # find contours
sortedcontours = sorted(contours, key = cv2.contourarea, reverse=true) # sort according to area, descending
bigbox = sortedcontours[0] # get the contour of the big box
middleline = sortedcontours[1] # get the contour of the vertical line
xmiddleline, _, _, _ = cv2.boundingrect(middleline) # get x coordinate of middleline
leftboxcontour = np.array([point for point in bigbox if point[0, 0] < xmiddleline]) # assign left of line as points from the big contour
rightboxcontour = np.array([point for point in bigbox if point[0, 0] >= xmiddleline]) # assigh right of line as points from the big contour
leftboxx, leftboxy, leftboxw, leftboxh = cv2.boundingrect(leftboxcontour) # get properties of box on left
rightboxx, rightboxy, rightboxw, rightboxh = cv2.boundingrect(rightboxcontour) # get properties of box on right
leftboxcrop = im[leftboxy:leftboxy + leftboxh, leftboxx:leftboxx + leftboxw] # crop left 
rightboxcrop = im[rightboxy:rightboxy + rightboxh, rightboxx:rightboxx + rightboxw] # crop right
# maybe do you assertations about aspect ratio??
cv2.imwrite("right.png", rightboxcrop) # save image
cv2.imwrite("left.png", leftboxcrop) # save image

I'm not using any assertions about aspect ratio, so maybe this is still something you need to do..

Basically, the most important lines in this method are generating left and right contours based on x-coordinates. This is the final result I get:

There are still some black parts on the edges, but that shouldn't be a problem for ocr.

FYI: I'm using the following packages in jupyter:

import cv2
import numpy as np
%matplotlib notebook
import matplotlib.pyplot as plt

v2.0: Implemented using only large box detection:

So I did some dilation and the big box was easily detectable. I use a horizontal kernel to ensure that the vertical lines of the large box are always thick enough to be detected. However, I cannot solve the problem with the middle line as it is very thin... Nonetheless, here is the code for the above method:

im = cv2.cvtcolor(cv2.imread("1.png"), cv2.color_bgr2rgb) # read im as rgb for better plots
gray = cv2.cvtcolor(im, cv2.color_rgb2gray) # convert to gray
gray[gray<255] = 0 # added some contrast to make it either completly black or white
_, threshold = cv2.threshold(gray, 250, 255, cv2.thresh_binary_inv) # inverse thresholding
thresholddilated = cv2.dilate(threshold, np.ones((1,10)), iterations = 1) # dilate horizontally
contours, _ = cv2.findcontours(thresholddilated, cv2.retr_external, cv2.chain_approx_none) # find contours
sortedcontours = sorted(contours, key = cv2.contourarea, reverse=true) # sort according to area, descending
x, y, w, h = cv2.boundingrect(sortedcontours[0]) # get the bounding rect properties of the contour
left = im[y:y+h, x:x+int(w/2)+10].copy() # generate left, i included 10 pix from the right just in case
right = im[y:y+h, int(w/2)-10:w].copy() # and right, i included 10 pix from the left just in case
fig, ax = plt.subplots(nrows = 2, ncols = 3) # plotting...
ax[0,0].axis("off")
ax[0,1].imshow(im)
ax[0,1].axis("off")
ax[0,2].axis("off")
ax[1,0].imshow(left)
ax[1,0].axis("off")
ax[1,1].axis("off")
ax[1,2].imshow(right)
ax[1,2].axis("off")

These are the results, you can notice it's not perfect, but again, since your target is ocr, this shouldn't be a problem.

Please tell me if this works, if not I will rack my brain to find a better solution...

v3.0: A better way to get straighter images, which will improve the quality of ocr.

Inspired by my other answer here: answer. It makes sense to straighten the image so that the ocr has better results. Therefore, I used a four-point transform on the detected outer frame. This will straighten the image slightly and make the text more horizontal. This is the code:

im = cv2.cvtcolor(cv2.imread("2.png"), cv2.color_bgr2rgb) # read im as rgb for better plots
gray = cv2.cvtcolor(im, cv2.color_rgb2gray) # convert to gray
gray[gray<255] = 0 # added some contrast to make it either completly black or white
_, threshold = cv2.threshold(gray, 250, 255, cv2.thresh_binary_inv) # inverse thresholding
thresholddilated = cv2.dilate(threshold, np.ones((1,10)), iterations = 1) # dilate horizontally
contours, _ = cv2.findcontours(thresholddilated, cv2.retr_external, cv2.chain_approx_none) # find contours
largest_contour = max(contours, key = cv2.contourarea) # get largest contour
hull = cv2.convexhull(largest_contour) # get the hull
epsilon = 0.02 * cv2.arclength(largest_contour, true) # epsilon
pts1 = np.float32(cv2.approxpolydp(hull, epsilon, true).reshape(-1, 2)) # get the points
result = four_point_transform(im, pts1) # using imutils
height, width = result.shape[:2] # get the dimensions of the transformed image
left = result[:, 0:int(width/2)].copy() # from the beginning to half the width
right = result[:, int(width/2): width].copy() # from half the width till the end
fig, ax = plt.subplots(nrows = 2, ncols = 3) # plotting...
ax[0,0].axis("off")
ax[0,1].imshow(result)
ax[0,1].axvline(width/2)
ax[0,1].axis("off")
ax[0,2].axis("off")
ax[1,0].imshow(left)
ax[1,0].axis("off")
ax[1,1].axis("off")
ax[1,2].imshow(right)
ax[1,2].axis("off")

Has the following packages:

import cv2
import numpy as np
%matplotlib notebook
import matplotlib.pyplot as plt
from imutils.perspective import four_point_transform

As you can see from the code, this is a better approach, you can force the image to be centered and horizontal thanks to the four point transform. Furthermore, there is no need to include some overlap since the images are well separated. Here is an example for your reference:

The above is the detailed content of OpenCV: Find columns in Arabic journals (Python). For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:stackoverflow.com. If there is any infringement, please contact admin@php.cn delete