Question Details

No question body available.

Tags

python pandas parsing pdf ocr

Answers (1)

Accepted Answer Available
Accepted Answer
November 1, 2025 Score: 1 Rep: 5,249 Quality: High Completeness: 60%

As mentioned in the comments you can make your life much easier by working with pymupdf.Page.findtables. This method has many arguments that can be used to get identify the table properly or that can be used to extract useful information for getting the entries.

In the example of pdf given in the OP the table has no horizontal lines for separating values and this influence the detection of the column entries (in this case are \n-separated) so a manual tuning is required.

import itertools as it import fitz import numpy as np import pandas as pd

CHECKEDSTATUSMSG = " [OK]" DFFILLVALUE = ''

path = "your path.pdf"

with fitz.open(path) as doc: page = doc[0]

# get table as df ################# tf = page.findtables() for tb in tf.tables: t = [] for row in tb.extract(): r = [entry.splitlines() for entry in row if entry] if r: t.append(r)

# header header = [] header.extend(zip(*t[0])) header.extend(zip(*t[1])) header = list(zip(*header)) header = pd.MultiIndex.fromtuples(header)

# df entries datacols = list(it.ziplongest(t[2:].pop(), fillvalue=DF_FILLVALUE)) df = pd.DataFrame(data_cols)

# get check-status ################## n_cols = set() n_rows = set() rs = [] for t, r in page.getbboxlog(): # https://pymupdf.readthedocs.io/en/latest/functions.html#Page.getbboxlog if t == 'stroke-path': r = fitz.IRect(*r) a = f"{r.getarea():.1f}"

# you need to do some pre-processing to get the "right" values if a in {"196.0", "210.0"}: w, h = r.width/3, r.height/3 subr = fitz.Rect(r.x0+w, r.y0+h, r.x1-w, r.y1-h)

pix = page.getpixmap(colorspace="gray", clip=subr, annots=False) isrchecked = pix.isunicolor is False

ncols.add(r.y0) nrows.add(r.x0)

rs.append(((r.y0, r.x0), isrchecked))

rs.sort() ncols = sorted(ncols) nrows = sorted(nrows) rs = [((nrows.index(y), ncols.index(x)), v) for (x, y), v in rs] d = dict(rs)

# final df ########## dfchecked = pd.Series(d).unstack(level=1).T

mask = dfchecked.fillna(False).astype(bool)

p = df + np.where(mask, CHECKEDSTATUSMSG, "") pdf = pd.DataFrame(p) pdf.columns=header

print(pdf)

Output

             I           II           III
      Column 1     Column 2      Column 3
0  Item 1 [OK]  Item 4 [OK]  Item 10 [OK]
1       Item 2       Item 5       Item 11
2       Item 3       Item 6       Item 12
3               Item 7 [OK]  Item 13 [OK]
4                    Item 8       Item 14
5                    Item 9  Item 15 [OK]


If you don't want None for the missing entries you can specify a default value with

defaultvalue = '' it.ziplongest(, fillvalue=default_value)