stackoverflow November 1, 2025 Rep: 961

How to extract table from PDF with boxes into pandas dataframe

Score

Answers

228

Views

29.9

Trend Score

Question Details

No question body available.

Answers (1)

Accepted Answer Available

Accepted Answer

November 1, 2025 Score: 1 Rep: 5,249 Quality: High Completeness: 60%

As mentioned in the comments you can make your life much easier by working with pymupdf.Page.findtables. This method has many arguments that can be used to get identify the table properly or that can be used to extract useful information for getting the entries.

In the example of pdf given in the OP the table has no horizontal lines for separating values and this influence the detection of the column entries (in this case are \n-separated) so a manual tuning is required.


import itertools as it
import fitz  
import numpy as np
import pandas as pdCHECKEDSTATUSMSG = " [OK]"
DFFILLVALUE = ''
path = "your path.pdf"
with fitz.open(path) as doc:
    page = doc[0]
    # get table as df
    #################
    tf = page.findtables()    
    for tb in tf.tables:
        t = []
        for row in tb.extract():
            r = [entry.splitlines() for entry in row if entry]
            if r:
                t.append(r)
        # header
        header = []
        header.extend(zip(*t[0]))
        header.extend(zip(*t[1]))
        header = list(zip(*header))
        header = pd.MultiIndex.fromtuples(header)
        # df entries
        datacols = list(it.ziplongest(t[2:].pop(), fillvalue=DF_FILLVALUE))
        df = pd.DataFrame(data_cols)
    # get check-status
    ################## 
    n_cols = set()
    n_rows = set()
    rs = []
    for t, r in page.getbboxlog(): # https://pymupdf.readthedocs.io/en/latest/functions.html#Page.getbboxlog
        if t == 'stroke-path':
            r = fitz.IRect(*r)
            a = f"{r.getarea():.1f}"
            # you need to do some pre-processing to get the "right" values
            if a in {"196.0", "210.0"}:
                w, h = r.width/3, r.height/3
                subr = fitz.Rect(r.x0+w, r.y0+h, r.x1-w, r.y1-h)
                pix = page.getpixmap(colorspace="gray", clip=subr, annots=False)
                isrchecked = pix.isunicolor is False
                ncols.add(r.y0)
                nrows.add(r.x0)
                rs.append(((r.y0, r.x0), isrchecked))
    rs.sort()
    ncols = sorted(ncols)
    nrows = sorted(nrows)
    rs = [((nrows.index(y), ncols.index(x)), v) for (x, y), v in rs]
    d = dict(rs)
    # final df
    ##########
    dfchecked = pd.Series(d).unstack(level=1).T
    mask = dfchecked.fillna(False).astype(bool)
    p = df + np.where(mask, CHECKEDSTATUSMSG, "")
    pdf = pd.DataFrame(p)
    pdf.columns=header
    print(pdf)

Output

             I           II           III
      Column 1     Column 2      Column 3
0  Item 1 [OK]  Item 4 [OK]  Item 10 [OK]
1       Item 2       Item 5       Item 11
2       Item 3       Item 6       Item 12
3               Item 7 [OK]  Item 13 [OK]
4                    Item 8       Item 14
5                    Item 9  Item 15 [OK]

If you don't want None for the missing entries you can specify a default value with


defaultvalue = ''
it.ziplongest(, fillvalue=default_value)

Export Question Data

Export this question and its answers for further analysis or reporting.

Back to Questions

How to extract table from PDF with boxes into pandas dataframe

Question Details

Tags

Answers (1)

Analysis Metrics

Question Information

Actions

Related Questions

Export Question Data