[949] Using re to extract unstructured tables of PDF files-526互联

Here is the problem, this unstructured table of a PDF file can not be extrcted as a table directly. We can only extract the whole texts of every page.

My task is to extract the Place ID, Place Name, and Title Details. Then only Title Details include patterns like this will be kept 00XXX0000, numbers + characters + numbers.

Another issues, the extracted texts have some \n or \n\n.

The script:

import re, os, PyPDF2
import pandas as pd 

# Specify the path to the PDF file
pdf_path = r"D:\Bingnan_Li\01_Tasks\11_20231109_PDF_reading\Planning_LGA\Fraser Coast Regional Council\DOCSHBCC__3131535_v6_Cover_sheet_of_Local_Heritage_Register_.pdf"

# Extract all the texts from the PDF file page by page
with open(pdf_path, "rb") as file:
    # Create a PDF reader object
    pdf_reader = PyPDF2.PdfFileReader(file)
    
    page_text = ""
    
    # From Page 2 to Page 6
    for i in range(2, 7):
        page = pdf_reader.getPage(i)
        page_text += page.extractText()

a = page_text
# In order to match the text better, we replace the "\n" and "\n \n"
a = a.replace("\n \n", "#####") 
a = a.replace("\n", "") 
# Delete the "*" in the text
a = a.replace("*", "")

# Try to match the text like this
# "#####1#####Howard War Memorial#####Cnr William and#####Steley Streets Howard#####Refer to Queensland Heritage Register Place ID 600545#####A, B, D, E, G#####2##########"
# (###[#]+[\d]{1,3}###[#]+) try to match "#####1#####"
# (.*?) try to match the middle part
# (###[#]+[\d]{1,3}###[#]+) try to match "#####2##########"
# [\d]{1,3} means numbers with 1 digit, 2 digits or 3 digits
pattern = r"(###[#]+[\d]{1,3}###[#]+)(.*?)(###[#]+[\d]{1,3}###[#]+)"

# Create an emplty DataFrame
df = pd.DataFrame(columns=["ID", "Heritage Name", "Lot", "Plan", "LotPlan"])   

# Get all the matches
# We cannot use the function of re.findall(), because it will miss the one start with "#####2##########"
# So every time, we only find the first one, then move the string one the right to match another first one
# Finally, we will get all the matches
while True:
    match = re.search(pattern, a)
    if not match: break 
    print(match.groups()[0], match.groups()[1])
    
    # From the Title Details, we need to match the lot and the plan
    pattern_2 = r"([0-9]+)([a-zA-Z]+)([0-9]+)"
    matches_2 = re.findall(pattern_2, match.groups()[1])
    
    for m_2 in matches_2:
        # Add this information in to the DataFrame
        df.loc[len(df)] = [match.groups()[0].replace("#", ""), 
                           match.groups()[1].split("#####")[0], 
                           m_2[0], 
                           m_2[1] + m_2[2], 
                           m_2[0]+m_2[1]+m_2[2]]

    a = a[match.span()[1]-20:]
    
df.drop_duplicates()
df.index = range(len(df))
df

unstructured extract tables using

extract files rust from

document-level table-to-graph generation extraction

multiple extract tables cells

extracting tables python from

postgresql using file run

rename files regex using

pdfplumber python using file

unstructured

extract