Extracting data from documents with python is not only fun but also saves ton of time. Python provides tools for automating such repetitive tasks and also many libraries that let us interact with documents programmatically. I have multiple scripts that does just that, extract data from hundreds of documents, clean data, and present in a more useful format. All of this can be automated and done with a click of a button. Alternative would be spending hours scanning through documents manually. Over the time things change. The data we need change, structure of documents we use change, the goals change. This may require revisiting and updating scripts. This becomes a bit more challenging if it has been a while since we wrote the scripts. This has been the case for me again this week.
I had a project to revisit some data extracting scripts because the structure of the documents used have changed over time. While everything worked as expected, tweaking the data extracting and processing could improve the desired output. Python has many libraries that deal with pdf documents. Pdfplumber is my favorite one and I have used many times. One feature that it has I haven't experimented with yet was the Visual Debugging. It is very simple process and using it saves a lot of time when writing the actual data extraction code from these documents. Sometimes when you extract data from PDFs, the results don’t match what you see on the page. For example, tables might look scrambled or text could be out of order. Visual debugging with pdfplumber lets you see how your code interprets the document so you can fix mistakes quickly.
If you don't have pdfplumber installed yet, make sure to pip install first. Extracting text from pdf documents is as simple as displayed below with few lines of code.
import pdfplumber
with pdfplumber.open("example.pdf") as pdf:
first_page = pdf.pages[0]
print(first_page.extract_text())
The code above gets all text on the page. However, we may want to get text only in specific locations on the page. For this we can use .crop(bounding_box, relative=False, strict=True) method. Using this method on the page we are working on will return a version of the page but only including items within the bounding box location we have provided with x and y coordinates. I just create a helper function like below to crop the areas I need. All we need to do is figure out our bounding box coordinates.
def get_rect_text(page, bounding_box):
text = page.crop(bounding_box).extract_text().split('\n')
return text
We can guess where approximately the x, y, top, bottom are and play with numbers until we get what we need. But this may create errors in the future, but also can be a very boring process of trying different numbers. Alternatively, we can utilize visual debugging features pdfplumber provides to visually see where things are. The simplest way would be drawing lines horizontally and vertically, kinda creating a grid and then figuring out what these numbers are super simple. Plugging in these numbers we can crop any area we need, and keep repeating the same process for all the pages and documents as needed.
def pdf_draw_lines(filename):
with pdfplumber.open(filename) as pdf:
count = 1
for page in pdf.pages:
page_img = page.to_image(resolution=250)
page_img.draw_line(((60,0), (60,800)), stroke='red', stroke_width=1)
page_img.draw_line(((63,0), (63,800)), stroke='blue', stroke_width=1)
page_img.draw_line(((110,0), (110,800)), stroke='red', stroke_width=1)
page_img.draw_line(((113,0), (113,800)), stroke='blue', stroke_width=1)
page_img.save(f'/location/doc{count}.png', format="PNG", quantize=True, colors=256, bits=8)
count += 1
Above you can see small function that draws lines on each page of the documents and saves pages locally. We can examine these pictures of the documents to get a better understanding the structure of the document and plan how we will be extracting and using the data. Drawing horizontal and vertical lines is the simplest way for us to visually debug the documents. pdfplumber provides much more interesting and powerful ways of accomplishing these tasks. Feel free to visit the pdfplumber documentation for more details.
This didn't work for me right away. I did get errors initially that complained I don't have the imagePage related dependencies on the machine. This wasn't just a pip install. The error suggested what to install and it took a while for it to complete the installation. In the end everything worked, except for .show() method. I didn't need, since I could just save the images and view them afterwards.
Pdfplumber works great with other Python libraries, like pandas, for handling data. For example, if you extract a table from a PDF, you can turn it into a pandas DataFrame to clean or analyze the data more easily. Debugging with pdfplumber ensures the data is clean before you move to the next steps.
Pdfplumber is a simple yet powerful tool for working with PDFs. It’s especially useful for beginners because it gives you visual feedback, making it easier to see what’s happening and fix issues. Whether you’re working with text, tables, or images, pdfplumber helps make the process smoother and more reliable.