In the past I have written how useful pdfplumber library is when extracting data from pdf files. Its true power becomes evident with dealing with multiple pdf files that have hundreds of pages. When you know what you are looking for, and don't want to go through hundreds of pages manually, and if you have to do deal with such files on daily basis, best thing to do is to automate. That's what python is great at, automating. Pdfplumber as the naming suggest works with pdf files and makes it easy to extract data. It works best with machine-generated pdf files rather than scanned pdf files.
When extracting data from pdf files we can utilize multiple approaches. If we just need some text, we can start with the simple .extract_text()
method. However, pdfplumber let's us extract all objects in the document like images, lines, rectangles, curves, chars, or we can just get all of these objects with .objects
. Sometimes machine generated pdf files utilize lines and rectangles to separate the information on the page. This can help up in identifying the type of text within those lines or rectangles. I recently came across some financial pdf data formatted in such a way. Using the location of these lines and rectangles can help to select the text in that area using pdfplumber's .crop()
method.
First, let's take a look at basic text extraction with pdfplumber
.
import pdfplumber
with pdfplumber.open('/Users/librarian/Desktop/document.pdf') as pdf:
page1 = pdf.pages[0]
page1_text = page1.extract_text().split('\n')
for text in page1_text:
print(text)
We open the file with pdfplumber, .pages
returns list of pages in the pdf and all the data within those pages. Since it is a list we can access them one by one. In the example above we are just looking at page one for now. Using .extract_text()
method, we can get all text of page one. It is one long string. If we want to separate the text line by line, we use the .split('\n')
. Now that we have a list of lines of text from page one, we can iterate through the list and display all lines of text.
In most cases, this might be all you need. But sometimes you may want to extract these lines of text and retain the layout formatting. To do this, we add layout=True
parameter to .extract_text()
method, like this page1.extract_text(layout=True).split('\n')
. Be careful when using layout=True
, because this feature is experimental and not stable yet. In might work in most cases, but sometimes it may return unexpected results.
Now that we know how to extract the text from the page, we can apply some string manipulation and regex to get only the data that we actually need. If we know the exact area on the page where our data is located, we can use .crop()
method and extract only that data using the same extraction methods described above.
pdfplumber.Page class has properties like .page_number
, .width
, and .height
. We can use width and height of the page in determining which area we are going to crop. Let's take a look at a code example using .crop()
import pdfplumber
with pdfplumber.open('/Users/librarian/Desktop/document.pdf') as pdf:
page1 = pdf.pages[0]
bounding_box = (200, 300, 400, 450)
crop_area = page1.crop(bounding_box)
crop_text = crop_area.extract_text().split('\n')
for text in crop_text:
print(text)
Once we have our page instance, we use the .crop(bounding_box)
method, and result is still page but only covers the area defined by bounding_box. Think of it is a piece of the page, but it still is a page, and we can apply other other methods like .extract_text()
on this piece of a page.
This cropping the area can be very useful if you know the exact area your text is located in. This feature become even more useful when the pdf documents we are working with have lines and rectangles for formatting and separating information. We can extract all the lines and rectangles on the page and get their locations. Using these locations we can easily identify which area of the page we need to crop. To get the lines on the page, we use .lines
property and to get the rectangles on the page we use .rects
property. To see how many lines we have on the page and properties of a line we can run the following code.
import pdfplumber
import pprint
with pdfplumber.open('/Users/librarian/Desktop/document.pdf') as pdf:
page1 = pdf.pages[0]
lines = page1.lines
print(len(lines))
pprint.pprint(lines[0])
The result would show the following properties and their values line objects will have. Some of them will be useful, other we can ignore.
{'bottom': 130.64999999999998,
'doctop': 130.64999999999998,
'evenodd': False,
'fill': False,
'height': 0.0,
'linewidth': 1,
'non_stroking_color': [0.859],
'object_type': 'line',
'page_number': 1,
'pts': [(18.0, 661.35), (590.25, 661.35)],
'stroke': True,
'stroking_color': (0, 0, 0),
'top': 130.64999999999998,
'width': 572.25,
'x0': 18.0,
'x1': 590.25,
'y0': 661.35,
'y1': 661.35}
Which property to use will be based on the project. In my case I would be using top, bottom, x0, and x1. Although top and bottom values are same in this example because line width is only 1, I would still get both values just in case the value of the line width changes in the future.
We would get the rectangles on the page the same way as we did with lines. In this case we change the property to .rects
. When using rects, the top and bottom value will be different for obvious reasons. Now that we have the coordinates where we need to crop and extract text from, we just plug in these values we get from .lines
and .rects
into our bounding_box for .crop()
method.
I just started using these features of pdfplumber today, and so far everything is working great and I have seen any issues yet. If you work with many pdf files to extract data and these documents have repeating lines and rectangles that separate information, you too may find pdfplumber to be useful in automating these tasks. Let me know your thoughts and experiences about text extraction from pdf documents in the comments.
Pdfplumber has great documentation. Feel free to visit the github page: https://github.com/jsvine/pdfplumber