Creating and Modifying PDF Files in Python

by David Amos May 25, 2020 intermediate python

The PDF, or Portable Document Format, is one of the most common formats for sharing documents over the Internet. PDFs can contain text, images, tables, forms, and rich media like videos and animations, all in a single file.

This abundance of content types can make working with PDFs difficult. There are a lot of different kinds of data to decode when opening a PDF file! Fortunately, the Python ecosystem has some great packages for reading, manipulating, and creating PDF files.

In this tutorial, you’ll learn how to:

  • Read text from a PDF
  • Split a PDF into multiple files
  • Concatenate and merge PDF files
  • Rotate and crop pages in a PDF file
  • Encrypt and decrypt PDF files with passwords
  • Create a PDF file from scratch

Along the way, you’ll have several opportunities to deepen your understanding by following along with the examples. You can download the materials used in the examples by clicking on the link below:

Extracting Text From a PDF

In this section, you’ll learn how to read a PDF file and extract the text using the PyPDF2 package. Before you can do that, though, you need to install it with pip:

$ python3 -m pip install PyPDF2

Verify the installation by running the following command in your terminal:

$ python3 -m pip show PyPDF2
Name: PyPDF2
Version: 1.26.0
Summary: PDF toolkit
Author: Mathieu Fenniak
License: UNKNOWN
Location: c:\\users\\david\\python38-32\\lib\\site-packages

Pay particular attention to the version information. At the time of writing, the latest version of PyPDF2 was 1.26.0. If you have IDLE open, then you’ll need to restart it before you can use the PyPDF2 package.

Opening a PDF File

Let’s get started by opening a PDF and reading some information about it. You’ll use the Pride_and_Prejudice.pdf file located in the practice_files/ folder in the companion repository.

Open IDLE’s interactive window and import the PdfFileReader class from the PyPDF2 package:

