Testing PDF content with PHP and Behat

If you have a PDF generation functionality in your app, and since most of the libraries out there build the PDF content in an internal structure before outputting it to the file system (FPDF, TCPDF). A good way to write a test for it is to test the output just before the rendering process.

Recently however, and due to this process being a total pain in the ass, people switched to using tools like wkhtmltopdf or some of its PHP wrappers (phpwkhtmltopdf, snappy) that let you build your pages in html/css and use a browser engine to render the PDF for you, and while this technique is a lot more developer friendly, you loose control over the building process.

So if you’re using one of those tools or just need to test for the existence of some string inside a PDF, here’s how to write a BDD style acceptance test for it using Behat.


Add this your composer.json then run composer install

Initialize Behat

This command creates the initial features directory and a blank FeatureContext class.

If everything worked as expected, your project directory should look like this :

All right, it’s time to create some features, create a new file inside /feature, I’ll name mine pdf.feature

Run Behat (I know we didn’t write any testing code yet, just run it, trust me !)

An awesome feature of Behat is it detects any missing steps and provides you with boilerplate code you can use in your FeatureContext. This is the output of the last command :

Cool right ? copy/paste the method definitions to you FeatureContext.php and let’s get to it, step by step :

Step 1

In this step we only need to make sure the filename we provided is readable then store it in a class property so we can use it in later steps :

Step 2

The heavy lifting is done here, we need to parse the PDF and store its content and metadata in a usable format:

Step 3

since we already know how many pages the PDF contains, this is a piece of cake, so let’s not reinvent the wheel and use PHPUnit assertions :

Step 4

Same method, we have an array containing all content from all pages, a quick assertion does the trick:


Et voilà ! you should have green


Note : 

For the purpose of this article, we’re relying on the PDF parser library which has many encoding and white space issues, feel free to use any PHP equivalent or a system tool like xpdf for better results.

If you want to make your test more decoupled (and you should). One way is to create a PDFExtractor interface then implement it for each tool you want to use, that way you can easily swap libraries.

The source code behind this article is provided here, any feedback is most welcome.