How to Merge PDF Files with PDFBox in Java?

Merging PDF files in java is made easier with Apache PDFBox.

The codes below illustrate how to sort and merge all PDF files found in a particular directory according by their last modified date:

import java.io.File;
import java.io.IOException;
import java.util.Arrays;
import java.util.Comparator;

import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.util.PDFMergerUtility;

public class test {
 public static void main(String[] args) throws IOException, COSVisitorException {
  int maxPdf = 1000; // use this to troubleshoot java/lang/OutOfMemoryError exception
  String pdfDirPath = "C:\\circulars_and_notices\\pdfs\\port_marine_notices";

  File pdfDir = new File(pdfDirPath);
  if (pdfDir.isDirectory()) {
   // proceed to crawl thru the folder and merge the pdf according to last mod date
   File[] pdfs = pdfDir.listFiles();
   int cnt = pdfs.length;

   if (cnt > 0) {
    // sort the pdfs by last mod date in desc order
    Arrays.sort(pdfs, new Comparator() {
    public int compare(File f1, File f2) {
     return Long.compare(f2.lastModified(), f1.lastModified());
     }
    });

    if (maxPdf != 0 && maxPdf < cnt) {
     cnt = maxPdf;
    }

    // start add merging sources
    PDFMergerUtility pdfMerger = new PDFMergerUtility();

    // set destination
    pdfMerger.setDestinationFileName("C:\\mergeDocs.pdf");

    // add in pdf source
    for (int i = 0; i < cnt; i++) {
     File file = pdfs[i];
     pdfMerger.addSource(file);
    }

    // merge pdfs
    pdfMerger.mergeDocuments();

   } else {
    System.out.println("Target directory is empty.");
   }
  } else {
   System.out.println("Target is not a directory (" + pdfDirPath + ").");
  }
 }
}

The codes above should works fine in most scenarios. But if you are merging large PDFs files like in my case, then the chances of encountering “java/lang/OutOfMemoryError” exception is high. Of course a quick solution is to increase heap size but this will only be a temporary solution.

Lucky for us, PDFBox offers another alternative way of merging PDFs by storing the PDF streams into a temp file. See below code for illustrations:

import java.io.File;
import java.io.IOException;
import java.util.Arrays;
import java.util.Comparator;
import java.util.Date;
import java.util.List;

import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink;
import org.apache.pdfbox.util.PDFMergerUtility;

public class test {
 public static void main(String[] args) throws IOException, COSVisitorException {
  int maxPdf = 1500; // use this to troubleshoot java/lang/OutOfMemoryError exception
  String pdfDirPath = "C:\\circulars_and_notices\\pdfs\\port_marine_notices";

  File pdfDir = new File(pdfDirPath);
  if (pdfDir.isDirectory()) {
   Date startTime = new Date();
   System.out.println("Start time: " + startTime.toString());

   // proceed to crawl thru the folder and merge the pdf according to last mod date
   File[] pdfs = pdfDir.listFiles();
   int cnt = pdfs.length;

   if (cnt > 0) {
    // sort the pdfs by last mod date in desc order
    Arrays.sort(pdfs, new Comparator<File>() {
     public int compare(File f1, File f2) {
      return Long.compare(f2.lastModified(), f1.lastModified());
     }
    });

    if (maxPdf != 0 && maxPdf < cnt) {
     cnt = maxPdf;
    }

    // create a temp file for temp pdf stream storage
    String tempFileName = (new Date()).getTime() + "_temp";
    File tempFile = new File("C:\\" + tempFileName);

    // proceed to merge
    PDDocument desPDDoc = null;
    PDFMergerUtility pdfMerger = new PDFMergerUtility();
    try {
     // traverse the files
     boolean hasCloneFirstDoc = false;
     for (int i = 0; i < cnt; i++) {
      File file = pdfs[i];
      PDDocument doc = null;
      try {
       if (hasCloneFirstDoc) {
        doc = PDDocument.load(file);
        pdfMerger.appendDocument(desPDDoc, doc);
       } else {
        desPDDoc = PDDocument.load(file, new RandomAccessFile(tempFile, "rw"));
        hasCloneFirstDoc = true;
       }
      } catch (IOException ioe) {
       System.out.println("Invalid PDF detected: " + file.getName());
       ioe.printStackTrace();
      } finally {
       if (doc != null) {
        doc.close();
       }
      }
     }

     System.out.println("Merging and saving the PDF to its destination");
     desPDDoc.save("C:\\mergeDoc.pdf");

     Date endTime = new Date();
     System.out.println("Process Completed: " + endTime);
     long timeTakenInSec = endTime.getTime() - startTime.getTime();
     System.out.println("Time taken: " + (timeTakenInSec / 1000) + " secs " + (timeTakenInSec % 1000) + " ms");

    } catch (IOException | COSVisitorException e) {
     e.printStackTrace(); // will encounter issues if it is more than 850 pdfs!!
    } finally {
     try {
      if (desPDDoc != null) {
       desPDDoc.close();
      }
     } catch (IOException ioe) {
      ioe.printStackTrace();
     }
     tempFile.delete();
    }
   } else {
    System.out.println("Target directory is empty.");
   }
  } else {
   System.out.println("Target is not a directory (" + pdfDirPath + ").");
  }
 }
}

[21/4/2014] We have encountered org.apache.pdfbox.exceptions.COSVisitorException: java.lang.NullPointerException when we tried to merge large number of PDFs (<850 pdfs) at once. Hence we decided to revise our codes by merge our PDFs in smaller quantities before merging them as one.

2 thoughts on “How to Merge PDF Files with PDFBox in Java?”

  1. Disregard the previous remark about bookmarking, figured it out. Was updating the PDF after the bookmarking and that was causing the problem.

Leave a Reply

Your email address will not be published. Required fields are marked *