Merging Files with Python
Below is an example of how multiple PDF documents, containing a varying number of pages, can be merged together into one file, with all page ones together, followed by all page twos and so on. For this to work the third-party module PyPDF2 must be installed.
Firstly, the file path is set and a check is made to see if it exists. This is followed by another check to verify that there are files to merge. The files are then processed one by one to find the number of pages that each PDF contains and this information is stored in a list along with the corresponding file name. All files without a ‘.pdf’ extension are ignored. Whilst doing this, a record is made of the maximum number of pages in an individual file. The list of file name and page information, along with the maximum number of pages figure, is then used to access pages in each file and check that the desired page actually exists in a particular file, which allows for PDFs of varying sizes to be merged. A confirmation message is also displayed stating how many files have been merged. Finally, ‘try-except’ blocks are used to handle errors with opening, saving and closing files.
# Import required modules import os import PyPDF2 # File path filePath = 'c:\\demo' # Check to see if the file path exists if os.path.exists(filePath): # Change the current working directory os.chdir(filePath) # Check if there are any files in the chosen directory if len(os.listdir(filePath)) == 0: print('There are no files to merge.') else: # List for file information files = [] # Source PDF file pdfFile = None # Maximum number of pages maxPages = 0 # Process the files at the path to get information about each file for filename in os.listdir(filePath): # Check if the file is a PDF document, excluding temp files if filename.endswith('.pdf') and not filename.startswith('~'): try: # Open the current PDF and assign it to a reader object pdfFile = open(filename, 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFile) # Assign the number of pages to the maximum if greater # than current value if pdfReader.getNumPages() > maxPages: maxPages = pdfReader.getNumPages() # Add the file information to the list files.append((filename, pdfReader.getNumPages())) # Close the PDF file pdfFile.close() except PermissionError as e: # Message confirming the file could not be merged print('The file "' + filename + '" cannot be merged.') # If there are PDFs to merge, process them if maxPages > 0 and len(files) > 1: # Writer object for new combined PDF pdfWriter = PyPDF2.PdfFileWriter() try: # Combine PDFs into one file using the file information list # Put all page 1s together, then all page 2s and so on for pageIndex in range(0, maxPages): # Extract the file name and number of pages for each file for file, pages in files: # Check if the current file has the desired page to merge if pageIndex <= pages-1: # Open the current PDF and assign it to a reader object pdfFile = open(file, 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFile) # Add the page to the new PDF. pdfPage = pdfReader.getPage(pageIndex) pdfWriter.addPage(pdfPage) # Open a new PDF file in write binary mode pdfCombined = open('combined.pdf', 'wb') # Write the PDF object to the new file pdfWriter.write(pdfCombined) # Close the PDF files to clean up pdfCombined.close() pdfFile.close() # Feedback that file merge has been success print(str(len(files)) + ' files have been merged successfully.') except PermissionError as e: # Display a message stating the merge was unsuccessful. print("The file merge was unsuccessful.") else: # Message to state there are no files to merge print('There are no files to merge.') else: # Display a message stating that the file path does not exist print('File path does not exist.')