AppleScript to merge MS Word Section PDFs
Mac OS X — 24 Oct 2006 01:11 — 1203 days ago

If you use the built-in print-to-PDF feature of Mac OS X for an MS Word document with multiple sections, you will get multiple PDFs, one per section. You will then have to combine them manually using a tool like Adobe Acrobat, PDFLab or Combine PDFs.

Hitting the “Preview” button instead of “Save to PDF” as explained in this tip might work, unfortunately it didn’t for me...

Having to merge the per-section PDFs with the tools mentioned above is admittedly convenient enough and a minor issue, but it does get old if you have to do it often. (Actually I just wanted something to tinker with :-). I thought this would be a perfect job for Mac OS X’s PDF Workflow feature and so I wrote this little AppleScript. In case you didn’t know, PDF Workflows allow you to send PDF files from the standard print dialog directly to your code which can be written in AppleScript or UNIX languages like Perl, bash or Python. Apple has some example code.

I compiled the following code in my AppleScript editor and saved it in compiled script form as “~/Library/PDF Services/Merge Word Sections.scpt”. Now I can choose it from the PDF drop down menu in the print dialog. It will churn for a bit and produce a PDF with all sections neatly merged into one file on the desktop. It picks up the output file name from the document automatically, replacing the extension with “.pdf”.


on open of theFiles -- store the name of the temp file created by the printing system set theFile to item 1 of theFiles set thePosixFile to POSIX path of theFile tell application "Finder" set parentFolder to (container of theFile) as alias end tell set cmd to "perl -n -e 'print m#/Title .+ - (.+?)(?:\\.\\w+)?\\)#' '" & thePosixFile & "'" set doctitle to do shell script cmd set outfile to (POSIX path of (path to desktop)) & doctitle & ".pdf" set outfileMissing to (do shell script "test -e '" & outfile & "' || echo missing") set outfileAge to (do shell script "perl -e 'print ((-M $ARGV[0]) * (24*60*60))' '" & outfile & "'") as number if (outfileMissing is "missing" or outfileAge > 10) then do shell script "cp '" & thePosixFile & "' '" & outfile & "'" else set tempfile to outfile & ".tmp" set pdfmerge to "echo f1b88000a1f3d3540030d605bce6380301c33efa8da271861931274a96e052a4e4963ae5a78a2aac1850ca60bd2362f8fbfea9094378a48ca167776667764f03eea5b3ed945364570037675b65c8d806d86b2c61a9e6fe4a9d32c661b4825ee9a42593d0a5127a9aea51516a40c41563e91cdc9166ab37e747e0339ac384e686a370ec988c85965730996b8b2b2c4d2372dd6c86be0e19d0047b134b76a7a247ad836c6095871f56e2c6578fa46bc891a51ebd26ee8a19dae380ca5cf06839818eb46279722697fd6a3323dac1e9c5661583cf49eae5298a21d0683814c0739c881d6aad20991dddf7eb3d47b920b08eaebaae247ed538aee467365f1461863cc65b01e4c5f7ff1f529c82a76618c2d7e841760e599bba56baf8a8c162c280cfa1e7f2fb7851aa2c0398df837a87edab66786fdbcd05fbd032872842a49831c01bc036b6c24a875d7a0dbe44ebf93fd1652597e5879b7bf26165c1fee6aeb5db9d52a916a0551d31924ac979259c6b6f8f71e6f7582435200000 | perl -e 'print pack(\"h*\", <>)' | gzip -d | python -" set cmd to pdfmerge & " '" & tempfile & "' '" & outfile & "' '" & thePosixFile & "'" do shell script cmd set cmd to "mv '" & tempfile & "' '" & outfile & "'" do shell script cmd end if tell application "Finder" to delete parentFolder end open

Because Word runs each print job for each section independently, including a fresh invocation of this script, there is no clean way to find out if the print job is the first or a later one of a series. The script simply checks for an existing output file which has been modified in the last 10 seconds. If it finds one, the current print job is appended, otherwise it creates/overwrites the output file and the following job in the series will append to it. What this means is that you don’t have a lot of time to confirm dialogs which Word sometimes pops up when it switches sections, for example if it warns you about narrow page margins.

The script is unfortunately a bit ugly because I embedded a python script for the actual PDF merge. I embedded it to make the AppleScript totally self-contained. I originally kept the python script in my home directory’s “bin” directory as “pdfmerge.py”. In case you’re interested in the python code alone, this is what it looks like:

#!/usr/bin/env python # Merges multiple PDF files into one. # usage: # pdfmerge.py <outfile> <infile1> <infile2> ... # from CoreGraphics import * import os, sys, getopt outfile = sys.argv[1] page_rect = CGRectMake(0, 0, 100, 100) c = CGPDFContextCreateWithFilename(outfile, page_rect) for arg in sys.argv[2:]: pdf = CGPDFDocumentCreateWithProvider(CGDataProviderCreateWithFilename(arg)); if pdf: for page in range(1, pdf.getNumberOfPages() + 1): rect = pdf.getMediaBox(page) c.beginPage(rect) c.drawPDFDocument(rect, pdf, page) c.endPage() c.finish()

Now that everything works, let’s rip it all apart and rewrite it completely in Python :-)

I originally wrote this PDF workflow in AppleScript because it used some GUI / user interaction stuff which is no longer present in the version above. Without the need for AppleScript’s GUI features, we can just as well write everything in Python which is a lot cleaner:

#!/usr/bin/env python # args: <programname> <document title> <printing ticket options> <spool file PDF name> from CoreGraphics import * import os, sys, re, time, shutil, fcntl def main(): lockfile = file("/tmp/merge_word_sections.lock", "w") fcntl.flock(lockfile, fcntl.LOCK_EX) outfile = re.compile(r'(\.\w+|$)').sub('.pdf', sys.argv[1]) outfile = re.compile(r'^Microsoft Word - ').sub('', outfile) outpath = os.popen("echo ~/Desktop/" + outfile).readline().rstrip() spoolfile = sys.argv[3] outfileExists = os.path.exists(outpath) outfileAge = outfileExists and time.time() - os.path.getmtime(outpath) or 0 if not outfileExists or outfileAge > 10: shutil.copyfile(spoolfile, outpath) else: outpathTemp = outpath + ".tmp" merge_pdfs(outpathTemp, [outpath, spoolfile]) shutil.move(outpathTemp, outpath) def merge_pdfs(outfile, infiles): page_rect = CGRectMake(0, 0, 100, 100) context = CGPDFContextCreateWithFilename(outfile, page_rect) for infile in infiles: pdf = CGPDFDocumentCreateWithProvider(CGDataProviderCreateWithFilename(infile)) if pdf: for page in range(1, pdf.getNumberOfPages() + 1): rect = pdf.getMediaBox(page) context.beginPage(rect) context.drawPDFDocument(rect, pdf, page) context.endPage() context.finish() if __name__ == '__main__': main()

You can store this in a .py file in the same location, e.g. “Merge Word Sections.py” and it will show up in the print dialog PDF services popup menu.

Comments
Posted by Dan Connolly on 2 Dec 2006 07:53

nifty; I had no idea you could merge PDFs by importing CoreGraphics right into python. Thanks!

Posted by Niels on 24 Jan 2007 19:16

Thanks! What a fantastic solution. Had to put the timer on 30secs though because it took over 10secs between some sections I used. Still, that was pretty damn easy due to your explanation. So, once again, many many thanks!

Posted by Harold Kyle on 14 Jun 2007 03:33

This tool is great. Is there any reason why a landscape section sandwiched between two portrait sections would disappear? Any help greatly appreciated.
Thanks,
Harold

Posted by Ian Chua on 3 Aug 2007 15:34

So, if I have 3 WORD documents (eg. f1.doc, f2.doc, and f3.doc, how can I merge them into a single PDF or WORD document?

Powered By blojsom