I am a big fan of personal finance and I always like to keep my books up to date. My favourite accounting software is GNU Cash. It’s free, powerful, and allows you to import transactions in various established financial interchange formats,
such as Quicken, OFX, etc. Unfortunately, some institutions only allow you to export your monthly statements as M$ Excel,
or worse, PDF.
In my particular case it was AMEX Canada, only providing monthly downloadable
PDF statements. Manually copying over the transactions into GNU Cash is not an option for me. I have better things to
do with my time. So I set out to find a solution to convert my AMEX statements into a format that GNU Cash understands,
with QIF being the least painful one to convert to.
The pain of making sense of PDFs
PDF is an evil format. Even though it is called a document, it is more similar to an image format that does not have as
much structure to it as for example XML, HTML, or EPUB for that matter. There have been several attempts to parse PDFs
in Python in the past; however, the packages PyPDF and PyPDF2 are completely oblivious to the layout of the PDF. All you
get is a stream of characters (without any spacing or formatting information).
Yuske Shinyama has a three-part video series on explaining how to make sense of the raw format. Also feeling the need
to make sense of PDF data, he developed a package called PDFMiner in Python that allows you to extract strings and layout
information from PDFs. He has an elaborate documentation explaining the design of his miner.
After a few tries with PyPDF2 I decided to give PDFMiner a chance. Below you find a code snipped that allows you to parse
a PDF and get some structured plain-text content out of it.
import sys
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from StringIO import StringIO
from pdfminer.layout import LAParams
from pdfminer.converter import TextConverter
class MyParser(object):
def __init__(self, pdf):
parser = PDFParser(open(pdf, 'rb'))
document = PDFDocument(parser)
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
rsrcmgr = PDFResourceManager()
retstr = StringIO()
laparams = LAParams()
codec = 'utf-8'
device = TextConverter(rsrcmgr, retstr,
codec = codec,
laparams = laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
self.records = []
lines = retstr.getvalue().splitlines()
for line in lines:
self.handle_line(line)
def handle_line(self, line):
self.records.append(line)
if __name__ == '__main__':
p = MyParser(sys.argv[1])
print '\n'.join(p.records)
With this sample it was just a piece of cake to develop a simple parsing grammar for the transaction records and dump them into a QIF file that could be imported in GNUCash. Since my QIF implementation was quite elaborate to handle all for formatting corner cases I leave you with conceptual line-by-line parser shown above to illustrate the approach.
Crossposted from my old blog
Published: 2014-04-26
Updated : 2025-10-04
Not a spam bot? Want to leave comments or provide editorial guidance? Please click any
of the social links below and make an effort to connect. I promise I read all messages and
will respond at my choosing.