Wednesday, January 28, 2015

HTML text to PDF with Beautiful Soup and xtopdf

By Vasudev Ram



Recently, I thought of getting the text from HTML documents and putting that text to PDF. So I did it :)

Here's how:

"""
HTMLTextToPDF.py
A demo program to show how to convert the text extracted from HTML 
content, to PDF. It uses the Beautiful Soup library, v4, to 
parse the HTML, and the xtopdf library to generate the PDF output.
Beautiful Soup is at: http://www.crummy.com/software/BeautifulSoup/
xtopdf is at: https://bitbucket.org/vasudevram/xtopdf
Guide to using and installing xtopdf: http://jugad2.blogspot.in/2012/07/guide-to-installing-and-using-xtopdf.html
Author: Vasudev Ram - http://www.dancingbison.com
Copyright 2015 Vasudev Ram
"""

import sys
from bs4 import BeautifulSoup
from PDFWriter import PDFWriter

def usage():
    sys.stderr.write("Usage: python " + sys.argv[0] + " html_file pdf_file\n")
    sys.stderr.write("which will extract only the text from html_file and\n")
    sys.stderr.write("write it to pdf_file\n")

def main():

    # Create some HTML for testing conversion of its text to PDF.
    html_doc = """
    <html>
        <head>
            <title>
            Test file for HTMLTextToPDF
            </title>
        </head>
        <body>
        This is text within the body element but outside any paragraph.
        <p>
        This is a paragraph of text. Hey there, how do you do?
        The quick red fox jumped over the slow blue cow.
        </p>
        <p>
        This is another paragraph of text.
        Don't mind what it contains.
        What is mind? Not matter.
        What is matter? Never mind.
        </p>
        This is also text within the body element but not within any paragraph.
        </body>
    </html>
    """

    pw = PDFWriter("HTMLTextTo.pdf")
    pw.setFont("Courier", 10)
    pw.setHeader("Conversion of HTML text to PDF")
    pw.setFooter("Generated by xtopdf: http://slid.es/vasudevram/xtopdf")
 
    # Use method chaining this time.
    for line in BeautifulSoup(html_doc).get_text().split("\n"):
        pw.writeLine(line)
    pw.savePage()
    pw.close()

if __name__ == '__main__':
    main()


The program uses the Beautiful Soup library for parsing and extracting information from HTML, and xtopdf, my Python library for PDF generation.
Run it with:
python HTMLTextToPDF.py
and the output will be in the file HTMLTextTo.pdf.
Screenshot below:


- Vasudev Ram - Python training and programming - Dancing Bison Enterprises

Read more of my posts about Python or read posts about xtopdf (latter is subset of former)


Signup to hear about my new software products or services.


Contact Page

1 comment:

Vasudev Ram said...

My apologies to anyone who sees this post twice via Planet Python or other aggregators / feed readers. I made a last minute edit to the post, adding links to relevant libraries, due to which duplicate posts may be seen.