Wednesday, January 28, 2015

HTML text to PDF with Beautiful Soup and xtopdf

By Vasudev Ram



Recently, I thought of getting the text from HTML documents and putting that text to PDF. So I did it :)

Here's how:

"""
HTMLTextToPDF.py
A demo program to show how to convert the text extracted from HTML 
content, to PDF. It uses the Beautiful Soup library, v4, to 
parse the HTML, and the xtopdf library to generate the PDF output.
Beautiful Soup is at: http://www.crummy.com/software/BeautifulSoup/
xtopdf is at: https://bitbucket.org/vasudevram/xtopdf
Guide to using and installing xtopdf: http://jugad2.blogspot.in/2012/07/guide-to-installing-and-using-xtopdf.html
Author: Vasudev Ram - http://www.dancingbison.com
Copyright 2015 Vasudev Ram
"""

import sys
from bs4 import BeautifulSoup
from PDFWriter import PDFWriter

def usage():
    sys.stderr.write("Usage: python " + sys.argv[0] + " html_file pdf_file\n")
    sys.stderr.write("which will extract only the text from html_file and\n")
    sys.stderr.write("write it to pdf_file\n")

def main():

    # Create some HTML for testing conversion of its text to PDF.
    html_doc = """
    <html>
        <head>
            <title>
            Test file for HTMLTextToPDF
            </title>
        </head>
        <body>
        This is text within the body element but outside any paragraph.
        <p>
        This is a paragraph of text. Hey there, how do you do?
        The quick red fox jumped over the slow blue cow.
        </p>
        <p>
        This is another paragraph of text.
        Don't mind what it contains.
        What is mind? Not matter.
        What is matter? Never mind.
        </p>
        This is also text within the body element but not within any paragraph.
        </body>
    </html>
    """

    pw = PDFWriter("HTMLTextTo.pdf")
    pw.setFont("Courier", 10)
    pw.setHeader("Conversion of HTML text to PDF")
    pw.setFooter("Generated by xtopdf: http://slid.es/vasudevram/xtopdf")
 
    # Use method chaining this time.
    for line in BeautifulSoup(html_doc).get_text().split("\n"):
        pw.writeLine(line)
    pw.savePage()
    pw.close()

if __name__ == '__main__':
    main()


The program uses the Beautiful Soup library for parsing and extracting information from HTML, and xtopdf, my Python library for PDF generation.
Run it with:
python HTMLTextToPDF.py
and the output will be in the file HTMLTextTo.pdf.
Screenshot below:


- Vasudev Ram - Python training and programming - Dancing Bison Enterprises

Read more of my posts about Python or read posts about xtopdf (latter is subset of former)


Signup to hear about my new software products or services.


Contact Page

Friday, January 23, 2015

PrettyTable to PDF is pretty easy with xtopdf

By Vasudev Ram




"PrettyTable to PDF is pretty easy with xtopdf."

How's that for some alliteration? :)

PrettyTable is a Python library to help you generate nice tables with ASCII characters as the borders, plus alignment of text within columns, headings, padding, etc.

Excerpt from the site:

[ PrettyTable is a simple Python library designed to make it quick and easy to represent tabular data in visually appealing ASCII tables.
...
PrettyTable lets you control many aspects of the table, like the width of the column padding, the alignment of text within columns, which characters are used to draw the table border, whether you even want a border, and much more. You can control which subsets of the columns and rows are printed, and you can sort the rows by the value of a particular column.

PrettyTable can also generate HTML code with the data in a <table> structure. ]

I came across PrettyTable via this blog post:

11 Python Libraries You Might Not Know.

Then I thought of using it with my PDF creation toolkit, to generate such ASCII tables, but as PDF. Here's a program, PrettyTableToPDF.py, that shows how to do that:
"""
PrettyTableToPDF.py
A demo program to show how to convert the output generated 
by the PrettyTable library, to PDF, using the xtopdf toolkit 
for PDF creation from other formats.
Author: Vasudev Ram - http://www.dancingbison.com
xtopdf is at: http://slides.com/vasudevram/xtopdf

Copyright 2015 Vasudev Ram
"""

from prettytable import PrettyTable
from PDFWriter import PDFWriter

pt = PrettyTable(["City name", "Area", "Population", "Annual Rainfall"])
pt.align["City name"] = "l" # Left align city names
pt.padding_width = 1 # One space between column edges and contents (default)
pt.add_row(["Adelaide",1295, 1158259, 600.5])
pt.add_row(["Brisbane",5905, 1857594, 1146.4])
pt.add_row(["Darwin", 112, 120900, 1714.7])
pt.add_row(["Hobart", 1357, 205556, 619.5])
pt.add_row(["Sydney", 2058, 4336374, 1214.8])
pt.add_row(["Melbourne", 1566, 3806092, 646.9])
pt.add_row(["Perth", 5386, 1554769, 869.4])
lines = pt.get_string()

pw = PDFWriter('Australia-Rainfall.pdf')
pw.setFont('Courier', 12)
pw.setHeader('Demo of PrettyTable to PDF')
pw.setFooter('Demo of PrettyTable to PDF')
for line in lines.split('\n'):
    pw.writeLine(line)
pw.close()

You can run the program with:
$ python PrettyTableToPDF.py
And here is a screenshot of the output PDF in Foxit PDF Reader:


- Enjoy.

--- Posts about Python --- Posts about xtopdf ---

- Vasudev Ram - Dancing Bison Enterprises - Python programming and training

Signup to be informed about my new products or services.

Contact Page

Monday, January 19, 2015

Music video: Dire Straits - Walk of Life

By Vasudev Ram



An old favorite, listened to it again today. The video is good too.

Dire Straits

Dire Straits - Walk of Life:



- Vasudev Ram - Dancing Bison Enterprises

Sunday, January 18, 2015

Manas National Park, Assam, India

By Vasudev Ram


From Wikipedia, the free encyclopedia:

"Manas National Park or Manas Wildlife Sanctuary (Pron:ˈmʌnəs) (Assamese: মানস ৰাষ্ট্ৰীয় উদ্যান) is a National Park, UNESCO Natural World Heritage site, a Project Tiger Reserve, an Elephant Reserve and a Biosphere Reserve in Assam, India. Located in the Himalayan foothills, it is contiguous with the Royal Manas National Park[1] in Bhutan. The park is known for its rare and endangered endemic wildlife."



- Vasudev Ram - Dancing Bison Enterprises

Signup to hear about new products or services from me.

Contact Page


Thursday, January 15, 2015

Publish databases to PDF with PyDAL and xtopdf

By Vasudev Ram


Some days ago, I had blogged about pyDAL, a pure Python Database Abstraction Layer.

Today I thought of writing a program to publish database data to PDF, using PyDAL and xtopdf, my open source Python library for PDF creation from other file formats.

(Here is a good online overview about xtopdf, for those new to it.)

So here is the code for PyDALtoPDF.py:
"""
Author: Vasudev Ram
Copyright 2014 Vasudev Ram - www.dancingbison.com
This program is a demo of how to use the PyDAL and xtopdf Python libraries 
together to publish database data to PDF.
PyDAL is at: https://github.com/web2py/pydal/blob/master/README.md
xtopdf is at: https://bitbucket.org/vasudevram/xtopdf
and info about xtopdf is at: http://slides.com/vasudevram/xtopdf or 
at: http://slid.es/vasudevram/xtopdf
"""

# imports
from pydal import DAL, Field
from PDFWriter import PDFWriter

SEP = 60

# create the database
db = DAL('sqlite://house_depot.db')

# define the table
db.define_table('furniture', \
    Field('id'), Field('name'), Field('quantity'), Field('unit_price')
)

# insert rows into table
items = ( \
    (1, 'chair', 40, 50),
    (2, 'table', 10, 300),
    (3, 'cupboard', 20, 200),
    (4, 'bed', 30, 400)
)
for item in items:
    db.furniture.insert(id=item[0], name=item[1], quantity=item[2], unit_price=item[3])

# define the query
query = db.furniture
# the above line shows an interesting property of PyDAL; it seems to 
# have some flexibility in how queries can be defined; in this case,
# just saying db.table_name tells it to fetch all the rows 
# from table_name; there are other variations possible; I have not 
# checked out all the options, but the ones I have seem somewhat 
# intuitive.

# run the query
rows = db(query).select()

# setup the PDFWriter
pw = PDFWriter('furniture.pdf')
pw.setFont('Courier', 10)
pw.setHeader('     House Depot Stock Report - Furniture Division     '.center(60))
pw.setFooter('Generated by xtopdf: http://google.com/search?q=xtopdf')

pw.writeLine('=' * SEP)

field_widths = (5, 10, 10, 12, 10)

# print the header row
pw.writeLine(''.join(header_field.center(field_widths[idx]) for idx, header_field in enumerate(('#', 'Name', 'Quantity', 'Unit price', 'Price'))))

pw.writeLine('-' * SEP)

# print the data rows
for row in rows:
    # methinks the writeLine argument gets a little long here ...
    # the first version of the program was taller but thinner :)
    pw.writeLine(''.join(str(data_field).center(field_widths[idx]) for idx, data_field in enumerate((row['id'], row['name'], row['quantity'], row['unit_price'], int(row['quantity']) * int(row['unit_price'])))))

pw.writeLine('=' * SEP)
pw.close()

I ran it (on Windows) with:
$ py PyDALtoPDF.py 2>NUL
Here is a screenshot of the output in Foxit PDF Reader:


- Enjoy.

--- Posts about Python  ---  Posts about xtopdf ---

- Vasudev Ram - Python programming and training

Signup to hear about new products or services from me.

Contact Page