Ryan Kanno: The diary of an Enginerd in Hawaii

Everything you've ever thought, but never had the balls to say.

My LinkedIn Profile
Follow @ryankanno on Twitter
My Feed

web2email.py – A web to email Python backup script

I’m back, at least for the time being. There’s definitely a calm before the impending storm, but until then, I’m back posting little tidbits of uselessness. Enjoy!

Python goodness

While introducing the concept of automation to a friend of mine, I came across a requirement to archive a series of URL’s on a daily basis. Luckily for me, the URL’s consisted primarily of plain text. Loading up VIM, I concocted this Python script in a few hours – most of which was spent searching Googs <3.

If you're looking for a true web crawler, this won't be for you - though loading up lxml/Beautiful Soup, cssutils, and a Javascript parser to determine what artifacts need to be downloaded shouldn’t be all that difficult…

But, I’ll leave that as an exercise for the reader (That’s you, btw!)

In any case, the following script crawls a URL and sends the page via Googs or Webfaction via SMTP-AUTH or via a plain SMTP server of your choosing. Sorta-kinda like having your own WayBackMachine. In any case, cut and paste the following into a neat file called web2email.py.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
#! /usr/bin/env python2.5
# -*- coding: utf-8 -*-
#
# Copyright (c) 2008 Ryan Kanno (ryankanno@localkinegrinds.com)
# License: GNU GPLv3
 
import urllib2
 
import smtplib
from email.MIMEMultipart import MIMEMultipart
from email.MIMEBase import MIMEBase
from email.MIMEText import MIMEText
from email.Utils import COMMASPACE, formatdate
from email import Encoders
import datetime
 
from optparse import OptionParser
import sys, logging
 
__doc__ = """
 
This script retrieves a URL and sends its contents via email to 
a list of recipients.  Typically, this script is run from a cron
job that sends emails to a Gmail account to archive the contents
of a URL.
 
Mail can be sent via normal or authenticated SMTP.  Tested using 
Gmail SMTP (authenticated), Webfaction SMTP (authenticated), and
localhost (normal).
 
Example:
 
Sends the contents of http://www.espn.com to friend@domain.com using your Gmail settings
 
    python web2email.py -u gmail_username \
                        -p gmail_password \
                        -f gmail_username@gmail.com \
                        -r friend@domain.com http://www.espn.com
 
Sends the contents of http://www.espn.com to friend@domain.com using your Webfaction settings
 
    python web2email.py -u webfaction_username \
                        -p webfaction_password \
                        -f webfaction_account@webfaction_domain.com \
                        -s smtp.webfaction.com \
                        -r friend@domain.com http://www.espn.com
 
Sends the contents of http://www.espn.com to friend@domain.com using your local settings
 
    python web2email.py -f your_email@domain.com \
                        -s localhost \
                        --port 25 \
                        -r friend@domain.com http://www.espn.com
"""
 
__author__  = "ryankanno@localkinegrinds.com"
__url__     = "http://blog.localkinegrinds.com"
__version__ = "0.1"
 
USAGE = "usage: %prog [options] url" 
DESC  = __doc__.split('\n\n')[0]
 
def configure_logging(log_level, format='%(asctime)s %(levelname)s %(message)s'):
    logging.basicConfig(level=log_level, format=format)
 
def _validate_options_and_args(parser, options, args):
    logging.debug("Validating options and arguments.")
    if (len(args) != 1):
        parser.error("Incorrect number of arguments.  Script expects 1 (URL to backup), but received %i." % len(args))
        sys.exit(2) # Command line syntax error
    elif not options.recipients: 
        parser.error("You must include at least one recipient.")
        sys.exit(1) 
    elif (options.username and options.password is None) or (options.username is None and options.password is not None):
        parser.error("You must include both a username and password.")
        sys.exit(1) 
    elif not options.from_email:
        parser.error("You must include a valid from email address.")
        sys.exit(1) 
 
def getPage(url):
    logging.debug("Attempting to retrieve %s" % url)
    try:
        response = urllib2.urlopen(url)
        return response.read()
    except urllib2.HTTPError, e:
        logging.error("HTTPError (%s) occurred retrieving %s" % (e.code, url))
        sys.exit(1)
    except urllib2.URLError, e:
        logging.error("URLError (%s) occurred retrieving %s" % (e.reason, url))
        sys.exit(1)
 
def mail(send_from, send_to, subject, text, content_type, files=[], server='localhost', port=25, username=None, password=None):
 
    def _auth(server, port, username, password):
        logging.debug("Attempting to send email via %s:%i using the following credentials (%s:%s)." % (server, port, username, password))
        smtp = smtplib.SMTP(server, port) 
        smtp.ehlo()
        smtp.starttls()
        smtp.ehlo()
        smtp.login(username, password)
        smtp.sendmail(username, send_to, msg.as_string())
        smtp.close()
 
    def _unauth(server, port):
        logging.debug("Attempting to send email via %s:%i" % (server, port))
        smtp = smtplib.SMTP(server, port)
        smtp.sendmail(send_from, send_to, msg.as_string())
        smtp.close()
 
    assert type(send_to)==list
 
    msg=MIMEMultipart()
    msg['From'] = send_from
    msg['To'] = COMMASPACE.join(send_to)
    msg['Date'] = formatdate(localtime=True)
    msg['Subject'] = subject
 
    text = MIMEText(text)
    text.set_type(content_type)
    text.set_param('charset', 'UTF-8')
 
    msg.attach(text)
 
    for f in files:
        part = MIMEBase('application', "octet-stream")
        part.set_payload(open(file,"rb").read())
        Encoders.encode_base64(part)
        part.add_header('Content-Disposition', 'attachment; filename="%s"' % os.path.basename(f))
        msg.attach(part)
 
    if not username and not password:
        _unauth(server, port)
    else:
        _auth(server, port, username, password) 
 
def main():
    parser = OptionParser(usage=USAGE, description=DESC)
 
    parser.add_option("-u", "--username", dest="username", metavar="USER", help="Username to SMTP server")
    parser.add_option("-p", "--password", dest="password", metavar="PWD", help="Password to SMTP server")
    parser.add_option("-s", "--server", dest="server", metavar="SERVER", help="SMTP server (Defaults to Gmail)")
    parser.add_option("--port", dest="port", metavar="PORT", type="int", help="SMTP server port (Defaults to Gmail)")
    parser.add_option("-f", "--from", dest="from_email", metavar="FROM", help="From address")
    parser.add_option("-r", "--recipient", action="append", dest="recipients", metavar="RCPT", type="string", help="Email recipient")
    parser.add_option('-t', '--test', action="store_true", dest="test", metavar="TEST", help="Run tests")
    parser.add_option('-v', '--verbose', action='store_const', dest='log_level', const=logging.DEBUG, help='Verbose output')
    parser.set_defaults(server="smtp.gmail.com", port=587, test=False, log_level=logging.INFO)
    (options, args) = parser.parse_args()
 
    _validate_options_and_args(parser, options, args)
    configure_logging(options.log_level)
 
    if options.test:
        _test() # Too lazy to write a test for this script.  @TODO - use mocks 
 
    # Retrieve URL and return html
    html = getPage(args[0])
 
    # Send mail with returned html as body 
    mail(options.from_email, options.recipients, 
         '%s @ %s' % (args[0], (datetime.datetime.now().strftime("%A %B %d %I:%M:%S %p %Y"))), 
         html, 'text/html', 
         server=options.server, port=options.port, username=options.username, password=options.password)
 
    # Return with appropriate exit code
    sys.exit(0)
 
def _test():
    import doctest
    doctest.testmod(sys.modules[__name__])
 
if __name__ == '__main__':
    main()

All right stop, cron time! (imagine a 90’s pop song)

As an added bonus, you can install this script to run via cron so you’ll magically end up with webpages archived in your inbox! Neat. You can read my previous post on cron, or you can create the following crontab.

MAILTO=ryankanno@CHANGE_TO_YOUR_EMAIL.com
# minute (0-59),
# |      hour (0-23),
# |      |       day of the month (1-31),
# |      |       |       month of the year (1-12),
# |      |       |       |       day of the week (0-6 with 0=Sunday).
# |      |       |       |       |       commands
  0      0       *       *       *      /usr/bin/python2.5 /PATH/TO/web2email.py -u GMAIL_USER -p GMAIL_PWD -f FROM_USER -r RECIPIENT URL

As a side note, don’t forget double quotes around URL if there’s spaces!

Notice, change the value of ryankanno@CHANGE_TO_YOUR_EMAIL.com to your email address (or comment the line out with a # if you don’t want emails sent to you), GMAIL_USER to your Google username, GMAIL_PWD to your Google password, FROM_USER to the from address in the mail header, RECIPIENT to the recipient email address, and URL to the URL you want backed up.

I know, I know. The critics.

The critics will say that your Gmail username and password are in cleartext. I know. They are. So… I’m hoping that since you just need an archive of a publicly available URL on the Internets, the data doesn’t need to be super-duper-Fort-Knox-protected. If it does, this script isn’t for you. :( Oh, yeah, before I forget… here’s a hint… *cough*create another Google account*cough*. With that said, archive to your heart’s content!

Enjoy!

Popularity: 33% [?]

  1. This is cool, is there a way to add user/password authentication for the URL as well?

  2. @Vudu12 – Interesting… are you referring to Basic Auth or another type of authentication? I’ve never had the need for Basic Auth before, but this script is using urllib2, so a little patch to it would be quite easy. Here’s a few links showing how to use Basic Auth in urllib2:

    http://www.voidspace.org.uk/python/articles/urllib2.shtml#id6
    http://www.voidspace.org.uk/python/articles/authentication.shtml

Please leave a reply »

Powered by Wordpress. Stalk me.