web2email.py - A web to email Python backup script
Python goodness
While introducing the concept of automation to a friend of mine, I came across a requirement to archive a series of URL’s on a daily basis. Luckily for me, the URL’s consisted primarily of plain text. Loading up VIM, I concocted this Python script in a few hours - most of which was spent searching Googs <3.
If you're looking for a true web crawler, this won't be for you - though loading up lxml/Beautiful Soup, cssutils, and a Javascript parser to determine what artifacts need to be downloaded shouldn’t be all that difficult…
But, I’ll leave that as an exercise for the reader (That’s you, btw!)
In any case, the following script crawls a URL and sends the page via Googs or Webfaction via SMTP-AUTH or via a plain SMTP server of your choosing. Sorta-kinda like having your own WayBackMachine. In any case, cut and paste the following into a neat file called web2email.py.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | #! /usr/bin/env python2.5 # -*- coding: utf-8 -*- # # Copyright (c) 2008 Ryan Kanno (ryankanno@localkinegrinds.com) # License: GNU GPLv3 import urllib2 import smtplib from email.MIMEMultipart import MIMEMultipart from email.MIMEBase import MIMEBase from email.MIMEText import MIMEText from email.Utils import COMMASPACE, formatdate from email import Encoders import datetime from optparse import OptionParser import sys, logging __doc__ = """ This script retrieves a URL and sends its contents via email to a list of recipients. Typically, this script is run from a cron job that sends emails to a Gmail account to archive the contents of a URL. Mail can be sent via normal or authenticated SMTP. Tested using Gmail SMTP (authenticated), Webfaction SMTP (authenticated), and localhost (normal). Example: Sends the contents of http://www.espn.com to friend@domain.com using your Gmail settings python web2email.py -u gmail_username \ -p gmail_password \ -f gmail_username@gmail.com \ -r friend@domain.com http://www.espn.com Sends the contents of http://www.espn.com to friend@domain.com using your Webfaction settings python web2email.py -u webfaction_username \ -p webfaction_password \ -f webfaction_account@webfaction_domain.com \ -s smtp.webfaction.com \ -r friend@domain.com http://www.espn.com Sends the contents of http://www.espn.com to friend@domain.com using your local settings python web2email.py -f your_email@domain.com \ -s localhost \ --port 25 \ -r friend@domain.com http://www.espn.com """ __author__ = "ryankanno@localkinegrinds.com" __url__ = "http://blog.localkinegrinds.com" __version__ = "0.1" USAGE = "usage: %prog [options] url" DESC = __doc__.split('\n\n')[0] def configure_logging(log_level, format='%(asctime)s %(levelname)s %(message)s'): logging.basicConfig(level=log_level, format=format) def _validate_options_and_args(parser, options, args): logging.debug("Validating options and arguments.") if (len(args) != 1): parser.error("Incorrect number of arguments. Script expects 1 (URL to backup), but received %i." % len(args)) sys.exit(2) # Command line syntax error elif not options.recipients: parser.error("You must include at least one recipient.") sys.exit(1) elif (options.username and options.password is None) or (options.username is None and options.password is not None): parser.error("You must include both a username and password.") sys.exit(1) elif not options.from_email: parser.error("You must include a valid from email address.") sys.exit(1) def getPage(url): logging.debug("Attempting to retrieve %s" % url) try: response = urllib2.urlopen(url) return response.read() except urllib2.HTTPError, e: logging.error("HTTPError (%s) occurred retrieving %s" % (e.code, url)) sys.exit(1) except urllib2.URLError, e: logging.error("URLError (%s) occurred retrieving %s" % (e.reason, url)) sys.exit(1) def mail(send_from, send_to, subject, text, content_type, files=[], server='localhost', port=25, username=None, password=None): def _auth(server, port, username, password): logging.debug("Attempting to send email via %s:%i using the following credentials (%s:%s)." % (server, port, username, password)) smtp = smtplib.SMTP(server, port) smtp.ehlo() smtp.starttls() smtp.ehlo() smtp.login(username, password) smtp.sendmail(username, send_to, msg.as_string()) smtp.close() def _unauth(server, port): logging.debug("Attempting to send email via %s:%i" % (server, port)) smtp = smtplib.SMTP(server, port) smtp.sendmail(send_from, send_to, msg.as_string()) smtp.close() assert type(send_to)==list msg=MIMEMultipart() msg['From'] = send_from msg['To'] = COMMASPACE.join(send_to) msg['Date'] = formatdate(localtime=True) msg['Subject'] = subject text = MIMEText(text) text.set_type(content_type) text.set_param('charset', 'UTF-8') msg.attach(text) for f in files: part = MIMEBase('application', "octet-stream") part.set_payload(open(file,"rb").read()) Encoders.encode_base64(part) part.add_header('Content-Disposition', 'attachment; filename="%s"' % os.path.basename(f)) msg.attach(part) if not username and not password: _unauth(server, port) else: _auth(server, port, username, password) def main(): parser = OptionParser(usage=USAGE, description=DESC) parser.add_option("-u", "--username", dest="username", metavar="USER", help="Username to SMTP server") parser.add_option("-p", "--password", dest="password", metavar="PWD", help="Password to SMTP server") parser.add_option("-s", "--server", dest="server", metavar="SERVER", help="SMTP server (Defaults to Gmail)") parser.add_option("--port", dest="port", metavar="PORT", type="int", help="SMTP server port (Defaults to Gmail)") parser.add_option("-f", "--from", dest="from_email", metavar="FROM", help="From address") parser.add_option("-r", "--recipient", action="append", dest="recipients", metavar="RCPT", type="string", help="Email recipient") parser.add_option('-t', '--test', action="store_true", dest="test", metavar="TEST", help="Run tests") parser.add_option('-v', '--verbose', action='store_const', dest='log_level', const=logging.DEBUG, help='Verbose output') parser.set_defaults(server="smtp.gmail.com", port=587, test=False, log_level=logging.INFO) (options, args) = parser.parse_args() _validate_options_and_args(parser, options, args) configure_logging(options.log_level) if options.test: _test() # Too lazy to write a test for this script. @TODO - use mocks # Retrieve URL and return html html = getPage(args[0]) # Send mail with returned html as body mail(options.from_email, options.recipients, '%s @ %s' % (args[0], (datetime.datetime.now().strftime("%A %B %d %I:%M:%S %p %Y"))), html, 'text/html', server=options.server, port=options.port, username=options.username, password=options.password) # Return with appropriate exit code sys.exit(0) def _test(): import doctest doctest.testmod(sys.modules[__name__]) if __name__ == '__main__': main() |
All right stop, cron time! (imagine a 90’s pop song)
As an added bonus, you can install this script to run via cron so you’ll magically end up with webpages archived in your inbox! Neat. You can read my previous post on cron, or you can create the following crontab.
MAILTO=ryankanno@CHANGE_TO_YOUR_EMAIL.com # minute (0-59), # | hour (0-23), # | | day of the month (1-31), # | | | month of the year (1-12), # | | | | day of the week (0-6 with 0=Sunday). # | | | | | commands 0 0 * * * /usr/bin/python2.5 /PATH/TO/web2email.py -u GMAIL_USER -p GMAIL_PWD -f FROM_USER -r RECIPIENT URL
As a side note, don’t forget double quotes around URL if there’s spaces!
I know, I know. The critics.
The critics will say that your Gmail username and password are in cleartext. I know. They are. So… I’m hoping that since you just need an archive of a publicly available URL on the Internets, the data doesn’t need to be super-duper-Fort-Knox-protected. If it does, this script isn’t for you.
Oh, yeah, before I forget… here’s a hint… *cough*create another Google account*cough*. With that said, archive to your heart’s content!







Please leave a reply »