How I moved to The Cloud
When I go, all that will be left will be my home directory. Might as well keep it safe.
I asked around, among friends, on Mastodon, on Twitter, and a good number of people wanted to hear about my Cloud Data Migration Architecture Solution Strategy. On Twitter, 83% voted that a write-up would be useful, and 11% voted that it would be pure navel-gazing.
So, here goes. This might be interesting if you’ve ever thought about moving your own things out of your house and to servers maintained by professionals. Think of the essay as a box of things for you to pick and choose from, including some initial organizational suggestions about where to keep the pepper grinder, then more small parts than anybody who isn’t me would want to use at once: a scanner, a mailinabox virtual machine running tmux, S3s, backup databases, drive docks, fuser mounts. Backups will be covered toward the end, because I don’t think your Cloud Data Migration Architecture Solution Strategy is complete without backing up. From there I leave to you to pick out the parts of this long document most apropos to you.
Some, but not nearly all, of the below is about not using walled gardens. A walled garden is a portion of The Internet that is curated and maintained by a single entity, possibly linked with other sites it owns and otherwise a little cut off from the rest of the world. If your entire digital life is photos, emails, and a dozen documents you have to hold on to, a walled garden might be fine. But if you have a more extensive digital work/life, walled gardens may start to feel like that creepy kind of utopia where everything looks nice but may not be under the surface. If your account runs into trouble, can they close it without notice? Can you move your data elsewhere (e.g., Google takeout)? Now that you’re locked in, is extra space at a reasonable price? Is your data tied to a single set of tools which may or may not be your ideal, and may or may not change every time you refresh your browser? Will you have access on your telephone or a six-year old laptop? If you lose favor with your government, will the curators of your garden hand over all your data? There are also political considerations, like whether you want alternatives to the GAFA oligopoly to continue to exist, but that’s your call.
Why rocket to The Cloud?
They promised us jet packs. They promised us a future where we could work anywhere, without necessarily carrying around some magic box. When you smell smoke and only have time to grab the kids, when the revolution comes, we should be able to up and go and wind up OK wherever we may land. As a complication to this, the U.S. gov’t claims that you have no right of privacy within 100 miles of a border and can detain you if you don’t give them access to any equipment you’re carrying, so having all your data on whatever you carry across the border may be a bad idea.
How close can we get to picking up any piece of hardware and feeling at home? It’s hard to enter a building that doesn’t have some sort of Internet-connected thing with a keyboard. (Tip: if you get the right adapter you can plug a USB keyboard into your telephone. Or, ask the Internet where to buy a folding portable bluetooth keyboard for your back pocket.) This essay lives in The Cloud, so I am writing this on my phablet with a keyboard I borrowed from a friend.
(Borrowing-a-laptop tips: most operating systems allow separate users, meaning that you can screw around with settings and download basic programs without bothering your hosts, who can delete your disposable account after you leave. Everything we do is in a browser these days, and Firefox has an excellent account system that lets you pull settings and tabs from other instances of Firefox on other computers; it makes a browser feel more like home.
For a quick check of something online, browsers have a “privacy mode” with an entirely different set of cookies. Say your host is logged in to their Google account; you can pop open privacy mode and log in to your own without interference. Firefox also has a “multi-account containers” add-on which provides another way to set up a segregated space for yourself.)
And forget jet packs: I’m a klutz. If you often use an object in close proximity to beverages, then that object is not reliable. I first alluded to this in my first navel-gazing post.
I’m past that point in life when I’m happier getting something new, and into the period when I feel better when I throw something out. Taking the filing cabinet to the curb out front felt good and I hope to never have a new one. I walk in to coworkers’ offices with desks covered in papers, I feel it.
Or, there was that one time when I left my laptop on the kitchen table, which has a high-up ventilation window maybe three meters up from the ground outside. I go do other things for a while, and when I get back I find that somebody got a ladder, reached through the ventilation window, and snatched the laptop. It was incredibly annoying, but I didn’t lose any work. (Also, I wasn’t logged in to my Google account or otherwise in an “If you have this physical object you have access to everything I own” state, so I didn’t lose any bank accounts either.)
Paper
Books are nice, and I still have a bookshelf in a room I still call the library. There will be a day when physical books are far less frequent, when I do most of my reading by checking out books from the library, and I’m not sure how we’ll decorate our rooms. But articles, old notes (meaning notes more than about a week old), bureaucratic documents — not a lot of pleasure in handling or even seeing them.
I scanned all of it, with a ten-year old printer/scanner somebody left at my house one day because the printer part ran out of toner. The Salvation Army will have one for you, or buy a used one on eBay, flatbed or wand, for cheap. I emptied the four-drawer filing cabinet, piled it up — a meter and a half! — threw out a good percent of it, found electronic copies of what I could, paid a company $160/box to scan most of it, hand-scanned a good number of personal or oddly-shaped things. The only papers I have left are those where lawyers believe that the physical object has some sort of magic that gets lost in digitization.
Post cards were easier to photograph with my telephone than put on the flatbed scanner. It was fun to look through them again,and to be reminded that there was once a time when people hand-wrote letters. I may look at them again some day, but the primary benefit of the scanning may have been to make throwing away the objects emotionally easy. Celebrity item-chucker Marie Kondo points out that the purpose of a postcard or letter is to be sent and received, and once that has purpose has been achieved it can be discarded. It’s a nice sentiment that did help me to let go of the paper.
** Overall review: very worth it.
It was nice to be reminded of everything I’ve done over the last twenty-ish years, and now it’s easier to find than when it was in a filing cabinet, or in stacks all over. Taking a bag full of papers out to the recycling bin outside was a visceral joy.
At home and the office, I have desks exactly big enough to hold the monitors and the tea. There is literally not room for new paper.
In case you ever live with me, let me tell you one of my biggest cohabitant pet peeves: using active space for storage. If the coffee grinder gets used maybe once every week, it shouldn’t be on the counter right next to the toaster oven, which we use all the time. The darn grinder gets in the way when you want to prep a bagel. Is the counter clean and clear? It’s hard to tell from the spice jars that migrated onto it from the spice rack.
I started making a more solid distinction between active work and archives with my data, and it has changed everything.
I had had a home directory with some things that don’t really change, like PDFs for reference, right next to my current projects. It was hard to find what was in need of active handling in the midst of what should have been in a cabinet.
(Tip: be sure that you have a home directory and know where to find it. If you’re on a POSIX box, you absolutely do. If you’re using a Mac you do, something like /Users/yourname, and the system will take care to put all your stuff there. Windows sometimes makes a mess, putting documents in all sorts of places. This is bad and needs to be fought against.)
I divided my world into a few levels of activenes.
- Projects I’m working on, which update daily.
- Administrative junk, like financial records and job-related paperwork. Updated every few days.
- The media library, including photos, music I’ve backed up from CDs I’d bought or mp3s from Bandcamp, journal articles that would have otherwise been in a stack on my desk. These do nothing but grow.
- My home directories of old, or archived projects, which will never change. I have them going back to the 1900s, even including copies of floppy discs I carried around from computer lab to computer lab (USB floppy disc reader: about $12 on eBay or Ali Express, then give it to a friend when yours are all copied off.) Add those above-mentioned digitized papers to this category. These aren’t 100% static, because they get cleaned up and moved around sometimes.
“A place for everything, everything in its place.” Separating by level of active-ness is why I now have a place for everything — I certainly didn’t before. In each case, I want three or four versions:
- The canonical copy, where the data really lives
- The backup, in case of fire
- The other backup, in case of revolution
- Disposable copies as needed, to be deleted fearlessly
Current projects
My last navel-gazing post was about keeping a clean home directory. That was four years ago, and not only did I keep up with it, but it’s how I organize my entire work flow now.
There’s a space for every ongoing project in a project archive. If I’m working on a project, I check out a copy from the archive. If I’m not working on it for the rest of the day, then I check the project back in, and delete it from my active work space.
This is backwards from how most people do it, so I’ll stress the point: the canonical copy is on the shelf, and what I see and handle and improve in my work space is a disposable copy. At the end of the day, I check all my changes in and throw out the local disposable copy. Develop no attachment to it, for attachment leads to suffering.
Is there anything pending from yesterday? Just check the home directory and see what’s there. Has a project been checked out into the home directory for a long time; are there too many sitting out? Time to take a few minutes to set a stopping point for each and put a few back on the shelf. Does something need to be filed in the archive? It should stand out, and if you know what shelf it’s on there’s no need to keep it out in active space.
And, as per the theme of this essay, such a reversal of what is canonical allows us to move everything to The Cloud and not really care when we spill tea all over the laptop keyboard.
If you aren’t using revision control — though I do recommend considering it — your active canonical store might be a dropbox-type drive.
My home directory has an archive folder, on some machines I named it “a” and on others “+”, but never more than one letter because I want it to be out of the way. [Technical: Not sure why, but I’m uncomfortable putting something this important in an invisible dot file.] That’s the shelf where I put away projects when they’re done for the day.
[Technical (until that end-bracket a few paragraphs down): I have one git revision control repository per project. This provides a lot of structure, and if you’re used to git (me, I’ve written a textbook chapter about it) then that structure is very beneficial and not just bureaucratic.
That above navel-gazing post discusses the details, which revolve around a short script I’d written to answer the question if I delete this clone of the canonical repository, will I lose any work?
There are services that will host your version control repositories, github being the most famous, but there are a lot of others — I’ll give a shout-out to Sourcehut, which I love for being 100% indie and doing exactly what you need and no more — or you can self-host if you have a virtual machine up somewhere.
[Very technical: Below I’ll use an S3 for storage, but you can’t have an S3-backed git repository. Git sometimes has to modify but not overwrite a file, I’ve found that git sometimes looks for empty files or directories that S3s will sometimes throw out (i.e., if git doesn’t recognize it as a repository, you may have to make some empty directories to match what other git repositories look like), and the S3 latency is annoying for larger checkouts. So any repositories I may be actively using are in a segregated space (currently $HOME/a/r) on plain old storage on that virtual machine. It gets included in the backups below.]
]
** The console
If you don’t spend time in a terminal, skip this part.
For those of you who are still around, I’m paying Digital Ocean about $10/month for an always-on virtual PC.
I used to have a server on the bar/server rack here in the house, but now somebody else is paying for the electricity and extra hard disks. I’m not of the illusion that when I order a take-out dinner that my dinner was kitchenless, but it’s probably more eco-friendly to have my setup be side-by-side on a blade with a hundred others, and more importantly, it feels better to me that one more object is out of the house.
It’s a Mailinabox server, which is as close to trivial to set up as this sort of thing gets. It’s a relatively generic virtual machine and not a crazy mess of little services like Amazon’s web services, though I haven’t yet worked out how to get the Docker container that is my VM away from Digital Ocean.
Tmux is essential for this sort of thing (install it via your package manager). By its name, it’s a terminal multiplexer, allowing you to have sub-terminals in the one terminal you have on the screen. For our purposes, its key power is that it allows us to disconnect — just close the terminal or shut down the laptop — then reconnect exactly where we left off.
Amazon Web Services has a service called Cloud 9, which provides a terminal in your browser. C9 used to be an independent company, and its level of service has been down since Amazon bought it, but it’s still so easy to log in to that Digital Ocean box anywhere, no matter whether port 22 is blocked or Putty is not downloadable, and get that tmux session attached.
Let me tell you about simple storage services (S3s).
These are very low-cost systems that let you dump ‘objects’, whatever they may be, into an online storage bucket. Ten-ish years ago, S3-type services weren’t really a thing, and now you probably look at objects (bits of web pages, videos, &c) retrieved from S3 buckets all day long. S3 is officially what Amazon calls its storage, maybe it’s trademarked, but there are a dozen vendors who provide something equivalent, often enough to be about 100% drop-in compatible. They’re all simple storage services, and I have no obligation to defend Amazon’s trademarks, and I think some of them are even reselling Amazon storage, so I’ll call them all S3s.
They can be used as cheap, slow file systems—exactly what you need for your long-term, infrequently changing archive. Every S3 has a web interface, so you can get to your stuff anywhere you can find a web browser. You can hand people links to certain files if so desired, or below I’ll use buckets for backups.
So, I’m storing everything that is not active use in an S3 bucket.
There are some glitches if you’re used to standard storage. Big moves or deletions can take several minutes, during which time the directory listings may be inaccurate. Though, it may be to my benefit that I can’t constantly micro-tweak the organization of the archive all day. [Micromanaging parenthetical: they lose time stamps, which are sometimes truly useful for archives, but I think I’m over the loss by now. In retrospect, I would have migrated to The Cloud by first generating the restic backup (see below) and then checking out to The Cloud, so that the restic backup could retain time stamps as needed. Or pull date stamps from your git checkouts.]
There are many systems that will mount a simple storage bucket to your local computer so they’ll look like any old drive in the file explorer/finder/thunar. On the Mac over there (*points across the room*), I’ve had some success with Cyberduck. As above, it can be laggy, but you’re looking up files in a filing cabinet, not doing active work.
[Technical for Linux: I’ve mostly been using goofys, though it sometimes eats up all my memory and has to be restarted, and it depends on fuser, which is not especially robust to network hiccoughs.]
So, now you’re now a renter, not an owner. Digital Ocean will give you S3 space for one cent per gigabyte per month. Backblaze’s B2 charges half that, but they don’t offer a terminal you can dial in to. So, 1 Terabyte of data on Backblaze for $60/year. For comparison, a 2TB hard drive is going to be about $60, though you’re not going to trust your data to a single physical object so you’ll need to buy two anyway. Costs aside, I’m paying somebody else to worry about uptime, replacing bad disks, redundancy, and making sure the hard drive doesn’t get lost under the couch cushions.
Media
Yes, I still have a personal media library. I know that Amazon will license me revocable-at-any-time permission to read one of their e-books on a Kindle. My local library lends me e-books (to their great expense, via Overdrive or the Libby application, which your library offers). Spotify will let me stream a million hours of music a day. But sometimes you like to walk around town and discover new things and sometimes you like to stay home and enjoy what is already in your sphere. And, uh, sometimes you like to consume media in a manner such that the artists get paid.
Once you’ve bought some new piece of media and filed it away, it’s a read-only file. If you’ve read this far, you have the tools in place to deal with a personal library. Mine is in a restic database (see below) on Backblaze’s simple storage. I access it remotely not-infrequently, when I’m at somebody’s house and just have to play them some odd track that’s not on YouTube. Or make disposable local copies of whatever you’re in the mood for. Or run an airsonic docker container and listen from any web browser anywhere.
The ingredients of a backup system
Backups protect against two sources of failure. The first potential failure is hardware.
The second source of failure is me. If I delete a file or mis-modify it while inebriated, I still want to keep the old copy around for a while in case the deletion was the wrong thing to do. I want a snapshot of the system as it looked on any given day.
A backup system has two parts: a store of files, and an index that lets you view your directories as they were on a given date. From week to week, most of your files won’t change, even if you move or rename them, so it makes sense for each week’s directory snapshot to point to the same file in the file store if there were no changes. Typically, loading up such a system takes a long time for the first snapshot, and then takes a fraction of the time for the next incremental snapshot.
If you have a million files named project-1April20.doc, project-2April20.doc, project-3April20.doc, maybe you can delete them after you’ve set up the backups. If you put the project itself under revision control like git or mercurial, then you won’t even need one of those home made backups.
Apple offers Time Machine, which works in a manner comparable to the one-liner in the technical parenthetical you will be skipping below. You can pay Apple for space on their cloud for the backup database, at prices competitive to Digital Ocean, Backblaze, and all the others at the cost of being tied to Apple products for life.
I’ve been using Restic, an open-source backup program designed around using S3s to store that sequence of snapshots. Also, it encrypts everything, which is nice if you’re feeling paranoid.
Borg, the lead alternative, didn’t work reliably for me on S3s mounted via ssh. Maybe I did something wrong. I didn’t try rclone. It seems to be a reimplementation of a Linux/GNU one-liner for the S3s.
[Technical: here it is, the one-line complete not-for-S3s backup system.
rsync -aPH --delete /directory/to/back/up ./current; cp -al current `date +%y-%m-%d`
It’s a space and two dashes in front of ‘delete’, in case Medium turns it into an em dash. If you’re on a Macintosh personal computer cp won’t have the -l option to generate hard links (it’s GNU), so replace the cp with:
mkdir `date +%d-%b-%y`; pax -r -w -l current `date +%d-%b-%y`
]
So, this is not complicated stuff. If you’re uncomfortable with the command line (Don’t be — there are millions of people like you who got over the initial intimidation and worked it out), there are GUI front-ends; just search for ‘restic gui’ or what-have-you. There are also commercial vendors who provide GUIs to do the same thing, which I sometimes suspect are also front-ends to the above tools but now you pay somebody for them.
For me, once a week, a script on that Digital Ocean virtual machine in the Cloud/Ocean writes a new incremental snapshot of all the above bits (the media and the things on the shelf both inactive and active) to a backup database stored in a separate bucket. Or if you aren’t sure how to schedule a task [Linux/Mac: ask The Internet how to use cron or crontab.] you can set a calendar reminder.
** The other half
We can trust the storage providers to have redundancy for hardware failures. But what if the revolution comes and we can’t access the Internet?
So, I also keep hard drives at home for backups, which are copies of whatever is in The Cloud. Having hard drives on the shelf is a step back from being fully in The Cloud, but paranoia is good for you and having only one copy of your entire digital life is not advisable. You could instead get a second set of S3 buckets with a different provider to get some redundancy.
(Tip: you can buy an empty SATA drive dock or drive case which you can use to plug in a bare SATA disk drive. This is more efficient if you want a few drives for different purposes, and is cheaper but identical to buying a more consumer-focused external drive where you can’t open the case. To give you a sense of how hi-tech this will make you feel, there was a Joss Wheedon sci-fi TV show, Dollhouse, where they prominently used such a system as a key part of the sci-fi technology.)
It’s not jet packs, but when the revolution comes and all you have is a telephone or a laptop you stole from somebody’s kitchen, you can still have all your data history, all the media that brings you joy, and can still be most of the way to fully work- and data-functional.