Data science boot camp, in one Medium post

Ben Klemens
21 min readApr 29, 2024

--

Dear Colleague,

I’ve been asked to give you some things to practice on while you wait for access to the servers. Think of this as the intro level to the video game, where you learn what the controls do before the real threats come after you. I’m the non-player character in the robe and long white beard.

[Maybe you’re not my colleague — I’m marking this post as public, because although there are some things I focus on because they’re somehow more important within our office, most of what I’m covering is very broadly applicable to anybody who is sitting down in front of their first production server. I invite you to please enjoy free-riding on this work product.]

We use data; we do science; we count as data scientists. The Spanish Spanish for a box that runs software is ordenador, which is descriptive of what most people do with their boxes: order things. We compute — Mexican Spanish, computadora — but that requires a relatively specialized set of tools for ordering things. Up until section VII will be ordenating; later sections (SQL ff) will start asking questions of data. You’ll see from the section headers that most of the topics are relatively low-level, because these are the things that make life bearable when you are working through a pipeline to a remote terminal, and because textbooks and teachers often skip this stuff as somehow too mechanical or not data science-y enough. Or, because it’s so taken for granted that you’re assumed to know it.

This post should take you a few days to get through. It is a meta-tutorial, in which I advise you on what is important for our purposes in the Internet’s vast sea of tutorials for the subtopics I’ll point out. As a point of preface and encouragement to your efforts, I’ll note that, because the command-line way of doing things has persisted for so long and is expected to continue to do so, the return on investment for learning command-line and coding basics is life-long. In cyberpunk fiction, people are always coming up with new ways to interface, but somehow real-world hackers continue to rely on plain text. Who knows what revolutions will come, but the decades-old methods here seem to still be growing in utility and usage.

People in 1993 looking at a computer screen, where the file system is presented as a red box with blue boxes on top and connected to other red boxes by green lines.
A scene from Jurassic Park (1993) in which the characters use a Silicon Graphics 3-D file manager, now forgotten.

I. Get your command line.

  • Go to aws.amazon.com.
  • Set up a new AWS account. All of this can and should be done using AWS’s introductory free tier, and you shouldn’t have to provide a credit card. Don’t forget to shut down any servers you create when you’re done with them.
  • Once you’re in, search for cloud9 (no space).
  • Once on the c9 control page, create a new environment. Give it whatever name you wish, then leave the other defaults (notably, the “Free-tier eligible” setting so you won’t get charged; see free tier details here).
  • Click your new instance and open in cloud9.

You’re now at an IDE (integrated development environment). Down at the bottom, you’ll see a tab named bash (Bourne-again shell, a sequel to the Bourne shell). Maximize that using the little icon in the top right. There’s your command line. Welcome to your new $HOME.

Note that you can also create new terminals via the Window menu and the green + at the tab level.

As a brief note, you’ll see a lot of instructions here and in the tutorials I’m going to recommend for you to work through, but the above is the last time any of the instructions will be awkward hand-waving about clicking here, finding this button, filling in that form. (Almost) everything here on out will be plain text, which you can unambiguously copy and paste. I’m pointing this out because it’s hard to notice the absence of annoyances.

The absolute basics you could type at that command line:

  • lsis “list what files or subdirectories are in this directory”.
  • cat a_file will print a_file to the screen.
  • Tab completion: the shell (bash) will try to complete file or command names if you hit <tab>. For
    example, type
    cat R<tab>
    and you’ll see it complete to
    cat README.md
  • less a_file will let you read longer files in a pager. Once in the pager, the space bar or page down will scroll down, or ‘q’ to quit. [The original version was named more, because at the bottom of a long file, there’d be a caption saying “more…”, and this is the successor, where less is more. None of this was ever funny and now you just have to memorize these command names that make little sense.]

Continue when: you have a terminal up and running.

II. Tmux

Tmux is short for “terminal multiplexer”, because its primary utility is to turn a single terminal into several.

I don’t have a specific tutorial to recommend; whatever you google for “tmux tutorial” should be fine, but see the Cloud9 caveats below. There are a million tutorials because software that people directly interact with tends to get emotional reactions from people, and when geeks feel strong positive emotions, they write tutorials.

Cloud9 puts you in a tmux session from the start; it’s how c9 works. On a more typical Linux box like ours, you would start a session, via plain tmux, or attach an extant session, via tmux attach. That’s what you’ll do on our servers, while in c9 when you get to the part in the tutorial about detaching you’ll see that c9 more-or-less won’t let you.

Nonetheless, when you are going through several layers of mostly-reliable systems to get to your data, the ability to reattach and resume with one command after a disconnection will be life-saving.

[One of our servers has tmux’s predecesessor, GNU Screen, which behaves largely identically, especially if you do what many people do and remap tmux’s attention key to <ctrl>-A. You’d do that by pasting this onto the command line: printf "unbind C-b\nset -g prefix C-a" >> ~/.tmux.conf , then restarting tmux.]

Continue when: you’ve split your screen into left and right panes (“horizontal split”; c9’s vertical split was buggy for me), and successfully copied text from the left pane and pasted it into the right.

III. Work out a text editor

Most of what you will be doing is creating and manipulating text files. For the creation part, you’ll need a text editor. We have an IDE comparable to cloud9’s on the server, but it isn’t rock-solid reliable. Our always-on choices are limited on the remote server, so we’ll cover only two here:

- Nano

The name stresses how small, and how few features, this one has. I recommend spending ten minutes getting to know it via any of a thousand tutorials.

- vim
Vim is vi improved, and vi is ‘visual’, because vi is old enough (1976) that the idea of a text editor where you can see what you’re working on was novel. Forty-six years later, it is still among the most popular text editors, despite being deeply frustrating initially. The initial frustration comes from its modal nature: there is an “insert mode” where typing “j” puts a j on the screen, and an edit mode, where typing “j” moves the cursor down one line. Having an edit mode means you have a keystroke for all sorts of things that usually take a mouse or banging on arrow keys to do. It’s wonderful, the curb cut effect at work: it was designed to be efficient and usable in a very constrained situation (1976), and those efficiencies are still there for us when less constrained. In fact, the vi keymap has been adopted all over (like the less pager’s navigation and tmux’s copy mode keys).

It is a lot to get into muscle memory, and you’re encouraged to give it ten minutes a day, as you would a musical instrument you want to be good at but don’t actually want to practice. Maybe learn three new commands per day, and within a week you’ll be doing fine with it.

Start with (1) at your command prompt, typing vimtutor and following through there, and (2) https://vim-adventures.com/ , or, when you get tired of it or don’t pass the paywall, google “vim tutorial” and pick your favorite. Maybe print out a vim cheat sheet to put by your desk.

Continue when: you can open a file in nano, modify it, and write out/quit. Get to know vim as you go.

IV Command-line basics

The Internet believes that people under a certain age can’t understand the concept of a file hierarchy, wherein you have a home directory, which has some files and some folders, and those folders also have some files and subfolders. I take this to be a moral panic; it’s not all that complicated. Also, it is non-negotiable, being how The Internet works, still.

You met ls already, and now that you’ve created some files via your text editor, now is your chance to get to know the hierarchy.

I’ll recommend this tutorial, or google “command line tutorial” for a few thousand others. Please get to know the content on p 4, 5, 6, mostly two-letter commands: ls, cd, cp, mv, rm, mkdir. The other pages are of limited relevance to us in our context.

On p7: When surfing The Web for help, you’ll find that many authors have a sense of “root privilege”, the belief that everybody has root access (aka superuser permissions) on their systems. On our system, you don’t, and everywhere you see somebody say “just become root and do this thing”, you’ll have to work out how to work around it. There’s usually a way.

Page 8 goes a long way to make a simple point, which is worth a one-¶ note here: if a file begins with a dot (.), then it is “hidden”, by custom. That means when you ls a directory, you won’t see it, and many other utilities will ignore it. This is custom only. There is zero security preventing you or anybody else from viewing dot files. You can ask for “all” in your home directory, including the dot files, via

ls -a ~ 

, where the ~ (upper left on your U.S. keyboard) is a bash convenience for referring to your home directory. This will show you all the configuration junk hidden away. One of these items, .bashrc, will become salient below.

You may not have known to ask “what’s the difference between an absolute and relative path?”, but here’s an answer for you: https://www.redhat.com/sysadmin/linux-path-absolute-relative

Continue when: all those two-letter commands make sense to you.

V Shell basics

Programs on a POSIX system, by default, have a standard input and standard output. This is frequently a text file in or plain text that will be written to the screen. A pipe lets you redirect the standard output from one program to feed directly into the standard input to the next. The word count utility, wc, will tell you how many lines, words, and characters its standard input has; e.g., wc -l README.md is the line count for the Readme file c9 gives you by default. As above,ls writes a list of files to standard out. Putting them together, ls | wc -l will tell you how many files you have in a directory.

Try this tutorial on some basic things you can do with comma-separated files (CSV). At the top, it will ask you to download a sample CSV file, and you can do so from the command line using wget (web get):

wget https://raw.githubusercontent.com/opensourceway/command-line-data-analysis/master/jan2017articles.csv 

Continue when: you’ve gotten through the tutorial.

VI Scripting basics

The command line, the shell, does far more than list what’s in your current directory. It is Turing complete, meaning that effectively any program you’d want to run, you could write in the shell.

Caveats: Bash does two jobs at once: run scripts, and facilitate your interactive work on the command line, and those two interact in glitchy ways. It is a “macro language”, which takes text, then replaces that text with other text, until everything has been reduced to a final answer. It has a lot of macro types, and using them all is, within the context of our office, bad form. Macro languages tend to be harder to maintain and control, but can be wonderful when used with restraint. [Other salient macro languages: Stata, SAS.]

OK, enough with the negativity: part of my goal is to shift your thinking of your work away from finding answers to questions about data and toward building little machines that can replicably answer questions about data. Bash scripts are the most basic way to build those machines. You should be able to tell your colleague to take a script you wrote, run it, and generate exactly the same results you got. Four months from now, you will have no recollection of how you got a result, and you will discover that your notes assumed some steps which seemed obvious at the time and are now forgotten. But if you have the start-to-finish script, then you by construction have everything you need.

Let’s build one now.

  • Ask tmux to split your window into a left and right pane.
  • On the right, open your text editor. Let’s create a one-line script; put wc -l $* in the file, and save as counter. The $* will be replaced with whatever the person who runs the script chooses to put on the command line.
  • On the left, make your file an executable program. Change the “executable mode” int the file’s metadata to on via chmod +x counter.
  • Run it on a file, via ./counter README.md, or on multiple files, like ./counter * . [If you don’t specify a file, then $* in the script will be replaced with a blank, then bash is running plain wc -l, and when wc gets no file names it waits for input from stdin. If you get into that state, break back to the command line via <ctrl>-c.]

Congratulations, you’re a programmer. A one-line script isn’t so exciting (maybe add an intro line like echo "Counting lines" if you’d like), but you’re now in a position to record what you’re doing in a script and running the script, rather than manually typing in every step every time.

You’ll quickly find a need for loops and branches, like a loop reapplying a step to every file (try for f in *; do ./counter $f; done) or checking whether a file exists before running. Here’s one tutorial that covers the basics. I encourage you to rapidly skim §4, 5, 6, though. Arrays are especially obtuse in the shell and are bash-specific. But the if/then statements, variables, and functions eventually become essential.

Also, please remember: you will not be tested on this. Get the gist; look up the details as you need them; you’ll memorize what needs memorizing as you repeat it in practice.

After you’ve gone through the tutorial, have a look at $HOME/.bashrc (where you now recognize $HOME as a predefined shell variable, defined as the location of your home directory). This script is what runs every time bash starts, meaning little preparatory bits such as variable assignments happen here. Sample usage: I have alias tm=”tmux attach” in my .bashrc, because as soon as I log in I will always start tmux, and now I can type nine characters fewer every time I re-login. The command prompt itself can be set here (ask The Internet about the PS1 variable), as are many site-specific details. Some of this file will be legible to you, some won’t be; take this as a caveat about not using bash too much.

To be pedantic, when you type a command at the shell, it creates a new sandboxed environment, then runs the program you specified in that environment. So there are two types of variables: regular within-script variables, and exported variables. One especially salient exported variable is PATH, which specifies where to look for programs. Even wc is just another file sitting on a hard drive, namely /usr/bin/wc, and bash knows to look in /usr/bin for it because it is in the path. Get your exported environment variables via env, search for lines with a given text via grep, so list exported paths via env | grep PATH.

The current directory is . or ./, and it is usually not on the PATH, which is why you had to specify ./count . Put it there if you’d like: open ~/.bashrc, and add this line

export PATH=./:$PATH

to set the PATH variable to be ./:, followed by whatever the PATH variable had been (elements in paths are separated by colons). N.b. that bash will have a fit it you put spaces around equals signs. Once you re-source the file (source ~/.bashrc) so the commands are read, count README.md should work, without the ./.

Finally, there are some conveniences on the command line itself. You’ve already seen tab completion. The exclamation point does much of the command history work. To repeat the prior command, use: !! which does what the up arrow does, but somehow feels good to type. To reuse the arguments to a command, use !*, like
mkdir new_dir
cd !*

to enter the directory you just created. There’s far more history bash does; see, e.g., everything after “reviewing your previous bash history” in
https://www.digitalocean.com/community/tutorials/how-to-use-bash-history-commands-and-expansions-on-a-linux-vps

Note also that there are other shells than bash. I personally prefer zsh. There is also a standard for the shell as a programming language, which includes the basic functions and variables but not the extravagances around arrays and unusual text substitutions that I’d advised against focusing on above.

Continue when: you’ve written a script with one function (using the shell’s fn() {} syntax) which is called repeatedly. For example, write an x-checker script: write a function which, for the file whose name is given as function input, uses grep and a pipe to wc to count how many lines have the letter x in them, then have the script call that function for each file in the directory provided by the script user.

VII Git

OK, you have a replicable script you’re sharing with a coworker. How do you get it to them, and how do they get it back to you after edits? The manner which solves a lot of problems at once is to use a revision control system. Git has become the standard in revision control, and we use it for some models, and hope to use it for more.

If nothing else, we have a very locked-down system, and copying a file to a colleague can run into permissions problems. If you create a repository in a shared space via git init --bare --shared=group , whatever you put into that repository can be checked out by your colleague, and you can check out what they contributed. You are working in your home directory, they in theirs, and you have a neatly-ordered shared neutral space for going back and forth.

I wrote a book chapter on git it once, and it was well-received, so I’m going to self-recommend that for your git tutorial. After you read that, you might also enjoy a write-up in an entirely different direction (one word: blockchain!). In the book chapter you should read first, I mention a diff-snapshot duality, and the chapter approaches git as diffs while the second write-up approaches it as snapshots.

Continue when: You’ve gone through the tutorial linked above.

VIII SQL

The source of all our data is in a database to which you will interface via structured query language. SQL is another of those old standards (1974) which is still very alive today, though it has its share of oddities and warts (e.g., there’s no standardized way to get metadata). The core of it is eminently sensible: select some columns, from some tables, joined together by matching certain keys, where certain restrictions are met, grouped in certain manners.

SQL is also Turing complete. Scroll down to §3.5 of this manual page and you’ll see queries that solve Sudoku grids and draw the Mandelbrot set. I’ve found that sort of power to be primarily useful for traversing node-and-edge networks; the thing about selecting from a table where certain restrictions are met is more than enough for now.

I’m going to self-recommend my chapter on SQL, (ch 3, p 90–97), or see an alternative below. SQLite3 is pre-installed on our servers and Cloud9 — it is fundamental to a lot of how computing works in the present day — so you can follow along directly. Get the sample data via

wget https://modelingwithdata.org/mwd_book_samples.zip
unzip mwd_book_samples.zip
cd mwd_book_samples

The data-* files are all you’ll need from that mess.

Please use this as an opportunity to enjoy the two-pane arrangement. In one, you might have a script with an SQLite command, like

sqlite3 data-wb.db "
select *
from pop
where population <=50
"

and in the other you can repeatedly re-run the script as you modify your query.

(An aside: Looking at this book I wrote over a decade ago now, skim the appendices, pp 397ff, if you want more on shell scripting. I’ve apparently been convinced that people who do math benefit from knowing these things for a long time.)

If you’d like to hear a different writing voice for a while, this in-browser tutorial covers the SQL basics. Consider page 12 to be the conclusion, because our data is read-only, so table alteration and creation are things to look up should you need them.

I have a regret about that tutorial you just read (either one): it doesn’t cover the WITH clause. [Archive.org tells me SQLite’s manual page on the WITH clause (the one linked above that solves Sudoku puzzles) didn’t appear until 2014, years after the book came out]. Its usage is simple: give a query a name, then use that table as named in your main query. Here we produce a table with the count of elements in each class, then we get the average count from that table:

sqlite3 data-wb.db "
with grps as (
select count(*) as ct
from classes
group by class)
select avg(ct)
from grps"
``

Next, I want to get the total GDP and count of countries in each region as defined in the class table, then use that to get a category average (where each country counts as one unit in the average regardless of population), then report the average of those categories for high-income countries and for all others. The I want…then I want…then I want format becomes a sequence of queries:

sqlite3 data-wb.db "
with grps as (
select count(*) as ct, sum(gdp) as total_class_gdp, class,
class like 'High-%' as is_high
from classes, gdp
where gdp.country==classes.country
group by class),
regional_avgs as (
select total_class_gdp/ct as avg_gdp, is_high
from grps
)
select avg(avg_gdp), is_high
from regional_avgs
group by is_high
"

[Maybe also try replacing is_high with like '%Asia%' or like '%Africa%'.]

This breaking down into subtables makes complex queries much more readable, and even though SQL is declarative, you wind up with a more procedural flow. You also get a speed boost, because query optimizers are incredible systems designed by brilliant people, and putting your entire inquiry into one query allows those optimizers to better do their job.

WITH clauses are also important because sending temp tables back and forth to our database server is complicated and annoying, and a recent ratcheting-up of security effectively broke it. Do as much data prep as possible in one query, keeping it structured.

Continue when: you’ve gotten through p97 of the above SQL chapter or p12 of the online tutorial.

IX Python basics

Python is the lingua franca of computing in the present day. From a computer-science-y perspective, it is … fine. There’s a type for everything (lists, sets, dictionaries, tuples, collections, iterators, data frames, plus types for the sort of large-scale numeric computing we aim to do; see below). There is an enjoyable list comprehension mechanism. You don’t have to worry about the redundancy of indentation and delimiters setting apart blocks like { }, because only the indentation matters.

You’ll see that python3 is preinstalled on Cloud9. Just type its name on the command line to get the interactive python prompt, which will be useful at the beginning of your tutorial journey. (You can exit both the Python command line and the shell via <ctrl>+D.) Or put your script in s.py in your text editor on the right and execute python3 s.py on the command-line on the left, as will be essential as you continue. [A useful third way of working, Jupyter notebooks, is beyond the scope, because setting them up on c9 is nontrivial. It works on our in-house server though. Also, I’m trying to stress building replicable scripts, and that’s not quite the focus of Jupyter notebooks.]

Because Python is the consensus language for teaching, the number of python tutorials is countably infinite. One reader recommended this tutorial out of the many. It comes with an in-browser Python interpreter for when your terminal isn’t up. Read until the page about commenting, then skip to the two-page section on modules, then continue with the several pages about built-in types, the ‘deep dive’ section which covers actually essential things like list comprehension, on to file read/write, but stop at virtual environments. You’ll be working on a server largely disconnected from The Net, you won’t be able to install packages, and as a result virtual environments are basically impossible for us. Other concerns of the tutorial like dealing with Web-oriented JSON data are not ours.

Given our context, I want to especially draw your attention to the relatively early section on writing functions with def. Let me tell you how I identify whether I can trust somebody’s code: I check whether they use functions. Self-taught coders who learned while confronting questions about math and the real world, myself included, usually started off by writing scripts that start at the top and flow to the bottom, breaking into parts with /**** banners specifying sections *****/. Professional code files are a sequence of functions, each of which does one thing and is only a few lines long, and a short main routine at the bottom calling all those functions. If there is a problem, you can focus on the function that seems at fault, without worrying about the rest. What variables are in play while working on that function are the limited few handed to the function. Repetition is reduced. The main flow is legible.

It’s hard to dictate how to write good functions — you just have to get the feel from doing it. But please do make an effort to break your procedures into small functions right from the start. Good function writing is what makes good code.

Continue when: you’ve read through the tutorial enough to feel you’ve got the Python basics down, and have written yourself some toy scripts that make use of functions.

X Numeric Python

Because Python is a general-purpose language, it is not restricted to predefined notions of how math is done. For example, in R, the model object is a generalized linear model, and Stata does the sort of regression analysis mainstream economists do very well, and stretches to do anything else. As methods evolve and “statistics” morphs into “machine learning”, focus has moved to the general-purpose languages.

Mainstream numeric Python has a stack of elements. [On Cloud9, you’ll need to pip install pandas for the below.]

  • At the base, Numpy provides arrays, which are lists in a sufficiently regimented format that math can be done quickly on them — basically the arrays you’d have in C or Fortran, with a nice Python front-end. Pythonland, linked above, has a one-page intro. The Numpy “absolute basics” is enough to make you an expert. I recommend close-skimming that page but stopping just before “how to create an array…”, because we’ll reserve that for the next layer.
  • Pandas provides a data frame, which wraps numpy arrays into a friendlier format, notably giving you column names. It gives each row an “index”, which is like a row name, but more regimented in various ways. I personally often have trouble with index management, but more importantly, Pandas’s referencing mechanism is another layer which can slow down things significantly. In practice, this means care must be taken to balance work in your data frame, let’s name it df, and the numpy array it holds, df.values.

On creating data, the most common means is to either get a data set from an SQL query, or from a CSV file. On the SQL front, this tutorial quickly got across that Python’s SQLite package, like most database systems, uses a connection object to talk to the database and a cursor to step through table results, but Pandas provides a read_sql_query function that encapsulates all the cursor work for you. Among the data sets you downloaded in §VIII, you also have a few CSV files about DC Metro ridership; try getting those into a data frame using Pandas’s read_csv function.

The data-wb.db data set you downloaded for the SQL tutorial includes a count of how many pages are in the Lonely Planet tourist guide for a given country. It was referenced on p 106 of the SQL tutorial. The data was provided by Dorothy Gambrel, one of the greats in data visualization, who used Lonely Planet page count divided by population to define “interestingness per capita”. You can determine the country with the highest interesting/capita via a single query, but to get a feel for Pandas, try writing a script where the query inside your read_sql_queryfunction stops at joining tables, then calculate and sort by interestingness per capita via Python column operations.

  • Scikit-learn provides the models that you would fit with all those data sets. If you read the rest of my 2009 textbook, you saw the two core structures were the apop_data data frame and the apop_model statistical model structure; Scikit provides something like the latter. Its structure is broad, and both statistics-flavored regressions and machine learning-flavored classification algorithms fit into the same framework.

With regards to getting to know it, I recommend just picking a model and trying it. The whole point is that most are operated in a similar manner, regardless of differences in the underlying mat. Here’s one of their classifiers; maybe follow through the “Usage” and “Missing values support” sections. There, you’ll see the standard procedure at work: get the (randomly-generated) data set, send it to the fit method, and once you have a model with estimated parameters, use it as desired, such as to predict new values.

Our office uses other related tools, like the ASF’s pyarrow format to take the slot numpy occupies in the above stack. At the moment, we do no artificial intelligence work (by which I mean neural network architectures trained on large data sets), but the frameworks for that are a layer on top of the above, and if you have all this down, the next steps will be legible to you.

--

--

Ben Klemens

BK served as director of the FSF’s End Software Patents campaign, and is the lead author of Apophenia (http://apophenia.info), a statistics library.