A lot of fellow students seem allergic to shell scripting, or even basic Unix / Linux functions. While I’m no expert, here are some extended fundamentals I think every computer scientist ought to know, to feel like a remote server is just as cozy as your personal laptop.
This guide assumes basic knowledge knowledge of how to navigate file systems and general coding intuition. It is also designed to be less of a manual and more of a very basic quickstart for common use cases.
Contents include:
- basic variables, how to define and use, e.g. why you need to
exportthat thing calledCUDA_VISIBLE_DEVICES, - viewing and searching through files, e.g. how to count the number of training instances without loading the file into Python; or how to find the files that call your function,
- manipulating strings (and files), e.g. merging files, renaming everything in a directory, etc.
Note: unless otherwise specified, anything surrounded in square brackets should be overwritten
without those square brackets, e.g. [user] to pikachu.
Variables
You can define variables like this.
Be careful not to add spaces around the =.
1 | |
If you omit the word export, the variable FILE won’t be available in child
processes, i.e. if your Python script requires an environmental variable, you
must export it.
You can view the value of a variable with echo.
The prefix $ distinguishes between the variable FILE and the string
"FILE".
1 | |
You can also format string and numbers quite intuitively, since printf formatting
is commonly found in many programming languages, including C and Python.
As you can also see, comments are prefixed by #.
1 | |
A use case of this would be formatting commands without dumping everything on a single unreadable line.
1 | |
A tidbit you might notice: you can separate commands with semicolons (like C and Java), and you can write multi-line commands with \ (like Python).
Accessing files
You probably already know how to view files. In the spirit of providing mathematical definitions for things we’ll use later though:
less [file]scroll through a file. Loads file dynamically, so it’s (relatively) fast for larger files. You can use vim-style navigation:jfor down andkfor up,/[pattern]to search for pattern,nto find the next occurrence. Quit viaq.cat [file1] [file2]concatenates and prints your file(s).tail -[k] [file]prints last k lines of your file. This version seeks to the end and reads back however many blocks.tail -n +[k] [file]prints all lines from line k to end (1-indexed). This version reads the whole file.head -[k] [file]prints first k lines of your file.
Why would you use this over Excel? Well it’s so much faster for one…
Pipes and redirection
You can string together commands by “piping” the standard out of one command to
the standard in of another.
For example, suppose hello.txt is a file with a header on line 1, and the
numbers 1 to 10 on the following lines (2-11).
You can combine tail and head to extract the first two lines of actual
content.
1 | |
You can also redirect the output of your command to a file via > or append to
existing files with >>.
1 | |
To redirect the standard error instead, you would use &>.
1 | |
Finally, if the program (that you probably didn’t write) is spitting out useless output that you find annoying, you can redirect everything to the void.
1 | |
/dev/null is the void that you’re redirecting standard output to, and 2>&1
sends standard error (to standard out) to the void too.
Learning via examples.
I find that it’s slow to learn by reading documentation, so here are some useful snippets.
For loops (e.g. rename lots of files)
Suppose we have a directory of files formatted *_b.pdb
and we would like to rename them *_b_COMPLEX.pdb.
1 | |
Line 3 specifies that we want to replace _b.pdb with _b_COMPLEX.pdb.
The for loop syntax in shell is for [var] in [iterable]; do [commands]; done.
More generic versions:
1 | |
Finding (and deleting) files
Find all files in [root] whose filenames match the specified pattern.
Uncomment -delete to delete them. Compared to vanilla rm -rf this will
search recursively.
1 | |
Finding a function (without an IDE)
Search for [pattern] within [root] directory.
1 | |
For example, grep -r "load_data" . searches for load_data starting from your
current directory.
Counting the number of training examples
Suppose you have a CSV file per data split (train/val/test). You can trivially count the number of lines in each file:
1 | |
Now suppose these are all stored in a single CSV with a column named “split”, with values “train” “val” or “test”. You can search for the number of occurrences of each phrase, in each file, and count the number of matches (lines).
1 | |
Counting the number of unique identifiers
Suppose you have a CSV where the first column is a 4-digit identifier (looking at you PDB). You want to know how many of these are unique.
1 | |
Breakdown of commands:
cut -c1-4will take the first four characters of each line (1-indexed, inclusive),uniqwill filter out unique lines,wc -lwill count the number of lines.
If you only want the number of unique lines, uniq [file] can accomplish that
alone.
Search and replace
We have our CSV from the previous example, with IDs in the first column. Suppose you want a list of the IDs to query an online service (hello UniProt ID mapper). Specifically, you’d like to
- strip everything after the first comma,
- for fun, let’s capitalize the IDs too,
- sort the entries for your sanity,
- remove duplicate entries,
- save the results to a file.
1 | |
Breakdown of commands:
- The square brackets here are real square brackets, so you would replace
FILEandout.txtinstead. sedstands for stream editor. The first argument is a “command,” formatted the same way asvimsearch and replace. On a high level,s/[pattern]/[replacement]/gwheregmeans “replace all.”trstands for translate, and we go from lowercase to uppercase.sortis self-explanatory.uniqis also self-explanatory.
End
Thank you for reading! Good night~