A lot of fellow students seem allergic to shell scripting, or even basic Unix / Linux functions. While I’m no expert, here are some extended fundamentals I think every computer scientist ought to know, to feel like a remote server is just as cozy as your personal laptop.
This guide assumes basic knowledge knowledge of how to navigate file systems and general coding intuition. It is also designed to be less of a manual and more of a very basic quickstart for common use cases.
Contents include:
- basic variables, how to define and use, e.g. why you need to
export
that thing calledCUDA_VISIBLE_DEVICES
, - viewing and searching through files, e.g. how to count the number of training instances without loading the file into Python; or how to find the files that call your function,
- manipulating strings (and files), e.g. merging files, renaming everything in a directory, etc.
Note: unless otherwise specified, anything surrounded in square brackets should be overwritten
without those square brackets, e.g. [user]
to pikachu
.
Variables
You can define variables like this.
Be careful not to add spaces around the =
.
1 |
|
If you omit the word export
, the variable FILE
won’t be available in child
processes, i.e. if your Python script requires an environmental variable, you
must export it.
You can view the value of a variable with echo
.
The prefix $
distinguishes between the variable FILE
and the string
"FILE"
.
1 |
|
You can also format string and numbers quite intuitively, since printf
formatting
is commonly found in many programming languages, including C and Python.
As you can also see, comments are prefixed by #
.
1 |
|
A use case of this would be formatting commands without dumping everything on a single unreadable line.
1 |
|
A tidbit you might notice: you can separate commands with semicolons (like C and Java), and you can write multi-line commands with \ (like Python).
Accessing files
You probably already know how to view files. In the spirit of providing mathematical definitions for things we’ll use later though:
less [file]
scroll through a file. Loads file dynamically, so it’s (relatively) fast for larger files. You can use vim-style navigation:j
for down andk
for up,/[pattern]
to search for pattern,n
to find the next occurrence. Quit viaq
.cat [file1] [file2]
concatenates and prints your file(s).tail -[k] [file]
prints last k lines of your file. This version seeks to the end and reads back however many blocks.tail -n +[k] [file]
prints all lines from line k to end (1-indexed). This version reads the whole file.head -[k] [file]
prints first k lines of your file.
Why would you use this over Excel? Well it’s so much faster for one…
Pipes and redirection
You can string together commands by “piping” the standard out of one command to
the standard in of another.
For example, suppose hello.txt
is a file with a header on line 1, and the
numbers 1 to 10 on the following lines (2-11).
You can combine tail
and head
to extract the first two lines of actual
content.
1 |
|
You can also redirect the output of your command to a file via >
or append to
existing files with >>
.
1 |
|
To redirect the standard error instead, you would use &>
.
1 |
|
Finally, if the program (that you probably didn’t write) is spitting out useless output that you find annoying, you can redirect everything to the void.
1 |
|
/dev/null
is the void that you’re redirecting standard output to, and 2>&1
sends standard error (to standard out) to the void too.
Learning via examples.
I find that it’s slow to learn by reading documentation, so here are some useful snippets.
For loops (e.g. rename lots of files)
Suppose we have a directory of files formatted *_b.pdb
and we would like to rename them *_b_COMPLEX.pdb
.
1 |
|
Line 3 specifies that we want to replace _b.pdb
with _b_COMPLEX.pdb
.
The for
loop syntax in shell is for [var] in [iterable]; do [commands]; done
.
More generic versions:
1 |
|
Finding (and deleting) files
Find all files in [root]
whose filenames match the specified pattern.
Uncomment -delete
to delete them. Compared to vanilla rm -rf
this will
search recursively.
1 |
|
Finding a function (without an IDE)
Search for [pattern]
within [root]
directory.
1 |
|
For example, grep -r "load_data" .
searches for load_data
starting from your
current directory.
Counting the number of training examples
Suppose you have a CSV file per data split (train/val/test). You can trivially count the number of lines in each file:
1 |
|
Now suppose these are all stored in a single CSV with a column named “split”, with values “train” “val” or “test”. You can search for the number of occurrences of each phrase, in each file, and count the number of matches (lines).
1 |
|
Counting the number of unique identifiers
Suppose you have a CSV where the first column is a 4-digit identifier (looking at you PDB). You want to know how many of these are unique.
1 |
|
Breakdown of commands:
cut -c1-4
will take the first four characters of each line (1-indexed, inclusive),uniq
will filter out unique lines,wc -l
will count the number of lines.
If you only want the number of unique lines, uniq [file]
can accomplish that
alone.
Search and replace
We have our CSV from the previous example, with IDs in the first column. Suppose you want a list of the IDs to query an online service (hello UniProt ID mapper). Specifically, you’d like to
- strip everything after the first comma,
- for fun, let’s capitalize the IDs too,
- sort the entries for your sanity,
- remove duplicate entries,
- save the results to a file.
1 |
|
Breakdown of commands:
- The square brackets here are real square brackets, so you would replace
FILE
andout.txt
instead. sed
stands for stream editor. The first argument is a “command,” formatted the same way asvim
search and replace. On a high level,s/[pattern]/[replacement]/g
whereg
means “replace all.”tr
stands for translate, and we go from lowercase to uppercase.sort
is self-explanatory.uniq
is also self-explanatory.
End
Thank you for reading! Good night~