practical unix for ml [1/n]

A lot of fellow students seem allergic to shell scripting, or even basic Unix / Linux functions. While I’m no expert, here are some extended fundamentals I think every computer scientist ought to know, to feel like a remote server is just as cozy as your personal laptop.

This guide assumes basic knowledge knowledge of how to navigate file systems and general coding intuition. It is also designed to be less of a manual and more of a very basic quickstart for common use cases.

Contents include:

  • basic variables, how to define and use, e.g. why you need to export that thing called CUDA_VISIBLE_DEVICES,
  • viewing and searching through files, e.g. how to count the number of training instances without loading the file into Python; or how to find the files that call your function,
  • manipulating strings (and files), e.g. merging files, renaming everything in a directory, etc.

Note: unless otherwise specified, anything surrounded in square brackets should be overwritten without those square brackets, e.g. [user] to pikachu.

Variables

You can define variables like this. Be careful not to add spaces around the =.

1
export FILE="hello.txt"

If you omit the word export, the variable FILE won’t be available in child processes, i.e. if your Python script requires an environmental variable, you must export it.

You can view the value of a variable with echo. The prefix $ distinguishes between the variable FILE and the string "FILE".

1
2
3
4
$  echo $FILE
hello.txt
$  echo FILE
FILE

You can also format string and numbers quite intuitively, since printf formatting is commonly found in many programming languages, including C and Python. As you can also see, comments are prefixed by #.

1
2
3
4
5
6
$  echo "My file is named ${FILE}."  # string
My file is named hello.txt.
$  printf "%05d\n" 123  # number padded to 5 digits
00123
$  printf "%.3f\n" 123  # number with 3 trailing zeros
123.000

A use case of this would be formatting commands without dumping everything on a single unreadable line.

1
2
3
4
5
6
7
8
9
$  MODEL="esm2_t30_150M_UR50D"; DATA="stability"; LR=1e-4
$  RUN_NAME="model=${MODEL}-data=${DATA}-lr=${LR%.5f}"
$  # looks like "model=esm2_t30_150M_UR50D-data=stability-lr=1e-4"
$  python main.py \
      --model $MODEL \
      --data $DATA \
      --lr $LR \
      --run_name $RUN_NAME
...

A tidbit you might notice: you can separate commands with semicolons (like C and Java), and you can write multi-line commands with \ (like Python).

Accessing files

You probably already know how to view files. In the spirit of providing mathematical definitions for things we’ll use later though:

  • less [file] scroll through a file. Loads file dynamically, so it’s (relatively) fast for larger files. You can use vim-style navigation: j for down and k for up, /[pattern] to search for pattern, n to find the next occurrence. Quit via q.
  • cat [file1] [file2] concatenates and prints your file(s).
  • tail -[k] [file] prints last k lines of your file. This version seeks to the end and reads back however many blocks.
  • tail -n +[k] [file] prints all lines from line k to end (1-indexed). This version reads the whole file.
  • head -[k] [file] prints first k lines of your file.

Why would you use this over Excel? Well it’s so much faster for one…

Pipes and redirection

You can string together commands by “piping” the standard out of one command to the standard in of another. For example, suppose hello.txt is a file with a header on line 1, and the numbers 1 to 10 on the following lines (2-11). You can combine tail and head to extract the first two lines of actual content.

1
2
3
$  tail -n +2 hello.txt | head -2
1
2

You can also redirect the output of your command to a file via > or append to existing files with >>.

1
2
3
4
$  tail -n +2 hello.txt | head -2 > hello2.txt
$  cat hello2.txt
1
2

To redirect the standard error instead, you would use &>.

1
2
3
4
5
6
7
$  tail --bad hello.txt  # fake flag for tail
tail: unrecognized option '--bad'
usage: tail [-F | -f | -r] [-q] [-b # | -c # | -n #] [file ...]
$  tail --bad hello.txt &> error.txt  # piped error to file
$  cat error.txt
tail: unrecognized option '--bad'
usage: tail [-F | -f | -r] [-q] [-b # | -c # | -n #] [file ...]

Finally, if the program (that you probably didn’t write) is spitting out useless output that you find annoying, you can redirect everything to the void.

1
$  command > /dev/null 2>&1

/dev/null is the void that you’re redirecting standard output to, and 2>&1 sends standard error (to standard out) to the void too.

Learning via examples.

I find that it’s slow to learn by reading documentation, so here are some useful snippets.

For loops (e.g. rename lots of files)

Suppose we have a directory of files formatted *_b.pdb and we would like to rename them *_b_COMPLEX.pdb.

1
2
3
4
$  for FILE in directory/*_b.pdb
   do
     mv "$FILE" "${FILE/_b.pdb/_b_COMPLEX.pdb}"
   done

Line 3 specifies that we want to replace _b.pdb with _b_COMPLEX.pdb.

The for loop syntax in shell is for [var] in [iterable]; do [commands]; done. More generic versions:

1
2
3
4
5
6
7
8
9
$  for VAR in {1..4}; do echo $VAR; done
1
2
3
4
$  for STR in a b c; do echo $STR; done
a
b
c

Finding (and deleting) files

Find all files in [root] whose filenames match the specified pattern. Uncomment -delete to delete them. Compared to vanilla rm -rf this will search recursively.

1
$  find [root] -type f -name "[filename]" # -delete

Finding a function (without an IDE)

Search for [pattern] within [root] directory.

1
$  grep -r "[pattern]" [root]

For example, grep -r "load_data" . searches for load_data starting from your current directory.

Counting the number of training examples

Suppose you have a CSV file per data split (train/val/test). You can trivially count the number of lines in each file:

1
2
3
4
$  wc -l *.csv
100   train.csv
 10   val.csv
 10   test.csv

Now suppose these are all stored in a single CSV with a column named “split”, with values “train” “val” or “test”. You can search for the number of occurrences of each phrase, in each file, and count the number of matches (lines).

1
2
3
4
5
6
$  grep -o "train" train.csv
line1, "train"
line2, "train"
...
$  grep -o "train" train.csv | wc -l
100

Counting the number of unique identifiers

Suppose you have a CSV where the first column is a 4-digit identifier (looking at you PDB). You want to know how many of these are unique.

1
$  cut -c1-4 [file] | uniq | wc -l

Breakdown of commands:

  • cut -c1-4 will take the first four characters of each line (1-indexed, inclusive),
  • uniq will filter out unique lines,
  • wc -l will count the number of lines.

If you only want the number of unique lines, uniq [file] can accomplish that alone.

Search and replace

We have our CSV from the previous example, with IDs in the first column. Suppose you want a list of the IDs to query an online service (hello UniProt ID mapper). Specifically, you’d like to

  1. strip everything after the first comma,
  2. for fun, let’s capitalize the IDs too,
  3. sort the entries for your sanity,
  4. remove duplicate entries,
  5. save the results to a file.
1
2
$  export FILE="..."
$  sed "s/,.*//g" $FILE | tr "[:lower:]" "[:upper:]" | sort | uniq > out.txt

Breakdown of commands:

  • The square brackets here are real square brackets, so you would replace FILE and out.txt instead.
  • sed stands for stream editor. The first argument is a “command,” formatted the same way as vim search and replace. On a high level, s/[pattern]/[replacement]/g where g means “replace all.”
  • tr stands for translate, and we go from lowercase to uppercase.
  • sort is self-explanatory.
  • uniq is also self-explanatory.

End

Thank you for reading! Good night~

}