Skip to content

Text manipulation in the shell

Daniel Nachun edited this page Mar 31, 2024 · 6 revisions

Why manipulate text in the shell?

Manipulating text files is a very frequent task that many beginners will attempt to do with a language like Python or R. While most modern general purpose and scientific programming languages have excellent libraries for text manipulation, it is often faster and less memory intensive to do these manipulations from the shell with specialized programs. This is consistent with the general Unix philosophy of "do one thing and do it well". This guide will demonstrate how to do some of the most basic text manipulation tasks from the shell.

Outputs and inputs

In UNIX shells, all processes return several outputs and many accept input as well. If not directed elsewhere, these outputs are simply printed to the screen for the user to see. The "standard output" is normally the result of a successfully executed command, while the "standard error" is normally the output of a failed command, although not all programs follow this distinction consistently. Most of the time, you will only be interested in redirecting standard output. To redirect standard output to another process, use a pipe:

COMMAND1 | COMMAND2

Pipes are one of the most fundamental and powerful concepts in UNIX shells because they allow you string to together multiple simpler commands to do something more complex, without having to save the output to a file at each step. In order for a pipe to function, the command receiving the standard output must accept "standard input". Not all commands support standard input - in some cases the command will only accept a file path as an argument, from which it will read its input.

Depending on your goals, the final output of the manipulations you wish to do may be something you need to simply print to your screen. However, there are many uses cases where you do wish to save the standard output of a command to a file:

COMMAND1 > FILE1

You may also have an existing file that want to append output to:

COMMAND >> FILE1

Note that if you try to append to a file which does not exist, it is automatically created the first time, so you don't need to worry about checking for the existence of the file.

You can choose to redirect the standard error to a file instead of standard output:

COMMAND 2> FILE1

Or you can combine standard error with standard output and send both

COMMAND 2>&1 > FILE1

This approach also works for sending standard error to pipes:

COMMAND1 2>&1 | COMMAND2

You can also redirect an existing file into the standard input of a command:

COMMAND < FILE1

The cat command can be used send the contents of any file to standard output printed to the screen:

cat FILE1

or to a pipe, but note that this is just a more verbose of using < to redirect the file without cat:

cat FILE1 | COMMAND1

If the file you want to view is too long to fit on your screen, you can use a "pager" like less to view it, which allows you to scroll through the with vim-style movement keys (see the Movement section of Neovim page in the wiki for details).

less -S FILE1

You usually want to use the -S argument with less because it does not wrap long lines - instead you can scroll horizontally to view them.

Some of the tools in the Pagers and viewers section of this wiki can also view files with colorization or nice formatting for specific file types.

The head and tail commands allow to send only the beginning or end of a file to standard output:

head FILE1
tail FILE1

You can use the -n argument to change the number lines from the default of 10:

head -n 5 FILE1
tail -n 5 FILE1

To remove the first N lines of a file, you can use tail with the -n argument with + in front of the number of lines:

tail -n +5 FILE1

and with head you can remove the list N lines:

head -n +5 FILE1

Replacing/substituting test

One of the most common text manipulations is to replace a specific string of text with another (which may be nothing if you want to remove it). There are several tools available in UNIX shells for this purpose.

tr

The tr command replaces (translates) a single character to another character:

tr 'a' 'b' < FILE1

This character can be a whitespace character like \t or \n

echo ${PATH} | tr ':' '\n'

You can also delete a character:

tr -d 'a' < FILE1

You can squeeze repeated characters into one with the -s argument:

tr -s ' ' < FILE1

This is especially useful for tabular files that have been "pretty printed" with an inconsistent number of spaces between each column!

sed

sed is a stream editor designed to allow more complex replacements of text than tr. Sed can even use regular expressions for matching the string of characters to replace - see the Regular Expressions section of the wiki for more details on this.

To replace (substitute) one text string with another, you use the s command:

sed -e 's/STRING1/STRING2/' FILE1

sed also accept standard input:

echo ${VARIABLE} | sed -e 's/STRING1/STRING2/'

If you want to delete a search string entirely, simply replace it with nothing:

sed -e 's/STRING1//' FILE1

The -e tells sed to execute the command that follows. You can have multiple commands in one line:

sed -e 's/STRING1/STRING2/' | sed -e 's/STRING3/STRING4/' FILE1

Note that commands for sed must be quoted. Usually you will want to use single quotes so that your shell does not try to evaluate the command before sending to sed. However you will need to use double quotes if you need to substitute shell variables into you sed command:

sed -e "s/STRING1/${VARIABLE}/" FILE1

Most examples you will see of substitution commands for sed use the / character to separate the search string and its replacement. However, this can be a problem if either the search or query contain /, as is often the case for UNIX file paths. There are two ways to handle this. You can use a different separator, provided it does not occur in the search or replacement string:

sed -e 's?DIR1/PATH1?DIR2/PATH2?' FILE1
sed -e 's|DIR1/PATH1|DIR2/PATH2|' FILE1

The ? and | characters are often a good choice here because they rarely occur in file paths.

You can also continue to use the / separator but instead use \ to "escape" the / characters in the search and query strings, meaning sed will treat them as literal characters instead of separators:

sed -e 's/DIR1\/PATH1/DIR2\/PATH2/' FILE1

While this approach is valid, it can be very difficult to read!

By default, sed only substitutes the first instance of the search string with its replacement. If your file or standard input multiple instances of the search string and you want to replace all of them, you can add the g command after your substitution command to do a global replacement:

sed -e 's/STRING1/STRING2/g' FILE1

There is not an easy way with sed to only replace specific instances of a string other than just the first one. If you need to do this, you will have to use more advanced tools or manually edit the text.

You can also delete the entire line that matches your query with sed using the d command:

sed -e '/STRING1/d' FILE1

The examples shown so far have use the -e command for cases where you want to send the standard output to a file or the standard input of another command through a pipe. Sometimes you will want do an in-place substitution or deletion in a file without making a new one. Use this with caution, as it is irreversible! You can make a backup of the original file if you are unsure of your changes or if the file cannot be easily regenerated.

To perform an in-place substitution without making a backup, use the -i with nothing else:

sed -i 's/STRING1/STRING2/g' FILE1

If you want to back up the original file, add a suffix after -i that sed will append to your backup:

sed -i.bak 's/STRING1/STRING2/g' FILE1

This will create FILE1.bak as a copy of the original before making any changes. .bak is commonly used in examples, but the suffix can be any string that is valid for UNIX file names.

An additional use case for sed beyond substitution and deletion is extract specific line numbers from a file. Although there are other ways of achieving this, using sed is one of the simplest and fastest ways to do this. This approach uses a combination of the q and d commands:

sed 'LINE_NUMBERq;d' FILE

This command will be interpreted by sed as reading the file until LINE_NUMBER has been reached, stopping, and then deleting all lines before this.

Filtering and extracting text

Another very common text manipulation task is to filter or extract text from a file or standard input. Typically you will have a query string similar to using sed but instead of modifying the text, the matching lines are simply sent to standard output to be written to a file or passed in a pipe.

grep

grep is a very powerful tool for extracting lines of text matching a query string from a file or standard input. grep can use regular expressions for the query - see the Regular Expressions section of the wiki for more details on this.

The most basic operation to use with grep is extract lines matching a string pattern:

grep 'STRING' FILE1

grep also accepts standard input:

COMMAND1 | grep 'STRING'

Search queries with grep must be quoted, and should usually use single quotes (') so that the query is not evaluated by the shell before passing it to grep. However if you need to evaluate a shell variable to construct your query, you will need to use double quotes (") to evaluate the variable first:

grep "${VARIABLE}" FILE1

You may sometimes find that you want to keep lines which do not match a pattern you are searching for. You can "reverse" your query with -v:

grep -v 'STRING1' FILE

This will return to you all lines in FILE which do not contain the query specified in STRING1

grep is also useful for searching for the presence of a string in files. If you are filter for a query in a single file as described above and output is blank, this indicates the string is not present. Sometimes you do not know which files may contain the string you are trying to find. You can use -R argument to do a recursive search of all the files in the current directory:

grep -R 'STRING'

To search a different directory than the current one, add it after the command:

grep -R 'STRING' DIR

Alternatives

The Text filtering part of the Application Directory in the wiki has other tools with similar functions to grep. One of the most important ones is ripgrep, which provides the rg command that is a much faster drop-in replacement for grep.

Tabular data manipulation

Data stored in a tabular format is widely used in many computational disciplines. A "tabular format" means the data can be visualized as a rectangular table, where each row has the same number of columns, and each column has the same number of rows. The columns are separated from each other in each row by a consistent, single character delimiter, usually , or a tab (represented in most UNIX software with a \t), and missing or blank values are usually allowable for incomplete data. The data for the all the rows in a single column is usually treated as being of the same "type" (i.e. numeric or text), but the data type of each column can and often does differ. The only exception to this structure is that the first row is often considered a "header" with column names that are normally text strings even if the rest of the column is numeric.

Many tools have been developed to manipulate tabular data given its widespread use. Tabular files can in some sense be thought of as a very simple (but rather inefficient) database. In fact, SQL databases are just very large tables that use special indexing to speed up data retrieval and insertion and are able to handle many parallel queries very rapidly without compromising the integrity of the data. The database perspective on tabular data is why terminology from databases is sometimes used to describe tabular files. "Records" in databases correspond to rows, and "fields" correspond to columns. Databases are an advanced, specialized topic that will not be discussed here - extensive documentation is available for these tools elsewhere. The focus here will be on simple manipulations of tabular data.

Extracting columns

One of the most basic tasks in tabular data manipulation is extraction of specific columns. The cut tool is specifically designed for this:

cut -f 1 FILE1

The -f argument specifies the column (field) number to extract. You can extract multiple adjacent columns using a range:

cut -f 1-2 FILE1

and you can also extract multiple columns which are not adjacent using , to separate the numbers:

cut -f 1,3 FILE1

By default, cut assumes the delimiter for your data is tabs (\t). You can specify an alternative delimiter with the -d argument:

cut -f 1 -d ',' FILE1

Concatenating columns

The complementary action to extracting columns is to concatenate them from multiple files. The paste command is used for this:

paste FILE1 FILE2

By default, paste concatenates the columns with tabs (\t). You can change this with the -d argument:

paste -d `,` FILE1 FILE2

awk

TODO