-
Notifications
You must be signed in to change notification settings - Fork 2
Text manipulation in the shell
Manipulating text files is a very frequent task that many beginners will attempt to do with a language like Python or R. While most modern general purpose and scientific programming languages have excellent libraries for text manipulation, it is often faster and less memory intensive to do these manipulations from the shell with specialized programs. This is consistent with the general Unix philosophy of "do one thing and do it well". This guide will demonstrate how to do some of the most basic text manipulation tasks from the shell.
In UNIX shells, all processes return several outputs and many accept input as well. If not directed elsewhere, these outputs are simply printed to the screen for the user to see. The "standard output" is normally the result of a successfully executed command, while the "standard error" is normally the output of a failed command, although not all programs follow this distinction consistently. Most of the time, you will only be interested in redirecting standard output. To redirect standard output to another process, use a pipe:
COMMAND1 | COMMAND2
Pipes are one of the most fundamental and powerful concepts in UNIX shells because they allow you string to together multiple simpler commands to do something more complex, without having to save the output to a file at each step. In order for a pipe to function, the command receiving the standard output must accept "standard input". Not all commands support standard input - in some cases the command will only accept a file path as an argument, from which it will read its input.
Depending on your goals, the final output of the manipulations you wish to do may be something you need to simply print to your screen. However, there are many uses cases where you do wish to save the standard output of a command to a file:
COMMAND1 > FILE1
You may also have an existing file that want to append output to:
COMMAND >> FILE1
Note that if you try to append to a file which does not exist, it is automatically created the first time, so you don't need to worry about checking for the existence of the file.
You can choose to redirect the standard error to a file instead of standard output:
COMMAND 2> FILE1
Or you can combine standard error with standard output and send both
COMMAND 2>&1 > FILE1
This approach also works for sending standard error to pipes:
COMMAND1 2>&1 | COMMAND2
You can also redirect an existing file into the standard input of a command:
COMMAND < FILE1
The cat
command can be used send the contents of any file to standard output printed to the screen:
cat FILE1
or to a pipe, but note that this is just a more verbose of using <
to redirect the file without cat
:
cat FILE1 | COMMAND1
If the file you want to view is too long to fit on your screen, you can use a "pager" like less
to view it, which allows you to scroll through the with vim
-style movement keys (see the Movement section of Neovim page in the wiki for details).
less -S FILE1
You usually want to use the -S
argument with less
because it does not wrap long lines - instead you can scroll horizontally to view them.
Some of the tools in the Pagers and viewers section of this wiki can also view files with colorization or nice formatting for specific file types.
The head
and tail
commands allow to send only the beginning or end of a file to standard output:
head FILE1
tail FILE1
You can use the -n
argument to change the number lines from the default of 10:
head -n 5 FILE1
tail -n 5 FILE1
To remove the first N lines of a file, you can use tail
with the -n
argument with +
in front of the number of lines:
tail -n +5 FILE1
and with head
you can remove the list N lines:
head -n +5 FILE1
One of the most common text manipulations is to replace a specific string of text with another (which may be nothing if you want to remove it). There are several tools available in UNIX shells for this purpose.
The tr
command replaces (translates) a single character to another character:
tr 'a' 'b' < FILE1
This character can be a whitespace character like \t
or \n
echo ${PATH} | tr ':' '\n'
You can also delete a character:
tr -d 'a' < FILE1
You can squeeze repeated characters into one with the -s
argument:
tr -s ' ' < FILE1
This is especially useful for tabular files that have been "pretty printed" with an inconsistent number of spaces between each column!
sed
is a stream editor designed to allow more complex replacements of text than tr
. Sed can even use regular expressions for matching the string of characters to replace - see the Regular Expressions section of the wiki for more details on this.
To replace (substitute) one text string with another, you use the s
command:
sed -e 's/STRING1/STRING2/' FILE1
sed
also accept standard input:
echo ${VARIABLE} | sed -e 's/STRING1/STRING2/'
If you want to delete a search string entirely, simply replace it with nothing:
sed -e 's/STRING1//' FILE1
The -e
tells sed
to execute the command that follows. You can have multiple commands in one line:
sed -e 's/STRING1/STRING2/' | sed -e 's/STRING3/STRING4/' FILE1
Note that commands for sed
must be quoted. Usually you will want to use single quotes so that your shell does not try to evaluate the command before sending to sed
. However you will need to use double quotes if you need to substitute shell variables into you sed
command:
sed -e "s/STRING1/${VARIABLE}/" FILE1
Most examples you will see of substitution commands for sed
use the /
character to separate the search string and its replacement. However, this can be a problem if either the search or query contain /
, as is often the case for UNIX file paths. There are two ways to handle this. You can use a different separator, provided it does not occur in the search or replacement string:
sed -e 's?DIR1/PATH1?DIR2/PATH2?' FILE1
sed -e 's|DIR1/PATH1|DIR2/PATH2|' FILE1
The ?
and |
characters are often a good choice here because they rarely occur in file paths.
You can also continue to use the /
separator but instead use \
to "escape" the /
characters in the search and query strings, meaning sed
will treat them as literal characters instead of separators:
sed -e 's/DIR1\/PATH1/DIR2\/PATH2/' FILE1
While this approach is valid, it can be very difficult to read!
By default, sed
only substitutes the first instance of the search string with its replacement. If your file or standard input multiple instances of the search string and you want to replace all of them, you can add the g
command after your substitution command to do a global replacement:
sed -e 's/STRING1/STRING2/g' FILE1
There is not an easy way with sed
to only replace specific instances of a string other than just the first one. If you need to do this, you will have to use more advanced tools or manually edit the text.
You can also delete the entire line that matches your query with sed
using the d
command:
sed -e '/STRING1/d' FILE1
The examples shown so far have use the -e
command for cases where you want to send the standard output to a file or the standard input of another command through a pipe. Sometimes you will want do an in-place substitution or deletion in a file without making a new one. Use this with caution, as it is irreversible! You can make a backup of the original file if you are unsure of your changes or if the file cannot be easily regenerated.
To perform an in-place substitution without making a backup, use the -i
with nothing else:
sed -i 's/STRING1/STRING2/g' FILE1
If you want to back up the original file, add a suffix after -i
that sed
will append to your backup:
sed -i.bak 's/STRING1/STRING2/g' FILE1
This will create FILE1.bak
as a copy of the original before making any changes. .bak
is commonly used in examples, but the suffix can be any string that is valid for UNIX file names.
An additional use case for sed
beyond substitution and deletion is extract specific line numbers from a file. Although there are other ways of achieving this, using sed
is one of the simplest and fastest ways to do this. This approach uses a combination of the q
and d
commands:
sed 'LINE_NUMBERq;d' FILE
This command will be interpreted by sed as reading the file until LINE_NUMBER
has been reached, stopping, and then deleting all lines before this.
Another very common text manipulation task is to filter or extract text from a file or standard input. Typically you will have a query string similar to using sed
but instead of modifying the text, the matching lines are simply sent to standard output to be written to a file or passed in a pipe.
grep
is a very powerful tool for extracting lines of text matching a query string from a file or standard input. grep
can use regular expressions for the query - see the Regular Expressions section of the wiki for more details on this.
The most basic operation to use with grep
is extract lines matching a string pattern:
grep 'STRING' FILE1
grep
also accepts standard input:
COMMAND1 | grep 'STRING'
Search queries with grep
must be quoted, and should usually use single quotes ('
) so that the query is not evaluated by the shell before passing it to grep
. However if you need to evaluate a shell variable to construct your query, you will need to use double quotes ("
) to evaluate the variable first:
grep "${VARIABLE}" FILE1
You may sometimes find that you want to keep lines which do not match a pattern you are searching for. You can "reverse" your query with -v
:
grep -v 'STRING1' FILE
This will return to you all lines in FILE
which do not contain the query specified in STRING1
grep
is also useful for searching for the presence of a string in files. If you are filter for a query in a single file as described above and output is blank, this indicates the string is not present. Sometimes you do not know which files may contain the string you are trying to find. You can use -R
argument to do a recursive search of all the files in the current directory:
grep -R 'STRING'
To search a different directory than the current one, add it after the command:
grep -R 'STRING' DIR
The Text filtering part of the Application Directory in the wiki has other tools with similar functions to grep
. One of the most important ones is ripgrep
, which provides the rg
command that is a much faster drop-in replacement for grep
.
Data stored in a tabular format is widely used in many computational disciplines. A "tabular format" means the data can be visualized as a rectangular table, where each row has the same number of columns, and each column has the same number of rows. The columns are separated from each other in each row by a consistent, single character delimiter, usually ,
or a tab (represented in most UNIX software with a \t
), and missing or blank values are usually allowable for incomplete data. The data for the all the rows in a single column is usually treated as being of the same "type" (i.e. numeric or text), but the data type of each column can and often does differ. The only exception to this structure is that the first row is often considered a "header" with column names that are normally text strings even if the rest of the column is numeric.
Many tools have been developed to manipulate tabular data given its widespread use. Tabular files can in some sense be thought of as a very simple (but rather inefficient) database. In fact, SQL databases are just very large tables that use special indexing to speed up data retrieval and insertion and are able to handle many parallel queries very rapidly without compromising the integrity of the data. The database perspective on tabular data is why terminology from databases is sometimes used to describe tabular files. "Records" in databases correspond to rows, and "fields" correspond to columns. Databases are an advanced, specialized topic that will not be discussed here - extensive documentation is available for these tools elsewhere. The focus here will be on simple manipulations of tabular data.
One of the most basic tasks in tabular data manipulation is extraction of specific columns. The cut
tool is specifically designed for this:
cut -f 1 FILE1
The -f
argument specifies the column (field) number to extract. You can extract multiple adjacent columns using a range:
cut -f 1-2 FILE1
and you can also extract multiple columns which are not adjacent using ,
to separate the numbers:
cut -f 1,3 FILE1
By default, cut
assumes the delimiter for your data is tabs (\t
). You can specify an alternative delimiter with the -d
argument:
cut -f 1 -d ',' FILE1
The complementary action to extracting columns is to concatenate them from multiple files. The paste
command is used for this:
paste FILE1 FILE2
By default, paste
concatenates the columns with tabs (\t
). You can change this with the -d
argument:
paste -d `,` FILE1 FILE2
TODO