January 24, 2017

Finding Interesting Documents with grep

grep-eli-francis-unsplash.jpg

grep tutorial
The grep command is a very powerful way to find documents on your computer. Learn how to use it in this tutorial.

The grep command is a very powerful way to find documents on your computer. You can use grep to see if a file contains a word or use one of many forms of regular expression to search for a pattern instead. Grep can check the file that you specify or can search an entire tree of your filesystem recursively looking for matching files.

One of the most basic ways to use grep is shown below, looking for the lines of a file that match a pattern. I limit the search to only text files in the current directory *.txt and the -i option makes the search case-insensitive. As you can see, the only matches for the string "this" are the capitalized string "This".

$ cat sample.txt
This is the sample file.
It contains a few lines of text
that we can use to search for things.
Samples of text
and seeking those samples
there can be many matches
but not all of them are fun
so start searching for samples
start looking for text that matches

$ grep -i this sample.txt 
This is the sample file.

The -A, -B, and -C options to grep let you see a little bit more context than a single line that matched. These options let you specify the number of trailing, preceding, and both trailing and preceding lines to print, respectively. Matches are shown separated with a "---" line so you can clearly see the context for each match in the presented results. Notice that the last example using -C 1 to grab both the preceding line and trailing line shows four results in the last match. This is because there are two matches (the middle two lines) that share the same context.

$ grep -A 2 It sample.txt 
It contains a few lines of text
that we can use to search for things.
Samples of text

$ grep -C 1 -i the sample.txt 
This is the sample file.
It contains a few lines of text
--
and seeking those samples
there can be many matches
but not all of them are fun
so start searching for samples

The -n option can be used to show the line number that is being presented. Below I grab one line before and one line after the match and see the line numbers, too.

$ grep -n -C 1 tha sample.txt 
2-It contains a few lines of text
3:that we can use to search for things.
4-Samples of text
--
8-so start searching for samples
9:start looking for text that matches

Digging through a bunch of files

You can get grep to recurse into a directory using the -R option. When you use this, the matching file name is shown on the output as well as the match itself. When you combine -R with -n the file name is first shown, then the line number, and then the matching line.

 $ grep -R sample .
./subdir/sample3.txt:another sample in a sub directory
./sample.txt:This is the sample file.
./sample.txt:and seeking those samples
./sample.txt:so start searching for samples
./sample2.txt:This is the second sample file

$ grep -n -R sample .
./subdir/sample3.txt:1:another sample in a sub directory
...

If you have some subdirectories that you don't want searched, then the --exclude-dir can tell grep to skip over them. Notice that I have used single quotes around the sub* glob below. The difference can be seen in the last commands where I use echo to show the command itself rather than execute it. Notice that the shell has expanded the sub* into 'subdir' for me in the last command. If you have subdir1 and subdir2 and use the pattern sub* then your shell will likely expand that glob into the two directory names, and that will confuse grep which is expecting a single glob. If in doubt, enclose the directory to exclude in single quotes as shown in the first command below.

$ grep -R --exclude-dir 'sub*' sample .
./sample.txt:This is the sample file.
./sample.txt:and seeking those samples
./sample.txt:so start searching for samples
./sample2.txt:This is the second sample file

$ echo grep -R --exclude-dir 'sub*' sample .
grep -R --exclude-dir sub* sample .

$ echo grep -R --exclude-dir sub* sample .
grep -R --exclude-dir subdir sample .

Although the recursion built into grep is handy, you might like to combine the find and grep commands. It can be useful to use the find command by itself to see what files you will be executing grep on. The find command below uses regular expressions on the file names to limit the files to consider to only those with the number 2 or 3 in their name and only text files. The -type f limits the output to only files.

$ find . -name '*[23]*txt' -type f
./subdir/sample3.txt
./sample2.txt

You then tell find to execute a command for each file that is found instead of just printing the file name using the -exec option to find. It is convenient to use the -H option to grep to print the filename for each match. You may recall that grep will give you -H by default when run on many files. Using -H can be handy in case find only finds a single file; if that file matches, it is good to know what the file name is as well as the matches.

$ find . -name '*[23]*txt' -type f -exec grep -H sampl {} +

For dealing with common file types, like source code, it might be convenient to use a bash alias such as the one below to "Recursively Grep SRC code". The search is limited to C/C++ source using file name matching. Many possible extensions are chained together using the -o argument to find meaning "OR". The "$1" argument passed to the grep command takes the first argument to RGSRC and passes it to grep. The last command searches for the string "Ferris" in any C/C++ source code in the current directory or any subdirectory.

$ cat ~/.bashrc
...
RGSRC() {
 find . \( -name "*.hh" -o -name "*.cpp" -o -name "*.hpp" -o -name "*.h" -o -name "*.c" \) \
    -exec grep -H "$1" {} +
}
...

$ RGSRC Ferris
...
./Ferris.cpp:using namespace Ferris::RDFCore;
...

Regular Expressions

While I have been searching for a single word using grep in the above, you can define what you want using regular expressions. There is support in grep for basic, extended, and Perl compatible regular expressions. Basic regular expressions are the default.

Regular expressions let you define a pattern for what you are after. For example, the regular expression '[Ss]imple' will match the strings 'simple' and 'Simple'. This is different from using -i to perform a case-insensitive search, because 'sImple' will not be considered a match for the above regular expression. Each character inside the square brackets can match, and only one of 'S' or 's' is allowed before the remaining string 'imple'. You can have many characters inside the square brackets and also define the an inversion. For example, [^F]oo will match any character than 'F' followed by two lower case 'o' characters. If you want to find the '[' character you have to escape it's special meaning by preceding it with a backslash.

To match any character use the full stop. If you follow a character or square bracketed match with '*' it will match zero or more times. To match one or more use '+' instead. So '[B]*ar' will match 'ar', 'Bar', 'BBar', 'BBBar', and so on. You can also use {n} to match n times and {n,m} to match at least n times but no more than m times. To use the '+' and {n,m} modifiers you will have to enable extended regular expressions using the -E option.

These are some of the more fundamental parts of a regular expression, there are more and you can defined some very sophisticated patterns to find exactly what you are after. The first command below will find sek, seek, seeek in the sample file. The second command will find the strings 'many' or 'matches' in the file.

$ grep -E  's[e]{1,3}k' sample.txt 
and seeking those samples

$ grep -E  'ma(ny|tches)' sample.txt 
there can be many matches
start looking for text that matches

Looking across lines

The grep command works on a line-by-line basis. This means that if you are looking for two words together, then you will have some trouble matching one word at the end of one line and the second word at the start of the next line. So finding the person 'John Doe' will work unless the Doe happens to be the first word of the next line.

Although there are other tools, such as awk and Perl, that will allow you to search over multiple lines, you might like to use pcregrep to get the job done. On Fedora, you will have to install the pcre-tools package.

The below command will find the string 'text that' with the words separated by any amount of whitespace. In this case, whitespace also includes the newline.

$ pcregrep -M 'text[\s]*that' sample.txt
It contains a few lines of text
that we can use to search for things.
start looking for text that matches

A few other things

Another grep option that might be handy is -m, which limits the number of matches sought in a file. The -v will invert the matches, so you see only the lines which do not match the pattern you gave. An example of an inverted match is shown below.

$ grep -vi sampl sample.txt 
It contains a few lines of text
that we can use to search for things.
there can be many matches
but not all of them are fun
start looking for text that matches

Final words

Using grep with either -R to directly inspect an area of your filesystem or in combination with a complicated find command will let you search through large amounts of text fairly quickly. You will likely find grep already installed on many machines. The pcregrep allows you to search multiple lines fairly easily. Next time, I'll take a look at some other grep-like commands that let you search PDF documents and XML files.

Learn more about Linux through the free "Introduction to Linux" course from The Linux Foundation and edX.

Click Here!