Using Grep-Like Commands for Non-Text Files

2411

In the previous article, I showed how to use the grep command, which is great at finding text files that contain a string or pattern. The idea of directly searching in a “grep-like” way is so useful that there are additional commands to let you search right into PDF documents and handle XML files more naturally. Things do not stop there, as you could consider raw network traffic a collection of data that you want to “grep” for information, too.

Grep on PDF files

Packages are available in Debian and Fedora Linux for pdfgrep. For testing, I grabbed version 1.2 of the Open Document Format specification. Running the following command found the matches for “ruby” in the specification. Adding the -H option will print the filename for each match (just as the regular grep does). The -n option is slightly different to regular grep. In regular grep, -n prints the line number that matches; in pdfgrep, the -n option will instead show the page number.

$ pdfgrep ruby OpenDocument-v1.2.pdf 
  6.4 <text:ruby
     6.4.2 <text:ruby
     6.4.3 <text:ruby
 17.10 <style:ruby-properties>..................................................................................................
     19.874.30 <text:ruby
     19.874.31 <text:ruby-text>..................................................................................................
 20.303 style:layout-grid-ruby-below......................................................................................... 783
 20.304 style:layout-grid-ruby-height........................................................................................ 783
 20.341 style:ruby
 20.342 style:ruby

$ pdfgrep -Hn ruby OpenDocument-v1.2.pdf 
OpenDocument-v1.2.pdf:10:   6.4 <text:ruby
OpenDocument-v1.2.pdf:10:      6.4.2 <text:ruby
OpenDocument-v1.2.pdf:10:      6.4.3 <text:ruby
OpenDocument-v1.2.pdf:26:  17.10 <style:ruby

Many other command-line options in pdfgrep operate like the ones in regular grep. You can use -i to ignore the case when searching, -c to see just the number of times the pattern was found, -C to show a given amount of context around matches, -r/-R to recursively search, –include and –-exclude to limit recursive searches, and -m to stop searching a file after a given number of matches.

Because PDF files can be encrypted, pdfgrep also has the –password option to allow you to provide decryption keys. You might consider the –unac option to be somewhat logically grouped into the -i (case-insensitive) class of options. With this option, accents and ligatures are removed from both the search pattern and the content as it is considered. So the single character æ will be considered as “ae” instead. This makes it simpler to find things when typing at a console. Another interesting option in pdfgrep is -p, which shows the number of matches on a page.

Grep for XML

On Fedora Linux, you can dnf install xgrep to get access to the xgrep command. The first thing you might like to do with xgrep is search using -x to look for an XPath expression as shown below.

$ cat sample.xml 
<root>
 <sampledata>
   <foo>Foo Text</foo>
   <bar name="barname">Bar text</bar>
 </sampledata>
</root>

$ xgrep -x '//foo[contains(.,"Foo")]' sample.xml 
<!--         Start of node set (XPath: //foo[contains(.,"Foo")])                 -->
<!--         Node   0 in node set               -->

   <foo>Foo Text</foo>

<!--         End of node set                    -->

$ xgrep -x '//bar[@name="barname"]' sample.xml 
<!--         Start of node set (XPath: //bar[@name="barname"])                 -->
<!--         Node   0 in node set               -->

   <bar name="barname">Bar text</bar>

<!--         End of node set                    -->

The xgrep -s option lets you poke around in XML elements looking for a regular expression. This might work slightly differently from what you expect at the start. The format for the pattern is to pick the element you are interested in and then use one or more subelement/regex/ expressions to limit the matches.

The example below will always print an entire sampledata element, and we limit the search to only those with a bar subelement that matches the ‘Bar’ regular expression. I didn’t find a way to pick off just the bar element, so it seems you are always looking for a specific XML element and limiting the results based on matching the subelements.

$ xgrep -s 'sampledata:bar/Bar/' sample.xml 
<!--         Start of node set (Search: sampledata:bar/Bar/)                 -->
<!--         Node   0 in node set               -->

 <sampledata>
   <foo>Foo Text</foo>
   <bar name="barname">Bar text</bar>
 </sampledata>

<!--         End of node set                    -->

As you can see from this example, xgrep is more about finding matching structure in an XML document. As such it doesn’t implement many of the normal grep command line options. There is also no support for file system recursion built into xgrep, so you have to combine with the find command as shown in the previous article if you want to dig around.

grep your network with ngrep

The ngrep project provides many of the features of grep but works directly on network traffic instead of files. Just running ngrep with the pattern you are after will sift through network packets until something matching is seen, and then you will get a message showing the network packet that matched and the hosts and ports that were communicating. Unless you have specially set up network permissions, you will likely have to run ngrep as the root user to get full access to raw network traffic.

# ngrep foobar
interface: eth1 (192.168.10.0/255.255.255.0)
filter: ((ip || ip6) || (vlan && (ip || ip6)))
match: foobar
###########...######
T 192.168.10.2:738 -> 192.168.10.77:2049 [AP]
.......2...........foobar.. 
...
################################################################
464 received, 0 dropped

Note that the pattern is an extended regular expression, not just a string. So you could find many types of foo using for example fooba[rz].

Similar to regular grep, ngrep supports (-i) for case insensitive search, (-w) for whole word matching, (-v) to invert the result, only showing packets that do not match your pattern, and (-n) to match only a given number of packets before exiting.

The ngrep tool also supports options to timestamp the matching packets with -t to print a timestamp when a match occurs, or -T to show the time delta between matches. You can also shut down connections using the -K option to kill TCP connections that match your pattern. If you are looking at low-level packets you might like to use -x to dump the packet as hexadecimal.

A very handy use for ngrep is to ensure that the network communications that you think are secure really have any security to them at all. This can easily be the case with apps on a phone. If you start ngrep with a reasonably rare string like “mysecretsarehere” and then send that same string in the app, you shouldn’t see it being found by ngrep. Just because you can’t see it in ngrep doesn’t mean the app or communication is secure, but at least there is something being done by the app to try to protect your data that is sent over the Internet.

grep your mail

While the name might be a little misleading, mboxgrep can search mbox files and maildir folders. I found that I had to use the -m option to tell mboxgrep that I wanted to look inside a maildir folder instead of a mailbox. The following command will directly search for matching messages in a maildir folder.

$ cd ~/mail/.Software
$ mboxgrep -m maildir "open source program delight" .

The mboxgrep tool has options to search only the header or body of emails, and you can choose between fcntl, flock, or no file locking during the search depending on your needs. mboxgrep can also recurse into your mail, handy if you have many subfolders you would like to search.

Wrap up

The grep tool has been a go-to tool for finding text files for decades. There are now a growing collection of grep-like tools that allow you to use the same syntax to find other matching things. Those things might be specific file formats — like XML and PDF — or you might consider searching network traffic for interesting events.

Learn more about Linux through the free “Introduction to Linux” course from The Linux Foundation and edX.