June 14, 2013

How to Use Awk to Find and Sort Text in Linux, GnuCash

awk is a splendid Unix scripting language for processing text files. The version included in most Linux distros is GNU awk, or gawk for short. I like it for pulling data from ordered data sets, such as text lists and CSV exports from spreadsheets. awk sees each line in a file as a separate record, and each item in a line as a separate field, which makes it possible to slice and dice your files in all kinds of flexible ways. The classic way to illustrate this is with /etc/passwd; this example prints the whole contents:

$ awk '{ print $0 }' /etc/passwd 
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/bin/sh
bin:x:2:2:bin:/bin:/bin/sh
sys:x:3:3:sys:/dev:/bin/sh

print with no options means "print the whole line", and $0 mean "whole line", so omitting $0 gives the same result. Now suppose you want just a list of usernames and UIDs:/etc/passwd is perfect for parsing with awk, because its data fields are delimited with colons. So all you do is count from the left starting with 1 to number the fields you want, and then extract the usernames and UIDs like this:

$ awk -F":" '{ print $1 " " $3 }' /etc/passwd 
root 0
daemon 1
bin 2
sys 3

-F defines your field separator, and " " inserts a space. What if you want the UIDs listed first? Easy peasey, just move the variables around like this:

$ awk -F":" '{ print $3 " " $1 }' /etc/passwd

Now let's look at a real-life example of using awk to migrate a big wad of financial data into GnuCash.

Importing Data into GnuCash

GnuCash is one of the crown jewels of FOSS, a brilliant and powerful accounting program that runs on Linux, Mac, and Windows. You can import QIF and OFX files, but there is only partial support for CSV imports. Which is a shame, because CSV (comma-separated values) makes a good universal format for moving financial data from one program to another, and many financial applications support CSV import and export.

Let's say you have this giant spreadsheet containing years of financial data, and you decide it's time to put it in a proper accounting program. Or you want to move from a different accounting program into GnuCash, and it doesn't support QIF or OFX, but it does support CSV export. You could retype all your data, or you could put your ace Linux scripting skills to work. The workflow is to first create a good clean CSV file, convert that to QIF, and then import the QIF into GnuCash. GnuCash is very picky about having a perfectly clean QIF file with no errors, so we'll make sure to have that.

Make sure you work from a copy of your source file! Don't muck up your original!

I'll keep this example simple and use only the following QIF fields:

  • D - date
  • P - payee
  • M - memo
  • T - amount
  • N - check number, or any notation in the check number field
  • L - category, which corresponds to GnuCash accounts
  • ^ - end of record

We'll also need to specify the account type in the QIF file header, like these examples:

!Type:Bank  
!Type:Cash  
!Type:CCard  
!Type:Invst

The QIF specification supports a lot more items, and you can see the full details here. If this page ever disappears just do a search for "qif spec" as it is abundantly documented.

Spreadsheets are fab for making mass changes, like date and number formats, so take a look at your CSV in a spreadsheet before trying to convert it to QIF. Make sure that your withdrawals have a minus sign, like -33.72, and don't use dollar signs. Deposits can have a plus sign if you prefer, but it's not required. All of your withdrawals and deposits need to be in a single column. This is what my final trimmed-down CSV export looks like:

11/03/2008  Copy Junction  Copy of building codes	-33.72	8732  Supplies	
11/03/2008  Home Depot	Trowel	-17.05	8734  Tools	
11/03/2008  Dewalt Service Center	Charger for Drill  -75.85  8735	Tools	
11/04/2008  Building Supply  Margin trowel  -13.23  8736  Tools	
11/05/2008  Jane Smith   invoice #5843	8,500.00	   dep   income:contracting

If there is even a single error anywhere in the QIF file the GnuCash import will fail. Some errors I've run into are multiple minus signs, like --33.72, extra decimal points, and incorrect date formatting. awk won't care but GnuCash will. Alrighty then, let's convert our nice CSV into a QIF file:

$ ( echo '!Type:Bank'; cat exportfile.csv | awk -F $'\t' '{ print "D" $1; print "P" $2; print "M" $3; print "T" $4; print "N" $5; print "L" $6; print "^"; }' ) > importfile.qif

And the result looks like this, with withdrawals indicated by minus signs and deposits with unsigned values:

!Type:Bank
D03/25/2008
PJane Smith
M invoice #4657
T4000.00
Ndep
Lincome:contracting
^
D04/02/2008
PFirst Bank of Money
MCheck Order
T-21.44
NACH
Lbank fees
^
D05/15/2008
PPretty Designs
MDesign Services
T-500.00
N8922
LContract Services
^

Note how to specify a tab field delimiter by using the normal ASCII escape sequence for horizontal tabs, \t. If all goes well you'll have a good error-free QIF file to import into GnuCash. You can import this into a blank new GnuCash book, or into an existing book, and you'll map the QIF categories to GnuCash accounts.

Find Blocks of Text

awk is wonderful for finding blocks of text, and you might give it a try when grep isn't picking up exactly what you want. For example, you can snag the complete lspci output for a single device:

$ lspci -v | awk '/VGA/,/^$/'
01:00.0 VGA compatible controller: NVIDIA Corporation G98 [GeForce 8400 GS] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Micro-Star International Co., Ltd. Device 1162
        Flags: bus master, fast devsel, latency 0, IRQ 18
        Memory at fd000000 (32-bit, non-prefetchable) [size=16M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Memory at fa000000 (64-bit, non-prefetchable) [size=32M]
        I/O ports at dc00 [size=128]
        [virtual] Expansion ROM at fe9e0000 [disabled] [size=128K]
        Capabilities: 
        Kernel driver in use: nvidia
        Kernel modules: nvidia_current, nouveau, nvidiafb

The caret, ^, is a regular expression anchor that matches the start of a string, and $ matches the end, so in this example /^$/ looks for the line breaks at the beginning and end of the text block. This is a great trick for extracting specific blocks from large configuration files with spaces between sections, like this example from sshd_config :

$ awk '/X11Forwarding/, /^$/' /etc/ssh/sshd_config 
X11Forwarding yes
X11DisplayOffset 10
PrintMotd no
PrintLastLog yes
TCPKeepAlive yes
#UseLogin no

Remove Duplicates Without Sorting

The sort and uniq commands are the ones we usually turn to for finding and removing duplicate entries in a file. But maybe you don't want your source file sorted or changed, so this is where trusty awk comes to the rescue by extracting the unique records and storing them in a new file:

$ awk '!x[$0]++' filewithdupes > newfile

Your original file won't be changed, and the new file will have only unique entries all nicely in order.

man awk contains complete documentation of options, and to get the most out of awk (or any Unix/Linux command) you need a good grasp of regular expressions, and for this I recommend the wonderful book Mastering Regular Expressions. If you want to pick the most useful thing of all to learn, learn regular expressions, because most programming languages and Linux/Unix commands rely on regular expressions.