Linux.com

Home Learn Linux Linux Tutorials How to Use Awk to Find and Sort Text in Linux, GnuCash

How to Use Awk to Find and Sort Text in Linux, GnuCash

awk is a splendid Unix scripting language for processing text files. The version included in most Linux distros is GNU awk, or gawk for short. I like it for pulling data from ordered data sets, such as text lists and CSV exports from spreadsheets. awk sees each line in a file as a separate record, and each item in a line as a separate field, which makes it possible to slice and dice your files in all kinds of flexible ways. The classic way to illustrate this is with /etc/passwd; this example prints the whole contents:

$ awk '{ print $0 }' /etc/passwd 
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/bin/sh
bin:x:2:2:bin:/bin:/bin/sh
sys:x:3:3:sys:/dev:/bin/sh

print with no options means "print the whole line", and $0 mean "whole line", so omitting $0 gives the same result. Now suppose you want just a list of usernames and UIDs:/etc/passwd is perfect for parsing with awk, because its data fields are delimited with colons. So all you do is count from the left starting with 1 to number the fields you want, and then extract the usernames and UIDs like this:

$ awk -F":" '{ print $1 " " $3 }' /etc/passwd 
root 0
daemon 1
bin 2
sys 3

-F defines your field separator, and " " inserts a space. What if you want the UIDs listed first? Easy peasey, just move the variables around like this:

$ awk -F":" '{ print $3 " " $1 }' /etc/passwd

Now let's look at a real-life example of using awk to migrate a big wad of financial data into GnuCash.

Importing Data into GnuCash

GnuCash is one of the crown jewels of FOSS, a brilliant and powerful accounting program that runs on Linux, Mac, and Windows. You can import QIF and OFX files, but there is only partial support for CSV imports. Which is a shame, because CSV (comma-separated values) makes a good universal format for moving financial data from one program to another, and many financial applications support CSV import and export.

Let's say you have this giant spreadsheet containing years of financial data, and you decide it's time to put it in a proper accounting program. Or you want to move from a different accounting program into GnuCash, and it doesn't support QIF or OFX, but it does support CSV export. You could retype all your data, or you could put your ace Linux scripting skills to work. The workflow is to first create a good clean CSV file, convert that to QIF, and then import the QIF into GnuCash. GnuCash is very picky about having a perfectly clean QIF file with no errors, so we'll make sure to have that.

Make sure you work from a copy of your source file! Don't muck up your original!

I'll keep this example simple and use only the following QIF fields:

  • D - date
  • P - payee
  • M - memo
  • T - amount
  • N - check number, or any notation in the check number field
  • L - category, which corresponds to GnuCash accounts
  • ^ - end of record

We'll also need to specify the account type in the QIF file header, like these examples:

!Type:Bank  
!Type:Cash  
!Type:CCard  
!Type:Invst

The QIF specification supports a lot more items, and you can see the full details here. If this page ever disappears just do a search for "qif spec" as it is abundantly documented.

Spreadsheets are fab for making mass changes, like date and number formats, so take a look at your CSV in a spreadsheet before trying to convert it to QIF. Make sure that your withdrawals have a minus sign, like -33.72, and don't use dollar signs. Deposits can have a plus sign if you prefer, but it's not required. All of your withdrawals and deposits need to be in a single column. This is what my final trimmed-down CSV export looks like:

11/03/2008  Copy Junction  Copy of building codes	-33.72	8732  Supplies	
11/03/2008  Home Depot	Trowel	-17.05	8734  Tools	
11/03/2008  Dewalt Service Center	Charger for Drill  -75.85  8735	Tools	
11/04/2008  Building Supply  Margin trowel  -13.23  8736  Tools	
11/05/2008  Jane Smith   invoice #5843	8,500.00	   dep   income:contracting

If there is even a single error anywhere in the QIF file the GnuCash import will fail. Some errors I've run into are multiple minus signs, like --33.72, extra decimal points, and incorrect date formatting. awk won't care but GnuCash will. Alrighty then, let's convert our nice CSV into a QIF file:

$ ( echo '!Type:Bank'; cat exportfile.csv | awk -F $'\t' '{ print "D" $1; print "P" $2; print "M" $3; print "T" $4; print "N" $5; print "L" $6; print "^"; }' ) > importfile.qif

And the result looks like this, with withdrawals indicated by minus signs and deposits with unsigned values:

!Type:Bank
D03/25/2008
PJane Smith
M invoice #4657
T4000.00
Ndep
Lincome:contracting
^
D04/02/2008
PFirst Bank of Money
MCheck Order
T-21.44
NACH
Lbank fees
^
D05/15/2008
PPretty Designs
MDesign Services
T-500.00
N8922
LContract Services
^

Note how to specify a tab field delimiter by using the normal ASCII escape sequence for horizontal tabs, \t. If all goes well you'll have a good error-free QIF file to import into GnuCash. You can import this into a blank new GnuCash book, or into an existing book, and you'll map the QIF categories to GnuCash accounts.

Find Blocks of Text

awk is wonderful for finding blocks of text, and you might give it a try when grep isn't picking up exactly what you want. For example, you can snag the complete lspci output for a single device:

$ lspci -v | awk '/VGA/,/^$/'
01:00.0 VGA compatible controller: NVIDIA Corporation G98 [GeForce 8400 GS] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Micro-Star International Co., Ltd. Device 1162
        Flags: bus master, fast devsel, latency 0, IRQ 18
        Memory at fd000000 (32-bit, non-prefetchable) [size=16M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Memory at fa000000 (64-bit, non-prefetchable) [size=32M]
        I/O ports at dc00 [size=128]
        [virtual] Expansion ROM at fe9e0000 [disabled] [size=128K]
        Capabilities: 
        Kernel driver in use: nvidia
        Kernel modules: nvidia_current, nouveau, nvidiafb

The caret, ^, is a regular expression anchor that matches the start of a string, and $ matches the end, so in this example /^$/ looks for the line breaks at the beginning and end of the text block. This is a great trick for extracting specific blocks from large configuration files with spaces between sections, like this example from sshd_config :

$ awk '/X11Forwarding/, /^$/' /etc/ssh/sshd_config 
X11Forwarding yes
X11DisplayOffset 10
PrintMotd no
PrintLastLog yes
TCPKeepAlive yes
#UseLogin no

Remove Duplicates Without Sorting

The sort and uniq commands are the ones we usually turn to for finding and removing duplicate entries in a file. But maybe you don't want your source file sorted or changed, so this is where trusty awk comes to the rescue by extracting the unique records and storing them in a new file:

$ awk '!x[$0]++' filewithdupes > newfile

Your original file won't be changed, and the new file will have only unique entries all nicely in order.

man awk contains complete documentation of options, and to get the most out of awk (or any Unix/Linux command) you need a good grasp of regular expressions, and for this I recommend the wonderful book Mastering Regular Expressions. If you want to pick the most useful thing of all to learn, learn regular expressions, because most programming languages and Linux/Unix commands rely on regular expressions.

 

Comments

Subscribe to Comments Feed
  • Ken Watson Said:

    Carla, thanks for this tip. I always enjoy your articles - they are informative, but not esoteric, and they seem to strike a balance between providing technical information to experienced users and educating newcomers. And all without bashing, ranting, or hyperbole. They make me (and I hope other readers) want to dig a little deeper to learn more about Linux. I've been using Linux since 2007, and even though I'm very comfortable now with the command line and tweaking things to my satisfaction, I find there's always a lot more to learn. :)

  • Bob Mesibov Said:

    Nice article on a great CLI tool. 'man awk' might be a bit daunting for AWK first-timers, though, so I've compiled a list of online AWK introductions and tips/tricks at the end of my recent Free Software Magazine item on AWK, here: http://fsmsh.com/3828

  • Roland Said:

    Here's a good one I snagged from Usenet in the early '90s: how do you get a shell variable (say $line) inside awk? Say you want to get a field from a particular line. Surround the shell variable with dblquote-singlequote-dblquote: cat data | awk 'NR=="'"$line"'"{print $3}

  • Carla Schroder Said:

    Thank you Ken, Bob, and Roland for compliments and helpful comments!

  • Tim Chambers Said:

    Perl was greatly influenced by awk syntax, but by mastering Perl I don't have to remember awk, which is much less powerful. As Python is to Perl, so Perl is to awk. I guess it's worthwhile for noobs to dip their toe in the water with awk, but for me there came a point where I threw off the water wings and started swimming at the deep end by leaving awk behind and using Perl on the command line. The Schwartzian Transform is mind-blowing in its compact elegance. I've moved on from Perl to Python for scripting, but when I need to do anything on the command line that shell builtins won't handle, I throw an expression at perl -pe, perl -ne, or (most general) perl -e. I have no need for awk anymore.

  • cyrus modaresi Said:

    Carla, I could be your example! I've converted 45000 quickbooks split transcations in csv into qif using awk programming. But unfortunately Gnucash despite its promises is not the right application. Moneydance is. Gnucash has 4 major problems: 1) it's not fully qif compatible. Ex: you cannot import !Account\nNCash\TOth A\n^. It imports all data as BANK. 2)Its investment module is beyond primitive. 3)one it's loaded with a large amount of data producing reports is extremely slow. 4) it has very limited reports capabilities. Yes, I love the fact it's written in C and the source can be downloaded. For people interested and capable of writing awk programs there is a invaluable tutorial by Bruce Barnett. Just Google "awk tutorial" and it comes up on the top - http://www.grymoire.com/Unix/Awk.html. He also has other important tutorials. Thanks Roland for the example. Here is one for you: input file: 46693,Transfer,05/28/2013,,,"CGR86LBBBN (Submitted on May 28, 2013 3:19 p.m. ET), and nothing",ALLY BANK 2110532245,,-132.62,Bank 46693,Transfer,05/28/2013,,,"CGR86LBBBN (Submitted on May 28, 2013 3:19 p.m. ET)",CAPITAL ONE BUSINESS,,132.62,Credit Card awk ' BEGIN {FS="\"";OFS=""}; {for (i = 1; i

  • Carla Schroder Said:

    Hi Cyrus, Right you are-- you have to import one account type at a time into Gnucash. In my experience migrating to a new finance program is always painful and horrid, and Gnucash fall into the 'more horrid' side :) I Iike its reporting, there are tricks and tweaks to get Gnucash reports to slice and dice your data the way you want. Moneydance is very good, and it even looks nice now on Linux instead of all jaggedy GTK pixelated like it used to. But for me Moneydance has a fatal flaw, and that is terrible network performance-- I do a lot of remote work via SSH tunnels and VPN, and for that Moneydance is hopeless. I yearn for the olden days of shoebox accounting-- income goes in, money to pay bills comes out.

  • raajesh Said:

    hi Carla, Please help with this. I have two files rspfile and myfile as below: rspfile # WAS_USRID= # GRP_GRPID= # ND_HOST= # BASE_NODE= # BASE_HOST= # APP_NAME_1= #APP_NAME_2= #APP_NAME_3= # CLSTER_NAME_1= #CLSTER_NAME_2= #CLSTER_NAME_3= # INIT_HEAP_1= MAX_HEAP_1= #INIT_HEAP_2= #MAX_HEAP_2= #INIT_HEAP_3= #MAX_HEAP_3= # VIRT_HOST_1= #VIRT_HOST_2= #VIRT_HOST_3= myfile WAS_USERID=400 GRP_GRPID=400 ND_HOST=abcdefgh.was.tb.dsdc BASE_NODE=abcdefgh BASE_HOST=abcdefgh.was.tb.dsdc APP_NAME_1=CASMS01CD APP_NAME_2=CASMS02CD CLSTER_NAME_1=CMS01CD CLSTER_NAME_2=CMS02CD INIT_HEAP_1=512 MAX_HEAP_1=1024 INIT_HEAP_2=512 MAX_HEAP_2=1024 VIRT_HOST_1=cms01cd-host VIRT_HOST_2=cms02cd-host Using awk like to have result as below: rspfile # WAS_USERID=400 # GRP_GRPID=400 # ND_HOST=abcdefgh.was.tb.dsdc # BASE_NODE=abcdefgh # BASE_HOST=abcdefgh.was.tb.dsdc # APP_NAME_1=CASMS01CD APP_NAME_2=CASMS02CD #APP_NAME_3= # CLSTER_NAME_1=CMS01CD CLSTER_NAME_2=CMS02CD #CLSTER_NAME_3= # INIT_HEAP_1=512 MAX_HEAP_1=1024 INIT_HEAP_2=512 MAX_HEAP_2=1024 #INIT_HEAP_3= #MAX_HEAP_3= # VIRT_HOST_1=cms01cd-host VIRT_HOST_2=cms02cd-host #VIRT_HOST_3= Please suggest thanks in advance.


Who we are ?

The Linux Foundation is a non-profit consortium dedicated to the growth of Linux.

More About the foundation...

Frequent Questions

Join / Linux Training / Board