December 1, 2016

Untangling Macros in C

Morse Code made with smoke

 

As programmers, in our daily office/school life, we are expected to write code following best practice, to comment it wisely, so that when need is to re-read it, well someone can do it. To take a break from all those constraints, we can head to the IOCCC the International Obfuscated C Code Contest.

In this post, we are going to focus on the IOCCC 1986 winner in the Worst abuse of the C preprocessor category. The code was written by James Hague.

Starting from the given source, observing its output, we will explain how it works.

The Code

Here it is in all its obfuscated glory:

#define	DIT	(
#define	DAH	)
#define	__DAH	++
#define DITDAH	*
#define	DAHDIT	for
#define	DIT_DAH	malloc
#define DAH_DIT	gets
#define	_DAHDIT	char
_DAHDIT _DAH_[]="ETIANMSURWDKGOHVFaLaPJBXCYZQb54a3d2f16g7c8a90l?e'b.s;i,d:"
;main			DIT			DAH{_DAHDIT
DITDAH			_DIT,DITDAH		DAH_,DITDAH DIT_,
DITDAH			_DIT_,DITDAH		DIT_DAH DIT
DAH,DITDAH		DAH_DIT DIT		DAH;DAHDIT
DIT _DIT=DIT_DAH	DIT 81			DAH,DIT_=_DIT
__DAH;_DIT==DAH_DIT	DIT _DIT		DAH;__DIT
DIT'\n'DAH DAH		DAHDIT DIT		DAH_=_DIT;DITDAH
DAH_;__DIT		DIT			DITDAH
_DIT_?_DAH DIT		DITDAH			DIT_ DAH:'?'DAH,__DIT
DIT' 'DAH,DAH_ __DAH	DAH DAHDIT		DIT
DITDAH			DIT_=2,_DIT_=_DAH_;	DITDAH _DIT_&&DIT
DITDAH _DIT_!=DIT	DITDAH DAH_>='a'?	DITDAH
DAH_&223:DITDAH		DAH_ DAH DAH;		DIT
DITDAH			DIT_ DAH __DAH,_DIT_	__DAH DAH
DITDAH DIT_+=		DIT DITDAH _DIT_>='a'?	DITDAH _DIT_-'a':0
DAH;}_DAH DIT DIT_	DAH{			__DIT DIT
DIT_>3?_DAH		DIT			 DIT_>>1 DAH:'\0'DAH;return
DIT_&1?'-':'.';}__DIT DIT			DIT_ DAH _DAHDIT
DIT_;{DIT void DAH write DIT			1,&DIT_,1 DAH;}

Apart from the particular formatting, what jumps to the eye is the number of “unnecessary” macros and the repetitive use of DIT and DAT variations.

The output

If we compile the code at this point we see many warnings. Among them, two for the implicit declaration of __DIT and _DAH. After that step, we can run the code, and as we provide sequences of ascii characters, it spits out sequences of . and _.

$ ./a.out hello, world

.... . .-.. .-.. --- --..-- .-- --- .-. .-.. -..

It looks like Morse code. And indeed, using an online Morse decoder, it is. It reverses back to HELLO, WORLD

De-Obfuscating

Let’s first try to perform the pre-processor job and replace the macros by their values. After a bit of reformatting, this is what we have:

char _DAH_[]=”ETIANMSURWDKGOHVFaLaPJBXCYZQb54a3d2f16g7c8a90l?e’b.s;i,d:”;
main()
{
char *_DIT, *DAH_, *DIT_, *_DIT_, *malloc (), *gets();
for (_DIT = malloc(81), DIT_=_DIT++; _DIT == gets(_DIT); __DIT(‘\n’))
   for (DAH_=_DIT; *DAH_; __DIT(*_DIT_ ? _DAH(*DIT_ ) : ‘?’),__DIT(‘ ‘),DAH_++) 
     for (*DIT_ = 2, _DIT_ = _DAH_; *_DIT_ && (*_DIT_ != (*DAH_ >= ‘a’ ? *DAH_&223 : *DAH_ )); (*DIT_ )++,_DIT_++)
         *DIT_+= (*_DIT_>=’a’ ? *_DIT_ — ‘a’ : 0);
}
_DAH(DIT_)
{ 
__DIT(DIT_> 3 ? _DAH(DIT_>>1) : ‘\0’);
return DIT_ & 1 ? ‘-’ : ‘.’;
} 
__DIT(DIT_) char DIT_;
{
(void) write (1,&DIT_,1);
}

Slightly better.

We see the three functions we expected: main, _DAH, and __DIT. We also see an external variable __DAH__ , a long string. __DIT looks like the putchar function from the standard library, printing a char at a time. And what about _DAH ?

Dive into _DAH

It is recursive. As long as the argument is a number that takes more than 2 bits to write, it calls the function again, stripping the number from its last bit. The output will be part of the argument printed as and . masking for 1 and 0 , i.e. the number in binary format, and it will return the second leftmost digit. As an example, if we call _DAH(5) , 5 being 101 in binary, it will call _DAH(2) . That is the base case. it prints \0 (nothing) and return 10 & 1 == 0 so . . Then it will print . and return 101 & 1 == 1 so -. If we want to print that we have to call __DIT(_DAH(5)) which outputs .- which actually corresponds to 3 written in binary. _DAH(n) is a rather obfuscated function, which will not print/return n in binary but n — 2 .

The main function

Again with more explicit variables names.

char code[]=”ETIANMSURWDKGOHVFaLaPJBXCYZQb54a3d2f16g7c8a90l?e’b.s;i,d:”;
main ( )
{
char *line, *letter, *value, *code_copy, *malloc ( ),* gets ( );
    for (line = malloc(81), value= line++; line == gets(line); putchar(‘\n’))
     {
     for (letter = line; *letter; putchar(*code_copy ? _DAH(*value) : ‘?’), putchar(‘ ‘), letter++)
         {
         for (*value = 2, code_copy = code; *code_copy && (*code_copy != (*letter >= ‘a’ ? *letter & 2\
23: *letter)); (*value)++, code_copy++)
             {
              *value += (*code_copy >=’a’ ? *code_copy — ‘a’: 0);
             }
         }
     }
}

The outer loop: Each time the user enters a new line, it creates a buffer, and reads a line from the standard input to the buffer. gets does not check for buffer overflow, so 81 means nothing in the code itself, and I have not found what it means for Morse users. The function either returns the buffer it takes as an argument or NULL in this case, the program will return. This loop also assigns an address to value which it will use in the inner loop. This loop will print a new line after completion of the two inner loops.

The middle loop: it will loop through each letter in the string obtained above. As it moves from one letter to another, it will either print it using _DAH seen above or print a ? and add a space. As we looked at _DAH above, we used integers as arguments. It works fine as letters, ASCII characters like *value to be more precise, and are small integers in C.

The inner loop: it sets the value of *value to 2, and look at the value of the letter at this point. ((*letter >= 'a') ? *letter & 223 : *letter) means if the letter is lower case, use its upper case version. The octal 233 or 10010011 serves as a mask to change the one bit that is different for a lowercase letter than an uppercase. Of course 223 is not the most obvious one 137 would more easily come to mind. Knowing this, we can realize this inner loop will iterate as long as the *letter variable does not have a match in the code string. As it iterates, it will increase *value by 1 and move on to the next letter in code . Interestingly, if at one point in this process, the letter in code is lowercase, the loop will increase *value by some special number.

Going from a letter to its Morse code: globally, the inner loop starts with a *value = 2 and increase it by 1 each time the iteration moves on to the next letter in code until the letter in code is the letter in my line. It will then print *value using _DAH . Let’s see some examples:

  •  *letter = 'E' then *value = 2 and since E is the first letter in code it goes back to the middle loop and calls putchar(*code_copy ? _DAH(*value) : '?')` . Value of *code_copy == 'E' here so expression above becomes putchar(_DAH(2)) As seen above _DAH(2) returns/prints the binary value of 0as ‘.’ and ‘-’. The output of this expression will be . This is indeed the Morse code for E.
  •  *letter = 'I' then to start *value = 2 , since I is the third letter in code the inner loop will exit with *value == 4 . putchar(_DAH(4)) will print out .. , the Morse code for I.

We can observe a pattern, as the order chosen for the letters in code is such that the letters have Morse codes that are equal to the binary values of their index in the code string plus 2. Of course, this is not perfect. Remember the inner loop special cases ? If *code == 'a' , the loop will skip that *value or said otherwise, this *value or index does not map any Morse code. If *code == 'b' , the loop will skip that *value and shift by 'b' — 'a' == 1 , which means from the on, the Morse code for the letters in code will map the value of their index in the string plus 3. An so on, next comes a d which will shift that new mapping by 'd' — 'a' == 3 …This is genius and so hard to figure out.

Conclusion

The insane contest that is the IOCCC produces crazy creative code. Behind the formatting, the dubious but nevertheless relevant variable names and over complications lies a genius idea that maps letters with their Morse code inside a single string. I do not know how challenging it was to come up with it in the first place, but it did require some time, doubts and a few ‘ha ha’ moments to unravel the underlying process. I am so glad I did it, and I feel I acquired new skills in reading others’ code in the process.

Post Scriptum

As an annex, here is a ‘de obfuscated’ code, it compiles with one warning due to the use of gets. I tried to stay close to the original, but making it more readable.

#include <stdio.h>
#include <stdlib.h>

int _DAH(int);

char code[]="ETIANMSURWDKGOHVFaLaPJBXCYZQb54a3d2f16g7c8a90l?e'b.s;i,d:";

int main(void)
{
        char *line, *letter, *code_copy;
        char *gets(char *);
        char value, upper_case;

for (line = malloc(81); gets(line) != NULL; putchar('\n'))
        {

for (letter = line; *letter; letter++)
                {
                        code_copy = code;
                        upper_case = (*letter >= 'a' ? *letter - 'a' + 'A' : *letter);
                        value = 2;
                        while (*code_copy && (*code_copy != upper_case))
                        {
                                value += (*code_copy >='a' ? *code_copy - 'a': 0);
                                value++;
                                code_copy++;
                        }
                        putchar(*code_copy ? _DAH(value) : '?');
                        putchar(' ');
                }
        }
}

int _DAH(int letter)
{
        putchar(letter > 3 ?_DAH (letter>>1): '\0');
        return (letter & 1 ? '-' : '.');
}

This article was contributed by a student at Holberton School and should be used for educational purposes only.

 

Click Here!