Download and search WikiLeaks

1064

After all that mess about US Embassy Cables on WikiLeaks I think this is a good moment to explain to a political analyst/columnist why he or she should use Linux. I will not cover here how to download distro, cut a CD and install Linux, every distro is giving those instruction in detail. I will describe how it looks on Ubuntu, though it will look about the same on any other distro.

Get files

We point Firefox to http://wikileaks.ch/cablegate.html and towards the bottom of the page we locate “Click here to download full site in single archive.”. Since that is the link to torrent, Firefox will offer to open it with the default application Transmission. We accept that by clicking the OK button. Then Transmission will ask if we want to add torrent. We click Add button and give some time to Transmission to finish the download. We may wish to help the others by providing upload or we stop uploads using File->Pause All and quit Transmission. That is the procedure for handling any kind of torrent.

Do processing of files

We can find the result of a successful download in Downloads folder and that is 7z archive. Since Ubuntu doesn’t support 7z out of the box we need to install it. We can do it like this: Applications->Accessories->Terminal and execute:

sudo apt-get install p7zip

It will prompt for password and install 7z. Now we can return to GUI, right click on archive, and from the menu select Extract Here. Back to terminal and we change the directory and do the search:

cd Downloads/cablegate-201012200724/cable/

find . -name “*.html” | xargs grep -l “UFO”

 

which will produce a list of files which contain UFO:

 

./2009/12/09STATE129362.html

./2009/09/09LISBON514.html

./2009/02/09STATE11937.html

./2009/04/09PRISTINA148.html

./2008/03/08PARIS461.html

./2008/09/08LAGOS368.html

./2008/08/08PARIS1501.html

./2008/08/08LISBON2300.html

./2008/02/08MADRID174.html

./2008/02/08ABUJA320.html

./2006/03/06MINSK311.html

./2006/09/06KINSHASA1410.html

./2006/06/06MINSK641.html

./2010/01/10PRISTINA44.html

./2010/02/10PRISTINA84.html

./2010/02/10ADDISABABA288.html

./2010/02/10BAMAKO52.html

./2007/03/07KINSHASA282.html

./2007/07/07ANKARA1842.html

./2007/07/07KINSHASA797.html

./2007/07/07HARARE638.html

 

Unfortunately, those are not real UFO’s but some other acronims like EUFOR. To see what is in the file we can use Gedit or Firefox like this:

 

gedit ./2008/08/08PARIS1501.html

firefox ./2008/08/08PARIS1501.html

 

To do more complicated searches we can go mastering find and grep or switch to SWISH++. Again we need to install it:

 

sudo apt-get install swish++

 

then we create index file:

 

index++ -v3 -e “text:*.html” .

 

and we can do the search:

 

search++ Chapman

 

The result is:

 

# results: 4

100 ./2009/11/09MAPUTO1291.html 46552 09MAPUTO1291.html

89 ./2009/07/09MAPUTO713.html 51555 09MAPUTO713.html

87 ./2010/01/10MAPUTO86.html 52688 10MAPUTO86.html

87 ./2010/01/10MAPUTO80.html 49769 10MAPUTO80.html

 

What would Anna Chapman do in Maputo? So we do a better search:

 

search++ Anna near Chapman

 

where we get a disappointing 0 files. To learn more about using and, not, or and near we execute:

 

man search++

 

There are quite a few examples towards the end of man file.

During indexing we may want to save log file and see what words index++ will discard:

 

index++ -v3 -e “text:*.html” . > log

 

There are quite a few of them and those are frequent words like Moscow, Clinton and so on. In order to search on those words we can do this:

 

search++ Mosco*

 

It will skip Moscow in the origin section. Alternatively we can always fall back to find and grep.