August 17, 2004

Why India is struggling with localized language computing

Author: Mayank Sharma

While IT is improving the quality of life in many developed nations, the use of technology is still out of reach for people in many other countries. Nowhere is this digital divide as big or visible than in India. Apart from access to a computer, unfamiliarity with the English language is one of the biggest factors contributing to this problem. In countries like India, where the majority of the population is English-illiterate, computing has to speak a language the locals understand. This is where user-interface localization steps in.

Most applications have an interactive component or UI which includes messages to the user and system commands. Localization means translating these messages and commands into the language of each country or region.

Developers can localize only programs which are internationalized, explains Dr. Nagarjuna G, Chairman FSF-India. Internationalized programs encode their messages and names of commands in a standard such as Unicode and follow a framework, so that the core program works completely independent of the natural language.

Localizing the UI enables non-English-speakers to access computers. Nagarjuna points out that India's English literacy rate is close to 65 percent, but most of these people cannot use the available computers because the UI isn't in their mother tongue. With localization, developers can offer computer environments for education, as well as tools that give local and useful information -- such as water resource maps -- without requiring knowledge of English.


A lack of excitement about localization

It doesn't take a rocket scientist to figure out the benefits of localization, and the enthusiasm around it from the various projects is quite evident. But you still don't see any major deployments, at least not in India.

The Indian government and several of the industry leaders initially did not take localization very seriously, despite the known positive implications of this task, says Nagarjuna. Government, he thinks, has not been advised by the right kind of people. Indian computing researchers, primarily from the Indian Intitute of Technology (IIT), and the Center for Development of Advanced Computing (CDAC), concentrated on high-tech problems like machine translation, speech recognition, and optical character recognition, since these are intellectually more challenging and can help them get their papers published in research journals.

There is no reason not to solve these interesting problems, but providing core support should be a priority. Nagarjuna says passing this on to the industry wouldn't help. Industry has, in the past, solved problems by imposing proprietary encoding standards to such an extent that even fonts of Indian scripts are encoded. Each software developer encoded fonts at will.

Nagarjuna said that non-standard font encoding -- an unethical practice -- is often used. As a result, a glyph (character shape) is positioned at a location on the glyph table by different vendors at different places. Also sometimes a glyph is broken into different shapes and placed at different places. The only problem here is the address where the glyph is placed is not standardized. This means only those applications that are produced by the vendors can use those fonts, and others can't. This is a vendor lock. And if this vendor is closed, or if the user intends to migrate to another platform/application, all the data the user created using the non-standard font is of no use. This industrial practice should be prevented by the government by law. ?We have never seen this happening for any English font; why should we let this happen for Indian language computing?" asked Nagarjuna.

CDAC also developed its own font-encoding schemes, such as ISFOC. The government failed to impose any control on this. The government's intentions may have been good, since it helped develop the standards, but it failed to impose them. Neither industry nor government required all application documents to be in a standard encoding format.

CDAC developed good usable standards like ISCII, but rather than free its standards in a similar way to the World Wide Web Consortium, CDAC acted like private industry, forgetting that it was running its business with public money. Since its programs are not free (as in freedom) they must compete with commercial software. If CDAC had acted differently, it could have achieved its objectives, and today CDAC's standards would be the Indian computing standards.

Nagarjuna suggests the government should form a consortium, like W3C, for Indian computing. Then from time to time propose solutions, develop standards for Indian languages, and invite all the stakeholders from industry as well as organizations like the FSF. This consortium can promote the use of Unicode or ISCII where the positions of the glyphs are fixed. Also government should pass and enforce a law saying: If due to some technical reason a company intends to use a non-standard method, it must publish the addresses of the glyphs and also provide an export filter in the application so that migration to open standards is possible.

CDAC recently ramped up localization efforts, possibly realizing their importance, Nagarjuna says. CDAC's team in Bangalore is localizing OpenOffice.org in Indian languages. Its Mumbai team is working on giving core support for Indian languages at the operating system level by experimenting with the X Window System.


Where is localization applicable?

Computing involves a lot more than just the operating system. So where should the developers working on localizing the GUI start? Nagarjuna believes that applications used by everyone -- browsers, Web sites, email clients, office applications, and file managers -- should be considered first while localizing.

Sankarshan Mukhopadhyay, one of the founders of the Ankur Project, which supports the Bangla language on X, points out that localization can promote educational content in local languages.

The Indian Ministry of IT has been doing some work on localization under the Technology Development for Indian Languages banner. Localized education content on cheap localized Linux machines could promote the spread of IT in schools across the country in no time.

Nirav Mehta, founder of the Utkarsh project -- for supporting Gujarati -- takes the bar a bit further and higher. He wants to localize in all mass-market areas. For his native language, this could mean some software to learn Gujarati for non-resident Indians, a tool for farmers to communicate with the city markets, a tool for the electoral commission to allow searches on voter lists, or a news channel providing SMS alerts in Gujarati. The possibilities, he says, are endless.


Ingredients for localizing computing

The extent and usefulness of localization for any developing country is immense. So, what does one need to squeeze the benefits out of this medium?

Jitendra Shah is a professor at Veermata Jijabai Technological Institute (VJTI) and project leader of IndicTrans, which supports various Indian languages. He defines the localization experience as "being able to communicate in one's language within a culturally familiar environment. As a sibling of internationalization, it is also a way to bring to India's people the winds of world experiences and world's opportunities, thereby giving the local talent a scope to express at the global platform."

For proper localization we need all the elements of a writing system (input method, editor, fonts, dictionaries, spell checkers, etc.). More importantly we need applications that give us the advantages of computer and communication. To get to many such resources on the Web, one has to use an operating system and applications that require basic English know-how, which a majority of the population in many countries doesn't have.

Mehta lists the following as necessary ingredients for a localized experience:

  • Fonts
  • Localized user interface -- i.e. locale, translations
  • Localized theme of the operating system -- colors, icons, etc.
  • Dictionaries and thesaurus
  • Help and documentation in local language
  • Keyboards in local language
  • Ability to make printouts in local language
  • A good collection of useful applications in the local language
  • Localized support resources for local language computing -- hardware and software support, developers, etc.
  • Continuous innovation and improvement!

The last bullet is an indication of the complexity of the process. Once the initial system is up and running, it requires constant checks and improvements. While this is true of any system, it is particularly complex in the case of localization, due to the complexities of language.

Building on this, Venkatesh Hariharan, co-founder of the IndLinux project, says that all of the foregoing comprises the first phase of localization. In the next stage, developers need to deploy the software and take feedback from end users. Based on this feedback, they might need to revamp their code, and perhaps even devise cultural user interfaces that are more appropriate to Indian users.

Remember, the current "files and folders" metaphor derives from the work of cognitive scientists who said that the interface of the computer must closely resemble that of the real world. For a western office the current metaphors may be appropriate, but a "desktop" means nothing to a farmer who has never owned a desk!

This phase will require a lot of time and money, and most importantly a sensitive approach to what rural users of IT want. City-based IT professionals will have to put their preconceived notions aside and go in with an open mind.

Wanna get local?

Localization is a complicated process and an article in itself. But just to get you an idea of what it involves, I asked one of the lead developers of the Ankur project, Sayamindu Das Gupta (a.k.a. SDG), to show us the ropes. SDG outlined eight important steps:

Step 1: Decide the language you want to localize. As SDG points out, language is not equal to a script. For example, Bangla and Assamese are different languages, but both use the same script -- Bengali.

Step 2: Find the script for your language.

Step 3: Find out whether your script has been encoded by the Unicode standard.

Step 4: Find the two-letter ISO code for your language, and look up your country code.

Step 5: Find out whether the locale data for your region exists in the GNU libc. This is the C library used with the Linux kernel.

The data files are named in the format of <languagecode_countrycode>. For example, the file for Indian Bengali is named bn_IN, while Hindi is hi_IN. Look for your locale data in the latest glibc sources from the CVSWeb interface. If a locale data file is not available, search Google and try to find out if someone is working on it. If no one is working on it, write one yourself.

Step 6: After the locale data is ready and has been tested with the "localedef" command, find out whether a font for your script exists. Fonts have to be Unicode-compliant, and if your script has advanced or complex features such conjunct (juktakshar) formation or character reordering, you'll need an OpenType font.

Step 7: Figure out whether your script is supported by the text drawing and rendering systems of the Linux distribution you use. GNOME and GTK2 applications use the Pango library for rendering text. KDE uses the internal rendering engine of QT. Join QT- and Pango-related mailing lists and ask the developers if your script is supported..

Step 8: Start translating. This is a major task, but it is not very difficult. However, you have to be careful and maintain consistency; users won't like it if they see an element as "foo" at one place and as "bar" at another place. Translation mainly consists of trawling through PO files. An introduction to PO files is available at the Ankur Web site.

Once you have translated a set of PO files, decide what to do with them. You might contribute your translation to be incorporated into a distribution, or you could start your own localized distro!


Where to start?

Where to start is a complex question, and the answer depends on whom you ask. Mehta, whose primary focus is the Gujarati-speaking non-IT city population, believes a complete localized experience should empower individuals to carry on personal and business activities in their own language. It should be able to position computers as simple tools to get the job done. The targets should be ease of use and productivity. Users should be able to create and dispatch documents, work with data and calculations, communicate with peers and associates, gain insight and knowledge about a field of interest, and satisfy their personal and business interests.

As a general progression, teams focus on getting the basic requirements taken care of and then move on to other things. Mehta and his project has the locale, fonts, keyboard layout, and core translations for GNOME 2.6 for Gujarati completed. OpenOffice 1.1 translations are complete too, and the team is building an install set. People have started asking for applications specific to their industries and needs -- accounts, stock market, education, and even software for opticians!

On the other hand, Shah is building localized solutions that cater to the general population. His current project involves enabling voter lists to be searchable in his native language, Marathi.

Sankarshan's Ankur team is working on a localized environment in Bengali, with the aim of assisting the delivery of education.


Building on popular software

While all the groups might be localizing for a different kind of audience, there is one similarity. All these projects are localizing a couple of open source software projects in their respective languages. OpenOffice.org seems to be on the priority list of everyone, closely followed by Mozilla. GNOME seems to be the desktop of choice.

Mehta says the clear advantage of using open source software is in the strong code base and market share. "If we were to build our own Office suite in Gujarati, it would be next to impossible with the resources we have. OpenOffice.org is there, so why reinvent the wheel? And people who may already be using OpenOffice.org in English would find it easier to migrate to and support the same software in Gujarati."

Popular software is also tested over time, which means the chance of bugs halting the localization work are less.

Mozilla is a favorite application for localization teams, Mehta confirms, "simply because these are the core applications that a user would need. Evolution will start figuring in the radars of localization teams quite soon. KDE is already there, and many teams have made good progress on KDE already."


Localization projects love LiveCDs

Ankur, Utkarsh, IndLinux, GNUBharaati, and other projects have all decided to release their localized systems on LiveCDs, primarily based on Knoppix or one of its derivatives. The projects have been pushing these LiveCDs as localization solutions to the government and the industry.

Shah says that in an otherwise hostile environment of proprietary software, with governments totally chained by bureaucratic red tape, with no overlap between visionaries and politicians, and with education in the hands of merchants whose interest is not primarily in education, free and open source localized software has to find a solution. A bootable CD provides a non-invasive alternative which can support a fail-safe roadmap for migration.

By contrast, Sankarshan believes that LiveCDs are dead. He believes that LiveCDs by their nature are meant to used as a technology demonstration platform. To take localization to the government, complete installable distros are the way to go. Unfortunately, the government doesn't see the benefits that a Live CD provides.

Mayank Sharma is a 21-year-old technology writer/developer from India. He does his bit to highlight and strengthen the localization efforts in India and is working on connecting FOSS with students and the education system.

Category:

  • Linux
Click Here!