Can Node.js Scale? Ask the Team at Alibaba


Alibaba is arguably the world’s biggest online commerce company. It serves millions of users and hosts millions of merchants and businesses. As of August 2016, Alibaba had 434 million users with 427 million active mobile monthly users. During this year’s Singles Day, which happened on November 11 and is one of the (if not the) biggest online sales events, Alibaba registered $1 billion in sales in its first five minutes.

So when you are talking about scaling a site and its properties for users’ demands, Alibaba tops the list. And how does it scale so quickly? One of the technologies that helps the company is the versatile applications platform, Node.js.

In advance of Node.js Interactive, to be held Nov. 29 through Dec. 2 in Austin, we talked with Joyee Cheung, a developer at Alibaba, about Alibaba’s instrumentation of Node.js; why they chose to use Node.js; and the challenges that they faced with trying to scale Node.js on the server side. How is Alibaba using Node.js? Why did it decide to use Node.js as a technology?

Joyee Cheung: In Alibaba, we use Node.js for both frontend tool chains and backend servers. For frontend stuff, it is just a natural transition, since Node.js has become the de-facto platform for frontend tooling. But for backend applications, Node.js has come a long way in Alibaba.

We started to adopt Node.js since 2011, and used it as a frontend in the backend – to serve data or render web pages. By then most of Alibaba’s business was still about E-commerce, so the applications were bound to change frequently to meet the demands of sales, marketing and operation. We used Java for most of our applications, which was stable and rigorous, tailored for the enterprise, but it came with the cost of productivity. Meanwhile, the view layer became deeply coupled with other layers on the server side, which made the code harder and harder to maintain. During that time, Rich Internet application and Single Page Applications were on the rise, but we were still limited if the innovation just stayed on the client side. Many improvements to the user experience could not be done without modifications to both sides.

Then we developed the idea of separation of frontend from backend, meaning we take away the burden of frontend-related responsibilities (routing, rendering, serving data through HTTP API, etc.) out of the traditional backend applications and give them to applications dedicated to these kinds of stuff.

The backend applications can keep focusing on business logic and use a more stable and rigorous technology like Java, because they are less subject to change. They provide reliable services via RPC’s called by the frontend-backend applications. These frontend-backend applications can then focus on user experience and better adjust to the changes of design, product, and operation with a more flexible language.

By giving frontend developers the access to our faster and trusted internal network, we can also reduce the overhead of network requests and keep the user state secure behind a set of more restricted HTTP APIs. And nothing is more suitable for this kind of job than Node.js, because it is designed for efficient I/O, quick to start up and deploy, and uses a flexible language in which most frontend developers are already fluent. The separation of frontend and backend on the server side is actually the separation of concern, where we use different technologies to meet the needs of different expertise and handle different frequencies of changes.

It was not an easy ride, however. Many people questioned if Node.js was mature enough for enterprise applications, because it lacked the tooling and infrastructure we had with Java. And because Alibaba is a huge group with many subsidiaries, each with a slightly different technology stack, we needed to unite the effort throughout the group to make this work. To fit Node.js into our architecture and environment, we’ve developed our own npm registry and client (cnpm), customized web frameworks (Egg and Taobao Midway), monitoring and profiling solution (alinode, which I’m working on and is offered to external customers in the Cloud), and numerous middleware that hook into our infrastructure. We also give back to the community by developing (a Chinese forum for Node.js), having people contributing to Koa.js (which most of our frameworks rely on) and Node.js core(we have three collaborators at the moment), and open sourcing a lot of node modules (most of them are under node-modules and ali-sdk).

Now Node.js runs on thousands of machines in our clusters, handling a moderate amount of traffic across different subsidiaries of Alibaba. It has proven itself after several double-11 sales. We expect it to receive more adoption in the next few years. What prompted you to analyze the V8 garbage collection logs?

Joyee Cheung: When using Node.js at scale on the server side, garbage collection quickly becomes more important to performance than it is on the client side, since server-side applications tend to run much longer and handle more data than average client-side applications.

When the garbage collector is not working properly with the application, the CPU usage could go up and hurt responsiveness, the memory might not be reclaimed in time and other processes could be affected.

Even when the garbage collector is doing a good job, developers can make mistakes that mislead the garbage collector and result in memory leaks. V8 and Chromium provide tools to analyze the heap and memory allocations, but they don’t reveal the whole picture, especially when it comes to garbage collection.

Luckily, V8 provides garbage collection logs for us (though not documented), and sometimes these logs can shed some light on problems that other tools can’t help us with. Thanks to the LTS plan of Node.js, the format of the garbage collection logs is very stable throughout a LTS version. So now we can put the GC logs into our box of tricks. After analysis, what performance problems did you discover? How did you solve the problems you encountered?

Joyee Cheung: We have discovered some common causes of performance issues related to GC: inappropriate caching strategies, excessive deep clones, some particular uses of closures, bugs in templating engines, to name a few.

As a team offering performance management solutions, the problems we analyze usually come from other teams (both inside and outside Alibaba), so we need to work together with them to fix the problems. For external clients, we can give a rough direction, since most of the times we cannot access the code base. Things usually get clearer after a few rounds of Q&A.

For internal clients, especially for those who work on infrastructures, we can usually access at least part of the codebase and are usually more familiar with their business logic, thus giving more suggestions.

We provide a platform for monitoring the performance of Node.js applications down to the core, including garbage collections. So after the application is modified and redeployed, we usually ask our clients to check if the statistics they see on our platform goes back to normal. If it is hard to tell by just looking at the figures, we will ask them to turn on the GC log for a few minutes and visualize it using our solution to see if the pattern of the problem has gone. How can people take what Alibaba did and implement it? Are there certain environments that might find this more useful?

Joyee Cheung:  We plan to open source our parser (and possibly the visualization) for the garbage collection logs in the near future. We have also posted a few articles about our experiences with them on our blog (we plan to translate them into English as part of the documentation).

These tools are more useful to long-running server-side applications that handle at least a fair amount of traffic, especially those who do a lot of serialization/deserialization and data transformation.

View the full schedule to learn more about this marquee event for Node.js developers, companies that rely on Node.js, and vendors. Or register now for Node.js Interactive.