Managing Production Systems with Kubernetes in Chinese Enterprises


Shanghai skyline
In his talk at KubeCon, Xin Zhang, CEO of Caicloud, will describe his company’s experiences using Kubernetes to manage production systems in large-scale Chinese enterprises.

Kubernetes has rapidly evolved from running production workloads at Google to deployment in an increasing number of global enterprises. Interestingly, US and Chinese enterprises have different expectations when it comes to requirements, platforms, and tools. In his upcoming talk at KubeCon, Xin Zhang, CEO of Caicloud, will describe his company’s experiences using Kubernetes to manage production systems in large-scale Chinese enterprises. We spoke with him to learn more. Is there anything holding back Kubernetes adoption and/or successful Kubernetes deployments in China?

Xin Zhang: There are several pain points of Kubernetes adoption we have encountered during Chinese enterprise deployment. Some examples are listed below:

  • The most obvious one is that people may immediately stumble onto is the Internet inaccessibility to certain Docker images hosted outside the Chinese network. Some traditional industries even require no outbound network accessibility (no traffic going out of the enterprise intranet), so being able to deploy Kubernetes without outside network access is a must.
  • Currently, most mutating operations of Kubernetes require using command-line and writing yaml or JSON files, whereas a considerable amount of Chinese enterprise users are more familiar and comfortable with UI operations.
  • Many of the networking and storage plugins of Kubernetes are based on US cloud providers such as AWS, GCE, or Azure, which may not be always available or satisfactory (performance-wise) to Chinese enterprise users.
  • The complexity of Kubernetes (both its concept and its operations manual) may seem a burden to certain users. Are there certain required features of a production system that are unique to Chinese enterprises?

Xin: When working with our customers, we did observe a set of commonly requested features that are missing or not currently mature from the official upstream releases. While these patterns are summarized from our Chinese customers, they may have broader applicability elsewhere. We sketch some of them below:

  • A better logging mechanism is required. The default logging module requires applications to dump their logs to stdout or stderr, while system components like fluentd will correctly do the right thing. However, Chinese enterprise applications are usually old-school style, which write logs to local files, and some applications use separate files to do fine-grained logging classification. Sometime enterprises even want to send logs into their existing, separate log store and processing pipeline, instead of using the EFK plugins.

  • Monitoring: There are several customized monitoring requests complementing the upstream solution:

    • Some customers consider running the somewhat heavyweight monitoring components in the same cluster as their applications a potential risk, and we did observe cases where monitoring components eat up system resources and affect user applications. Hence, being able to run monitoring components separately from the application cluster represents a common request.
    • While Kubernetes monitors applications running in it, a follow-up question is who monitors Kubernetes itself (its system components) and makes sure even the master is highly available.
    • Chinese enterprises tend to have existing monitoring infrastructure and tools (Zabbix is extremely popular!), and they’d like to have a unified monitoring panel that include both Kubernetes container level monitoring and existing metrics.
  • Network separation: While the default Kubernetes networking model allows any point-to-point network access within a cluster, complex enterprise usage scenarios require network policies, isolation, access control, or QoS among pods or services. Some enterprises even require Kubernetes to manage or cope with underlying SDN devices such as Huawei SDN controller. What are the most common pitfalls you’ve seen when running Kubernetes in the wild?

Xin: We did encounter a handful of pitfalls during production usage in large-scale enterprise workloads. Some of them are summarized below:

  • Resource quota and limit: While the resource quota and limit are intended to perform resource isolation and allocation, a good percentage of Chinese enterprise users have little idea of what values are appropriate to set. As a result, users may set inappropriate min or max resource range for applications, that either result in task OOM or very low resource utilization.

  • Monitoring instability: We found in our setting using the default heapster + influxdb solution for monitoring is not very stable in large-scale deployments, which can cause missed alerting or instability of the whole system.

  • Running out of disk: As there is little limitation on disk usage in certain scenarios, an application that writes excessive logs may exhaust the local disk quota and cause other tasks to fail.

  • Update the cluster: We provide commercial distributions of Kubernetes to customers and update our version every three months, roughly aligned with the upstream release schedule. And updating a live Kubernetes cluster is still cumbersome. What well-known Chinese enterprises currently run Kubernetes in production today? What are they using it for? 

Xin: Some of our own Kubernetes users cover leaders in a variety of industries, some example customers or industries are:

  • Jinjiang Travel International is one of the top 5 largest OTA and hotel companies that sells hotels, travel packages, and car rentals. They use Kubernetes containers to speed up their software release velocity from hours to just minutes, and they leverage Kubernetes to increase the scalability and availability of their online workloads.
  • China Mobile is one of the largest carriers in China. They use containers to replace VMs to run various applications on their platform in a lightweight fashion, and they leverage Kubernetes to increase resource utilization.
  • State Power Grid is the state-owned power supply company in China. They use containers and Kubernetes to provide failure resilience and fast recovery. How can Kubernetes be used more effectively in global environments?

Xin: To us, some imminent needs that will enable wider Kubernetes adoption globally are the following:

  • Ease of deployability with more diverse IaaS settings, in the parts of world where GCE, AWS, etc. are not the best choices.

  • More performance tuning and optimization: Production systems have stringent performance requirements, hence continuing to push the boundary of Kubernetes performance is of great value.

  • Better documentation and education: We have received customer complaints that the official document is still hard to follow and too many cross-references exist. We hope more efforts could be devoted to better documentation and more educational events happening around the globe (such as training, certification, and technical meetups/conferences).

Registration for this event is sold out, but you can still watch the keynotes via livestream and catch the session recordings on CNCF’s YouTube channel. Sign up for the livestream now.

Click Here!