Monitoring in the entire technical world is terrible and continues to be a giant, confusing mess. How do you monitor? Are you monitoring things the wrong way? Why not hire a monitoring consultant!
Today, we’re talking to monitoring consultant Mike Julian, who is the editor of the Monitoring Weekly newsletter and author of O’Reilly’s Practical Monitoring. He is the voice of monitoring.
Some of the highlights of the show include:
Observability comes from control theory and monitoring is for what we can anticipate
Industry’s lack of interest and focus on monitoring
When there’s an outage, why doesn’t monitoring catch it?” Unforeseen things.
Cost and failure of running tools and systems that are obtuse to monitor
Outsource monitoring instead of devoting time, energy, and personnel to it
Outsourcing infrastructure means you give up some control; how you monitor and manage systems changes when on the Cloud
Although Real-Time Linux (RT Linux) has been a staple at Embedded Linux Conferences for years — here’s a story on the RT presentations in 2007 — many developers have viewed the technology to be peripheral to their own embedded projects. Yet as RT, enabled via the PREEMPT_RT patch, prepares to be fully integrated into the mainline kernel, a wider circle of developers should pay attention. In particular, Linux device driver authors will need to ensure that their drivers play nice with RT-enabled kernels.
Julia Cartwright speaking at Embedded Linux Conference.
At the recent Embedded Linux Conference in Portland, National Instruments software engineer Julia Cartwright, an acting maintainer on a stable release of the RT patch, gave a well-attended presentation called “What Every Driver Developer Should Know about RT.” Cartwright started with an overview of RT, which helps provide guarantees for user task execution for embedded applications that require a high level of determinism. She then described the classes of driver-related problems that can have a detrimental impact to RT, as well as potential resolutions.
One of the challenges of any real-time operating system is that most target applications have two types of tasks: those with real-time requirements and latency sensitivity, and those for non-time critical tasks such as disk monitoring, throughput, or I/O. “The two classes of tasks need to run together and maybe communicate with one another with mixed criticality,” explained Cartwright. “You must resolve two different degrees of time sensitivity.”
One solution is to split the tasks by using two different hardware platforms. “You could have an Arm Cortex-R, FPGA, or PLD based board for super time-critical stuff, and then a Cortex-A series board with Linux,” said Cartwright. “This offers the best isolation, but it raises the per unit costs, and it’s hard to communicate between the domains.”
Another approach is to use the virtualization approach provided by Xenomai, the other major Linux-based solution aside from PREEMPT_RT. “Xenomai follows a hypervisor, co-kernel approach using a hypervisor or AMP solution to separate an RTOS from Linux,” said Cartwright. “However, there’s still a tradeoff in limited communications between the two systems.”
RT Linux’s PREEMPT_RT, meanwhile, enables the two systems to share a kernel, scheduler, device stack, and subsystems. In addition, Linux IPC mechanisms can be used to communicate between the two. “There’s not as much isolation, but much greater communication and usability,” said Cartwright.
One challenge with RT is that “because the drivers are shared between the real-time and non real-time systems, they can misbehave,” said Cartwright. “A lot of the bugs we’re finding in RT come from device drivers.”
In RT, the time between when an event occurs – such as a timer firing an interrupt or an I/O device requesting service – and the time when the real-time task executes is called the delta. “RT systems try to characterize and bound this in some meaningful way,” explained Cartwright. “A cylic test takes a time stamp, then sleeps for a set time such as 10ms, and then takes a time stamp when the thread wakes up. The difference between the time stamps, which is the amount of time the thread slept, is called the delta.”
This delta can be broken down into two phases. The first is the irq_dispatch, which is the time it takes between the hardware firing and the dispatch occurring up until the time the thread scheduler is instructed that the thread needs to run. The second phase is scheduling latency, the time between when the scheduler has been made aware that a high priority task needs to run to the moment when the CPU is given the task to execute.
irq_dispatch latency
When using mainline Linux without RT extensions, irq_dispatch latency can be considerable. “Say you have one thread executing in user mode, and an external interrupt such as a network event fires that you don’t care about in your real time app,” said Cartwright. “But the CPU is going to vector off to into hard interrupt context and start executing the handler associated with that network device. If during that interrupt handler duration, a high priority event fires, it’s not able to be scheduled on the CPU until the low priority interrupt is done executing.”
The delta between the internal event firing and the external event “is a direct contributor to irq_dispatch latency,” said Cartwright. “Without an RT patch it would be a mess to define bounds on this because the bound would be the bound of the longest running interrupt handler in the system.”
RT avoids this latency by forcing irq threads. “There’s very little code that we execute in a hard interrupt context – just little shims that wake up the threads that are going to execute your handler,” said Cartwright. “You may have a low priority task running, and perhaps also a medium priority task that the irq fires, but only a small portion of time is spent waking up the associated handlers for the threads.”
RT also provides other guarantees. For example, because interrupt handlers are now running in a thread, it can be preempted. “If a high priority, real-time critical interrupt fires, that thread can be scheduled immediately, which reduces the irq_dispatch latency,” said Cartwright.
Cartwright said that most drivers require no modification to participate in forced irq threading. In fact, “Thread irq actually exists in mainline now. You can boot a kernel and pass the thread irq parameter and it will thread all your interrupts. RT will add a forced enablement.”
Yet there are a few cases when this causes problems. “If your drivers are invoked in the process of delivering an interrupt dispatch, you can’t be threaded,” said Cartwright. “This can happen with irqchip implementations, which should not be threaded. Another issue may arise if the driver is invoked by the scheduler.”
Other glitches can emerge when “explicitly disabling interrupts using local_irq_disable or local_irq_save. Cartwright recommended against using such commands in drivers. As an alternative to local_irq_disable, Cartwright suggested using spinlocks or local locks, a new feature that will soon be proposed for mainline. In a separate presentation at ELC 2018, called Maintaining a Real Time Stable Kernel, Linux kernel developer Steven Rostedt goes into greater depth on local locks.
Cartwright finished up her discussion of irq_dispatch latency issues by discussing some rare, hardware related MMIO issues that can occur. Once when she accidentally pulled the Ethernet cable during testing, it caused buffering in the interconnect, which screwed up the interrupts and was a pain to fix. “To solve it we had to follow each write by a readback, which prevents write stacking,” she said. The ultimate solution? “Take the drugs away from the hardware people.”
Scheduling latency
There appear to be fewer driver problems related to the second latency phase – scheduling – which is the time from when you execute a real-time thread to the time when the irq thread is scheduled. One example stems from the use of preempt_disable, when prevents a higher priority thread to be scheduled upon return from interrupt because preemption has been disabled.
“The only reason for a device driver to use preempt_disable is if you need to synchronize with the act of scheduling itself, which can happen with cpufreq and cpuidle,” said Cartwright. “Use local locks instead.”
On mainline Linux, spinlock-protected critical sections areimplicitly executed with preemption disabled, which can similarly lead to latency problems. “With RT we solve this by making spinlock preemptible in critical sections,” said Cartwright. “We turn them into pi-aware mutexes and disable migration. When a spinlock is held by a thread it can be preempted by a higher priority thread to bring in the outer bound.”
Most drivers require no changes in order to have their spinlock critical sections preemptible, said Cartwright. However, if a driver is involved in interrupt dispatch or scheduling, they must use raw_spin_lock(), and all critical sections must be made minimal and bounded.
One major downside for developers is that Kubernetes has no native functionality to manage its nodes or other clusters. As a consequence, operations must get involved every time a new cluster or worker node is needed. Of course, several ways exist to create a cluster and also to create a new worker node. But all of them require a specific domain knowledge.
Cluster API
TheCluster API is a new working group under the umbrella of the sig-cluster-lifecycle. The objective of the group is to design an API which enables you to create clusters and machines via a simple, declarative API. The working group is in the very early stages of defining all API types. But an example for GCP already exists.
The whole Cluster API currently consists of four components: one API server and three controllers.
API server
The Cluster API in its current state brings an extension-API server which is responsible for CRUD operations on the API resources.
Controllers
Currently there are three controllers planned:
MachineController
MachineSet
MachineDeployment
The MachineController is meant to be provider specific as each provider has its own way of managing machines. MachineSet and MachineDeployment on the other hand will be generic controllers which simply generate Machine or, respectively, MachineSet resources. This means that this approach is very similar to a Deploymentwhich manages ReplicaSets and a ReplicaSet managesPods.
But, of course, a provider could also implement a MachineSet and MachineDeployment controller, which could make sense, looking at AutoScaling groups on AWS or NodePools on GKE.
API types
The Cluster API introduces four new types:
Cluster
The cluster represents a Kubernetes cluster with configuration of the entire control plane except for node specifics.
Machine
A machine represents an instance at a provider, which can be any kind of server, like an AWS EC2 instance, a PXE booted bare-metal server or a Raspberry PI.
MachineSet – You see where this is going 😉
A MachineSet is similar to a ReplicaSet: a definition for a set of same machines. The MachineSet controller will create machines based on the defined replicas and the machine template.
A MachineDeploymentis similar to a Deployment: a definition for a well managed set of machines. The MachineDeployment controller though, will not directly manage machines but MachineSets. For each change on a MachineDeployment, the controller will create and scale up a new MachineSet to replace the old one.
MachineDeployment
ProviderConfig
Each machine and cluster type has a field called providerConfig within its spec. The field providerConfig is loosely defined. It allows arbitrary data to be stored in it. It allows provider-specific configuration for each API implementation.
Outlook
Possible ways to utilize this new API would be:
Autoscaling
Integration with the Cluster Registry API
Automatically add a new cluster to a registry, support tooling that works across multiple clusters using a registry, delete a cluster from a registry
Streamlining Kubernetes installers by implementing the Cluster API
Declarative Kubernetes upgrades for the control plane and kubelets
Maintaining consistency of control plane and machine configuration across different clusters / clouds
Cloud adoption / lift and shift / liberation
Henrik Schmidt
Henrik Schmidt is a Senior Developer at Loodse. He is passionate about the potential of Kubernetes and cloud native technologies and has been a major contributor to the Open Source projects nodeset and kube-machine.
While SREs are hotshots in the industry, their role in a microservices environment is not just a natural fit that goes hand-in-hand, like peanut butter and jelly. Instead, while SREs and microservices evolved in parallel inside the world’s software companies, the former actually makes life far more difficult for the latter.
That’s because SREs live and die by their full stack view of the entire system they are maintaining and optimizing. The role combines the skills of a developer with those of an admin, producing an employee capable of debugging applications in production environments when things go completely sideways.
As Google engineers essentially invented the role, the company offers a great deal of insight into how they manage systems that handle up to 100 billion requests a day. They boil down reliability into an essential element, every bit as desirable as velocity and innovation.
“The initial step is taking seriously that reliability and manageability are important. People I talk to are spending a lot of time thinking about features and velocity, but they don’t spend time thinking about reliability as a feature,” said Todd Underwood, an SRE director at Google.
Attend Automotive Linux Summit and Open Source Summit Japan in Tokyo, June 20 – 22, for three days of open source education and collaboration.
Automotive Linux Summit connects those driving innovation in automotive Linux from the developer community, with the vendors and users providing and using the code, in order to propel the future of embedded devices in the automotive arena.
Session highlights for Automotive Linux Summit:
Enabling Hardware Configuration Flexibility Keeping a Unified Software – Dominig ar Foll, Intel
Beyond the AGL Virtualization Architecture – AGL Virtualization Expert Group (EG-VIRT) – Michele Paolino, Virtual Open Systems
High-level API for Smartphone Connectivity on AGL – Takeshi Kanemoto, RealVNC Ltd.
AGL Development Tools – What’s New in FF? – Stephane Desneux, IoT.bzh
Amazon lost control of a small number of its cloud services IP addresses for two hours on Tuesday morning when hackers exploited a known Internet-protocol weakness that let them to redirect traffic to rogue destinations. By subverting Amazon’s domain-resolution service, the attackers masqueraded as cryptocurrency website MyEtherWallet.com and stole about $150,000 in digital coins from unwitting end users. They may have targeted other Amazon customers as well.
The incident, which started around 6 AM California time, hijacked roughly 1,300 IP addresses, Oracle-owned Internet Intelligence said on Twitter. … The 1,300 addresses belonged to Route 53, Amazon’s domain name system service.
The highly suspicious event is the latest to involve Border Gateway Protocol, the technical specification that network operators use to exchange large chunks of Internet traffic. Despite its crucial function in directing wholesale amounts of data, BGP still largely relies on the Internet-equivalent of word of mouth from participants who are presumed to be trustworthy. Organizations such as Amazon whose traffic is hijacked currently have no effective technical means to prevent such attacks.
Only 20 years after the label “Open Source” was coined, the entire tech ecosystem has embraced its values of sharing, collaboration and freedom. Although Open Source Software is pervasive to our everyday life, does everyone and especially the younger generation realize how to leverage it?
Last summer, over the course of 3 weeks, High School students with no prior experience in Computer Science (CS) joined Holberton School’s first Immersion Coding Camp to learn how to code and build their own website.
“Best things that the tech industry could give us” – Julia (senior at Bishop O’Dowd High School)
Students who had never heard of open source before the camp, got excited about the concept and ended up completely hooked by Git, Atom or Bootstrap.
When asked to talk about their experience and their understanding of OSS, students have a lot to say! Here is a summary of their feedback.“Over the course of 3 weeks, not only have the students experienced the school approach — project-based and peer-learning — but they also have had the opportunity to be exposed to the tech industry. They visited tech companies such as Scality, Salesforce, Twitch and the VC firm Trinity Ventures. They made connections with engineers, designers, investors and learned how to use open-source softwares that professionals use on a daily basis”
“A great way to learn is by looking at other people’s work and study what they did” – Mickael
All the students realize open source is a great way to explore or learn more about technology on their own, especially when you don’t have computer science classes at school. As Mickael (sophomore at Redwood High School) says “A great way to learn is by looking at other people’s work and study what they did”. GitHub is a considerable source of information and knowledge, “there’s always a ‘Read me’ file” to help you understand projects.
“It’s really cool that your work can help so many other people because you might have created what people had been looking for” shares Tyler (junior at KIPP King Collegiate).
Students embraced that OSS allows collaboration and mutual support: people are able to share valuable information and anyone can contribute to a project or a product and “it can even become better than what the person made originally”. When coding their navigation bar for example students used the code already available on Bootstrap instead of coding from scratch. It allowed them to express their creativity without the limitation of coding expertise.
“It can even become better than what the person made originally” – Tyler
Another reason why students adopted OSS so easily is because they realized the potential and opportunity it conveys. When using OSS, students feel they earn valuable skills and feel they are given equal opportunity to succeed: “Anybody can get into the tech industry” says Jonathan(senior at Castro Valley High School). Learning how to use software used by big tech companies equip you best for the real world once you are looking for a job regardless of your social background!
“Anybody can make their own website and create animations for free and nobody’s held back” – Jonathan
As they experienced the benefit of sharing and collaboration (which is not always encouraged in their classroom) students really embraced OSS and its values. They realized they had a steep learning curve over the course of the summer camp and they are looking forward to discovering more software to help them learn more of all aspects of coding and programming.
With no prior knowledge in CS, by the end of camp high school students were able by to build their own website from scratch, implement very sophisticated layouts and create animations. They learned a lot, expressed their creativity and increased their confidence in their technical abilities This coding camp would not have been so rewarding for them without OSS.
The Network Time Protocol (NTP) is a protocol used to synchronize computer system clock automatically over a networks. The machine can have the system clock use Coordinated Universal Time (UTC) rather than local time.
The most common method to sync system time over a network in Linux desktops or servers is by executing the ntpdate command which can set your system time from an NTP time server. In this case, the ntpd daemon must be stopped on the machine where the ntpdate command is issued.
In most Linux systems, the ntpdate command is not installed by default. To install it, execute the below command:
Interest in hiring open source professionals is on the rise, with more companies than ever looking for full-time hires with open source skills and experience. To gather more information about the changing landscape and opportunities for developers, administrators, managers, and other open source professionals, Dice and The Linux Foundation have partnered to produce two open source jobs surveys — designed specifically for hiring managers and industry professionals.
Please take a few minutes to complete the short survey and share it with your friends and colleagues. Your participation can help the industry better understand the state of open source jobs and the nature of recruiting and retaining open source talent. This is your chance to let companies, HR and hiring managers and industry organizations know what motivates you as an open source professional.
The survey results will be compiled into the 2018 Open Source Jobs Report. This annual report from Dice and The Linux Foundation presents the current state of the job market for open source professionals and examines what hiring managers are looking for when recruiting open source talent. You can download the 2017 Open Source Jobs Report for free.
As a token of our appreciation, $2000 in Amazon gift cards will be awarded to survey respondents selected at random after the closing date. Complete the survey for a chance to win one of four $500 gift cards.
Serverless computing or Function as a Service (FaaS) is a new buzzword created by an industry that loves to coin new terms as market dynamics change and technologies evolve. But what exactly does it mean? What is serverless computing?
Before getting into the definition, let’s take a brief history lesson from Sirish Raghuram, CEO and co-founder of Platform9, to understand the evolution of serverless computing.
“In the 90s, we used to build applications and run them on hardware. Then came virtual machines that allowed users to run multiple applications on the same hardware. But you were still running the full-fledged OS for each application. The arrival of containers got rid of OS duplication and process level isolation which made it lightweight and agile,” said Raghuram.
Serverless, specifically, Function as a Service, takes it to the next level as users are now able to code functions and run them at the granularity of build, ship and run. There is no complexity of underlying machinery needed to run those functions. No need to worry about spinning containers using Kubernetes. Everything is hidden behind the scenes.
“That’s what is driving a lot of interest in function as a service,” said Raghuram.
What exactly is serverless?
There is no single definition of the term, but to build some consensus around the idea, the Cloud Native Computing Foundation (CNCF) Serverless Working Group wrote a white paper to define serverless computing.
According to the white paper, “Serverless computing refers to the concept of building and running applications that do not require server management. It describes a finer-grained deployment model where applications, bundled as one or more functions, are uploaded to a platform and then executed, scaled, and billed in response to the exact demand needed at the moment.”
Ken Owens, a member of the Technical Oversight Committee at CNCF said that the primary goal of serverless computing is to help users build and run their applications without having to worry about the cost and complexity of servers in terms of provisioning, management and scaling.
“Serverless is a natural evolution of cloud-native computing. The CNCF is advancing serverless adoption through collaboration and community-driven initiatives that will enable interoperability,” said Chris Aniszczyk, COO, CNCF.
It’s not without servers
First things first, don’t get fooled by the term “serverless.” There are still servers in serverless computing. Remember what Raghuram said: all the machinery is hidden; it’s not gone.
The clear benefit here is that developers need not concern themselves with tasks that don’t add any value to their deliverables. Instead of worrying about managing the function, they can dedicate their time to adding featured and building apps that add business value. Time is money and every minute saved in management goes toward innovation. Developers don’t have to worry about scaling based on peaks and valleys; it’s automated. Because cloud providers charge only for the duration that functions are run, developers cut costs by not having to pay for blinking lights.
But… someone still has to do the work behind the scenes. There are still servers offering FaaS platforms.
In the case of public cloud offerings like Google Cloud Platform, AWS, and Microsoft Azure, these companies manage the servers and charge customers for running those functions. In the case of private cloud or datacenters, where developers don’t have to worry about provisioning or interacting with such servers, there are other teams who do.
The CNCF white paper identifies two groups of professionals that are involved in the serverless movement: developers and providers. We have already talked about developers. But, there are also providers that offer serverless platforms; they deal with all the work involved in keeping that server running.
That’s why many companies, like SUSE, refrain from using the term “serverless” and prefer the term function as a service, because they offer products that run those “serverless” servers. But what kind of functions are these? Is it the ultimate future of app delivery?
Event-driven computing
Many see serverless computing as an umbrella that offers FaaS among many other potential services. According to CNCF, FaaS provides event-driven computing where functions are triggered by events or HTTP requests. “Developers run and manage application code with functions that are triggered by events or HTTP requests. Developers deploy small units of code to the FaaS, which are executed as needed as discrete actions, scaling without the need to manage servers or any other underlying infrastructure,” said the white paper.
Does that mean FaaS is the silver bullet that solves all problems for developing and deploying applications? Not really. At least not at the moment. FaaS does solve problems in several use cases and its scope is expanding. A good use case of FaaS could be the functions that an application needs to run when an event takes place.
Let’s take an example: a user takes a picture from a phone and uploads it to the cloud. Many things happen when the picture is uploaded – it’s scanned (exif data is read), a thumbnail is created, based on deep learning/machine learning the content of the image is analyzed, the information of the image is stored in the database. That one event of uploading that picture triggers all those functions. Those functions die once the event is over. That’s what FaaS does. It runs code quickly to perform all those tasks and then disappears.
That’s just one example. Another example could be an IoT device where a motion sensor triggers an event that instructs the camera to start recording and sends the clip to the designated contant. Your thermostat may trigger the fan when the sensor detects a change in temperature. These are some of the many use cases where function as a service make more sense than the traditional approach. Which also says that not all applications (at least at the moment, but that will change as more organizations embrace the serverless platform) can be run as function as service.
According to CNCF, serverless computing should be considered if you have these kinds of workloads:
Asynchronous, concurrent, easy to parallelize into independent units of work
Infrequent or has sporadic demand, with large, unpredictable variance in scaling requirements
Stateless, ephemeral, without a major need for instantaneous cold start time
Highly dynamic in terms of changing business requirements that drive a need for accelerated developer velocity
Why should you care?
Serverless is a very new technology and paradigm, just the way VMs and containers transformed the app development and delivery models, FaaS can also bring dramatic changes. We are still in the early days of serverless computing. As the market evolves, consensus is created and new technologies evolve, and FaaS may grow beyond the workloads and use cases mentioned here.
What is becoming quite clear is that companies who are embarking on their cloud native journey must have serverless computing as part of their strategy. The only way to stay ahead of competitors is by keeping up with the latest technologies and trends.
It’s about time to put serverless into servers.
For more information, check out the CNCF Working Group’s serverless whitepaper here. And, you can learn more at KubeCon + CloudNativeCon Europe, coming up May 2-4 in Copenhagen, Denmark.