Home Blog Page 511

This Week in Open Source: OSS as New Normal in Data, New Linux Foundation Kubernetes MOOC

This week in open source and Linux news, Hortonworks CTO considers why open source is the new normal in analytics, new Linux Foundation edX MOOC called a “no-brainer” and more! Read on for the top headlines of the week

1) Hortonworks CTO unpacks how open source data architectures are “now considered mainstream in the IT environments and are widely deployed in live production in several industries.”

Open Source Is The New Normal In Data and Analytics – Forbes

2) Steven J. Vaughan-Nichols calls new Linux Foundation Kubernetes MOOC a “no-brainer.”

Linux Foundation Offers Free Introduction to Kubernetes Class – ZDNet

3) “Lyft’s move is part of a greater trend among tech companies to open-source their internal tools for performing machine learning work.”

Lyft to Open-Source some of its AI Algorithm Testing Tools – VentureBeat

4) The Linux Foundation has become a catalyst for the shift toward network functions virtualization (NFV) and software-defined networking (SDN)

How is The Linux Foundation Shaping Telecom? – RCRWireless News

5) You can now download a flavor of the popular Linux distribution to run inside Windows 10

Ubuntu Linux is Available in the Windows Store– engadget

How Open Source Took Over the World

GOING WAY BACK, pretty much all software was effectively open source. That’s because it was the preserve of a small number of scientists and engineers who shared and adapted each other’s code (or punch cards) to suit their particular area of research. Later, when computing left the lab for the business, commercial powerhouses such as IBM, DEC and Hewlett-Packard sought to lock in their IP by making software proprietary and charging a hefty license fee for its use.

The precedent was set and up until five years ago, generally speaking, that was the way things went. Proprietary software ruled the roost and even in the enlightened environs of the INQUIRER office mention of open source was invariably accompanied by jibes about sandals and stripy tanktops, basement-dwelling geeks and hairy hippies. But now the hippies are wearing suits, open source is the default choice of business and even the arch nemesis Microsoft has declared its undying love for collaborative coding.

But how did we get to here from there? Join INQ as we take a trip along the open source timeline, stopping off at points of interest on the way, and consulting a few folks whose lives or careers were changed by open source software.

Read more at The Inquirer

Analyzing GitHub, How Developers Change Programming Languages over Time

Have you ever been struggling with an nth obscure project, thinking: “I could do the job with this language but why not switch to another one which would be more enjoyable to work with”? In his awesome blog post: The eigenvector of “Why we moved from language X to language Y.” Erik Bernhardsson generated an N*N contingency table of all Google queries related to changing languages. However, when I read it, I couldn’t help wondering what the proportion of people who effectively switched is. Thus, it has become engaging to deepen this idea and see how the popularity of languages changes among GitHub users.

Dataset available

Thanks to our data retrieval pipeline, source{d} opened the dataset that contains the yearly numbers of bytes coded by each GitHub user in each programming language. In a few figures, it is:

  • 4.5 Million GitHub users
  • 393 different languages
  • 10 TB of source code in total

Read more at source{d}

Decentralizing Your Microservices Organization

Adaptability — the ability to quickly and easily change — has become a primary goal for modern businesses and has put pressure on technology teams to build platforms that are easier and less costly to change.  Working in such environments, these teams have been attracted more and more to the microservices style of software architecture.  What attracts them is the promise of a method for expediting changes to software, without introducing unnecessary danger to the business.

The microservices way of doing things is made possible in large part by favoring decentralization of software components and data — more specifically, by breaking up “monolithic” elements into smaller, easier to change pieces, and deploying those pieces on the network. Making this architecture work well requires a change to the way work is done and how work is governed. The organization that adopts microservices is one that “gets out of the developer’s way,” and provides the freedom and autonomy to make the magic happen.

Read more at The New Stack

Introducing Facade: An Easy Way to Track Git Repo Contributions

One of the great things about open source is that (most of the time), source code repositories are easily accessible. They can be great sources of diagnostic data, enabling you to understand who is contributing and committing code to your critical upstream projects. However, acquiring this data can be a labor-intensive process when monitoring a bunch of repos at once. This is particularly true if you want to monitor how contributions to a project change over time.

My background is in math, and I love digging into numbers to understand how and why things are happening in a certain way. Over the past few years I realized that generating summary statistics is the most time-consuming part of analyzing contributor stats. There are tools out there which can generate excellent summaries for single repos (and particularly for the kernel; gitdm is fantastic at this).

However, I regularly found myself doing substantial post-processing to generate consolidated views of the data. Unfortunately, this meant hours in Excel monkeying around with pivot tables. And, if you discover you got something wrong or need to map a domain name to a corporate affiliation, it’s back to square one… and when you want to see how things have changed, it happens all over again.

This is not a good way to keep yourself sane. So, in the spirit of “scratch your own itch,” I wrote a tool to analyze groups of git repos, aggregate the stats, and produce summary reports.

The FOSS Anomaly Detector, aka “Facade”

I call the project Facade (Figure 1). The meaning is twofold: First, this was originally conceived as a “FOSS Anomaly Detector” (F.O.S.S.A.D.), which would allow you to see how contribution patterns changed over time. Second, it allows you to see behind your speculations about the project, and get an informed view of who is doing the development work, based upon real data.

Figure 1: Facade.

Facade is built around the idea of “projects,” which are groups of repositories (Figure 2). Statistics are aggregated by project and for individual repos. Most interactions with it are web-based, and it can run more or less unattended and continuously. Creating reports requires little more than copying and pasting, but if you really want to dive in, it can produce a CSV of raw contributor data. If you’re handy with Python and MySQL, you can also create customized summary reports as Excel files each time the dataset is updated.

Figure 2: Facade projects.

Facade gets its data by mining the git log. First, it calculates the list of parent commits for HEAD, and figures out which it hasn’t yet parsed. For each commit, it stores the author info, the committer info, the files that were changed, and stats about the patch itself. Then once the analysis is complete, it summarizes these stats by repo or project:

  • Lines of code added (minus whitespace changes)

  • Lines of code removed

  • Number of whitespace changes (changes to indentation, blank lines, etc)

  • Number of patches

  • Number of unique contributors

Facade attempts to associate email addresses with companies (Figure 3). These mappings can be added in the web interface, so you can gradually increase the accuracy of Facade over time.

Figure 3: Facade summary.

Facade also includes the ability to tag email addresses, for identifying teams of contributors within the data.

All of the info is stored in a database, so if you want to get really granular (for example, “who authored the most non-whitespace changes to a certain file between 2011 and 2014?”) you have the raw data you need. I designed it to store pretty much everything, so for every commit it records:

  • The commit hash

  • The author’s name

  • The author’s email

  • The author’s canonical email, if they used an alias

  • The author’s affiliation (if known)

  • The date the patch was generated

  • The committer’s name

  • The committer’s email

  • The committer’s canonical email, if they used an alias

  • The committer’s affiliation (if known)

  • The date the patch was committed

  • The filename

  • Lines of code added

  • Lines of code removed

  • Number of whitespace changes

Getting started

The Facade source code can be found at https://github.com/brianwarner/facade.

The best way to get started is to clone Facade into your web root, and then follow the README. You will probably need to resolve a few dependencies.

Next you’ll run python utilities/setup.py. By and large, it should do everything for you unless you want to customize things. If you already have a MySQL database that supports utf8mb4, Facade can use that. Or you can just mash the Enter key a bunch of times, enter the root database password, and Facade will create the database, user, and tables for you. Once you set up your username and password for the website, you’re ready to go.

The first thing to do is log into the website, using the “manage” link at the bottom. This will allow you to configure projects, add repos, create tags, update aliases and affiliations, and set global configurations.

Once you’ve added a few repos, it’s back to the command line. Run utilities/Facade-worker.py, and when it’s complete project and repo stats will appear on the website.

Chances are pretty good that almost all will be categorized as (Unknown) affiliations. Don’t panic, that’s expected. Go to the People tab, fill in a few domain mappings, and re-run Facade-worker.py. The results should make a bit more sense.

Facade is intended to be run on a regular basis, so I recommend setting up a daily cron job. Just remember, if you make changes using the web interface, run the facade-worker.py script to see their effects.

So how does this compare with gitdm?

Gitdm is a fantastic tool, and it’s used for different things than Facade. In particular, it’s really well designed for gathering Linux kernel statistics, and it enables much finer-grained control over the range of commits. It also is a little easier to get up and running, as it doesn’t require a database or web server. gitdm also works on a single repository, and produces a single aggregate report.

On the other hand, Facade is meant to be run continuously, and data is stored so it doesn’t have to be recalculated each time. The statistics are grouped by date, which allows different views of the data. Facade will also yield slightly different results because it attempts to break out whitespace contributions separately.

So while both tools do gather summary stats, there are different (and very good) reasons to use one or the other.

“Dammit Jim, I’m a manager, not an engineer!”

I’ll just close with a preemptive apology — I write code for fun, not for a living, and am the first to admit I have lots more to learn. There may be rough edges, corner cases, and things which can be improved. So if you look at Facade and something about it makes you cringe, I would love to see your patches. Or if you’d like to make it do something new and cool, I would also love to see your patches. I am maintaining a list of things I’d like to add, and welcome both ideas and contributors.

You can find the code on GitHub, and me on Twitter.

Observability for Cloud Native

Integrating Honeycomb into your Kubernetes cluster with ksonnet.

Although JSON/YAML Kubernetes manifests are straightforward to read and write, they are not always the best way to manage applications on your cluster. If you have a complex, production system and want to modify its deployment with existing approaches, you may experience significant operational costs.

In that case, what do you do if you want to add a feature like observability to your Kubernetes cluster? Even if you have a solution that (1) provides insight into the intrinsically dynamic workloads of a cloud native platform, it also needs to be (2) easy to embed and (3) easily extensible.

Recently we teamed up with the folks at Honeycomb, who had prior domain experience from Facebook, to address these points. Fortunately for us, they have an existing observability agent that handles all of (1). What we bring to the table is ksonnet, an open-source Jsonnet library and a powerful, composable approach to writing Kubernetes manifests. Our resulting collaboration, a Honeycomb library of ksonnet mixins, accomplishes all the aforementioned goals.

Read more at Heptio

Which Spark Machine Learning API Should You Use?

A brief introduction to Spark MLlib’s APIs for basic statistics, classification, clustering, and collaborative filtering, and what they can do for you.

But what can machine learning do for you? And how will you find out? There’s a good place to start close to home, if you’re already using Apache Spark for batch and stream processing. Along with Spark SQL and Spark Streaming, which you’re probably already using, Spark provides MLLib, which is, among other things, a library of machine learning and statistical algorithms in API form.

Here is a brief guide to four of the most essential MLlib APIs, what they do, and how you might use them.  

Basic statistics

Mainly you’ll use these APIs for A-B testing or A-B-C testing. Frequently in business we assume that if two averages are the same then the two things are roughly equivalent. That isn’t necessarily true. Consider if a car manufacturer replaces the seat in a car and surveys customers on how comfortable it is. At one end the shorter customers may say the seat is much more comfortable. At the other end, taller customers will say it is really uncomfortable to the point that they wouldn’t buy the car and the people in the middle balance out the difference. 

Read more at InfoWorld

Why You Should Care About Net Neutrality

This image does a good job illustrating what the net neutrality discussion is all about (thanks to Software Engineering daily).

When folks discuss the idea of net neutrality, there are a lot of terms around legislation like “Title I” and “Title 2” and regulatory bodies like the FCC and FTC that are discussed. I’ve linked to articles that dig into this in detail below. While those are interesting pieces of information, I’d like to spend time on why this is a matter of philosophy and principle and why this discussion is very important.

Here are 4 ideas we’ll spend time on today (the “executive summary” if you will).

* Freedom of expression isn’t a function of the values of a place but the structure of the information infrastructure.

Read more at Medium

Actor Joseph Gordon-Levitt to Speak on Art and the Internet at Open Source Summit North America

Actor and online entrepreneur Joseph Gordon-Levitt will be speaking at Open Source Summit North America — Sept. 11-14 in Los Angeles, CA — about his experiences with collaborative technologies.

Gordon-Levitt, the founder and director of HITRECORD — an online production company that makes art collaboratively with more than half a million artists of all kinds — will share his views on the evolution of the Internet as a collaborative medium and offer some key technological lessons learned since the company’s launch.

Other new additions to the keynote lineup are:

  • Wim Coekaerts, Senior Vice President, Linux and Virtualization Engineering, Oracle

  • Chris Wright, Vice President & Chief Technologist, Office of Technology at Red Hat

And, previously announced speakers include:

  • Linus Torvalds, Creator of Linux and Git, in conversation with Jim Zemlin, Executive Director of The Linux Foundation

  • Tanmay Bakshi, a 13-year-old Algorithm-ist and Cognitive Developer, Author and TEDx Speaker

  • Bindi Belanger, Executive Program Director, Ticketmaster

  • Christine Corbett Moran, NSF Astronomy and Astrophysics Postdoctoral Fellow, CalTech

  • Dan Lyons, FORTUNE Columnist and Bestselling Author of “Disrupted: My Misadventure in the Startup Bubble”

  • Jono Bacon, Community Manager, Author, Podcaster

  • Nir Eyal, Behavioral Designer and Bestselling Author of “Hooked: How to Build Habit Forming Products”

  • Ross Mauri, General Manager, IBM z Systems & LinuxONE, IBM

  • Zeynep Tufekci, Professor, New York Times Writer, Author and Technosociologist

The full exciting lineup of Open Source Summit North America speakers and 200+ sessions can be viewed here.

Register by July 30th and save $150! Linux.com readers receive a special discount. Use LINUXRD5 to save an additional $47.

DNS Spoofing with Dnsmasq

DNS spoofing is a nasty business, and wise Linux admins know at least the basics of how it works. We’re going to learn the basics by doing some simple spoofing with Dnsmasq.

Dnsmasq has long been my first choice for LAN name services. It provides DHCP, DNS, and DHCP6, and it also provides a PXE/TFTP boot server. It performs router advertisement for IPv6 hosts, and can act as an authoritative name server. (See Dnsmasq For Easy LAN Name Services to learn the basics.)

DNS Spoofing Bad

DNS spoofing is a bad thing. A couple of legitimate uses I can think of are easier testing of locked smartphones, when they need to be jailbroken to edit their hosts files, or playing “funny” pranks on the people who use your Dnsmasq server. DNS spoofing is forgery; it’s faking a DNS entry to hijack site traffic. Some governments and businesses do this to control their people’s Internet activities. It is an effective monkey-in-the-middle trick for eavesdropping and altering packets. HTTP sessions are sent in the clear, so an eavesdropper sees everything. HTTPS sessions are also vulnerable; packet headers are not encrypted (they can’t be, as random routers need to read them), and there are tools like sslstrip that break SSL.

The good news is that DNS spoofing is self-limiting, because it only works on DNS servers that you control, and savvy users can find other servers to use.

Conquering Network Manager on Ubuntu

Network Manager is nice when you’re running your machine as a client, especially for auto-configuring wireless interfaces, but it has its quirks when you want to do anything your way, like run Dnsmasq. The easy way is to disable Network Manager and manually configure your network interface; then you can play with Dnsmasq without fighting Network Manager.

Another way is to make them play nice together. On Ubuntu, Network Manager uses dnsmasq-base, which is not the complete Dnsmasq. Follow these steps to get real Dnsmasq up and running:

  • sudo apt-get install dnsmasq resolvconf
  • Comment out dns=dnsmasq in /etc/NetworkManager/NetworkManager.conf
  • Stop Dnsmasq with sudo killall -9 dnsmasq
  • After configuring Dnsmasq, restart Network Manager with sudo service network-manager restart

Then configure and start Dnsmasq as shown in the following steps.

Simple Dnsmasq Spoofing

Install Dnsmasq and then create a new empty /etc/dnsmasq.conf. Save a copy of the original installed copy as a reference, and you should also have dnsmasq.conf.example somewhere, depending where your particular Linux flavor puts it.

Add these lines to /etc/dnsmasq.conf. Replace 192.168.1.10 with your own IP address:

server=208.67.222.222
server=208.67.220.220
listen-address=127.0.0.1
listen-address=192.168.1.10
no-dhcp-interface=
no-hosts
addn-hosts=/etc/dnsmasq.d/spoof.hosts

The server lines configure which DNS servers handle your Internet DNS requests. This example uses the free OpenDNS servers.

listen-address tells Dnsmasq which addresses to listen on. You must enter 127.0.0.1, and then also the IP address of your machine.

no-dhcp-interface= disables the built-in DHCP server, for fewer complications.

no-hosts disables reading /etc/hosts, again, to keep our testing as simple as possible.

addn-hosts names the file that you are going to enter your DNS spoofs in. It uses the same format as /etc/hosts. For testing purposes you can use fake IP addresses, like this:

192.168.25.101 www.example.com example.com
192.168.25.100 www.example2.com example2.com

Replace example.com with a real site name. Now start Dnsmasq from the command line:

$ sudo dnsmasq --no-daemon --log-queries
dnsmasq: started, version 2.75 cachesize 150
dnsmasq: compile time options: IPv6 GNU-getopt 
DBus i18n IDN DHCP DHCPv6 no-Lua TFTP conntrack 
ipset auth DNSSEC loop-detect inotify
dnsmasq: using nameserver 208.67.220.220#53
dnsmasq: using nameserver 208.67.222.222#53
dnsmasq: reading /etc/resolv.conf
dnsmasq: using nameserver 208.67.220.220#53
dnsmasq: using nameserver 208.67.222.222#53
dnsmasq: ignoring nameserver 127.0.0.1 - local interface
dnsmasq: read /etc/dnsmasq.d/spoof.hosts - 2 addresses

Ctrl+c stops it. This shows that Dnsmasq sees my upstream DNS servers and my spoof file. Test your spoof with the dig command:

$ dig +short @192.168.1.10 example2.com
192.168.25.100

You should see this in your Dnsmasq command output:

dnsmasq: query[A] example2.com from 192.168.1.10
dnsmasq: /etc/dnsmasq.d/spoof.hosts example2.com is 192.168.25.100

Fake Sites

If you successfully trick someone into using your spoof server, you can capture and examine their traffic at your leisure. This highlights the importance of SSL everywhere, and using SSH and OpenVPN. Though even these can be vulnerable, but it takes considerably more expertise than eavesdropping on unencrypted traffic.

Your spoofed IP addresses will not resolve to web sites, but will merely hang if you try to access the sites in a web browser. If you really want to act like a mad phisher, the next step is to build a fake web site to fool site visitors.

Detect Spoofing

The distributed nature of DNS means that DNS spoofing is impossible to implement on a large scale. The simplest test is to use dig to query multiple DNS servers and compare the results. This example queries an OpenDNS server:

$ dig +short @208.67.220.220 example2.com
10.11.12.13

It is good manners to use only public DNS servers and to not abuse private servers. Google’s public DNS is 8.8.8.8 and 8.8.4.4, and you can find many more with a quick web search.

Feed the Kitty

Dnsmasq has been chugging along for years now, while other name servers have come and gone. If you use it, send the maintainer, Simon Kelley, a few bucks because nothing says “thank you” for this excellent software like cash money.

Learn more about Linux through the free “Introduction to Linux” course from The Linux Foundation and edX.