Hadoop MapReduce

August 3, 2017

Hadoop MapReduce Introduction

MapReduce is the processing layer of Hadoop. MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. You just need to put business logic in the way MapReduce works and rest things will be taken care by the framework. Work (complete job) which is submitted by the user to master is divided into small works (tasks) and assigned to slaves.

MapReduce programs are written in a particular style influenced by functional programming constructs, specifical idioms for processing lists of data. Here in map reduce we get input as a list and it converts it into output which is again a list. It is the heart of Hadoop. Hadoop is so much powerful and efficient due to map reduce as here parallel processing is done.

MapReduce – High-level Understanding

Map-Reduce divides the work into small parts, each of which can be done in parallel on the cluster of servers. A problem is divided into a large number of smaller problems each of which is processed independently to give individual outputs. These individual outputs are further processed to give final output.

Hadoop Map-Reduce is highly scalable and can be used across many computers. Many small machines can be used to process jobs that normally could not be processed by a large machine.

The Case for Open Source Software at Work

Esther Shein

August 2, 2017

Open source has entered the limelight at work. Not only is it frequently being used in businesses – but it’s helping people build their professional reputations, according to the recently released 2017 GitHub Open Source Survey.

Notably, half of the 5,500 survey GitHub contributors say that their open source work was somewhat or very important in getting their current role.

The survey found nearly all (94 percent) employed respondents use open source at least sometimes professionally (81 percent use it frequently), and 65 percent of those who contribute back do so as part of their work duties.

Also striking is the fact that most respondents (82 percent) say their employers accept – or encourage – use of open source applications and dependencies in their code base (84 percent), although some said their employers’ policies on use of open source are unclear (applications: 13 percent; dependencies: 11 percent).

The survey also found that nearly half (47 percent) of respondents’ employers have an IP policy that allows them to contribute to open source without permission, while another 12 percent can do so with permission. There is also a grey area here: 28 percent say their employer’s IP policy is unclear and another 9 percent aren’t sure how a company’s IP agreement handles open source contributions.

Large Companies Are On Board

The attention open source is receiving is no doubt helped by the fact that it is, well, open, allowing anyone to participate, regardless of the company they work at, enabling a variety of different perspectives. Some of the world’s largest companies – including Walmart, ExxonMobil, and Wells Fargo — are using the software as well as open sourcing their own code. The government has taken notice, too. In 2016, the Obama administration released its first official federal source code policy, which stipulates that “new custom-developed Federal source code be made broadly available for reuse across the Federal Government.”

TechCrunch recently released an index of the top 40 open source projects occurring in enterprise IT. They include IT operations; data and analytics, including tools for artificial intelligence and machine learning as well as databases; and DevOps, which includes projects involving containerization.

Some of the attributes of open source software clearly contribute to its increasing use on the job; the survey revealed stability and user experience are “extremely important” to 88 percent of respondents and “important” to 75 percent. Yet, these same attributes don’t make open source a superior option — only 36 percent said the user experience is better, and 30 percent find it more stable than proprietary options. However, open source software remains the preferred option for 72 percent of respondents who say they gravitate toward it when evaluating new tools.

Access to the software code makes developing with open source a “no-brainer,” writes Jack Wallen in a TechRepublic article on the 10 best uses for open source software in the business world. Among the other compelling use cases he cites for open source: big data, cloud, collaboration, workflow, multimedia, and e-commerce projects.

Typically, when an issue is discovered in open source, it can be reviewed and addressed quickly by either internal or third-party software developers. Contrast that with using proprietary software, where you are beholden to the software vendor or partner to provide software updates, and their timing may be different from yours. Additionally, if a bug is found, is it more likely to be identified and resolved faster when the source code is readily available, avoiding issues that occur when closed, proprietary systems are used.

The GitHub survey’s 5,500 respondents were randomly sampled and sourced from more than 3,800 open source repositories on GitHub.com, and more than 500 responses were from a non-random sample of communities who work on other platforms.

Connect with the open source development community at Open Source Summit NA, Sept. 11-14 in Los Angeles. Linux.com readers save on registration with discount code LINUXRD5. Register now!

OS Summit Europe Keynotes Feature Jono Bacon, Keila Banks, and a Q&A with Linus Torvalds

The Linux Foundation

August 2, 2017

Open Source Summit Europe is not far away! This year’s event — held Oct. 23-26 in Prague, Czech Republic — will feature a wide array of speakers, including open source community expert Jono Bacon, 11-year-old hacker Reuben Paul, and Linux creator Linus Torvalds.

At OS Summit Europe, you will have the opportunity to collaborate, share, learn, and connect with 2,000 technologists and community members, through keynote presentations, technical talks, and many other event activities.

Confirmed keynote speakers for OS Summit Europe include:

Jono Bacon, Community/Developer Strategy Consultant and Author
Keila Banks, 15-year-old Programmer, Web Designer and Technologist, with her father Phillip Banks
Mitchell Hashimoto, Founder of HashiCorp and Creator of Vagrant, Packer, Serf, Consul, Terraform, Vault, and Nomad
Neha Narkhede, Co-founder & CTO, Confluent
Sarah Novotny, Program Manager, Kubernetes Community, Google
Reuben Paul, 11-year-old Hacker, CyberShaolin Founder and Cyber Security Ambassador
Imad Sousou, VP, Software Services Group & GM, Open Source Technology Center, Intel Corporation
Linus Torvalds, Creator of Linux and Git in conversation with Dirk Hohndel, VP, Chief Open Source Officer, VMware
Jim Zemlin, Executive Director, The Linux Foundation

The full schedule will be published in the next few weeks, and applications are now being accepted for diversity and needs-based scholarships.

Registration is discounted to $800 through August 27, and academic and hobbyist rates are also available. Linux.com readers receive an additional $40 off with code LINUXRD5. Register Now!

Future Proof Your SysAdmin Career: New Networking Essentials

Sam Dean

August 2, 2017

In this series, we’re looking at some important considerations for sysadmins who want to expand their skills and advance their careers. The previous article provided an introduction to the concepts we’ll be covering, and this article focuses on one of the fundamental skills that every sysadmin needs to master: networking.

Networking is a complicated but essential core competency for sysadmins. A good sysadmin understands:

How users connect to a network, including managing remote connections via a Virtual Private Network (VPN)
How users authenticate to a network, ranging from standard two-factor authentication, to custom authentication requirements
How switching, routing and internetworking work
Software-Defined Networking (SDN)
End-to-end protocols
Network security

Fundamentals

TCP/IP (Transmission Control Protocol/Internet Protocol) forms the basis of how devices connect to and interface with the Internet. Sysadmins understand how TCP/IP packets address, route, and deliver data across a network.

A good sysadmin also knows how domain name servers (DNS) and resource records work, including understanding nameservers. They typically are fluent with DNS query tools such as dig and nslookup, as well topics such as sender policy framework and NOTIFY.

With large-scale security threats continuing to emerge, there is now a premium on experience with network security tools and practices. That means understanding everything from the Open Systems Interconnect (OSI) model to devices and protocols that facilitate communication across a network. Locking down security also means understanding the infrastructure of a network. Securing a network requires competency with routers, firewalls,VPNs, end-user systems, server security, and virtual machines.

Additionally, knowledge of a platform like OpenStack can effectively expand any sysadmin’s networking clout, because OpenStack, CloudStack, and other cloud platforms essentially expand the perimeter of what we think of as “the network.”

Likewise, the basics of software-defined networking (SDN) are increasingly important for sysadmins to understand. SDN permits admins to programmatically initialize, control, and manage network behavior dynamically through open interfaces and abstractions of lower-level functionality. This, too, is a category where familiarity with leading open source tools can be a big differentiator for sysadmins. OpenDaylight, a project at The Linux Foundation, is an open, programmable, software-defined networking platform worth studying, and OpenContrail and ONOS are also on the rise in this space.

Additionally, many smart sysadmins are working with open configuration management tools such as Chef and Puppet. Julian Dunn, a product manager at Chef, writes: “System administrators have got to stop thinking of servers/disk/memory/whatever as ‘their resources’ that ‘they manage.’ DevOps isn’t just some buzzword concept that someone has thought up to make sysadmins’ lives hell. It’s the natural evolution of both professions.” See our list of relevant, open configuration management tools here.

Training courses

For sysadmins who want to learn more about networking, the good news is that training in this area is very accessible and, in some cases, free. Furthermore, excellent free and open source administration and configuration tools are available to help boost any sysadmin’s networking efficiency.

Training options for Linux-focused sysadmins include a variety of networking courses. For sysadmins, CompTIA Linux+ offers solid training options, as does the Linux Professional Institute. The Linux Foundation Certified System Administrator (LFCS) course is another good choice. The Linux Foundation offers the LFS201 basic course and LFCS exam. Many vendors in the Linux arena also offer networking-focused training and certification for sysadmins, including Red Hat.

It’s also worth checking out O’Reilly’s Networking for Sysadmins video training options. These videos cover TCP/IP basics, OSI, and all the essential components within a network’s infrastructure, ranging from firewalls to VPNs to routers and virtual machines. The information is comprehensive, with some of the individual videos requiring a full day to complete and digest. Additionally, the curriculum is available on demand, so it can be used as reference material for networking essentials.

Additionally, Lynda.com offers an array of online network administration courses taught by experts. Sysadmins can quickly get through courses such as Linux File Sharing Services and Identity and Access Management.

Even as sysadmins focus on moving up the technology stack with their skillsets, networking basics remain essential. Fortunately, training and education are more accessible than ever. Next time, we’ll look at important security requirements to consider when advancing your sysadmin career.

Learn more about essential sysadmin skills: Download the Future Proof Your SysAdmin Career ebook now.

Future Proof Your SysAdmin Career: New Networking Essentials

Future Proof Your SysAdmin Career: Locking Down Security

Future Proof Your SysAdmin Career: Looking to the Cloud

Future Proof Your SysAdmin Career: Configuration and Automation

Future Proof Your SysAdmin Career: Embracing DevOps

Future Proof Your SysAdmin Career: Getting Certified

Future Proof Your SysAdmin Career: Communication and Collaboration

Future Proof Your SysAdmin Career: Advancing with Open Source

The Roadmap for Successfully Managing Open Source Software Vulnerabilities and Licensing

Jeff Luszcz

August 2, 2017

By Jeff Luszcz, Vice President of Product Management at Flexera Software

If Heartbleed has taught us anything, it’s that third-party security and compliance risks are dangerously threatening the integrity of the softwaresupply chain.

As you may know, the majority of organizations use more open source code in their products than code they’ve written themselves, which expedites the product creation process. The problem with this is that many companies, when using open source software (OSS), do so with no regard for the licenses associated with the code they use. Because OSS is free to use, many companies misunderstand that they still need to respect the legalities associated with its licensing, such as passing along a copyright statement or a copy of license text, or providing the entire source code for the company’s product.

Often, enterprises are largely unaware of the percentage of OSS their products depend on, and it’s impossible for them to be aware of the legal responsibilities associated with the code they don’t know they’re using. Software vulnerabilities could also negatively affect your product, and lacking awareness about what open source code you’re using will put you behind the curve on upgrades or patches for known software bugs, as anyone impacted by the recent WannaCry attack can attest to.

That said, there’s amazing value in OSS. You just need to know how to comply.

Discovering, Managing and Complying in Five Actions

Once you know the risks associated, you need to figure out how to get a handle on the open source components your company is using. There are five key actions that can help you understand what your company is doing and set up a process for discovering, managing and complying with the OSS it uses.

1.Understand how OSS enters your company

There are many ways that open source enters your company. The classic case is that a developer decides to use an open source component, downloads the source code and incorporates that source code into their product. This is still a very common case, but there are many other ways that open source ends up being used in an organization. Often, developers will use what is known as a Repository Manager. This allows developers to specify the components that they want to use and download the source code, or compiled binary file, as well as any dependencies that this component may have. Repository Managers typically store the open source components in a separate repository outside of your classic source code management system. Some common Repository Managers are Maven, Nuget or npm.

Another way that open source comes into an organization is as a subcomponent of a commercial or larger open source component. It is very common to have multiple open source subcomponents or dependencies for a single, top-level component. These subcomponents are often not disclosed or managed.

Additionally open source will be used as runtime infrastructure pieces such as Web servers, operating systems or databases.

All of these components may be pulled in by developers, graphic designers, procurement, professional services, IT administrators and many others. This is no longer only a developer-based process.

2.Start looking for OSS

Once you know the ways that open source is selected and used, you can start performing an assessment of what components you depend on and how they are being utilized or distributed. This is typically known as building your Bill of Materials (BOM), essentially your open sourcedisclosure list. This list is used to follow the obligations, modify OSS policy and react to published vulnerabilities. It is common to find opensource packages that have licensing obligations that your organization isn’t able to follow. This puts you out of compliance with the license. In these cases, the open source component needs to be removed and the functionality replaced either through the use of another OSS component or by writing equivalent functionality.

Codebase reviews, interviews and scanning tools can be helpful during this process.

3.Question your development team

As projects get larger, more complex and more distributed, it has become harder to discover all of the pieces that are in use. This makes it important to have periodic conversations with the developers, devops and IT personnel involved with the creation, deployment and running of the project in question. Asking targeted question such as “What database do we use?” or “What encryption libraries do we use?” can be helpful in discovering other modules that may have been missed the first time.

Simply asking “What open source are we using” rarely creates a complete list for a few reasons, including that the knowledge of what OSS components were selected often has disappeared from memory or the company. There is also a misunderstanding or lack of knowledge among many developers about what is considered open source.

4.Understand how incoming OSS is managed

There should be a consistent and enforced process for managing your third-party usage. This allows your organization to properly comply with open source license obligations, as well as be able to react to new vulnerabilities. It is common to have teams with varying level of completeness and compliance. Some organizations are currently only tracking components that have been “Requested” by developers. These types of companies find that they are often only tracking the larger pieces of open source, or have some developers who are better than others in following the process.

Other companies use scanning technology to help discover and track OSS. Depending on the scanner used, or the level of analyses performed, varying degrees of discovery will be found. Some tools only discover license text, not actual OSS components. Others can find components managed by package managers but can’t find anything else. It’s important to understand the level of analysis performed, and what should be expected to be found. It’s common to come into compliance in phases as more components are discovered and managed.

5.Look for evidence of OSS compliance

Once you create policies, start looking and managing OSS, and requiring compliance with the OSS obligations it is important to confirm that this compliance is visible. Do you see the required attributions or copyright notices in the product or documentation? Do you see the license text as required? Is there a written offer for source code or the actual source code distributions for any Copyleft-style licensed content you may be using? These are all visible indicators of an effective open source management process.

By going through these five actions, in addition to educating your company on how to use OSS correctly and encouraging other enterprises to do the same, you’ll be able to integrate OSS into your applications while respecting licensing agreements. Not only will this keep you more informed and aware of the code in your products; it also creates a more secure product for your customers.

Everyone Is Not Ops

Medium Blog

August 2, 2017

Yesterday was Sysadmin Appreciation Day. There was a lot of chatter about what the future of Operations will look like, a recurrent theme being that in this day and age, Operations is “everyone’s job” or that “everyone is Ops”.

While I think people who believe this have their hearts in the right place, it’s a tad simplistic or opportunistic view. The reality on the ground happens to be more nuanced, and the problems facing most organizations are highly unlikely to be solved by idealistic chants of “everyone is Ops”.

Operations is a shared responsibility

We build systems. We write code to this end. Systems aren’t, however, defined by code alone.

Code is only ever a small component of the system. Systems are primarily conceived to fulfill a business requirement and are defined by, among others, the following characteristics:

How to Get the Next Generation Coding Early

OpenSource.com

August 2, 2017

You’ve probably heard the claim that coding, or computer programming, is as crucial a skill in the 21st century as reading and math were in the previous century. I’ll go one step further: Teaching a young person to code could be the single most life-changing skill you can give them. And it’s not just a career-enhancer. Coding is about problem-solving, it’s about creativity, and more importantly, it’s about empowerment.

Empowerment over computers, the devices that maintain our schedules, enable our communications, run our utilities, and improve our daily lives.

But learning to code is also personally empowering. The very first time a child writes a program and makes a computer do something, there’s an immediate sense of “I can do this!” And it transforms more than just a student’s attitude toward computers. Being able to solve a problem by planning, executing, testing, and improving a computer program carries over to other areas of life, as well. What parts of our lives wouldn’t be made better with thoughtful planning, doing, evaluating, and adjusting?

Database Updates Across Two Databases (part 1)

Rally Engineering

August 2, 2017

Every time a product owner says “We should pull in XYZ data to the mobile app” or “we need a new app to address this healthcare fail” an engineer needs to make two crucial decisions:

Where and how should the new code acquire data the company currently has?
Where and how should the new code record the data that will be newly created?

Sadly, the most expedient answer to both questions is “just use one of our existing databases”. The temptation to do so is high when an engineer need only add a database migration or write a query for a familiar database. The alternative might involve working with the organization’s infrastructure team to plan out changes to the operational footprint, then potentially making updates to the developer laptop setup.

Decisions expedient for today aren’t necessarily the best decisions for Rally’s long term delivery velocity. We recognized database reuse and sharing was fairly common at Rally, so we tried to stop to the practice in Spring 2017. We were concerned the company’s development speed and agility would eventually grind to a halt.

Open Source AI Solutions Evolve through Community Development

Sam Dean

August 1, 2017

Tech titans ranging from Google to Facebook have been steadily open sourcing powerful artificial intelligence and deep learning tools, and now Microsoft is out with version 2.0 of the Microsoft Cognitive Toolkit. It’s an open source software framework previously dubbed CNTK, and it competes with tools such as TensorFlow (created by Google) and Caffe (created by Yahoo!). Cognitive Toolkit works with both Windows and Linux on 64-bit platforms. It was originally launched into beta in October 2016 and has been evolving ever since.

“Cognitive Toolkit enables enterprise-ready, production-grade AI by allowing users to create, train, and evaluate their own neural networks that can then scale efficiently across multiple GPUs and multiple machines on massive data sets,” reports the Cognitive Toolkit Team. The team has also compiled a set of reasons why data scientists and developers who are using other frameworks now should try Cognitive Toolkit.

For example, Microsoft has tuned its software framework for peak performance, as detailed here. “Hundreds of new features, performance improvements and fixes have been added since beta was introduced,” the Cognitive Toolkit team notes. “The performance of Cognitive Toolkit was recently independently measured, and on a single GPU it performed best amongst other similar platforms.”

The other open source platforms in this space are making surprising advancements as well. H2O.ai, formerly known as Oxdata, has carved out a unique niche in the machine learning and artificial intelligence arena because its primary tools are free and open source. You can get the main H2O platform and Sparkling Water — a package that works with Apache Spark — just by downloading them. You can also find many tutorials for H2O.ai’s AI and machine learning tools here. As an example of how the H2O platform is working in the field, Cisco uses it to analyze its huge data sets that track when customers have bought particular products — such as routers — and when they might logically be due for an upgrade or checkup.

Google has open sourced a program called TensorFlow that it has spent years developing to support its AI software and other predictive and analytics programs. You can find out more about TensorFlow at its site; it is the engine behind several Google tools you may already use, including Google Photos and the speech recognition found in the Google app. According to MIT Technology Review: “[TensorFlow] underpins many future ambitions of Google and its parent company, Alphabet…Once you’ve built something with TensorFlow, you can run it anywhere — but it’s especially easy to transfer it to Google’s cloud platform. The software’s popularity is helping Google fight for a bigger share of the roughly $40 billion (and growing) cloud infrastructure market, where the company lies a distant third behind Amazon and Microsoft.”

Indeed, both Google and Microsoft are drawing benefits from open sourcing their artificial intelligence tools, as community development makes the tools stronger. “Our goal is to democratize AI to empower every person and every organization to achieve more,” Microsoft CEO Satya Nadella has said.

Yahoo! has also released its key artificial intelligence software (AI) under an open source license. Its CaffeOnSpark tool is based on deep learning, a branch of artificial intelligence particularly useful in helping machines recognize human speech, or the contents of a photo or video.

If you are interested in experimenting with Microsoft Cognitive Toolkit, you can learn more here, and assorted code samples and tutorials are found here.

To learn more about the promise of machine learning and artificial intelligence, watch a video featuring David Meyer, Chairman of the Board at OpenDaylight.

Connect with the open source development community at Open Source Summit NA, Sept. 11-14 in Los Angeles. Linux.com readers save on registration with discount code LINUXRD5. Register now!

Let Us Know How You Are Using R and Data Science Tools Today

John Mertic

August 1, 2017

The R Consortium exists to promote the R language, environment, and community. The R community has seen significant growth — with more than 2 million users worldwide — and a broad range of organizations have adopted the R language as a data science platform.

Now, to help us understand the changing needs of our community, we have put together a short survey.

Take the R Consortium Survey Now

We want to hear: How do you use R? What do you think about the way R is developing? What issues should we be addressing? What does the big picture look like?

We want to know how you use R, and we would like to hear from the entire R community. We don’t have any particular hypothesis or point of view but would like to reach everyone who is interested in participating.

Please take a few minutes to respond to the survey and help us understand your perspective. The survey will adapt depending on your answers and will take about 10 minutes to complete.

Take the R Consortium survey now and please share with others who might be interested.

1...500501502...10,742 Page 501 of 10,742