Enabling Easier Collaboration on Open Data for AI and ML with CDLA-Permissive-2.0

By

-

June 22, 2021

The Linux Foundation is pleased to announce the release of the CDLA-Permissive-2.0 license agreement, which is now available on the CDLA website at https://cdla.dev/permissive-2-0/. We believe that CDLA-Permissive-2.0 will meet a genuine need for a short, simple, and broadly permissive license agreement to enable wider sharing and usage of open data, particularly to bring clarity to the use of open data for artificial intelligence and machine learning models.

We’re happy to announce that IBM and Microsoft are making data sets available today using CDLA-Permissive-2.0.

In this blog post, we’ll share some background about the original versions of the Community Data License Agreement (CDLA), why we worked with the community to develop the new CDLA-Permissive-2.0 agreement, and why we think it will benefit producers, users, and redistributors of open data sets.

Background: Why would you need an open data license agreement?

Licenses and license agreements are legal documents that define how content can be used, modified, and shared. They operate within the legal frameworks for copyrights, patents, and other rights that are established by laws and regulations around the world. These laws and regulations are not always clear and are not always in sync with one another.

Decades of practice have established a collection of open source software licenses and open content licenses that are widely used. These licenses typically work within the frameworks established by laws and regulations mentioned above to permit broad use, modification, and sharing of software and other copyrightable content in exchange for following the license requirements.

Open data is different. Various laws and regulations treat data differently from software or other creative content. Depending on what the data is and which country’s laws you’re looking at, the data often may not be subject to copyright protection, or it might be subject to different laws specific to databases, i.e., sui generis database rights in the European Union.

Additionally, data may be consumed, transformed, and incorporated into Artificial Intelligence (AI) and Machine Learning (ML) models in ways that are different from how software and other creative content are used. Because of all of this, assumptions made in commonly-used licenses for software and creative content might not apply in expected ways to open data.

Choice is often a good thing, but too many choices can be problematic. To be clear, there are other licenses in use today for open data use cases. In particular, licenses and instruments from Creative Commons (such as CC-BY-4.0 and CC0-1.0) are used to share data sets and creative content. It was also important in drafting the CDLA agreements to enable collaboration with similar licenses. The CDLA agreements are in no way meant as a criticism of those alternatives, but rather the CDLA agreements are focused on addressing newer concerns born out of AI and ML use cases. AI and ML models generated from open data are the primary use case organizations have struggled with — CDLA was designed to address those concerns. Our goal was to strike a balance between updated choices and too many options.

First steps: CDLA version 1.0

Several years ago, in talking with members of the Linux Foundation member counsel community, we began collaborating to develop a license agreement that would clearly enable use, modification, and open data sharing, with a particular eye to AI and ML applications.

In October 2017, The Linux Foundation launched version 1.0 of the CDLA. The CDLA was intended to provide clear and explicit rights for recipients of data under CDLA to use, share and modify the data for any purpose. Importantly, it also explicitly permitted using the results from analyzed data to create AI and ML models, without any of the obligations that apply under the CDLA to sharing the data itself. It was launched with two initial types: a Permissive variant, with attribution-style obligations, and a Sharing variant, with a “copyleft”-style reciprocal commitment when resharing the raw data.

The CDLA-Permissive-1.0 agreement saw some amount of uptake and use. However, subsequent feedback revealed that some potential licensors and users of data under the CDLA-Permissive-1.0 agreement found it to be overly complex for non-lawyers to use. Many of its provisions were targeted at addressing specific and nuanced considerations for open data under various legal frameworks. While these considerations were worthwhile, we saw that communities may balance that specificity and clarity against the value of a concise set of easily comprehensible terms to lawyers and non-lawyers alike.

Partly in response to this, in 2019, Microsoft launched the Open Use of Data Agreement (O-UDA-1.0) to provide a more concise and simplified set of terms around the sharing and use of data for similar purposes. Microsoft graciously contributed stewardship of the O-UDA-1.0 to the CDLA effort. Given the overlapping scope of the O-UDA-1.0 and the CDLA-Permissive-1.0, we saw an opportunity to converge on a new draft for a CDLA-Permissive-2.0.

Moving to version 2.0: Simplifying, clarifying, and making it easier

Following conversations with various stakeholders and after a review and feedback period with the Linux Foundation Member Counsel community, we have prepared and released CDLA-Permissive-2.0.

In response to perceptions of CDLA-Permissive-1.0 as overly complex, CDLA-Permissive-2.0 is short and uses plain language to express the grant of permissions and requirements. Like version 1.0, the version 2.0 agreement maintains the clear rights to use, share and modify the data, as well as to use without restriction any “Results” generated through computational analysis of the data.

Unlike version 1.0, the new CDLA-Permissive-2.0 is less than a page in length.

The only obligation it imposes when sharing data is to “make available the text of this agreement with the shared Data,” including the disclaimer of warranties and liability.

In a sense, you might compare its general “character” to that of the simpler permissive open source licenses, such as the MIT or BSD-2-Clause licenses, albeit specific to data (and with even more limited obligations).

One key point of feedback from users of the license and lawyers from organizations involved in Open Data were the challenges involved with associating attribution information with data (or versions of data sets).

Although “attribution-style” provisions may be common in permissive open source software licenses, there was feedback that:

As data technologies continue to evolve beyond what the CDLA drafters might anticipate today, it is unclear whether typical ways of sharing attributions for open source software will fit well with open data sharing.

Removing this as a mandated requirement was seen as preferable.

Recipients of Data under CDLA-Permissive-2.0 may still choose to provide attribution about the data sources. Attribution will often be important for appropriate norms in communities, and understanding its origination source is often a key aspect of why an open data set will have value. The CDLA-Permissive-2.0 simply does not make it a condition of sharing data.

CDLA-Permissive-2.0 also removes some of the more confusing terms that we’ve learned were just simply unnecessary or not useful in the context of an open data collaboration. Removing these terms enables the CDLA-Permissive-2.0 to present the terms in a concise, easy to read format that we believe will be appreciated by data scientists, AI/ML users, lawyers, and users around the world where English is not a first language.

We hope and anticipate that open data communities will find it easy to adopt it for releases of their own data sets.

Voices from the Community

“The open source licensing and collaboration model has made AI accessible to everyone, and formalized a two-way street for organizations to use and contribute to projects with others helping accelerate applied AI research. CDLA-Permissive-2.0 is a major milestone in achieving that type of success in the Data domain, providing an open source license specific to data that enables access, sharing and using data among individuals and organizations. The LF AI & Data community appreciates the clarity and simplicity CDLA-Permissive-2.0 provides.” Dr. Ibrahim Haddad, Executive Director of LF AI & Data

“We appreciate the simplicity of the CDLA-Permissive-2.0, and we appreciate the community ensuring compatibility with Creative Commons licensed data sets.” Catherine Stihler, CEO of Creative Commons

“IBM has been at the forefront of innovation in open data sets for some time and as a founding member of the Community Data License Agreement. We have created a rich collection of open data sets on our Data Asset eXchange that will now utilize the new CDLAv2, including the recent addition of CodeNet – a 14-million-sample dataset to develop machine learning models that can help in programming tasks.” Ruchir Puri, IBM Fellow, Chief Scientist, IBM Research

“Sharing and collaborating with open data should be painless – and sharing agreements should be easy to understand and apply. We applaud the clear and understandable approach in the new CDLA-Permissive-2.0 agreement.” Jennifer Yokoyama, Vice President and Chief IP Counsel, Microsoft

“It’s exciting to see communities of legal and AI/ML experts come together to work on cross-organizational challenges to develop a framework to support data collaboration and sharing.” Nithya Ruff, Chair of the Board, The Linux Foundation and Executive Director, Open Source Program Office, Comcast

“Data is an essential component of how companies build their operations today, particularly around Open Data sets that are available for public use. At OpenUK, we welcome the CDLA-Permissive-2.0 license as a tool to make Open Data more available and more manageable over time, which will be key to addressing the challenges that organisations have coming up. This new approach will make it easier to collaborate around Open Data and we hope to use it in our upcoming work in this space.” Amanda Brock, CEO of OpenUK

“Verizon supports community efforts to develop clear and scalable solutions to legal issues around building artificial intelligence and machine learning, and we welcome the CDLA-Permissive-2.0 as a mechanism for data providers and software developers to work together in building new technology.” Meghna Sinha, VP – AI Center, Verizon

“Sony believes that the spread of clear and simple Open Data licenses like CDLA-2.0 activates Open Data ecosystem and contributes to innovation with AI. We support CDLA’s effort and hope CDLA will be used widely.” Hisashi Tamai, SVP, Sony Group Corporation

Data Sets Available under CDLA-Permissive-2.0

With today’s release of CDLA-Permissive-2.0, we are also pleased to announce several data sets that are now available under the new agreement.

The IBM Center for Open Source Data and AI Technologies (CODAIT) will begin to re-license its public datasets hosted here using the CDLA-Permissive 2.0, starting with Project CodeNet, a large-scale dataset with 14 million code samples developed to drive algorithmic innovations in AI for code tasks like code translation, code similarity, code classification, and code search.

Microsoft Research is announcing that the following data sets are now being made available under CDLA-Permissive-2.0:

The Hippocorpus dataset, which comprises diary-like short stories about recalled and imagined events to help examine the cognitive processes of remembering and imagining and their traces in language;The Public Perception of Artificial Intelligence data set, comprising analyses of text corpora over time to reveal trends in beliefs, interest, and sentiment about a topic;The Xbox Avatars Descriptions data set, a corpus of descriptions of Xbox avatars created by actual gamers; A Dual Word Embeddings data set, trained on Bing queries, to facilitate information retrieval about documents; andA GPS Trajectory data set, containing 17,621 trajectories with a total distance of about 1.2 million kilometers and a total duration of 48,000+ hours.

Next Steps and Resources

If you’re interested in learning more, please check out the following resources:

The text of CDLA-Permissive-2.0 on the CDLA websiteThe updated CDLA FAQThe CDLA repositories on GitHub

The post Enabling Easier Collaboration on Open Data for AI and ML with CDLA-Permissive-2.0 appeared first on Linux Foundation.

Linux Foundation Introduces Open Voice Network to Prioritize Trust and Interoperability in a Voice-Based Digital Future

By

-

June 22, 2021

Target, Schwarz Gruppe, Wegmans, Microsoft, Veritone and Deutsche Telekom lead standards effort to advance voice assistance

SAN FRANCISCO, June 22, 2021 – The Linux Foundation, the nonprofit organization enabling mass innovation through open source, today announced the Open Voice Network, an open source association dedicated to advancing open standards that support the adoption of AI-enabled voice assistance systems. Founding members include Target, Schwarz Gruppe, Wegmans Food Markets, Microsoft, Veritone, and Deutsche Telekom.

Organizations are beginning to develop, design and manage their own voice assistant systems that are independent of today’s general-purpose voice platforms. This transition is being driven by the desire to manage the entirety of the user experience – from the sound of the voice, the sonic branding and the content – to integrating voice assistance into multiple business processes and brand environments from the call center, to the branch office and the store. Perhaps most importantly, organizations know they must protect the consumer and the proprietary data that flows through voice. The Open Voice Network will support this evolution by delivering standards and usage guidelines for voice assistant systems that are trustworthy, inclusive and open.

“At Target, we’re continuously exploring and embracing new technologies that can help provide joyful, easy and convenient experiences for our guests. We look forward to working with the Open Voice Network community to create global standards and share best practices that will enable businesses to accelerate innovation in this space and better serve consumers,” said Joel Crabb, Vice President, Architecture, Target Corporation. “The Linux Foundation, with its role in advancing open source for all, is the perfect home for this initiative.”

Voice is expected to be a primary digital interface going forward and will result in a hybrid ecosystem of general-purpose platforms and independent voice assistants that demand interoperability between conversational agents of different platforms and voice assistants. Open Voice Network is dedicated to supporting this transformation with industry guidance on the voice-specific protection of user privacy and data security.

“Voice is expected to be a primary interface to the digital world, connecting users to billions of sites, smart environments and AI bots. It is already increasingly being used beyond smart speakers to include applications in automobiles, smartphones and home electronics devices of all types. Key to enabling enterprise adoption of these capabilities and consumer comfort and familiarity is the implementation of open standards,” said Mike Dolan, senior vice president and general manager of Projects at the Linux Foundation. “The potential impact of voice on industries including commerce, transportation, healthcare and entertainment is staggering and we’re excited to bring it under the open governance model of the Linux foundation to grow the community and pave a way forward.”

“To speak is human, and voice is rapidly becoming the primary interaction modality between users and their devices and services at home and work. The more devices and services can interact openly and safely with one another, the more value we unlock for consumers and businesses across a wide spectrum of use cases, such as Conversational AI for customer service and commerce,” said Ali Dalloul, General Manager, Microsoft Azure AI, Strategy & Commercialization.

Much as open standards in the earliest days of the Internet brought a uniform way to exchange information and connect with any site anywhere, the Open Voice Network will bring the same standardized ease of development and use to voice assistant systems and conversational agents, leading to huge growth and value for businesses and consumers alike. Voice assistance depends upon technologies like Automatic Speech Recognition (ASR), Natural Language Processing (NLP), Advanced Dialog Management (ADM) and Machine Learning (ML). The Open Voice Network will initially be focused on the following areas:

● Standards development: research and recommendations toward the global standards that will enable user choice, inclusivity, and trust.

● Industry value and awareness: identification and sharing of conversational AI best practices that are both horizontal and specific to vertical industries, serving as the source of insight and value for voice assistance.

● Advocacy: working with and through existing industry associations on relevant regulatory and legislative issues, including those of data privacy.

“Voice is transforming the relationships between brands and consumers,” said Rolf Schumann, Chief Digital Officer, Schwarz Gruppe. “Voice is changing the way we are interacting with our digital devices. For instance, when shopping through our smart home appliances. However, voice includes more information than a fingerprint and can entail data about the emotional state or mental health of a user. Therefore, it is of utmost importance to put data protection standards in place to protect the user’s privacy. This is the only way we will contribute to the future of voice.”

“Self-regulation of synthetic voice content creation and use, to protect the voice owner as well as establishing trust with the consumer is foundational,” said Ryan Steelberg, president and cofounder of Veritone. “Having an open network through Open Voice Network for education and global standards is the only way to keep pace with the rate of innovation and demand for influencer marketing. Veritone’s MARVEL.ai, a Voice as a Service solution, is proud to partner with Open Voice Network on building the best practices to protect the voice brands we work with across sports, media and entertainment.”

Membership to the Open Voice Network includes a commitment of resources in support of the its research, awareness and advocacy activities and active participation in the its symposia and workshops. The Linux Foundation open governance model will allow for community-wide contributions that will accelerate conversational AI standards rollout and adoption.

For more information, please visit: https://openvoicenetwork.org/

About the Linux Foundation

Founded in 2000, the Linux Foundation is supported by more than 1,000 members and is the world’s leading home for collaboration on open source software, open standards, open data, and open hardware. Linux Foundation’s projects are critical to the world’s infrastructure including Linux, Kubernetes, Node.js, and more. The Linux Foundation’s methodology focuses on leveraging best practices and addressing the needs of contributors, users and solution providers to create sustainable models for open collaboration. For more information, please visit us at linuxfoundation.org.

The Linux Foundation has registered trademarks and uses trademarks. For a list of trademarks of The Linux Foundation, please see its trademark usage page: www.linuxfoundation.org/trademark-usage. Linux is a registered trademark of Linus Torvalds.

###

Media Contact

Jennifer Cloer
for Linux Foundation and Open Voice Network
jennifer@storychangesculture.com
503-867-2304

The post Linux Foundation Introduces Open Voice Network to Prioritize Trust and Interoperability in a Voice-Based Digital Future appeared first on Linux Foundation.

A study of the Linux kernel PCI subsystem with QEMU

By

-

June 22, 2021

Using QEMU to analyze the Linux kernel PCI subsystem.
Click to Read More at Oracle Linux Kernel Development

What Is OpenIDL, the Open Insurance Data Link platform?

By

adavis

-

June 22, 2021

OpenIDL is an open-source project created by the American Association of Insurance Services (AAIS) to reduce the cost of regulatory reporting for insurance carriers, provide a standardized data repository for analytics, and a connection point for third parties to deliver new applications to members. To learn more about the project, we sat down with Brian Behlendorf, General Manager for Blockchain, Healthcare and Identity at Linux Foundation, Joan Zerkovich, Senior Vice President, Operations at American Association of Insurance Services (AAIS) and Truman Esmond, Vice President, Membership & Solutions at AAIS.

17 Linux commands every sysadmin should know

By

-

June 21, 2021

17 Linux commands every sysadmin should know

Image

Image by Tumisu from Pixabay

Get out your notepad, here is a huge list of commands that every Linux sysadmin needs to know.

Posted:
June 30, 2021

|

by
Tyler Carrigan (Red Hat)

Read the full article on redhat.com

Topics:
Linux
Command line utilities
Sudoer Sit-Down
Read More at Enable Sysadmin

3 tips for Linux process performance improvement with priority and affinity

By

-

June 18, 2021

A quick look at how to pull more performance out of your system.
Read More at Enable Sysadmin

What is a sysadmin?

By

-

June 18, 2021

What is a sysadmin?

Who is a sysadmin? Why is a sysadmin? Long-time Linux geek and sysadmin David Both explains what being a sysadmin means to him.
David Both
Fri, 6/18/2021 at 10:46am

Image

Photo by hitesh choudhary from Pexels

What does the term “sysadmin” or “system administrator” mean to you? What does a sysadmin do? How do you know you are one?

Existential questions like this seldom have concrete answers, and the answers that do present themselves can be pretty fluid. The answers are also significantly different for different people and depend upon their work experience.

Are you a sysadmin?

Topics:
Linux Administration
Sysadmin culture
Linux
Read More at Enable Sysadmin

Determining the Source of Truth for Software Components

By

Linux.com Editorial Staff

-

June 18, 2021

Abstract: Having access to a list of software components and their respective meta-data is critical to performing various DevOps tasks successfully. After considering the varying requirements of the different tasks, we determined that representing a software component as a “collection of files” provided an optimal representation. Conversely, when file-level information is missing, most tasks become more costly or outright impossible to complete.

Introduction

Having access to the list of software components that comprise a software solution, sometimes referred to as the Software Bill of Materials (SBOM), is a requirement for the successful execution of the following DevOps tasks:

Open Source and Third-party license compliance
Security Vulnerability Management
Malware Protection
Export Compliance
Functionally Safe Certification

A community effort, led by the National Telecommunications and Information Administration (NTIA) [1], is underway to create an SBOM exchange format driven mainly by the Security Vulnerability Management task. The holy grail of an effective SBOM design is twofold:

Define a commonly agreed-upon data structure that best represents a software component and;
Devise a method that uniquely and effectively identifies each software component.

A component must represent a broad spectrum of software types including (but not limited to): a single source file, a library, an application executable, a container, a Linux runtime, or a more complex system that is composed of some combination of these types. For instance, a collection of source files (e.g., busybox 1.27.2) or a collection of three containers (e.g., a half dozen scripts and documentation) are two examples.

Because we must handle an eclectic range of component types, finding the right granular level of representation is critical. If it is too large, we will not be able to represent all the component types and the corresponding meta-information required to support the various DevOps tasks. On the other hand, it may add unnecessary complexity, cost, and friction to adoption if it is too small.

Traditionally, components have been represented at the software package or archive level, where the name and version are the primary means of identifying the component. This has several challenges, with the two biggest ones being:

The fact that two different software components can have the same name yet be different, and
Conversely, two copies of software with different names could be identical.

Another traditional method is to rely on the hash of the software using one of several methods – e.g., SHA1, SHA256, or MD5. This works well when your software component represents a single file, but it presents a problem when describing more complex components composed of multiple files. For example, the same collection of source files (e.g., busybox-1.27.2 [2]) could be packaged using different archive methods (e.g., .zip, .gz, .bz2), resulting in the same set of files having different hashes due to the different archive methods used.

After considering the different requirements for the various DevOps tasks listed above, and given the broad range of software component types, we concluded that representing a software component as a “collection of files” where the “file” serves as the atomic unit provides an optimal representation.

This granular level enables access to metadata at the file level, leading to a higher quality outcome when performing the various DevOps tasks (e.g., file-level licensing for license compliance, file-level vulnerability data for security management, and file-level cryptography info for export compliance). To compute the unique identifier for a given component we recommend taking the “hash” of all the “file hashes” of the files that comprise a component. This enables unique identification independent of how the files are packaged. We discuss this approach in more detail in the sections that follow.

Why the File Level Matters

To obtain the most accurate information to support the various DevOps tasks sufficiently, one would need access to metadata at the atomic file level. This should not be surprising given that files serve as the building blocks from which software is built. If we represented software at any higher level (e.g., just name and version), pertinent information would be lost.

License Compliance

If you want to understand all the licenses that impose obligations and restrictions on a program or library, you will need to know all the licenses of the files from which it was built (derived). There are many instances where, although an open source software component’s top level license is declared to be one license, it is common to find a half dozen or more other licenses within the codebase which typically impose additional obligations. Popular open source projects usually borrow from other projects with different licenses. The open sharing of code is the force behind the success of the Open Source movement. For this reason, we must accept license diversity as the rule rather than the exception.

This means that a project is often subject to the obligations of multiple licenses. Consider the impact of this on the use of busybox, which provides a lot of latitude regarding the features included in a build. How one configures busybox will determine which files are used. Knowing which files are used is the only way to know which licenses are applicable. For instance, although the top-level license is GPL-2.0, the source file math.c [3] has three licenses governing it (GPL-2.0, MIT, and BSD) because it was derived from three different projects.

If one distributed a solution that includes an instance of busybox derived from math.c and provided a written offer for source code, one would need to reproduce their respective license notices in the documentation to comply with the MIT and BSD licenses. Furthermore, we have recently seen an open source component with Apache as the top-level license, yet deep within the bowels of the source code lies a set of proprietary files. These examples illustrate why having file level information is mission-critical.

Security Vulnerability Management

The Heartbleed vulnerability was identified within the OpenSSL component in 2014. Many web servers used OpenSSL to provide secure communication between a browser and a website. If left unpatched, it would allow attackers unprecedented access to sensitive information such as login and password credentials [4]. This vulnerability could be isolated to a single line within a single file. Therefore, the easiest and most definitive way to understand whether one was exposed was to determine whether their instance of OpenSSL was built using that file.

The Amnesia:33 vulnerability announcement [5], reported in November 2020, suggested that any software solution that included the FNET component was affected. With only the name and version of the FNET component to go on, one would have incorrectly concluded the Zephyr LTS 1.14 operating system was vulnerable. However, by examining the file level source, one could have quickly determined the impacted files were not part of the Zephyr build, making it definitively clear that Zephyr was not in fact vulnerable [6]. Having to conduct a product recall when a product is not affected would be highly unproductive and costly. However, in the absence of file-level information, the analysis would not be possible and would have likely caused unnecessary worry, work, and cost. These examples further illustrate why having access to file-level information is mission-critical.

Export Compliance

The output quality of an Export Compliance program also depends on having access to file-level data. Although different governments have different rules and requirements concerning software export license compliance, most policies center around the use of cryptography methods and algorithms. To understand which cryptography libraries and algorithms are implemented, one needs to inspect the file-level source code. Depending on how a given software solution is built and which cryptography-based files are used (or not used), one should classify the software concerning the different jurisdiction policies. Having access to file-level data would also enable one to determine the classification for any given jurisdiction dynamically. The requirements of the export compliance task also mean that knowing what is at the file level is mission-critical.

Functional Safety

The objective of the functional safety software certification process is to mitigate the unacceptable risk of physical injury or of damage to the health of people and/or property. The standards that govern functional safety (e.g., ISO/IEC 61508, ISO 26262, …) require that the full system context is known to assess and mitigate risk successfully. Therefore, full system transparency requires verification and validation at the source file level, which includes understanding all the source code and build tools used, and how it was configured. The inclusion of components of unknown content and provenance would increase risk and prohibit most certifications. Thus, functionally safe certification represents yet another task where having file-level information becomes mission-critical.

Component Identification

One of the biggest challenges in managing software components is the ability to identify each one uniquely. Developing a high confidence method ensures that two copies of a component are the same when they represent identical content and different if the content is not identical. Furthermore, we want to avoid creating a dependency on a central component registry as a requirement for determining a component’s identifier. Therefore, an additional requirement is to be able to compute a unique identifier simply by examining the component’s contents.

Understanding a component’s file-level composition can play a critical role in designing such a method. Recall that our goal is to allow a software component to represent a wide spectrum of component types ranging from a single source file to a collection of containers and other files. Each component could therefore be broken down into a collection of files. This representation enables the construction of a method that can uniquely identify any given component.

File hash methods such as SHA1, SHA256, and MD5 are effective at uniquely identifying a single file. However, when representing a component as a collection of files, we can uniquely represent it by creating a meta-hash – i.e., by taking the hash of “all file hashes” of the files that comprise a component. That is, i) generate a hash for each file (e.g., using SHA256), ii) sort the list of hashes, and iii) take the hash of the sorted list. Thus, the meta-hash approach would enable us to identify a component based solely on its content uniquely, and no registry or repository of truth is required.

Conclusion

Having access to software components and their respective metadata is mission-critical to executing various DevOps tasks. Therefore, it is vital to establish the right level of granularity to ensure we can capture all the required data. This challenge is further complicated by the need to handle an eclectic range of component types. Therefore, finding the right granular level of representation is critical. If it is too large, we will not represent all the component types and the meta-information needed to support the DevOps function. If it is too small, we could add unnecessary complexity, cost, and friction to adoption. We have determined that a file-level representation is optimal for representing the various component types, capturing all the necessary information, and providing an effective method to identify components uniquely.

References

[1] NTIA: Software Bill of Materials web page, https://www.ntia.gov/SBOM

[2] Busybox Project, https://busybox.net/

[3] Software Heritage archive: math.c, https://archive.softwareheritage.org/api/1/content/sha1:695d7abcac1da03e484bcb0defbee53d4652c347/raw/

[4] Wikipedia: Heartbleed, https://en.wikipedia.org/wiki/Heartbleed

[5] “AMNESIA:33: Researchers Disclose 33 Vulnerabilities Across Four Open Source TCP/IP Libraries”, https://www.tenable.com/blog/amnesia33-researchers-disclose-33-vulnerabilities-tcpip-libraries-uip-fnet-picotcp-nutnet

[6] Zephyr Security Update on Amnesia:33, https://www.zephyrproject.org/zephyr-security-update-on-amnesia33/

Linux Foundation Announces Software Bill of Materials (SBOM) Industry Standard, Research, Training, and Tools to Improve Cybersecurity Practices

By

Linux.com Editorial Staff

-

June 17, 2021

The Linux Foundation responds to increasing demand for SBOMs that can improve supply chain security

SAN FRANCISCO, June 17, 2021 – The Linux Foundation, the nonprofit organization enabling mass innovation through open source, today announced new industry research, training, and tools – backed by the SPDX industry standard – to accelerate the use of a Software Bill of Materials (SBOM) in secure software development.

The Linux Foundation is accelerating the adoption of SBOM practices to secure software supply chains with:

SBOM standard: stewarding SPDX, the de-facto standard for requirements and data sharingSBOM survey: highlighting the current state of industry practices to establish benchmarks and best practicesSBOM training: delivering a new course on Generating a Software Bill of Materials to accelerate adoptionSBOM tools: enabling development teams to create SBOMs for their applications

“As the architects of today’s digital infrastructure, the open source community is in a position to advance the understanding and adoption of SBOMs across the public and private sectors,” said Mike Dolan, Senior Vice President and General Manager Linux Foundation Projects. “The rise in cybersecurity threats is driving a necessity that the open source community anticipated many years ago to standardize on how we share what is in our software. The time has never been more pressing to surface new data and offer additional resources that help increase understanding about how to adopt and generate SBOMs, and then act on the information.”

Ninety percent (90%) of a modern application is assembled from open source software components. An SBOM accounts for the open source software components contained in an application that details their quality, license, and security attributes. SBOMs are used to ensure developers understand what components are flowing throughout their software supply chains, proactively identify issues and risks, and establish a starting point for their remediation.

The recent presidential Executive Order on Improving the Nation’s Cybersecurity referenced the importance of SBOMs in protecting and securing the software supply chain. The National Telecommunications and Information Administration (NTIA) followed the issuance of this order by asking for wide-ranging feedback to define a minimum SBOM. The Linux Foundation has responded to the NTIA’s SBOM inquiry here, and the presidential Executive Order here.

SPDX: The De-Facto SBOM Open Industry Standard

SPDX – a Linux Foundation Project, is the de-facto open standard for communicating SBOM information, including open source software components, licenses, and known security vulnerabilities. SPDX evolved organically over the last ten years by collaborating with hundreds of companies, including the leading Software Composition Analysis (SCA) vendors – making it the most robust, mature, and adopted SBOM standard in the market.

SBOM Readiness Survey

Linux Foundation Research is conducting the SBOM Readiness Survey. It will be deployed next week and will examine obstacles to adoption for SBOMs and future actions required to overcome them related to the security of software supply chains. The recent US Executive Order on Cybersecurity emphasizes SBOMs, and this survey will help identify industry gaps in SBOM applications. Survey questions address tooling, security measures, and industries leading in producing and consuming SBOMs, among other topics.

New Course: Generating a Software Bill of Materials

The Linux Foundation is also announcing a free, online training course, Generating a Software Bill of Materials (LFC192). This course provides foundational knowledge about the options and the tools available for generating SBOMs and how to use them to improve the ability to respond to cybersecurity needs. It is designed for directors, product managers, open source program office staff, security professionals, and developers in organizations building software. Participants will walk away with the ability to identify the minimum elements for an SBOM, how they can be assembled, and an understanding of some of the open source tooling available to support the generation and consumption of an SBOM.

New Tools: SBOM Generator

Also announced today is the availability of the SPDX SBOM generator, which uses a command-line interface (CLI) to generate SBOM information, including components, licenses, copyrights, and security references of your application using SPDX v2.2 specification and aligning with the current known minimum elements from NTIA. Currently, the CLI supports GoMod (go), Cargo (Rust), Composer (PHP), DotNet (.NET), Maven (Java), NPM (Node.js), Yarn (Node.js), PIP (Python), Pipenv (Python), and Gems (Ruby). It is easily embeddable in automated processes such as continuous integration (CI) pipelines and is available for Windows, macOS, and Linux.

Additional Resources

What is an SBOM?Build an SBOM training courseFree SBOM tool and APIs

About the Linux Foundation

Founded in 2000, the Linux Foundation is supported by more than 1,000 members and is the world’s leading home for collaboration on open source software, open standards, open data, and open hardware. Linux Foundation’s projects are critical to the world’s infrastructure, including Linux, Kubernetes, Node.js, and more. The Linux Foundation’s methodology focuses on leveraging best practices and addressing the needs of contributors, users, and solution providers to create sustainable models for open collaboration. For more information, please visit us at linuxfoundation.org.

###

The Linux Foundation has registered trademarks and uses trademarks. For a list of trademarks of The Linux Foundation, please see its trademark usage page: www.linuxfoundation.org/trademark-usage. Linux is a registered trademark of Linus Torvalds.

###

Media Contacts

Jennifer Cloer

for Linux Foundation

jennifer@storychangesculture.com

503-867-2304

The post Linux Foundation Announces Software Bill of Materials (SBOM) Industry Standard, Research, Training, and Tools to Improve Cybersecurity Practices appeared first on Linux Foundation.

Free Training Course Explores Software Bill of Materials

By

-

June 17, 2021

At the most basic level, a Software Bill of Materials (SBOM) is a list of components contained in a piece of software. It can be used to support the systematic review and approval of each component’s license terms to clarify the obligations and restrictions as it applies to the distribution of the supplied software. This is important to reducing risk for organizations building software that uses open source components.

There is often confusion concerning the minimum data elements required for an SBOM and the reasoning behind why those elements are included. Understanding how components interact in a product is key for providing support for security processes, compliance processes, and other software supply chain use cases.

This is why The Linux Foundation has taken the step of creating a free, online training course, Generating a Software Bill of Materials (LFC192). This course provides foundational knowledge about the options and the tools available for generating SBOMs and will help with understanding the benefits of adopting SBOMs and how to use them to improve the ability to respond to cybersecurity needs. It is designed for directors, product managers, open source program office staff, security professionals, and developers in organizations building software. Participants will walk away with the ability to identify the minimum elements for a SBOM, how they can be coded up, and an understanding of some of the open source tooling available to support the generation and consumption of an SBOM.

The course takes around 90 minutes to complete. It features video content from Kate Stewart, VP, Dependable Embedded Systems at The Linux Foundation, who works with the safety, security, and license compliance communities to advance the adoption of best practices into embedded open source projects. A quiz is included to help confirm learnings.

Enroll today to start improving your development practices.

The post Free Training Course Explores Software Bill of Materials appeared first on Linux Foundation – Training.