Home Blog Page 687

StormCrawler: An Open Source SDK for Building Web Crawlers with ApacheStorm

StormCrawler is an open source collection of reusable resources, mostly implemented in Java, for building low-latency, scalable web crawlers on Apache Storm. In his upcoming talk at ApacheCon, Julien Nioche, Director of DigitalPebble Ltd, will compare StormCrawler with similar projects, such as Apache Nutch, and present some real-life use cases.

We spoke with Nioche to learn more about StormCrawler and its capabilities.

Julien Nioche, Director of DigitalPebble Ltd.

Linux.com: What is StormCrawler and what does it do? Briefly, how does it work?

Julien Nioche: StormCrawler (SC) is an open source SDK for building distributed web crawlers with Apache Storm. The project is under Apache license v2 and consists of a collection of reusable resources and components, written mostly in Java. It is used for scraping data from web pages, indexing with search engines or archiving, and can run on a single machine or an entire Storm cluster with exactly the same code and a minimal number of resources to implement.

The code can be built with Maven and has a Maven archetype, which helps users bootstrap a fully working crawler project which can be used as a starting point.  

Apache Storm handles the distribution of work across the cluster, error handling, monitoring and log capabilities, whereas StormCrawler focuses on the specific resources for crawling the web. The project aims at being as flexible and modular as possible and provides code for commonly used third-party tools, such as Apache SOLR or Elasticsearch.

Linux.com: Why did you choose ApacheStorm for this project?

Julien: I knew from my experience with batch-driven crawlers like Apache Nutch that stream-processing frameworks were probably the way I wanted to go for a new crawler. There were not so many resources around when I started working on StormCrawler two to three years ago (or at least less than now — there seems to be a new one cropping up every month), but luckily Storm was in incubation at Apache. I remember finding that its concepts were both simple and elegant, the community was already active, and I managed to leverage it pretty quickly to get some stuff up and running. That convinced me that this was a good platform to build on, and I am glad I chose it because the project has developed very nicely since, going from strength to strength with each release.

Linux.com: Can you describe some use cases? Who should be using StormCrawler?

Julien: There is a variety of web crawlers based on SC, which is possible thanks to its flexible and modular nature. The “Powered By” page on the wiki lists some of them.  A very natural fit is for processing URLs coming as a stream (e.g., pages visited by some users). This is difficult to implement elegantly with batch-driven web crawlers, whereas StormCrawler can be both efficient and elegant for such cases.

There are also users who do traditional recursive crawls with it, for instance with Elasticsearch as a back end, or crawls that are more vertical in nature and aim at data scraping. A number of users are also using it on finite lists of URLs, without any link discovery.

StormCrawler comes out of the box with a set of resources, which help building web crawlers with minimal effort. What I really like is that with just a few files and a single Java class, you can build a web crawler that can be deployed on a large Storm cluster. Users also seem to find the flexibility of StormCrawler very attractive, and some of them have added custom components to a basic crawl topology relatively easily. Others love its performance and the fact that with continuous processing, they always get the most from their hardware: StormCrawler uses the CPU, bandwith, memory, and disk constantly.

Linux.com: Are there additional features or capabilities you would like to add? If so, what?

Julien: StormCrawler is constantly improving and, as the number of users and contributors grows, we get more and more bugfixes and new functionalities. One thing in particular that has been planned for some time is to have a Selenium-based protocol implementation to handle AJAX-based websites. Hopefully, we’ll get that added in a not-too-distant future. There are also external components, like the WARC resources that might be moved to the main repository.

Hear from leading open source technologists from Cloudera, Hortonworks, Uber, Red Hat, and more at Apache: Big Data and ApacheCon Europe on November 14-18 in Seville, Spain. Register Now >>

How to Stay Relevant in the DevOps Era: A SysAdmin’s Survival Guide

The merging of development and operations to speed product delivery, or DevOps, is all about agility, automation and information sharing. In DevOps, servers are often treated like cattle”that can be easily replaced, rather than individual pets”to be nurtured.

System administrators that built their careers configuring and troubleshooting individual servers still have a role to play in this new world. But they must learn to apply their skills to entire IT infrastructures described and managed by code. They must learn to manage cloud services and use automated deployment tools and code repositories—and to share their expertise with others.  

A system administrator’s debugging skills still matter in DevOps, says Nathen Harvey, vice president of community development at Chef.  

Read more at Tech Beacon

Carriers Embrace Trial & Error Approach as NFV Becomes Real

THE HAGUE, Netherlands — Telcos kicked off the SDN World Congress here with boasts about how un-telco-like they’ve become, influenced by software-defined infrastructure and the world of virtualization. Specifically, they’re starting to adopt software’s “agile” philosophy by being willing to proceed in small steps, rather than waiting for technology to be fully baked.

“The time of proofs-of-concept is over. You need to take it to production,” said Deutsche Telekom Vice President Axel Clauberg during his opening keynote here at the SDN World Congress.

That’s particularly true in network functions virtualization (NFV), where telcos might be tempted to wait for management and network orchestration (MANO) to get sorted out. Multiple proposals are in development, particularly from the Open-O and Open Source MANO (OSM) open source projects.

Read more at SDx Central

The Open Source SDN Distro That Keeps Microsoft’s WiFi Secure

In case you didn’t know, “Microsoft IT is big,” according to Brent Hermanson, who leads the Network Infrastructure Services group for Microsoft IT.  In a keynote presentation at OpenDaylight Summit in September, Hermanson noted that Microsoft IT has users and locations all over the globe.  Until recently, they had a legacy approach to the corporate network, but now they want to modernize the legacy. The need for corporate networks in buildings has evaporated and 70 percent of wired network ports now are unused.

Microsoft has adopted a “Wireless First” approach along with an “Internet First” approach to their IT investments.  The Wireless First approach centers on WiFi, with a driving concern for ensuring that users are more secure in a Microsoft building than at the local coffee shop given that your workload is still in the cloud. The Internet First approach leads to the key question of how to maintain QoS and ensure security.  With more and more workloads in the cloud, their new default is that everything goes to the Internet.  Their corporate intranet is used for applications, such as Skype, that require QoS and security.  This approach Hermanson noted has produced an estimated cost savings of 50 percent.  

Yet, Hermanson continued, it’s a “huge cultural change” for how they’ve built up processes that secure their data, manage their identity, and control data loss prevention. The way they view it is the corporate network becomes the IT data center and offices locations are just on the Internet. As they “aggressively, move workloads to Azure and they need to aggressively move users to an Internet optimized path.”   

Gert Vanderstraeten, Network Architect at Microsoft, did note that “not all traffic gets dumped on the Internet.” He said that Skype for Business requires QoS  and High Business Intelligence (HBI) information requires security, neither of which you will get on the Internet. Thus, the default is the Internet first with these noted exceptions, which go to the corporate WAN where they have better chance of QoS and security.

There are many ways to mark traffic, Vanderstraeten said.  A method they tried was marking based on known UDP port numbers.  This worked great until employees figured out how to spoof the port number making their traffic always a high priority. Next, they added DPI (Deep Packet Inspection). This worked even better — about 75 percent of the time — but the move to encrypting everything dampened this approach.

Dr. Bithika Khargharia, a principal solutions architect at Extreme Networks and director of product and community management at the Open Networking Foundation (ONF), then elaborated on the new approach by discussing a project called Atrium Enterprise. Atrium Enterprise is an open source SDN distribution that’s ODL-based and has an integrated unified communications and collaboration application. It runs on Atrium partner hardware according to Khargharia.

“In phase 2, what they are essentially providing is a VNF, a virtual network function” she said.  The Skype 5-Tuple information is communicated to ODL SDN Controller, which then tells this VNF that this is Skype traffic and here’s what you do with it. This function will sit behind the building’s router in what Vanderstraeten also refers to as the “Decision Point.”

Khargharia noted they are looking at the use case of Unified Communications (Skype) in the cloud serving enterprises with one or more service providers (SP)providing connectivity between them.  They are interested in an end-to-end solution where Skype for example, communicates its requirements to both the enterprise cloud and to the SP’s cloud.  In her example, the enterprise could be ODL based and the SP could be ONOS based. The requisite APIs would be SDN controller independent to allow this end-to-end signaling.

Cloud Native Computing Foundation Adds OpenTracing Project

The Cloud Native Computing Foundation (CNCF) today officially announced that the open-source OpenTracing project has been accepted as a hosted project.

CNCF got started in July 2015 as a Linux Foundation Collaborative Project. The inaugural project behind the CNCF is Google’s Kubernetes, which wasrecently updated to version 1.4.

In May 2016, CNCF welcomed its second project, the Prometheus monitoring project. Now with OpenTracing there is another key tool being added to the CNCF portfolio.

Read more at Internet News

IRC 3: The Original Online Chat Program Gets Updated

Long before there was Whatsapp, Slack, or Snapchat, IRC was the program for online chatting. And, it’s not dead yet.

Internet Relay Chat (IRC) was born in 1988 to help people message each over over the pre-web internet. While many other programs have become more popular since then, such as WhatsappGoogle Allo, and Slack, IRC lives on primarily in developer communities. Now, IRC developers are updating the venerable protocol to revitalize it for the 21st century.

Read more at ZDNet

Red Hat Announces Open Source Release of Ansible Galaxy with Aim of Advancing Automation

Red Hat launched Tuesday its Ansible Galaxy project with the full availability of Ansible Galaxy’s open-sourced code repository. Ansible Galaxy is Ansible’s official community hub for sharing Ansible Roles. By open-sourcing Ansible Galaxy, Red Hat further demonstrates its commitment to community-powered innovation and advancing the best in open source automation technology.

Ansible Tower by Red Hat offers a visual dashboard; role-based access control; job scheduling; graphical inventory management aling with real-time job status updates.

Read more at Computer Technology Review

Fear Makes The Wolf Look Bigger

The biggest impediment to the DevOps “Revolution” may be the language used to describe it. Many proponents focus on the automation aspect of DevOps. At its core automation implies giving up control and that’s a scary prospect. This tech-centric focus does a disservice to what DevOps is really about.

Automation is just one aspect of DevOps, and at the risk of committingheresy, it is the least interesting. Please, before you make a run on pitchforks hear me out.

DevOps is based on 3 key pillarsPeopleProcess and Automation. I believe their importance to a business should be considered in that order.

Read more at Chris Scharff’s Blog

 

Blockchain Adoption Faster Than Expected

study released last week by IBM indicates that blockchain adoption by financial institutions is on the rise and beating expectations. This is good news for IBM, which is betting big on the database technology that was brought to prominence by Bitcoin. Yesterday, Big Blue announced that it has made its Watson-powered blockchain service available to enterprise customers.

For its study, IBM’s Institute for Business Value teamed with the Economist Intelligence Unit to survey 200 banks spread through 16 countries about “their experience and expectations with blockchains.” The study found that 15 percent of the banks surveyed plan to implement commercial blockchain solutions in 2017.

Read more at IT Pro

Keynote: Apache Milagro (incubating) – Brian Spector, CEO & Co-Founder, MIRACL

https://www.youtube.com/watch?v=bIaA7-Eady0?list=PLGeM09tlguZTvqV5g7KwFhxDlWi4njK6n

In this keynote, Brian Spector provides an introduction to Apache Milagro, which enables a post-PKI Internet that provides stronger IoT and Mobile security while offering independence from monolithic third-party trust authorities.