DeepSPADE (alias DeepSmokey): A Machine-Learning System That Collects Spam from the Internet

By

-

August 22, 2017

This blog is about a deep learning system I’ve created, called DeepSPADE (alias DeepSmokey) and how it’s being used to build better Internet communities.

To begin, what is DeepSPADE, and what does it do?

DeepSPADE stands for Deep Spam Detection, and the basic point is for machine learning to do a Natural Language Classification task to differentiate between spam and non-spam posts on public community forums.

One such website is Stack Exchange (SE), a network of over 169 different web forums for everything ranging from programming, to artificial intelligence, to personal finance, to Linux, and much more!

Stack Overflow (SO), a community forum part of SE that’s dedicated to general programming, is the world’s most popular forum site for coders. With over 14,500,000 questions asked during the seven years it’s been up, and 6,500,000 of those questions answered, you can see how popular it truly is.

However, like any public website, Stack Overflow is cluttered with garbage. While most members of this community are legitimately interested in sharing their knowledge or getting help from others, there are some who seek to spam the website. In fact, there are more than 30 spam posts everyday on SO, on average.

To combat this, the SmokeDetector system was designed and developed by a group of programmers, called Charcoal SE. SmokeDetector uses massive RegEx to try and find spam messages based on their content.

Once I, a big supporter of Machine Learning, found out they used RegEx for their spam classification, I immediately shouted “Why not Deep Learning?!?” This idea was welcomed by the Charcoal Community; in fact, the reason they hadn’t incorporated it earlier was that they didn’t have anybody who worked with machine learning. I joined the Charcoal Community and began developing DeepSPADE to contribute towards their mission.

The DeepSPADE Model

DeepSPADE uses a combination of Convolutional Neural Networks (CNNs) and Gated Recurrent Units (GRUs) to run this classification task. The word-vectors it uses to actually understand the natural language that it’s given are word2vec vectors trained before the actual model’s training starts. However, during model training, the vectors are fine-tuned to achieve optimal performance.

The Neural Network (NN) is designed in Keras with a Tensorflow (TF) back end (TF provides significant performance gains over Theano), and Figure 1 shows a very long diagram of the model itself:

Y8G62Zk3Irb451e_Mu8Gofv220dqVwuJjXO6GpfV

As you can see, the model I’ve designed is very deep. In fact, not only is it deep, it’s a parallel model.

Let’s start off with a question that a lot of people have: Why are you using CNNs and GRUs? Why not just either one of those layers?

The answer lies deep within the actual working of these two layers. Let’s break them down:

CNNs understand patterns in data that aren’t time-bound. This means that the CNN doesn’t look at the natural language in any specific order, it just looks at the Natural Language like an array of data without order. This is helpful if there is a very specific word that we know is almost always related to spam or non-spam.

GRUs (or RNNs – Recurrent Neural Networks – in general) understand patterns in data that are very specifically arranged in a time-series. This means that the RNN understands the order of words, and this is helpful because some words may convey entirely different concepts based on how they work.

When these two layers are combined in a specific way to highlight their advantages, the real magic happens!

In fact, to explain why the combination is so powerful, take a look at the following “evolution” of the accuracy of the DeepSPADE system on 16,000 testing rows:

65% – Baseline accuracy with Convolutional Neural Networks
69% – With deeper Convolutional Neural Networks
75% – With introduction of higher quantity & quality of data
79% – With small improvements to model
85% – With LSTMs introduced along with CNN model (no parallelism)
89% – With higher embedding size, deeper CNN and LSTM
96% – With GRUs instead of LSTMs, more Dropout, more Pooling, and higher embedding size
98.76% – With Parallel model & higher embedding size

The answer, again, lies in how the CNN itself works: It has a very strong ability to filter out noise and look at the signal of some content – plus, the performance (training/inference time) is much greater compared to that of an RNN.

So, the three Conv1D+Dropout+MaxPool groups in the beginning act as filters. They create many representations of the data with different angles of the data portrayed in each. They also work to decrease the size of the data while preserving the signal.

After that, the result of those groups splits into two different parts:

It goes into a Conv1D+Flatten+Dense.
It goes into a group of 3 GRU+Dropout, and then a Flatten+Dense.

Why the parallelism? Because again, both networks try and find different types of data. While the GRU finds ordered data, the CNN finds data “in general”.

Once the opinion of both Neural Nets is collected, the opinions are concatenated and fed through another Dense layer, which understands patterns and relationships as to when each Neural Network’s results or opinions are more important. It does this dynamic weighting and feeds into another Dense layer, which gives the output of the model.

Finally, this system can now be added to SmokeDetector, and its automatic weighting systems can begin incorporating the results of Deep Learning!

Plus, this system is trained, tested, and used entirely on Linux servers! Of course, Linux is an amazing platform for such software, because the hardware constraints are practically nil, and because most great development software is supported primarily on Linux (Tensorflow, Theano, MXNet, Chainer, CUDA, etc.).

I love open source software – doesn’t everyone? And, although this project isn’t open source just yet, there is a great surprise awaiting all of you soon!

Tanmay Bakshi, 13, is an Algorithm-ist & Cognitive Developer, Author and TEDx Speaker. He will be presenting a keynote talk called “Open-Sourced Inspiration – The Present and Future of Tech and AI” at Open Source Summit in Los Angeles. He will also present a BoF session discussing DeepSPADE.

Check out the full schedule for Open Source Summit here. Linux.com readers save on registration with discount code LINUXRD5. Register now!

Creating Better Disaster Recovery Plans

By

O'Reilly

-

August 22, 2017

Five questions for Tanya Reilly: How service interdependencies make recovery harder and why it’s a good idea to deliberately and preemptively manage dependencies.

I recently asked Tanya Reilly, Site Reliability Engineer at Google, to share her thoughts on how to make better disaster recovery plans. Tanya is presenting a session titled Have you tried turning it off and turning it on again? at the O’Reilly Velocity Conference, taking place Oct. 1-4 in New York.

1. What are the most common mistakes people make when planning their backup systems strategy?

The classic line is “you don’t need a backup strategy, you need a restore strategy.” If you have backups, but you haven’t tested restoring them, you don’t really have backups. Testing doesn’t just mean knowing you can get the data back; it means knowing how to put it back into the database, how to handle incremental changes, how to reinstall the whole thing if you need to. It means being sure that your recovery path doesn’t rely on some system that could be lost at the same time as the data.

Your Serverless Raspberry Pi Cluster with Docker

By

Alex Ellis Blog

-

August 22, 2017

This blog post will show you how to create your own Serverless Raspberry Pi cluster with Docker and the OpenFaaS framework. People often ask me what they should do with their cluster and this application is perfect for the credit-card sized device – want more compute power? Scale by adding more RPis.

“Serverless” is a design pattern for event-driven architectures just like “bridge”, “facade”, “factory” and “cloud” are also abstract concepts – so is “serverless”. …

We’ll be using OpenFaaS which lets you turn any single host or cluster into a back-end to run serverless functions. Any binary, script or programming language that can be deployed with Docker will work on OpenFaaS and you can chose on a scale between speed and flexibility. The good news is a UI and metrics are also built-in.

Trending Developer Skills, Based on My Analysis of “Ask HN: Who’s Hiring?”

By

freeCodeCamp

-

August 22, 2017

A few years ago, I became curious about identifying emerging technologies and predicting them. So I created Hacker News Hiring Trends, or HN Hiring Trends for short. Hacker News is one of the most popular discussion boards for programmers. It is also one of the best places to discover new technologies. Every month Hacker News hosts a thread called “Ask HN: Who is Hiring?” Users also post jobs opportunities from their companies on this thread.

The fact that these job opportunities are posted monthly and that most are from start-ups (new technologies are usually created or used in start-ups) makes this the ideal environment to capture data. Data which can be used to discover trends.

Let’s dig into the latest trends.

Spyware Backdoor Prompts Google to Pull 500 Apps with >100m Downloads

By

ArsTechnica

-

August 22, 2017

At least 500 apps collectively downloaded more than 100 million times from Google’s official Play Market contained a secret backdoor that allowed developers to install a range of spyware at any time, researchers said Monday.

The apps contained a software development kit called Igexin, which makes it easier for apps to connect to ad networks and deliver ads that are targeted to the specific interests of end users. Once an app using a malicious version of Igexin was installed on a phone, the developer kit could update the app to include spyware at any time, with no warning.

OpenShift on OpenStack: Delivering Applications Better Together

By

OpenShift Blog

-

August 22, 2017

Have you ever asked yourself, where should I run OpenShift? The answer is anywhere—it runs great on bare metal, on virtual machines, in a private cloud or in the public cloud. But, there are some reasons why people are moving to private and public clouds related to automation around full stack exposition and consumption of resources. A traditional operating system has always been about exposition and consumption of hardware resources—hardware provides resources, applications consume them, and the operating system has always been the traffic cop. But a traditional operating system has always been confined to a single machine[1].

Well, in the cloud-native world, this now means expanding this concept to include multiple operating system instances. That’s where OpenStack and OpenShift come in. In a cloud-native world, virtual machines, storage volumes and network segments all become dynamically provisioned building blocks.

Future Proof Your SysAdmin Career: Configuration and Automation

By

Sam Dean

-

August 21, 2017

System administrators looking to differentiate themselves from the pack are increasingly getting cloud computing certification or picking up skills with configuration management tools. From Puppet, to Chef to Ansible, powerful configuration management tools can arm sysadmins with new skills such as cloud provisioning, application monitoring and management, and countless types of automation.

Configuration management platforms and tools have converged directly with the world of open source. In fact, several of the best tools are fully free and open source. From server orchestration to securely delivering high-availability applications, open source tools such as Chef and Puppet can bring organizations enormous efficiency boosts.

The prevalence of cloud computing, and the open platforms that facilitate it, have contributed to the benefits organizations can reap from configuration management tools. Cloud platforms allow teams to deploy and maintain applications serving thousands of users, and the leading open source configuration management tools have integrated ways to automate all relevant processes.

When many people envision a sysadmin in action, they imagine an interaction with an end user. However, as organizations move to the cloud and heterogeneous technology infrastructure environments, many sysadmins need to expand their skills. Today, automation of tasks and application delivery are big themes. Among other benefits, automated provisioning and configuration can result in time savings and reduce human error.

Tools for the task

Puppet and Chef are both open configuration management tools that can automate many common tasks. As noted in an UpGuard blog post, “It is frequently stated that Puppet is a tool that was built with sysadmins in mind. The learning curve is less imposing due to Puppet being primarily model driven. Getting your head around json data structures in Puppet manifests is far less daunting to a sysadmin who has spent their life at the command line than ruby syntax is.”

Puppet can automate many sysadmin tasks, including deploying new machines, pushing changes out to existing systems, and performing verification checks. Chef, however, is noted for providing a great deal of power and flexibility. It automates the management of systems in the cloud, on-premises, or in a hybrid environment.

So, how can sysadmins gain familiarity with these tools? Puppet and Chef have commercial enterprises behind them, and flexible training options are available. For example, if you just want to take Puppet for a test drive within a virtual machine, you can do so here; instructor-led and online training options are detailed there as well. You can chart a learning roadmap for Puppet here.

Red Hat and other vendors also offer training options for Puppet as used in a standard operational environment or in a cloud environment. Red Hat also offers training for Ansible, and the curriculum is specifically geared toward sysadmins who need to automate, configure, and manage systems and processes. In-person or online training options for Chef can be found here, and you can sample some of the online tutorials here.

The Linux Foundation’s “Guide to the Open Cloud: Current Trends and Open Source Projects” includes a comprehensive section on configuration management tools, and you can find out more and visit some relevant open source project repositories here.

Sysadmins who add cloud and configuration management skills to their toolkits are keeping pace with rapidly changing technology environments. These aren’t the only ways to expand your skills, though. In the next article, we will look more closely at the importance of DevOps.

Learn more about essential sysadmin skills: Download the Future Proof Your SysAdmin Career ebook now.

Future Proof Your SysAdmin Career: New Networking Essentials

Future Proof Your SysAdmin Career: Locking Down Security

Future Proof Your SysAdmin Career: Looking to the Cloud

Future Proof Your SysAdmin Career: Configuration and Automation

Future Proof Your SysAdmin Career: Embracing DevOps

Future Proof Your SysAdmin Career: Getting Certified

Future Proof Your SysAdmin Career: Communication and Collaboration

Future Proof Your SysAdmin Career: Advancing with Open Source

Docker Enterprise Now Runs Windows and Linux in One Cluster

By

InfoWorld

-

August 21, 2017

With the newest Docker Enterprise Edition, you can now have Docker clusters composed of nodes running different operating systems.

Three of the key OSes supported by Docker—Windows, Linux, and IBM System Z—can run applications side by side in the same cluster, all orchestrated by a common mechanism.

Clustering apps across multiple OSes in Docker requires that you build per-OS images for each app. But those apps, when running on both Windows and Linux, can be linked to run in concert via Docker’s overlay networking.

This Week in Numbers: Serverless Adoption on Par with Containers

By

The New Stack

-

August 21, 2017

Serverless technologies like functions as a service (FaaS) are in use by 43 percent of enterprises that both have a significant number of strategic workloads running in the public cloud workloads and the ability to dynamically manage them.

Without those qualifications, it is easy to misinterpret the findings from New Relic’s survey-based ebook “Achieving Serverless Success with Dynamic Cloud and DevOps.” After digging in, we found that the survey says 70 percent of enterprises have migrated a significant number of workloads to the public cloud. Among this group, 39 percent of using serverless, 40 percent are using containers and 34 percent are using container orchestration.

At least superficially, adoption of serverless technologies now matches that of containers.

Here Are All the Git Commands I used Last Week, and What They Do

By

freeCodeCamp

-

August 21, 2017

Like most newbies, I started out searching StackOverflow for Git commands, then copy-pasting answers, without really understanding what they did.

Image credit: XKCD

Well, here I am years later to compile such a list, and lay out some best practices that even intermediate-advanced developers should find useful.

To keep things practical, I’m basing this list off of the actual Git commands I used over the past week.

Almost every developer uses Git, and most likely GitHub. But the average developer probably only uses these three commands 99% of the time: