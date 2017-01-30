All these newfangled container and microservices technologies inspire all manner of ingenious experiments, and running IBM's Watson on Apache Mesos has to be one of the most -- maybe it's not fair to say crazy -- but certainly ambitious. Jason Adelman of IBM tells us the story of this novel endeavor at MesosCon Asia 2016.

If you're not familiar with Watson, that is IBM's cognitive computing platform. Watson became a Jeopardy champion in 2011, beating human contestants. Watson is a mighty beast, so how do you make it run on Mesos? And why? Adelman answers the why: "IBM looked at how they could commercialize this. Turn this into something that customers could use, first in healthcare, then financial services and then broader industries."

Now we'll look at the how. Want to play with Watson on Mesos? That's what Bluemix is for. Adelman says, "This is IBM's developer portal, our developer cloud. There's a lot of services in Bluemix that developers can use to run to get applications up and running quickly on the web. This is where you'll find Watson as well. Under Watson you'll see the services...There are 16 services there for now but there's a lot of things coming all the time. It's been developing very rapidly. A lot of these services are currently running on Mesos, and we are working on trying to get everything running on one platform there...It's running on a mixture of containers managed by Mesos, Marathon, and Netflix OSS."

The Watson Developer Cloud also uses Eureka, Zuul, Ansible, ZooKeeper, and Solr. Solr presented some special challenges. Adelman's team concluded that they needed local storage for Solr to work effectively. But, as it happens in so many similar projects, when you need stateful services (Solr) in a stateless environment you have an interesting condundrum. Adelman's team elected to use SolrCloud, which provides a highly-available cluster of Solr servers.

There were growing pains caused by network problems and Marathon limitations which caused lapses in communication between the various elements. Adelman says, "We had some outages where Marathon and Mesos were not talking to each other and connection was lost for a significant amount of time. After that...the connection was re-established, but when Marathon reconnected with Mesos, Mesos thought it was a new Marathon, gave it a new framework IP."

"So now we have a Marathon running with a new framework IP, we have all these original containers still running with the old IP, so they can no longer communicate with Marathon. This is the problem with stateful services...To get out of this we had to do a bunch of manual work." This included developing pinning functionality, and building additional infrastructure on top of Mesos and Marathon.

Adelman discusses not only the difficulties but also valuable lessons about how to make everything work reliably. Watch the full presentation (below) to learn how they set up networking, scheduling, auto-scaling, and use chaos testing to keep everything operating smoothly.

