Amith Nambiar's blog: September 2016

One of the companies I worked for provided insights to their customers on online user behaviour. To that end, we had to analyse millions of weblogs daily. The websites were categorised based on a list of pre-defined categories (e.g: Travel, Sports, Health and Medical, Government, Shopping and Classifieds etc) making it is easier for analysts to gain insights. One of the challenges here was to categorise web links into these predefined categories as new websites (100's of them) were popping up on a daily basis.

The process being followed was:

A dedicated team was in charge of going to the URL of the new websites appearing in the newly landed data.
Browse through the different links and then based on what the team member thought of it, categorise it into one of the categories.
The categorised URL would feed into the build pipeline and would show up in the reports and analytics provided to the clients.

Though effective, this process was not efficient and/or scalable. I thought this was an interesting problem to solve using Machine Learning. I have used a Supervised learning technique called Naive Bayes which is considered to be an effective solution for text categorisation problems. The results are promising.

This blog does not touch on how the problem can be solved, as I believe there are multiple approaches (some better than what I have here) of solving this problem. What I'm trying to achieve is to architect a solution so that the Data Science aspect of the solution integrates seamlessly with the rest of the application and becomes part of the whole user experience.

Here is the app in action predicting the category for www.bmw.com and www.ivanhoeschool.org.

The model has been trained only on 2550 websites and has the problem of class imbalance, where few categories are under represented in the training data. This can be fixed by adding more training data for those categories which are lagging behind. (You could help here by adding more websites for training the model.)

As an aside, I have seen some projects in the past where Data Science never fits into the deployment model for the rest of the project and often involves manual intervention - build the model in a different environment, ship the model so that it can be 'Production'ised' in a completely different environment. Much of the problem stems from the fact that these models do not have a way to be interfaced from the outside world and are often built and worked upon in isolation.

This problem has been addressed in the following blog http://engineering.pivotal.io/post/api-first-for-data-science/ . "API first" for Data Science will make these models so relevant to the business and integrating these models with the rest of business processes so much easier. API's need not necessarily mean HTTP API's it could be any well defined interface, it could be AMQP, SOAP etc whatever makes sense. In fact, in my experience it really depends on what the needs of of the consuming application(s) are and what is practically possible for the service to deliver.

The whole app is deployed using PCF (Pivotal Cloud Foundry) infrastructure here on PWS (Disclosure: I work for Pivotal).
PCF is a PaaS (Platform as a Service) like Heroku - the difference being it can be deployed On-Premise or on the cloud like AWS, Google Cloud Platform etc - with language support for Java/ Scala, .NET, Python, PHP, Ruby etc.

High Level Architecture of WebCat.

WebCat app architecture

The whole project has a Microservice's based architecture and is a Polygot app with 5 services (written in 3 different languages):

An App serving as a frontend and also interacting with other services (Spring Boot/Java)
A Link Collection Service (Python)
A Link Crawler Service (Python)
A Categorisation Service (Scala)
A Stemming Service (Python)

Categorisation app does all the delegating of work to the other services.
Link Collection Service returns a bunch of URL's (like http://www.bmw.com/com/en/general/corporate_direct_sales/index.html) given a URL (www.bmw.com).
Link Crawler Service crawls each link and then returns the text for the link.
Stemming Service - Stem the words/text returned by the Crawling service.
Categorisation Service loads the training data from the database on startup and then does a TF-IDF on the corpus which is the input to building the model. The Categorisation Service builds the model using Apache Spark and predicts categories using Apache Spark's Scalable Machine Learning Library - MLLIB. I'm currently working on using Apache MADLib to do the same.

I'm currently working on:

1) Building a feedback loop so that when the app predicts the category for an unseen website and the user gets an opportunity to validate whether the prediction was correct. If the user thinks it is incorrect, the user can change the category and send the feedback to the app.And this becomes the input to the next model, so that the model gets better over a period of time (hopefully). Finally, there might be a day when it is 100% accurate :-)!

2) Adding user defined categories (like Weather, Realestate etc) to the categories. For example - the weather.com website is categorised as "News and Media" - now the user could be given an option to create a new category for this website i.e Weather (Of course, this needs moderation).

I will update this post when it is done. If you have any ideas to make this better please email me at amith.nmbr@gmail.com.

Amith Nambiar's blog

Saturday, 24 September 2016

Predicting website categories using Supervised Learning

High Level Architecture of WebCat.