Got back to writing some C code after nearly 12 years. Doing some work with Kafka's C library https://github.com/edenhill/librdkafka . Wantedto see which libraries were linked to my executable.
Did a ldd and then realised Mac OSX does not have ldd instead it has otool use it with the -L option
One of the companies I worked for provided insights to their customers on online user behaviour. To that end, we had to analyse millions of weblogs daily. The websites were categorised based on a list of pre-defined categories (e.g: Travel, Sports, Health and Medical, Government, Shopping and Classifieds etc) making it is easier for analysts to gain insights. One of the challenges here was to categorise web links into these predefined categories as new websites (100's of them) were popping up on a daily basis.
The process being followed was:
A dedicated team was in charge of going to the URL of the new websites appearing in the newly landed data.
Browse through the different links and then based on what the team member thought of it, categorise it into one of the categories.
The categorised URL would feed into the build pipeline and would show up in the reports and analytics provided to the clients.
Though effective, this process was not efficient and/or scalable. I thought this was an interesting problem to solve using Machine Learning. I have used a Supervised learning technique called Naive Bayes which is considered to be an effective solution for text categorisation problems. The results are promising.
This blog does not touch on how the problem can be solved, as I believe there are multiple approaches (some better than what I have here) of solving this problem. What I'm trying to achieve is to architect a solution so that the Data Science aspect of the solution integrates seamlessly with the rest of the application and becomes part of the whole user experience.
Here is the app in action predicting the category for www.bmw.com and www.ivanhoeschool.org.
The model has been trained only on 2550 websites and has the problem of class imbalance, where few categories are under represented in the training data. This can be fixed by adding more training data for those categories which are lagging behind. (You could help here by adding more websites for training the model.)
As an aside, I have seen some projects in the past where Data Science never fits into the deployment model for the rest of the project and often involves manual intervention - build the model in a different environment, ship the model so that it can be 'Production'ised' in a completely different environment. Much of the problem stems from the fact that these models do not have a way to be interfaced from the outside world and are often built and worked upon in isolation.
This problem has been addressed in the following blog http://engineering.pivotal.io/post/api-first-for-data-science/ . "API first" for Data Science will make these models so relevant to the business and integrating these models with the rest of business processes so much easier. API's need not necessarily mean HTTP API's it could be any well defined interface, it could be AMQP, SOAP etc whatever makes sense. In fact, in my experience it really depends on what the needs of of the consuming application(s) are and what is practically possible for the service to deliver.
The whole app is deployed using PCF (Pivotal Cloud Foundry) infrastructure here on PWS (Disclosure: I work for Pivotal).
PCF is a PaaS (Platform as a Service) like Heroku - the difference being it can be deployed On-Premise or on the cloud like AWS, Google Cloud Platform etc - with language support for Java/ Scala, .NET, Python, PHP, Ruby etc.
High Level Architecture of WebCat.
WebCat app architecture
The whole project has a Microservice's based architecture and is a Polygot app with 5 services (written in 3 different languages):
An App serving as a frontend and also interacting with other services (Spring Boot/Java)
A Link Collection Service (Python)
A Link Crawler Service (Python)
A Categorisation Service (Scala)
A Stemming Service (Python)
Categorisation app does all the delegating of work to the other services.
Link Collection Service returns a bunch of URL's (like http://www.bmw.com/com/en/general/corporate_direct_sales/index.html) given a URL (www.bmw.com).
Link Crawler Service crawls each link and then returns the text for the link.
Stemming Service - Stem the words/text returned by the Crawling service.
Categorisation Service loads the training data from the database on startup and then does a TF-IDF on the corpus which is the input to building the model. The Categorisation Service builds the model using Apache Spark and predicts categories using Apache Spark's Scalable Machine Learning Library - MLLIB. I'm currently working on using Apache MADLib to do the same.
I'm currently working on: 1) Building a feedback loop so that when the app predicts the category for an unseen website and the user gets an opportunity to validate whether the prediction was correct. If the user thinks it is incorrect, the user can change the category and send the feedback to the app.And this becomes the input to the next model, so that the model gets better over a period of time (hopefully). Finally, there might be a day when it is 100% accurate :-)! 2) Adding user defined categories (like Weather, Realestate etc) to the categories. For example - the weather.com website is categorised as "News and Media" - now the user could be given an option to create a new category for this website i.e Weather (Of course, this needs moderation).
I will update this post when it is done. If you have any ideas to make this better please email me at amith.nmbr@gmail.com.
This is a blog showing how Java applications can query data on HDFS using HAWQ. HAWQ is a SQL based MPP engine on Hadoop. Having a SQL interface to query data sitting on a Hadoop clusters opens up a lot of possibilities and interesting usecases for analytics and visualization of structured and unstructured data.
Pivotal Greenplum database and HAWQ are quite easy to integrate with Java applications. The first barrier to entry while moving from an RDBMS to a MPP/Hadoop based datastore is the work involved in changing the application code to work with the new datasources and environment.
Postgres is a popular RDBMS used by several companies and if you have ever used Postgres you are in luck!
Both Greenplum and HAWQ work with the same JDBC drivers used with Postgres.
And hence, if you want to scale your app to an MPP (Massively Parallel Processing) database you just need to point it to a Greenplum or a HAWQ cluster.
Below is a link to some PoC code to show how to query data on HDFS (Hadoop Distributed Filesystem) from Java applications using HAWQ. Though this is a usecase specific to HDFS, the point I want to drive home is that Postgres JDBC driver works seamlessly with both Greenplum and HAWQ.
I hope this blog encourages you to consider Greenplum or HAWQ as an option when you want to scale out your Postgres or other RDBMS backed applications.
Internet of Pi's is a blog about my experiments with building an IoT (Internet of Things) using MQTT as the protocol between the IoT and the cloud platform Heroku. This IoT is an Internet connected Lamp which can be controlled from anywhere at anytime. Built using a Raspberry Pi. I have been calling it the Internet of Pi (IoP) for pun. Below is a short video of the Internet of Pi in action. The Internet of Pi app takes a snapshot of the Lamp after switching it on or off in response to user action and then broadcasts the picture to all watching clients.
Picture of the lamp when switched on.
Here you can see many subscribed clients interacting with the Internet of Pi app.
Here is the NFC tag in the living room used to turn the lamp on/off in action.
A Relay - to switch the lamp on/off via the GPIO pins on the Raspberry Pi
A Camera - connected to the Pi to take photos of the Lamp which is then uploaded to Amazon S3
MQTT broker - for receiving and sending MQTT encoded messages. The protocol used between the device and Internet of Pi service running on Heroku is MQTT.
Websockets for publishing updates in near real-time to subscribed users
NFC tag for toggling the Lamp (on/off) when in proximity with a NFC enabled phone
Amazon S3 - used by the Pi for uploading photos of the lamp
Heroku - A cloud PaaS provider
Before getting into the details of the Internet of Pi, some thoughts on IoT. My thoughts are based out of my own experience, gut feeling plus influence reading up on IoT. As with any new wave of innovation there are new buzzwords, predictions etc. plus there will be use cases mentioned about what it can do and how it is going to transform the world we live in. Same with IoT, i came across these use case of the toaster going on when you walk into a room and the refrigerator ordering more eggs when it runs out of them - these are not very convincing use cases for IoT. In my opinion, the real value in the IoT revolution lies elsewhere. The advent of Big Data i.e the possibility to store and process Terabytes or Petabytes of data made possible by technological advancements and availability of open source software in the area of distributed computing has made business intelligence and analytics at scale possible. The combination of Big Data and Machine Learning (at scale) used in an IoT context is the real game changer. Using Predictive models to analyze sensor data and making decisions based on the measurements is something which is definitely going to shape how some industries will be using IoT. I have read about some very interesting use cases for IoT in the Health and Medicine, Agriculture, Mining, Transport industries. Predictive maintenance is one interesting use of IoT. Switching on a Lamp at your home from a mobile device on the go - is it a good use case for IoT?. This is just a fun experiment which was a learning experience for me on how to use a Raspberry PI as an IoT, read up on IoT protocols like MQTT, figuring out how to run services reliably on the "thing" etc. I have definitely learned a few "things" while building the Internet of Pi. The app is deployed on Heroku which is a good PaaS provider for quick prototyping.
I showed this project to a few of my friends/colleagues and some of my colleagues asked - is the Lamp really going on/off at your home? That kicked off another round of weekend hacking on the Pi, to add "Pi Photo Service" to take a photo with a timestamp after the Lamp goes on/off. The Pi runs two services: 1) the "Lamp Control Service" to switch the lamp on/off. 2) the "Pi Photo Service" to take photo's off the camera connected to the Raspberry Pi. The "Pi Photo Service" is a standalone service and the co-ordination happens at the server running on Heroku Cloud platform. There is a If This Then That kind of logic which triggers the "Take photo" event when the lamp is switched on/off. Microservices on the Raspberry Pi Developing the "Pi Photo Service" as a separate (micro) service meant that I could first of all reuse this service elsewhere and also update and release them on to the Pi independent to the "Pi Lamp Control Service". Secondly, there is no tight coupling between those 2 events ("Lamp state change" and "Take photo event" - the logic of what to do when is managed/configured elsewhere (i.e on the App server). Below are notes on how the Internet of Pi application works. The entire application including the services running on the Pi are written in node.js - Javascript everywhere app! There are 2 components that make the entire Internet of Pi app. 1) The Internet of Lamps app running on Heroku has a Websocket and MQTT module to connect to clients (browsers/mobile apps) and to the MQTT broker respectively. This app is written in node.js. 2) Two services running on the Raspberry Pi
A high level architecture of the Internet of Pi app.
Below are the sequence of events that happen when a user switches on the lamp.
There is an HTTP encoded message going from the browser/mobile app to the Internet of Lamps application on Heroku with the action indicating there is a request to switch the lamp ON.
The Internet of Lamps app on Heroku now Publishes a MQTT message to the MQTT Broker. I have used https://www.cloudmqtt.com/ as the MQTT broker for this app.
The Pi only talks MQTT - a protocol used in IoT applications. The Pi is SUBSCRIBE'd to topics on the MQTT broker.
The Pi turns the GPIO pin ON and PUBLISH'es a "lamp on" message to the broker, which is picked up by the Internet of Lamps app running on Heroku.
The Internet of Pi app on Heroku first NOTIFY's the client (browser) that the Lamp is now ON. It then runs the event through the ITTT (If This Then That) engine and decides that it needs to take a photo of the Lamp now!
The Internet of Lamps app now sends a message back to the Pi over MQTT to take a photo which is now picked up by the "Pi Photo service"
The Pi Photo service takes the photo of the lamp and then uploads it to Amazon S3. It then PUBLISH'es a message back on the broker that there is a new photo available for viewing.
The Internet of lamps app now broadcasts a message over Websockets to all connected clients that a new photo is available for display.
The client (Browser) fetches the image from S3. (This can be avoided by the app fetching the image from S3 and serving it)
I would be open sourcing the application and the code will be on Github soon.I will build docker images for the services running on the Raspberry Pi so that it can be run without much hassle. Also, add some notes on how to put this all together to get a working Internet of Pi. Picture of the NFC tag in the living room
Picture of the Lamp when switched on.
A picture of the Pi, Relay and the Lamp
Feel free to leave any comments/ideas for improvement.
To implicitly converting Java collections to Scala you need to import the following.
Very handy when you use Java libraries from Scala.
import scala.collection.JavaConversions._
will do the implicit conversion from Java List<T> to Scala List[T].
var computerScientists =new java.util.ArrayList[String]("Linus", "Doug Cutting", "Jon Postel") computerScientists.foreach(println) foreach is a function on collections in Scala