Tracking I.O.T. Security with NLP:
This Article is based on the research of Dr. Giacomo Bernardi, he is a part of the Internet service provider or networking world so this is also a story of a multidisciplinary sort of research that he did with his colleagues which he says is the “Kind of thing that everyone should look at in machine learning and AI today”. He came up with this concept while looking at the dwell time statistics which is basically the measure of how much time passes in networking before you realize that your network has been compromised. Reports usually claim something like around 100 days which is three months which is crazy because it’s like saying you know that, you operate a corporate network without realizing that for three months your network has been compromised and there is an attacker that has access to your network.
I.O.T. devices these days are vastly popular and there are many companies producing small and limited usage I.O.T. devices with open source technology such as Raspberry pi which you must have heard is a small computer basically which can be used for many things, so in this example let’s say there is a temperature sensor and humidity sensor attached to rpi and its periodically sending data back to the cloud but it’s also doing other activities for example synchronizing the clock or sending statistics or updating the firmware.
If you have a small deployment you could just follow the traditional approach of setting prescriptive firewall rules to determine what can these devices do and how much access they have. But you know typically things tend to expand so after a while you find yourself with a larger deployments with a lot of devices and you know at some point some of them might be at remote locations and also some of them might be from a particular vendor you bought several years ago and they are not updated anymore. What you might not realize that one of them has been compromised and now it is an insertion point from an attacker into your network and typically this is very hard to identify using traditional fire walling rules because maybe this device is having very little traffic because it’s just sending a keep alive to a remote server controlled by the attacker.
They came up with this approach — instead of trying to control what kind of traffic these devices can do; we should allow every type of traffic and try to determine what’s the typical behavior of each individual device and see if they change behavior over time. They used the concept of online learning, where basically we tell our customers that we will look at the data only once and then we will throw it away and we won’t require you to store anything at all and beside being much cheaper.
So the inspiration for this work comes from natural language processing and particularly from all the work that has started from 2014–15 and over the last couple of years starting with word to vector and up to the recent advancements. The core of modern natural language processing is the idea of word embedding, the idea that the meaning of order can be determined by looking at the context in which the word appears most of the time.
We are concerned about the network so our context here is not spoken languages like English or Hindi or Spanish, it’s networking. So, can we model or can we use this recent advancement in NLP to model made-up languages such as synthetic language of networking? What if we say that a network flow for us is a word and all the words generated by or from an individual device is a document, then we would be able to determine what’s the typical topic of conversation of that particular temperature sensor and determine whether it changes over data over time.
Because we are not using English or any spoken languages the size of the vocabulary is not naturally limited so we do a few tricks to try to bound the maximum size of the vocabulary otherwise it will grow to infinity. Now we have actual meaning or vector representation of the meaning of individual networking concepts such as a packet being TCP rather than UDP.
let’s imagine device a in this example is a temperature sensor and let’s say over the last few minutes we saw four individual flows so we pass them through our modified word vector approach and in output we get vector representation which in practice will be very similar in reality. We also keep us a tiny state for each individual device which describes the typical behavior of the device which is basically the last clusters of flows that this device has generated over the over the recent past and these clusters are basically moving average of flows that are similar together.
Then we look at the similarity between any pair of flows and we group together those that are more similar than given threshold which basically saying all the flows that are almost identical let’s consider them as one.
We have for this particular temperature sensor we then take the maximum of the flows and we feed those maximum into a contour sketch algorithm.
What is being done is basically taking the last row in the in the comparison table which is basically the maximum similarity between the net flows and the existing clusters and if there is any previous cluster that is more similar than the threshold then this is not an anomaly .
I mean we have seen a behavior similar to this one, so we are fine but if there isn’t then this is a new behavior from this temperature sensor and this generates an alert.
Applying this for multiple devices looks something similar to
Where you can see there are a few clusters which denote one type of sensor activity and the nodes far away denote abnormalities.