IP Insights Model ‘De-simplify’ — Part III Model Training

Data Centric Mind
3 min readJan 8, 2022

This blog explains in details about how the AWS IP insights model train the entity embedding vectors. If you want to learn the background about this project, please check out this post in the IP insights series.

Alright. Let’s get started!

Entity Embedding

The IP Insights algorithm uses a neural network to learn the latent vector representations for entities (users) and IP addresses. Entities are first hashed to a large but fixed hash space and then encoded by a simple embedding layer. Character strings such as user names or account IDs can be fed directly into IP Insights as they appear in log files. You don’t need to preprocess the data for entity identifiers.

Below is s sample training data snippet. All you need is a userID and a IP pair as input data. Adopting the NLP word embedding logic, where words occur closer or in the same context/sentence would be semantically closer. Similarly, at its core, the IP Insight algorithm iteratively pushes the points representing IP addresses and resources together if they are associated with each other in the training data, and it pulls them away from each other if they are not associated. To be more specific, it adopts the negative sampling strategy during training to , IP Insights automatically generates negative samples by randomly pairing entities and IP addresses. These negative samples represent data that is less likely to occur in reality.

We split the simulated data from last step into training, validation and testing. We use the training data to train the initial entity embedding and using the validation data & the Hyper parameter auto-tuning tool to fine tune the hyper parameters. After the fine-tuning, we will have our final entity embeddings.

IP Insights model pipeline

Select a Threshold to Separate Good and Dad Traffic

Once obtain the embeddings, we can use it for inference to get the representation of userID and IP. We will decide if the login is rare/malicious by calculating how close the two entities are — cosine similarity. However, we need to create a cutoff of the similarity score, where if the two entities have a lager (closer) score, it will be a benign login while a smaller score will be rare and malicious.

We further simulated malicious traffics and ran embedding inference on these traffics to get a threshold that can separate most of the good and bad traffic. Till now, we have all the needed information to deploy the model and generate real time alerts.

--

--

Data Centric Mind

Data Science is a mindset, cyber security is awareness.