Thursday, 26 March 2020

AI and Cybersecurity. Part 4 - Clustering URLs


In Part 3, we tried to apply the feature scaling and dimensionality reduction techniques to the dataset with phishing and benign URLs. As a result, we were able to clearly see the distribution of URLs between two classes based on four attributes: registrar, country, lifetime, and protocol.

But what if we don’t have labels (phishing and benign) for the Internet links in the beginning. Will ML still work to detect phishing attacks? In this case, we may come to unsupervised learning, in particular, clustering. Clustering enables grouping objects of unknown classes according to common features so that we do not need labeled data for a training set.
Here, we're going to use the popular method of clustering called K-means. K-means seeks to minimize the total quadratic deviation of the points of the clusters from the centers of these clusters. There are two clusters in this example.

With the t-SNE dimensionality reduction technique for projecting clustered data onto a two-dimensional plane, we obtain the following visualized results (Figure 1).


Figure 1. Visualization of the URLs in two-dimensional space after clustering with K-means (phishing links - orange dots, benign links - blue dots).


The resulting clusters can be compared with the marked data to get an idea of the accuracy of cluster analysis.
Figure 2. The training dataset with the labeled data (left) and clustering results K-means (right)

A more accurate assessment of the clustering efficiency can be obtained using the metrics described in Part 2: TPR, TNR, FPR, FNR, PPV, NPV, F-measure.

  • TP = 110, TN = 41, FP = 7, FN = 44;
  • TPR (Recall) = 71%;
  • TNR = 85%;
  • FPR = 15%;
  • FNR = 29%;
  • PPV (Precision) = 0.94;
  • NPV = 0.48;
  • Accuracy = 0.75
  • F-measure = 0.73.

To improve the accuracy of the method, we can apply feature scaling, which we discussed in Part 3.
Figure 3. Visualization of the URLs in two-dimensional space after clustering with K-means and min-max normalization (phishing links - orange dots, benign links - blue dots).


Figure 4. The training dataset with the labeled data (left) and clustering results K-means (right) with min-max normalization.
The "classifier's" evaluation:
  • TP = 148, TN = 48, FP = 0, FN = 6
  • TPR (Recall) = 96%
  • TNR = 100% 
  • FPR = 0%
  • FNR = 3.9%
  • PPV (Precision) = 1 
  • NPV = 0.89 
  • Accuracy = 0.97
  • F-measure = 0.965
As you could see, feature scaling can dramatically improve the quality of clustering (classification) from 75% to 97%  in terms of Accuracy so that even unsupervised learning can be used for the detection of phishing URLs, of course, in a case when we can identify which cluster contains which URLs. As an option, after clustering URLs, we can scan URLs in each cluster on Virustotal or search on Phishtank to identify the class of the cluster: phishing or benign. According to the calculated above metrics, we need to gather verdicts for more than 2*3.9% (2*max(FNR, FPR)) of phishing URLs to be able to classify the whole cluster like the one that contains phishing links.

In this post, we considered the imaginary case of URLs clustering that can be turned under some conditions into solving the classification problem. In practice, clustering can also help with identifying a group of unknown malware samples submitted to a malware lab that were missed by a trained classifier (FN) and/or signature scanner and that may belong to the same malware family or written by the same threat actor - the attribution problem. Further, the clustered data that we classified post factum can be used in supervised learning to train the classifier.

If you want to play with the model, the source code is available on Github.

To be continued...

No comments:

Post a comment