Monday 16 March 2020

AI and Cybersecurity. Part 2 - Detecting Phishing URLs with ML

hack fraud card code computer credit crime cyber data hacker identity information internet password phishing pile privacy protection safety secure spy steal technology thief green cartoon text product line font illustration human behavior angle clip art graphics computer wallpaper

In Part 1, we already got acquainted with AI paradigms and the main ML approaches: supervised, unsupervised, and reinforcement learning. Even though the unsupervised learning approach looks more attractive as you do not need to pre-mark the data for training, supervised learning can be seen as a more precise instrument for detecting malicious objects such as phishing URLs once we have enough labeled data.

Supervised Detection of Phishing

The first good thing about phishing is that we can use a database of validated phishing links at, where new antivirus companies and individual researchers upload new phishing links and they are validated. Benign URLs are also easy to find with that keeps tracking popular websites.

Second, the variety of techniques for launching phishing webpages is limited by the capabilities of the hosting and domain name registrar, and little has changed over the past 15–20 years, except the fact that certificates began to be used more. Attackers still register domain names that are consonant with the attacked service for the shortest possible time, typically one year. The webpage is hosted locally, or a hacked server is used. Most attacks typically use the HTTP protocol without TLS and certificate. Thus, it is enough to take into consideration the following attributes: a domain name, WhoIs data, geolocation, type of network protocol (HTTP or HTTPS), and, optionally, the presence of the certificate with its parameters.

We will need a binary classifier to detect phishing URLs. A complex apparatus of deep neural networks (DNN), which requires a large amount of data and computing power for high-quality training, is not necessarily needed to solve such a basic task. It is enough to use traditional classification methods such as discriminant analysis, support vector machine (SVM), decision trees, random forest, or even regression analysis to predict a threat score once you decide to end up with a threat scoring function.

Consider the task of detecting phishing links based on the following attributes: domain registrar, domain registration period, geolocation of the hosting server and the presence of a secure connection with a valid certificate. I propose my students to solve this problem as a laboratory work. First, they need to create a training dataset, using the resource as a source of classified phishing links.

Table 1. An example of dataset with URL attributes and class labels.
#URLIP addressRegistrarLifetimeCountryClass
5[https://] Direkt14SEbenign
6[https://]wikipedia.com208.80.154.238MARKMONITOR INC.16USbenign
Then, we perform a data exploratory analysis.

Figure 1. The charts showing the frequency of occurrence of the values of three attributes: a) registrar name (Registrar), b) domain registration period (Lifetime), c) geolocation of the hosting server (Countries) for phishing and benign URLs.

After a preliminary data analysis, we see that the Registrar and Country attributes will be difficult to use for classification since there are many coincidences among the most common values. Differences in these two attributes are more subtle and can be revealed in larger datasets. However, the Lifetime attribute corresponding to the period for which the domain is registered shows that among phishing links there is a tendency to register domains for a shorter period. For example, 37% of phishing domains were registered for 1 year in contrast to benign domains, which are usually registered for a longer period - 10 years and more.

As for TLS, the overwhelming majority of phishing links (96%) do not use the HTTPS protocol. Therefore, this is one of the essential criteria that can be used to detect phishing attacks. For HTTPS URLs, you can additionally analyze the certificate based on its attributes. For instance, what certificate authority (CA) issued that certificate and to whom, the registration address of the company to which the certificate was issued. Hackers can register certificates for fake companies, as it happened with the LockerGoga ransomware, whose files were signed with digital certificates issued by Sectigo RSA Code Signing CA to the fake companies Alina Ltd, Kitty’s Ltd., Mikl Limited and AB Simba Limited. I recommend reading the Chronicle study on this matter.

When preparing a dataset, we need to translate all string values ​​into categorical ones and encode them with numerical values. The simplest way is to use an ordinal number for the category to represent its values in a numerical space. For example, for the categorical attribute Country, we create a new numerical attribute Country_code with the ordinal number of a country found in the dataset (1 - Australia, 2 - Bangladesh, 3 - Canada, ..., 28 - USA).

Figure 2. Distribution of phishing (orange dots) and benign (blue dots) links depending on the registration period (lifetime), geolocation (Country_code) and network protocol type (Protocol_code: {0 - http, 1 - https}).

The figure shows several trends in the distribution of data:
  • A large number (47% of phishing and 53% benign) IP addresses are located in the US (Country_code = 28).
  • The vast majority of phishing links (96%) use HTTP, while all benign links use encrypted communication (HTTPS).
  • The registration period for phishing domains is generally substantially shorter than for benign ones. In a larger dataset, this trend will be better expressed and will tend to one year.
Now, we’ll apply the k-nearest neighbors algorithm (k-NN) for links classification. The algorithm assigns a class which is the most common among the k-neighbors of this object, the classes of which are already known.

But before, we need to train our model and then draw a decision boundary of the classifier. If the new link falls into the blue zone, then it will be classified as benign, if in the orange - as a phishing one. The accuracy of the classifier will be the higher, the more marked links the training dataset contains. In addition, you can experiment with model parameters, such as the number of nearest neighbors, weight function, the algorithm for finding nearest neighbors, and others.
Figure 3. The binary classifier based on the k-NN algorithm. Parameters: the number of neighbors k = 5; weight functions: on the left - uniform (all points in each neighborhood have the same weight) and on the right - distance (closer neighbors of the query point will have a greater impact than neighbors that are farther).

Then, we can use the trained classifier to detect phishing links. For example, 
(source: Phishtank) - a URL leading to a fake eBay webpage with the following attributes: 
  • the domain registered for 1 year (27.01 .2019 - 01.27.2020), 
  • the server is located in the USA, HTTPS is enabled, 
  • the registrar is NameCheap, Inc. 
will be detected by our classifier as phishing. But if the domain was registered not for 1 year, but for 3, then our classifier would determine it as benign.

To evaluate the classifier, it is necessary to carry out cross-validation on a test dataset with known classes. The test result will show us the classification errors such as false positives and false negatives. In particular, the following basic metrics can be used to evaluate the classifier:
  • True positive (TP) - the number of phishing links identified as phishing;
  • True negative (TN) - the number of benign links identified as benign;
  • False positive (FP) - the number of benign links identified as phishing;
  • False negative (FN) - the number of phishing links identified as benign.
And derivatives from them:
  • True positive rate (TPR) or Recall = TP / (TP + FN);
  • True negative rate (TNR) = TN / (TN + FP);
  • False positive rate (FPR) = FP / (FP + TN);
  • False negative rate (FNR) = FN / (FN + TP);
  • Precision or Positive predictive value (PPV) = TP / (TP + FP);
  • Negative predictive value (NPV) = TN / (TN + FN);
  • Accuracy = (TP+TN) / (P+N) =  (TP + TN) / (TP + TN + FP + FN);
  • F-measure - harmonic mean between Recall (TPR) and Precision (PPV) = 2 * Precision * Recall / (Precision + Recall).
In the context of detecting cyber attacks, the most important is the FN / FNR metric, which quantifies the missed attacks.

If you want to play with the model, the source code is available on Github.

In the next post, we'll consider the visualization problem and dimensionality reduction.
To be continued...

No comments:

Post a Comment