Saturday 14 March 2020

AI and Cybersecurity. Part 1 - Intro

Image via
[Author: Alexander Adamov]

I have spent almost all my professional life working in the antivirus industry detecting and analyzing malware. Around ten years ago, when the malware flow had increased so much that my colleagues and I did not have enough resources to analyze them all, we started thinking about automating our efforts. How to make a machine that autonomously detects and analyzes malware and phishing URLs day and night, writes and publishes reports? As a result, we managed to create a robot (what we call now 'malware sandbox') from scratch to automate most of the processes in the malware laboratory with the help of His Majesty Artificial Intelligence (AI). Since that, we accumulated a bunch of use cases for cyberattacks detection, malware analysis, and security testing with ML that can be useful for cybersecurity professionals that decided to leverage ML for cyberdefense. I'm going to share this knowledge in the series of blog posts that will eventually become a part of a new university course 'ML in Cybersecurity' that I plan to make open-source. I also welcome cybersecurity experts and data scientists to contribute and help universities adopting the course.

Today, we start with the main paradigms of AI and, in the next post, we'll consider an example of detecting phishing URLs using classification methods and cluster analysis. But first, I would like to invite you to join a brief tour of AI history to understand its basic concepts.

Diverse AI. History tour
The term “Artificial Intelligence” (AI) was first used by John McCarthy (Dartmouth College), Marvin Minsky (Harvard University), Nathaniel Rochester (IBM) and Claude Shannon (Bell Telephone Laboratories) in 1956. Since then, the definition of AI has repeatedly changed depending on the level of development of information technology. Therefore, it is usually interpreted depending on the context and level of technology maturity. Such definitions are called AI paradigms, and the process of changing them is called the AI ​​paradigm shift. For example, David Auerbach identifies five AI paradigms:
  • Speculative (until 1940).
  • Cybernetic (1940–1955).
  • Symbolic AI (1955–1985), AI winter (1974–80).
  • Subsymbolic AI (1985–2010), 2nd AI winter (1987–1993).
  • Deep Learning (2010 —...).
To verify this, just search the terms Artificial Intelligence (AI), Machine Learning (ML) and Data Mining, and you will get a lot of Euler diagrams showing the relationships between these scientific areas but all different :)

Figure 1. A variety of existing relationships between Artificial Intelligence, Machine Learning, Data Mining, and Statistics shown with Euler diagrams. Image Sources: 123

Ron Schmelzer (Managing Partner & Principal Analyst at AI Focused Research and Advisory firm Cognilytica), in his Forbes article, argues that there are two polar opinions.

According to one, only Artificial General Intelligence (AGI - general, full or strong AI), which has the same cognitive abilities as a human, can be called a real AI; and only those ML algorithms that serve the purpose of AGI can be honorably called AI. This position is peculiar to the paradigms of the middle of the XX century.

In 1950, Alan Turing in his famous work Computing Machinery and Intelligence proposed a test to answer the question "can the machine think?". The essence of the test, which was called the Imitation Game, was in asking questions to a human and machine one by one and trying to figure out by their answers who are you communicating with at the moment. His AI convention was: "If a machine acts as reasonably as a human, then it is as intelligent as a human."

Steve Wozniak subsequently proposed his own version of the test called Coffee, the essence of which was that a machine (robot), entering an unfamiliar American house, could find all the necessary ingredients and devices to make a coffee :)

Seven years later, in 1957, Frank Rosenblatt created the first Perceptron (the simplest artificial neural network), giving the U.S. military the hope of creating a machine that can walk, talk, see, write, reproduce itself and be aware of its existence.

To date, the described concepts are rather similar to the scripts for the series “Black Mirror” and are still far from reality.

Another, more practical position was taken by scientists engaged in applied research in the field of AI (Applied AI). Within this paradigm, ML is a subset of AI. AI can be defined as an algorithm that finds a pattern in the input data or makes its assessment. In the future, the algorithm uses the results to make decisions, as if it were a human. We will consider the problem of detecting and blocking cyberattacks in spite of Applied AI.

Figure 2. The Euler diagrams illustrating the difference in the paradigms of General AI (left) and Applied AI (right).

Cyberattacks well-known and not very

Two approaches are traditionally distinguished for identifying cyberattacks: deterministic and probabilistic. The first one usually takes advantage of using signatures - unique byte sequences that describe malicious objects (files, processes, network connections, keys in a Windows registry, synchronization kernel objects) that allow a defender to uniquely identify known cyberattacks in the automatic mode.

The second approach is mainly used to block unknown threats or zero-day threats within targeted attacks when we do not know the indicators of compromise in advance. As the name implies, this approach enables the identification of new cyberattacks with some probability, leaving the last word to a user or cybersecurity expert. The probabilistic approach opens up a wide field for the use of AI with ML.

Machine learning triad

Commonly, ML has three main approaches:
  1. Supervised learning that can be used for:
    • Classification and recognition;
    • Pattern recognition
    • Supervised anomaly detection
    • Forecasting (regression analysis)
  2. Unsupervised learning
    • Clustering
    • Pattern recognition
    • Unsupervised anomaly detection
  3. Reinforcement learning
    • Robot control
    • Game theory - algorithms for Go (AlphaGo), chess (AlphaZero), checkers, online strategic games. AI initially knows only the rules of the game and creates algorithms and strategies while playing with other instances of itself.
There are two more hybrid approaches:
  1. Semi-supervised learning - training with the input data that can be either labeled or not containing a class label.
  2. Self-supervised learning. The class labels are already present in the input data. For example, for an existing set of images, we generate new images by rotating it by 90, 180, or 270 degrees. The task of the model is to learn how to turn the rotated images to their original position. Another example is picking up a puzzle after cutting the image into pieces. In both tasks, we know the desired result and teach the model to find it.

Supervised attack detection
The most popular task in the field of cybersecurity is binary classification, i.e. dividing objects into malicious and benign in order to detect and block cyber attacks. This implies that, in addition to the classes themselves, we already know the criteria by which we attribute the objects. To develop these criteria, we must have knowledge of the threat. In other words, classification methods allow you to identify known types of threats and attack vectors but do not allow you to identify a new cyberattack that does not use traditional techniques and does not contain indicators known to us.

In addition, the creation of a binary classifier is fraught with a number of problems such as the lack of labeled data and adequate feature extraction.

The lack of labeled data
The first problem is the lack of a large amount of classified data that already has a label (e.g. an object is malicious or benign, it is an indicator of attack or normal activity) that could be suitable for supervised training.

For example, to train a neural network to distinguish cats from dogs, you must first provide labeled images. Such marking is usually done manually and it will be difficult to do this on large amounts of data. If we talk about cats and dogs, then, fortunately, we can use datasets from Kaggle already exist, containing 25,000 classified images. You can also use CAPTCHA so that users categorize images when passing a test that identifies Internet crawlers.

"To prove that you are not a robot, select the pictures that show the shelters in which you would hide during the rise of machines"

However, both of these options do not work when it comes to cyberattacks. First, cyberattack techniques evolve faster than cats and dogs, so the model must be regularly trained on new attack patterns. Secondly, ordinary users cannot be involved in the classification of cyber threats due to the non-triviality of this action.

In the next post, we'll create a basic classifier to detect phishing URLs.

No comments:

Post a Comment