It’s difficult to talk about security analytics without considering machine learning. Machine learning is used to detect malicious websites, flow anomalies, infectious files, infected endpoints and user behavior anomalies. It’s applied to big data repositories to glean information and insights that may otherwise go undetected.

Multiple industries are using machine learning to better automate security screening, border entry, college applicant selection, loan analytics and health care. Behind the scenes, almost every industry that affects our daily lives involves some type of machine learning.

Training the System

Machine learning is based upon statistical analytics of existing data and learning applied to new data sets. In the case of college applicants, admission analysts train the system by feeding transcripts, financial information, demographic information, high school information, SAT scores and any data that may seem relevant for accepting an applicant. In the case of network security, security analysts train the system by examining web browsing tendencies, entry/exit data, email tendencies, login authentication data and any other available user behavioral analytics.

The goal is to identify and classify anomalous situations that serve to train the system. This sounds great doesn’t it? Not so fast — in its infancy, machine learning can produce errant results.

Machine Learning Flunks Its First Tests

For example, my son was recently denied a home mortgage loan. He has good credit, a steady job and met the minimal standards for a down payment. He was simply denied the application with little explanation other than the computer had determined he was a high risk. After weeks of significant pressure on the bank, we discovered that his job was classified as a high-risk occupation for long-term employment.

One of my colleagues was also recently identified as a high-risk user based upon his browsing habits and voice-over-IP (VoIP) usage. Full disclosure identified his geolocated communications as the culprit, coupled with his web browsing habits. The web browsing habits could not be identified or fully disclosed since they were simply “anomalous.” I suspect it was because he communicated with his homeland family often and the geolocation was from a country associated with cybercrime.

Accuracy and Classification

Machine learning makes complex statistical decisions on data based solely on the accuracy of classification. It recursively quantifies and correlates millions of potential decision trees until it has the most accurate classification. In human terms, it does not understand why these decisions make sense, only what are the most accurate decisions based upon the classifications. This is a real problem.

The above diagram is a very simple decision tree derived using machine learning classifiers. The ellipses are different data sets used by the classifiers. If this were a classifier for loan applications, would it make sense to a human? Machine learning makes decisions based on best-guess algorithms, but more importantly, it makes decisions that have no apparent human explanation.

In fact, the main benefit of machine learning — the ability to make decisions that are not humanly evident — is also its potential danger. Imagine machine learning inaccurately identifying a malicious website and then blocking access to that website. The owner wants an explanation and remediation, but the classification cannot explain it.

Human Touch

Machine learning is gradually touching every part of our lives and making decisions of which we may not be fully aware. There is a significant need to disclose both the underlying data and the classification schemes of these processes. IBM is currently working with machine learning analytics to determine domain or website maliciousness. With this comes the ethical responsibility to disclose the information and decision analytics that determine benign or malicious intent. If we deny access to a website, we must then provide the human explanation and core data that drove this action.

In security, this level of responsibility also has a legal tangent. What is the damage to the website owner if access is inappropriately denied? IBM is building both traceability and disclosure into our Domain Name System (DNS) security analytics and believes it will be a significant differentiator. It also carries interesting side effects that involve human interaction to reclassify incorrect data: Explain it in human terms and then allow someone to educate the classifier with new data. Maybe we add a “like” or “don’t like” button for misclassified data.

IBM prides itself on business ethics as one of its core foundations to help build trust with consumers. Therefore, we’re building transparency into our machine learning analytics and striving to be right more than we’re wrong. I would encourage the decision-makers of machine learning products to challenge the transparency of the offering and demand humanly interpretable audits of outcomes.

Learn more about cognitive security

More from Intelligence & Analytics

Hive0051’s large scale malicious operations enabled by synchronized multi-channel DNS fluxing

12 min read - For the last year and a half, IBM X-Force has actively monitored the evolution of Hive0051’s malware capabilities. This Russian threat actor has accelerated its development efforts to support expanding operations since the onset of the Ukraine conflict. Recent analysis identified three key changes to capabilities: an improved multi-channel approach to DNS fluxing, obfuscated multi-stage scripts, and the use of fileless PowerShell variants of the Gamma malware. As of October 2023, IBM X-Force has also observed a significant increase in…

Email campaigns leverage updated DBatLoader to deliver RATs, stealers

11 min read - IBM X-Force has identified new capabilities in DBatLoader malware samples delivered in recent email campaigns, signaling a heightened risk of infection from commodity malware families associated with DBatLoader activity. X-Force has observed nearly two dozen email campaigns since late June leveraging the updated DBatLoader loader to deliver payloads such as Remcos, Warzone, Formbook, and AgentTesla. DBatLoader malware has been used since 2020 by cybercriminals to install commodity malware remote access Trojans (RATs) and infostealers, primarily via malicious spam (malspam). DBatLoader…

New Hive0117 phishing campaign imitates conscription summons to deliver DarkWatchman malware

8 min read - IBM X-Force uncovered a new phishing campaign likely conducted by Hive0117 delivering the fileless malware DarkWatchman, directed at individuals associated with major energy, finance, transport, and software security industries based in Russia, Kazakhstan, Latvia, and Estonia. DarkWatchman malware is capable of keylogging, collecting system information, and deploying secondary payloads. Imitating official correspondence from the Russian government in phishing emails aligns with previous Hive0117 campaigns delivering DarkWatchman malware, and shows a possible significant effort to induce a sense of urgency as…

Topic updates

Get email updates and stay ahead of the latest threats to the security landscape, thought leadership and research.
Subscribe today