Continuous Monitoring of OT Key Risk Indicators (KRIs)

Sep 05, 2021 | Radiflow team

The problem with “guesstimating” the probability of a threat

I usually start the process of assessing risk by creating a Risk Registrar, filling in the risk statement, description, details on the loss scenario and so on… eventually working my way to the risk analysis part of the register.

The classic formula for risk is:

Risk = Probability x Loss

And in the popular modern Open FAIR™ Risk methodology:

$ Value of Cyber Risk = SUM (LEF X $ ML)

Where:

ML is the Magnitude of Loss for the Asset at Risk
LEF is the Loss Event Frequency which is derived from the combination of:
- Threat Event Frequency (TEF): the number of times over the next 12 months the threat likely to materialize, and
- Vulnerability: the percentage of threat events are likely to result in loss events, based on Threat Capability and on Resistance Strength given the security controls installed

[inject id=’code-47fd23f73a9caecab1e206306adae7f9′]

Better input for better results

Many of these parameters are similar and confusing:

“Confusing a loss event with a threat event in an analysis will lead to inaccurate results. Remember, Loss Event Frequency is how often the organization actually suffers a loss and the damaging event materializes.” (Source: the Fair Institute)

Furthermore, since many of these parameters are not available as an accurate figure the calculation uses ranges and a statistical simulation to calculate the probable range of the risk value.

For the ML section, filling in the boxes (for potential loss) isn’t too bad, as they can be defined based on known business parameters such as revenue loss, cost of response and so on.

But then we come to the “head-scratcher” boxes of LEF, TEF Vulnerability. What values do you enter in those?

Am I to hunt the latest cyber reports and look at my geolocation and industry breach history? Does past history predict future scenarios? Are all food manufactures the same? And where am I in this benchmarking cauldron?

So, as it often happens with too many such vague parameters: you reach a state of GIGO (Garbage in Garbage Out). Either your calculations are wrong or the range of the resulting risk score is too wide to be useful for decision-making.

Applying a Data-Driven approach

In this article I propose that in order to answer the above questions, we need to use a data-driven approach which combines OT breach attack simulation (OT-BAS) with statistical simulation techniques.

By using virtual OT-BAS we are able to obtain data points on system vulnerability, i.e., Threat Capability and Resistance Strength on a specific production system under consideration (SUC), not just generic information. And combining breach attack and statistical simulations de-facto applies a data-driven approach of entering values into statistical simulation tools instead of “guesstimating”.

Using the above approach, we can reduce our input variance on threat actor capabilities and resistance strength and thus narrow down the value range of risk we get as an output.

For example, I’ll use the FAIR-U tool on a Phishing database breach scenario supplied with the tool.

For the purpose of this post, I won’t change the Loss Magnitude (ML) side, and only concentrate on the left side, Loss Event Frequency (LEF).

Our starting-point: Using common sense (I hope my sense is common), with no “prior” information on what values to enter i.e., the “guesstimate” methodology.

The initial values I entered are: 50/50 on threat capability, 50/50 on probability of action, 40-60-80% on resistance strength, and [1-2-4] on Contact frequency.

As you can see in the chart below, the starting-point LEF looks good, with values in the tolerable risk area.

I’ll run the same statistical simulations with values from an OT-BAS simulation.
After entering the resistance strength and threat capability (TI), taking into account the security level achieved (SLA) at the site, the digital image, and the relevant threat intelligence, the resulting risk values now look very different, and not for the better.

		Min	ML	Max
Line	Start:
1	Resistance strength	40	60	80	Vulnerability
2	Threat Capability	50	50	50	Vulnerability
3	Contact Frequency	1	2	4	TEF
4	Probability of Action	50	50	50	TEF
5	Risk	0$	Avg 138K$	8.1M$	Within tolerable risk

6	BAS:
7	Resistance strength	30	40	50	Site SLA using BAS
8	Threat Capability	60	70	80	TI, MITRE, BAS
9	Contact Frequency	1	2	4	No change
10	Probability of Action	60	70	80	Insight From (7)(8)
11	Risk	0$	Avg 4.3M$	22.7M$	Exceeding tolerable risk

Continuous Risk Monitoring

In today’s everchanging environment, an annual risk assessment is no longer sufficient. To continually monitor LEF as threat landscape and vulnerabilities change, we need to continuously monitor key risk indicators (KRIs) to alert us of changes.

The IOR institute defines KRIs (Key Risk Indicators) as metrics that provide information on the level of exposure to a given operational risk which the organization has at a particular point in time.

KRIs are an early warning system of changes in our threat landscape and system vulnerabilities, which provide the needed time to proactively address changes in our risk posture.

Using the changes in these KRIs, we re-run our OT-BAS and enter the new values from the OT-BAS in our risk registrar using a statistical simulation tool for the probability ranges.

In order to continuously track changes in LEF, we recommend assigning KRIs (key risk indicators) to TEF and vulnerabilities, in particular probability of action, threat capability and resistance strength.

Example KRIs for LEF:

TEF KRIs: “How many times will the asset face a threat action?”

Geolocation
Industrial sector
Active adversaries
Adversary capability and ATT

Vulnerability KRIs: “What percentage of threat events are likely to result in loss events”

Production functionality and topology
Distribution of security controls
Possible threat scenarios causing a loss
Connectivity, interdependencies
Escalation and propagation of a loss scenario
Vulnerabilities in system and procedures

Use the OT-BAS to determine and prioritize “Key” Indicators

It’s important to note that KRIs scores are dynamic. They may not be as frequent as EPS to a SIEM, but KRIs alerts need to be timely to identify the shift in risk posture and give us the needed time to adjust our defenses.

Changes to a KRI signal a change in the level of risk exposure associated with specific processes and activities. Thus, KRIs are pro-active metrics used by organizations to provide an early signal of increasing risk exposures in various areas of the enterprise.

We recommend the following work flow:

Understand TEF
Understand Vulnerability
Understand if a threat event scenario’s is a loss scenario
Address only scenarios that cause loss
Use KRIs for each loss threat category – no more than 5 KRIs.
Monitor changes in the KRIs and re-evaluate risk for each such change

Example of Risk registrar with KRIs

KRIs are added to the risk registrar to pro-actively recommend mitigation controllers that would reduce the risk before a loss event happens.

Below is an example of the extended risk registrar with quantitate values to address LEF.

Risk category:	OT operational
Risk Program:	Cyber Origin / network connectivity
Risk Title: (scenario)	Loss of control on heat level – boiler tank A12 Remote Safe shut-down not possible
Frequency of scenario as security target LEF (times per year):	1<N<=5
Current values after BAS
Current Threat likelihood of risk title (simulated):	80% (High)
Resistance strength	40%
Threat Capability	70%
Estimated current LEF	1<N<=10
BAS simulation to SLT3
Threat likelihood of risk mitigated to IEC62443 SLT3 (simulated):	45% (Medium)
Resistance strength	85%
Threat Capability	70%
Estimated simulated mitigated LEF at SLT3	1<N<=5
Overall Impact rating:	High (I omitted the pre overall impact calculation stages for simplicity)
Overall risk rating:	High (80%)
Risk tolerance:	Medium (45%)
Risk response:	Mitigate overall risk rating down by reducing threat likelihood SLA = SLT3 to reach tolerance level
KRIs for risk title:	New ATT and cyber tools Change of asset vendor New common vulnerabilities Change in connectivity to asset Change of project on logic controller

So next time when we are challenged by the “head-scratcher” Loss Event Frequency, we recommend a data driven approach using statistical tools such a FAIR-U and to add data points that are derived from simulated breach attack simulation on your specific production environment, thus reducing the ranges of inputs for the statistical calculation.

If you’ve found this article interesting, please visit and follow Radiflow on LinkedIn, where you’ll find a wealth of exclusive content.