Our first blog post in the three-part series is on the sections "The usual" and "Architectural".
Ever watched the clip where Charlie Chaplin works in a factory? If you have not, then watch it, it is called “Modern Times-Factory work”. In short, Chaplin has to work on a piece of metal very fast, without missing a single item, before it reaches another person who does something, and the chain continues until the product reaches the main machinery. The reason for the mention is that I feel like in some SOC environments, Chaplin (an analyst) has to work on parts (alerts) FAST, WITHOUT A MISS, PASSING-ON TO OTHERS.
According to a study, an analyst typically spends an average of 15 minutes to triage an alert, which includes looking at an alert, scanning the logs, creating a hypothesis, finding user-attributable information. That is a lot of work in a very short time. But this is normal in many SOC environments. Many things can act as a barrier to achieve this metric or at least come close to it at scale. SOC is like the revolving building you see, there are a lot of moving parts in it. At a minimum, we have log shippers, log aggregators, log brokers, and alert engines on the technological side of things. There are policies, procedures, knowledge sets, metrics, analyst hierarchy, and playbooks on the people and process side of things. For an analyst to shorten the time to work on an alert effectively and efficiently, one would expect all the parts mentioned above work in a coherent and intertwined manner.
To give you some perspective on why this is important, according to Infosecurity Magazine, a typical SOC receives around 174,000 alerts per week.
We will take a look at some of the issues we (SOC) face in the moving parts mentioned-above that may hinder an analyst to be efficient and consistent in their triaging process. More importantly, we will also discuss what an analyst may do to overcome these issues. Remember, many things are above the analyst’s pay grade, but we will only discuss what we as analysts may have control over.
Let’s get to it.
The usual:
As we discussed before, the number of alerts some SOCs have to deal with is just shy of 25,000/day. Because of this, two issues arise, analyst burnout and tribal knowledge. Though analyst burnout may be caused by many reasons, one of the main reasons is the number of alerts. Due to the increase in the number of alerts, an analyst already increased workload just spiked. Now add “less efficient/skilled staff” to the mix. These two reasons jagged the resources (time, energy, etc..) required from an already drained analyst. This problem has created another problem, tacit knowledge.
As veteran analysts are in a marathon to close/escalate these alerts as they have adequate knowledge of their environment, it created a distance between veteran analysts (the know-all) and new analysts. Tribal knowledge is great, and we do need people who possess it, but it does not scale well throughout the organization if that person leaves the organization. This will hinder the SOC team if the knowledge is not distributed among others through playbooks, knowledge bits, sessions training, and others. If you think it is not an actual problem, read the “Phoenix Project” book and you will understand its effect at the team/organizational level.
Architectural:
One of the tools that we see in almost every SOC is SIEM. The features may differ from organization to organization depending on the maturity, comfort level, budget, staff power, and other things but the main purpose of a SIEM is to bring logs from point products, application (both commercial and custom), and other places like Cloud. If there is one place that an analyst spends their time other than family, friends, and pets, it is SIEM.
SIEM provides a huge value to an organization if it’s planned and designed right. There are many moving parts in a SIEM, or we can say, an analyst work can be hindered at many levels. As we saw before one of the important components of a SIEM are data/log shippers, log aggregators, storage, and alert engine. Each one poses a different issue and if they are not accounted for, it will fail a SIEM and/or hurt an analyst’s workflow.
Let us drive this home with an example to better understand how each component hinders an analyst’s work. An analyst (Chaplin) wants to work on the alert, he does not have the rule the alert triggered on nor he has all the context required around the triggered alert. So, he took whatever info he had from the alert and searched in the SIEM’s storage using SEIM’s search engine. The info he wants is not populating right away as there is too much data in SIEM’s storage and the SIEM is not planned/designed to handle significant amounts of writing, indexing, and querying. So, Chaplin decided to trim his search and decided to only concentrate on a particular data set/info, after crawling for a while he saw some results related to the alert. The documents or logs returned are not enriched, have no common naming convention so that he can correlate with other log sources, and on top of that, the results are skewed because of time format or time zone issues.
Thank you for reading this far, we will discuss the other things in the next two blogs. You can check them here and here.
No comments:
Post a Comment