So, you want to build a SOC?

12 min readAug 31, 2020

Having worked in cyber incident response, detection and intelligence for the last 7 years and helping stand up detection capabilities across organizations I hope to share some key components to setting up successful security operations.

Setting up a SOC can be a complicated beast and this is by no means a comprehensive How-To guide that aims to cover it all. The goal here is to highlight some concepts that I wish I knew back then after having some hard lessons learned.

Here are a list of some key considerations.

Assess your Maturity

Building a SOC or detection capabilities can be a massive undertaking and not all SOCs are created equal. Larger organizations who essentially print money can have a ton of resources at their disposal and the sky is the limit. Smaller organizations might have to take a more subtle approach and focus on what’s important. An organization should understand its gaps, current constraints (whether it be from a budgetary or resource perspective), scalability (do you expect to grow significantly in the coming years?), governance, threat model, etc.

For some organizations it might be more cost effective to outsource the detection capabilities to a managed security service provider (MSSP). Arguments can be made on the pros and cons of such services. Personally, I tend to prefer these sensitive functions be kept in-house, but it’s worth pointing out that other options exist.

People, Process, Technology is a methodology popular in the IT world aimed at highlighting the important relationship between these three key elements to drive business function. Take all of these elements into consideration when determining where you stand. Do you have enough resources? Are they trained properly or do you have access to the SMEs required to achieve your goals? Do you have an effective governance model in place? How long do you have to retain the data and are you sending it somewhere else for retention? Does your technology support your initiatives?

If people don’t understand the technology or it doesn’t support your goals, you may run into the next problem.

Garbage In, Garbage Out

It’s pretty easy to get data into your SIEM. Most of them offer a number of options whether it be syslog, file monitoring, API, custom add-ons or applications to facilitate integration, etc. With that said, take a step back and consider what you want to do with the data. All too often, organizations jam data in without really considering why they are doing it. Maybe you have a regulatory requirement you need to fulfill. If that’s the case, I would also think about why the regulatory requirement exists in the first place. For example, let’s say we need authentication data for network devices. What’s the underlying security implication of not having this data and what do we hope this data achieves outside of checking a box for regulators?

The problem becomes even more complex when we need to understand how authentication works across the enterprise. Are all devices using LDAP? Do we implement multi-factor anywhere? Are any devices using local authentication? Are logging levels configured properly so we can actually see this data before we start sending logs to the SIEM? These types of questions apply to any data source we decide we want to ingest.

Another example that comes to mind is potentially noisy security tools, which is generally a result of improper configuration within the tool itself. Consider application whitelisting software that generates logs any time anything occurs on a system due to poor policy configuration. Spin up an instance of Sysinternals Process Monitor to see how noisy Windows can be. Just sending all of this to a SIEM can create more harm than good when analysts are inundated with logs they don’t understand with little context. I always like to tune things upstream at the security tools whenever possible. It alleviates pressure on the SIEM and the analysts and helps provide more actionable data downstream. There is a balancing act, however. You want to be sure you aren’t filtering out data that you may need so always consider the end goal.

We mentioned regulatory requirements, but the main goal is improved security and detection capabilities. A good threat model can help you prioritize log data. When building it, be sure to consult other teams such as Red Teams or Intel. These folks can often provide valuable insight into the overall threat landscape and the current techniques used by attackers. Generally, I like to prioritize security tools when it comes to adding value, but arguments can be made for other log sources as well which we will cover in a second.

Common security tools include anti-malware and Endpoint detection and response (EDR), Data loss prevention (DLP), Web proxies, Web application firewalls, Intrusion detection and more. I didn’t include basic network firewalls here because in my experience it’s one of the noisier log sources and doesn’t always provide a ton of value. I see it used more during the course of investigations and for context opposed to near real-time alerting.

Along with these tools one of the more important logs are Windows logs and Sysmon. Having proper auditing in place or a solid Sysmon configuration can yield valuable information and help “Find Evil” within the environment. SwiftOnSecurity has great information on the configuration of Sysmon in an environment. Sysmon helps bridge the gap on command line level logging and other things that may not be available in older domain functional levels.

One of the last logs I’ll mention is DNS. You will find a bit of an overlap with DNS and Web proxy, but where the DNS bridges a gap is in circumstances where maybe the DNS servers don’t resolve the host, the domain is sinkholed or the traffic doesn’t otherwise make it to the web proxy. This does happen and can provide useful information on potential attack attempts that may not be visible from the proxy.

Use Case Development

One of the ways we avoid the garbage in, garbage out scenario is to have a good handle on what use cases we want derived from our threat model. Early in my adventures I often took it upon myself in immature organizations to start building near real-time alerts from what I knew was considered bad with the data that was available to me. This was a terrible approach and when we develop alerts or the use cases that drive them we really want to use a structured approach. In addition, when I say near real-time alerts I’m referring to alerts that trigger when an event occurs in your log data that generally result in some action whether it be a ticket generated or demand triage from an analyst. These also usually have some level of delay depending on how often a search runs. Determining the frequency of a saved search or alert depends on a few factors, most notably the severity of the alert along with the resource constraints of the SIEM. For example, combing through millions of firewall logs over longer periods could stress the environment resulting in the search or other searches timing out if resources aren’t up to par.

What does a structured development process look like? Well, I can tell you if you just start building stuff without even considering small things such as repeatable naming conventions, search optimization, and whether associated run books and processes exist, you will find you’re creating more work for yourself in the future. Other questions you might ask are

Do you have supporting log sources?
Who is validating the data and responsible for proper parsing?
If you change technology or include other products in our use case, does it conform to some type of common information model?

These are all things that need consideration before we begin development. Even if you don’t have a ton of people to consult or a threat model in place there are a few places you can visit to get some general ideas.

Sigma is one project that aims to create generic signature formats for detection cases in a SIEM, similar to Yara rules. You can adopt Sigma in your environment and translate the rules of your choice to whatever SIEM you have, but the rule-sets also provide good material for a brainstorming session on developing your own valuable detection cases.

Another avenue to develop good use cases comes from Threat Hunting.

Threat Hunting

Threat Hunting isn’t really a new concept, but it is one of those security buzzwords that is being tossed around a lot more over the past couple of years. What is threat hunting? I’m sure there are a few definitions depending on who you ask, but in my opinion, the key words are ‘human-driven’. Any human-driven activity to search for the bad stuff in your environment that are not yet automated alerts.

I prefer the following life cycle on an approach to threat hunting.

1. Create a hypothesis — A hunt starts by essentially asking a question or making an assumption about some type of activity going on in the environment

2. Investigate — Follow up on the hypothesis by searching through data sets to discover malicious patterns

3. Uncover — Prove or disprove the hypothesis by uncovering these patterns or anomalies

4. Inform and Enrich — The end goal here outside of finding something bad is to be able to create new alerts from a successful or unsuccessful hunt if the SIEM and associated data support it. Even so-called unsuccessful threat hunts can identify otherwise unknown gaps in log data.

Creating a hypothesis sounds easy, but many analysts simply aren’t thinking about these things on a regular basis. Feel free to setup regular working sessions with your analysts and work through some potential scenarios to get people to think outside the box. Here are four potential inputs to help drive the threat hunting process

External Intelligence — Information sharing channels, vulnerability analysis, reports, various OSINT etc.
Intrusion Analysis — Analyze your own data to uncover to TTPs or actor behaviors and ask yourself if they are currently being detected.
Authorized Knowledge — Work closely with other teams in your organization such as the Red Teams who may have better knowledge of attack methods. View results of pentest reports as well.
Continuous Monitoring — This one is a bit of a crap shoot and speaks more to poorly managed security tools, but combing through raw event data from your tools may also uncover some interesting things.

If you work in DFIR or Intel, you may have heard of the Pyramid of Pain. The general concept is that not all indicators are created equal. It’s easy for attackers to change hashes with polymorphic malware but changing behavior is more difficult. This concept applies to threat hunting by illustrating the advantages of searching for behavior rather than static IOCs.

Lastly, there needs to be some sort of a team structure around this. You can have dedicated teams if resources permit, hybrid teams or outsource the function entirely. Whichever model you choose, make sure it’s properly communicated and defined so you’re not missing key events in your environment.

Test Your Alerts

It’s easy to craft an alert and assume it works as expected, but sometimes that’s not the case. It’s advantageous to validate it early on instead of when you’re face to face with a team of auditors.

Some alerts are easier to test than others are. You can download EICAR or WICAR test files for generic signature based detection of malware. Certain vendors like Bluecoat/Symantec/Broadcom (who can keep track anymore?) provide test pages for web proxy activity.

Bytes in/out activity can be simulated with large file uploads or speed tests. Encoded PowerShell and seemingly malicious documents are easy enough to recreate as is scanning activity using tools like Nmap. Malicious user-agents can also be easily be mimicked.

More complex use cases might require a little more effort and be sure to make this a consideration during the use case development process. For those, there are also both paid and free tools that exist to not only test use cases but also gaps in detection capabilities. Without naming any companies if you Google “Breach and Attack Simulation” you’ll get a general idea.

On the open source front one tool that I’m aware of is Caldera. I haven’t personally used it so take this with a grain of salt, but I could see this being leveraged to test certain detection cases.

Fun with Metrics

Ok…so metrics aren’t actually much fun unless you like making PowerPoint slides for management, but they really can provide a ton of useful information. I’ve seen a lot of useless metrics and a lot of good ones. What makes a useless metric? In my opinion, one that results in no action.

A metric could also be a good metric, but if you don’t have the resources to take action you aren’t getting much benefit from it. If we want to report on the number of alerts that result in false positives, but don’t do anything with the data then it doesn’t really help us much. If we use this data to better tune alerts and security tool configurations resulting in saving analysts time and effort therefore saving money for the company….well that’s pretty good stuff.

I’m also a big fan of the ‘mean-time-to-whatever’ metrics. Examples are Mean Time to Detect (MTTD), Mean Time to Respond (MTTR), Mean Time to Escalate (MTTE), and Mean Time to Close (MTTC). There’s also Dwell time, which depending on who you ask could be the time between when something occurs to when it’s detected or the time from occurrence to closure.

The last set of good metrics I want to mention is anything surrounding alert volume or trends based on categories. If you are able to map alerts or incidents to popular frameworks such as Mitre or the Cyber Kill Chain then you can glean some helpful information on possible gaps in controls or shifts in attack techniques.

Train your Analysts

This one goes without saying. In my opinion, every analyst should have a basic understanding of the SIEM and any use cases they are responsible for supporting. They should probably have some level of understanding of the network architecture. It’s necessary to know what log sources are available and key fields in each log source. If you have one person developing everything in the SIEM in a small org and that person leaves, good luck.

Training doesn’t necessarily have to be costly, either. There’s plenty of great resources depending on the SIEM you use. Splunk offers the Splunk Fundamentals I course for free.

It’s been a while since I used IBMs QRadar, but I remember a great YouTube channel by Jose Bravo that including tons of amazing content.

Jose Bravo

Security Technology I demo often

www.youtube.com

Don’t Forget About Data Availability!

Never assume things are working as you expect. The amount of times I’ve seen log sources stop sending to the SIEM or gaps in data for days or even weeks is pretty outstanding. It’s silly to not consider this, but I feel like it happens all too often. In cases where the log sources don’t support use cases that trigger more regularly it’s also easier for it to go unnoticed.

That said; make sure you also have operational use cases to validate the availability of the data. If something stops sending, especially important security tools, this should be a high priority.

Process Documents

As much as it pains me to say it…what’s even more fun than metrics? You guessed it. Process documents! In all seriousness, for me this is the most necessary evil within the whole process. You need processes for your processes and processes to process them. Let’s not get too carried away, but there are some things that are just plain necessary at a bare minimum.

Response processes — When an incident is declared, the IR team should have some sort of documented approach for each scenario. They can be high level, low level, or both. You might also want to consider separate response processes for IR and SOC analysts since they are likely to be operating at different stages of the event or incident. Somewhere it should also be documented when that hand-off should occur. Formally define what your organization considers an incident, agree on it, communicate it and set your hand-off point.

Use case development process — We talked a lot about use case development. There needs to be processes in place not only around how we create, test and validate use cases, but also some sort of periodic review. Associated with that, we need processes for log source on-boarding and use case tuning.

Training process — We have a section on the importance of training, so equally important is a training process. You want to make sure everyone has the same opportunity and training is regularly reviewed and updated accordingly.