Why Panther Chose to Open Up Its Security Data Lake

Sep 1, 2020

11 min read

Giving organizations and security engineers total control over their data

At its core, Panther is a data-driven platform that needs to store, manipulate, and query huge amounts of data to help answer important security questions.

In this blog, I’ll explain how we chose our datastore and why we took the bold step of fully opening it to customers. I’ll also share how Panther’s open data infrastructure integrates with third-party applications and services. You’ll discover how this leads to improved security and transparency, and why we believe that data should never be trapped in a security platform.

How Panther Processes Data

One of Panther’s core components is a log processing sub-system which takes raw logs and converts them into structured events that can be analyzed with Python rules. This system parses, normalizes, and extracts indicators (like IP addresses) from a variety of security data sources like CloudTrail, Osquery, Okta, Suricata, and more. When rules are triggered by suspicious activity, alerts are generated and dispatched to destinations like Slack or PagerDuty.

The processed log data and all events related to an alert are stored and made searchable via SQL in a serverless Data Lake. This Data Lake is fully open and available to users of our system to enable post-processing, visualizations, training machine learning models, or anything else desired.

A Modern Database Architecture

To keep up with cloud-scale data demands, small security teams need to scale very quickly. Panther is built on a fully serverless architecture, that enables higher scale, more flexibility, and quicker time-to-value than traditional on-premises applications. The choice of database should be consistent with this architecture.

Database as a Service

We designed Panther to be fundamentally serverless to offer customers automatic scaling, lower operational load, lower cost, better security, and easier deployment. Being serverless also impacted our database decision because we do not want to manage any database servers! Therefore, our database must be exposed as a managed service.

Storage Decoupled from Compute

Panther is a security data processing platform, which means huge volumes of data are expected to be processed and retained for long periods of time. For example, it’s not uncommon to consume 2-5TB of data per day and store that data for up to 3 years. To handle this heavy workload, Panther’s database must be extremely scalable.

Decoupling storage from computation is the only way to achieve the scale needed to operate in a cloud-first world where data volumes continue to grow. This means data storage is not local to the compute nodes, but remote on the network. Remote storage adds latency to queries, but it allows the storage size to grow independently of the compute nodes which is incredibly important for security applications that need to retain petabytes of data that span years.

In contrast, if your database storage is local to your compute nodes, then to add storage capacity, more compute nodes must be added. This is expensive and wasteful when the usage is primarily historical data because compute nodes must be added simply to increase storage capacity when they are not needed for query speed. In addition, the more compute nodes that are added, the more operational burden it is to maintain patching and to replace failing nodes.

One way organizations work around this pain is by limiting the volume of security data they ingest, either in the form of low retention or filtering for only the most valuable data. These difficult tradeoffs significantly hinder the security team’s ability to effectively detect, investigate, and remediate security incidents.

You might ask: “You already decided that you want to use a database as a service. If the vendor manages the database, why do you care if storage is decoupled from compute?” The reason is that while the operational burden is removed by using the service, the underlying problems have not been resolved. The users of the service still suffer from excessive costs and data capacity issues–just ask anyone who’s had to resize a big Redshift cluster. Once storage is decoupled from compute, these issues simply disappear. In addition, new capabilities are now possible like running multiple query clusters concurrently and having other tools access the storage.

Support for Complex Objects

Security data is complex. The days of simple rows and columns with scalar values are gone. Today’s security data often originates as complex objects that express rich relationships necessary for answering critical security questions. Our database must support complex objects such as arrays, maps, and structs. A well-known example is AWS CloudTrail data which is composed of many complex elements, for example:

{
  "Records": [
    {
      "eventVersion": "1.0",
      "userIdentity": {
        "type": "IAMUser",
        "principalId": "EX_PRINCIPAL_ID",
        "arn": "arn:aws:iam::123456789012:user/Alice",
        "accessKeyId": "EXAMPLE_KEY_ID",
        "accountId": "123456789012",
        "userName": "Alice"
      },
      "eventTime": "2014-03-06T21:22:54Z",
      "eventSource": "ec2.amazonaws.com",
      "eventName": "StartInstances",
      "awsRegion": "us-east-2",
      "sourceIPAddress": "205.251.233.176",
      "userAgent": "ec2-api-tools 1.6.12.2"
    }
  ]
}Code language: JSON / JSON with Comments (json)

In particular, take note of the complex userIdentity attribute which is very useful for answering security questions concerning “who” was responsible for an activity. A security application database needs to be able to represent such data and allow efficient searching and element extraction.

Making a Decision

Given the important qualities we require for a database, Panther chose to initially support two options that may be used interchangeably in the backend: AWS Athena and Snowflake.

Security products commonly keep their choice of database engine proprietary and hidden from users (but if you ask, most use Elasticsearch which is neither serverless nor does it separate storage from compute). Customer interaction with data only happens using the product’s UI or APIs, which allows vendors to:

Change datastores and internal representations without impacting customers.
Abstract the underlying technology which might be proprietary and offer competitive advantages.

Panther decided not to follow this path.

As an open security product, it was important to us that customers always had the ability to customize our platform to fit their needs and integrate with other systems. Let’s dig deeper into why opening up our data infrastructure has such a valuable impact on the usability of our system.

Enriching Security Data with Business Context

Allowing customer access to Panther’s database improves overall security.

Providing relevant context for security analytics is the key to successfully utilizing security data. For example, the normal behavior of a salesperson is very different from a system administrator. Distinguishing the activity of a salesperson from a system administrator could be important if the activity involved the salesperson logging into production servers (very suspicious!).

A customer typically has internal mappings (e.g., Active Directory data) to label a user id as a salesperson or a system administrator for business reasons. By exposing the core database, customers can join their internal data (which has the user’s job role) to the security data (which likely only has the user’s id) to provide the needed context to disposition events.

While contextualizing security data with customer business data makes the data more relevant, the compliment is also true. Processed security data can add value to core business analytics by supplying information about user activity and cloud infrastructure to improve efficiency and reduce costs.

Open Integrations and Partner Ecosystems

Panther has a beautiful user interface that makes it easy for security teams to customize detections, triage alerts, baseline behaviors, and investigate incidents. However, our customers often need to answer very particular questions about their security posture and infrastructure that is outside the scope of what we offer in our UI today. Exposing the datastore allows customers to use their existing business intelligence (BI) tools without any special integrations.This allows customers to create their own rich graphs, metrics, and reports to communicate to stakeholders the state of the organization’s security posture.

In addition, some teams also have significant investments in existing data processing technologies. Rather than forcing them to replace a datastore, using a shared datastore enables easy integration. Notebooks are particularly popular for Security Engineers to implement customized threat hunting. For example, see: “Threat Hunting with Jupyter Notebooks— Part 1: Your First Notebook”. Customers can integrate Jupyter and Zeppelin notebooks as well as Spark systems with Panther data by using the AWS Glue catalog, and there are ways to easily connect Snowflake as well.

It is also very common to have in-house security workflows that are very particular to the customer’s business and data. These must (by definition) be implemented by the customer. Having easy access to the common datastore allows these workflows to query back into the database to retrieve critical data. An example of this is a mitigation workflow that is triggered by a Panther alert of suspicious activity. The workflow reaches back into the datastore to collect data identifying the employee and circumstances from both the processed security data and the business-related data. Using this information, the mitigation then restricts access accordingly (e.g., using the SSO API to revoke an access token) and notifies the employee’s manager and the security organization. While SOAR tools may help with some of this orchestration, the point remains that the common data store is the key to making such automated responses possible.

Zero Copy

Many security products do not allow direct access to their datastore, which means to consume the data, customers must use it within the tool or export it elsewhere. However, when data is in a common place, there’s no need for copying since it can be joined and queried directly.

Besides being inconvenient, forcing customers to copy data has several serious issues:

Expensive: Exporting and storing data in multiple places is inefficient and expensive.
Slow and error-prone: Moving data can be slow, which can negatively impact SLAs. In addition, failures in the copy process must be addressed by the data movement software, which can be difficult to get right.
Inconsistent: Data may change in the security system but not appear downstream, producing inconsistent analytic results.

Transparency

Since customers have direct access to Panther’s datastore, they always know how much data they hold and the associated cost of storing that data. This enables informed decisions on data retention, and expense forecasting.

For customers using Panther powered by AWS Athena, the data is stored using industry-standard JSON and Parquet formats, which are both broadly supported formats by popular big data tools.

Ownership and Control

Unlike other security products where the database and application layer are bundled as a single entity, customer data is not locked into Panther. All data that’s processed by Panther is owned and controlled by the customer, and can be used with or without Panther as needed. This enables customers to control access, audit, and backup.

Wrapping up

At Panther, we’re building a platform that provides immense security value without sacrificing transparency and flexibility. Opening up our datastore might give away some competitive advantage compared to other security products, but empowering security teams with their own data is simply the right thing to do. Security is hard enough, and we want to alleviate some of the pain!

What’s Next?

Our team of experienced security engineers are working hard to bring new features to Panther, and the design decisions discussed in this blog allow us to rapidly iterate and improve our platform.

One of our short-term goals is to make analysis simpler and faster, and we’ll soon ship an Indicator Search feature which allows you to paste a set of indicators (e.g. IPs, domains, hashes, etc.) and batch search ALL of your data at once. Alert Summaries are also coming soon which will allow users to specify data attributes to automatically summarize for alerts to add relevant context.