Sensitive data discovery is the process of locating all of your company’s vulnerable data, such as financial information, customer details, trade secrets, and more. You can think of sensitive data discovery as the key to safeguarding your business, employees, and clients from bad actors who could potentially exploit critical internal information.
Effective sensitive data discovery requires classification and categorization of the data discovered. The two concepts are closely connected and must be aligned in order to secure your business’ most crucial information. For a robust sensitive data discovery strategy that protects your organization, you’ll need data classification that’s comprehensive and accurate.
Types of Sensitive Data Important to Discover and Classify
In order to correctly identify sensitive data, you’ll need to categorize what you’re looking to protect. Here are the most common kinds of sensitive data that you should categorize as important to secure.
Personally Identifiable Information (PII)
Any data that can be used to identify an individual person falls under the category of PII. This could include details such as a person’s first and last names, birthdate, passport or driver's license number, address, and social security number. This information is one of the core data sets that organizations must cover when conducting sensitive data discovery.
PII can be used by bad actors for illegal activity such as identity theft or credit card fraud, and there are numerous legal requirements obligating organizations to safeguard this information from cybercriminals or even to expose it to third parties without the express consent of the person in question.
Protected Health Information (PHI)
Companies operating in the healthcare industry must be particularly careful when it comes to ensuring that PHI is properly safeguarded. This data includes sensitive information about an individual’s health history, including lab test results, medical conditions, insurance claims, and more.
PHI is protected under HIPAA laws, and companies who are found to violate the strict regulations around safeguarding this data may be subject to financial penalties and other regulatory consequences.
Financial Data
Anything related to the financial dealings of your company should be kept on a strictly need-to-know basis and categorized as sensitive data. This could include information regarding payroll, like employee salaries and benefits, your company’s annual expenditures, and more. A good rule of thumb is to simply treat all financial data as sensitive, and secure that information accordingly.
Intellectual Property
Trade secrets, in-depth explanations of your product design or features, and everything else that’s related to your organization’s intellectual property should be treated as an incredibly valuable asset.
This insider information must be strictly safeguarded; otherwise, your competition could gain critical insights into your products, leading them to exploit your weaknesses or even to improve upon your offering and present it as their own.
Confidential and Strategic Business Information
Information regarding your company’s internal workings, future plans, and overall business strategy should be classified as sensitive data. Much of this information could prove to be disastrous should it be leaked, leading to damage to your company’s financial and market standing.
For example, a go-to-market strategy that hasn’t been implemented yet could become useless should it be obtained by a bad actor and released publicly. When conducting sensitive data discovery, be sure that your confidential and strategic business information is included.
Reasons Why Your Business Needs Sensitive Data Discovery
There are numerous reasons why your company should engage in sensitive data discovery.
Regulatory compliance
As awareness around consumer privacy grows, companies are now subject to more laws than ever before requiring them to keep customers’ information secure.
In order to remain in compliance with regulations like GDPR, companies must allow a customer or contact to delete their personal information from an organization’s database. To abide by this rule, the organization must be able to locate and identify all instances of that contact’s personal data throughout their data environment.
To comply with GDPR, companies will often need to anonymize or mask that sensitive data - which also requires the ability to immediately locate all of the data and implement those policies.
For companies operating in highly regulated industries like finance or healthcare, there are a slew of additional regulations regarding data privacy to which they must adhere, and sensitive data discovery is a critical part of that process.
Data access governance
Understanding the sensitivity level of a specific asset, along with the type of data contained within it, is critical for providing the right level of access to users and third parties.
It’s clear that users at your organization (and occasionally, external contractors) will need to access sensitive data as part of their day-to-day workflows, but that data still should remain protected on a need-to-know basis.
Remember, sensitive data isn’t a one-size-fits-all category: even within the same company, there are different types of confidential information. Access to sensitive data granted to employees will vary according to their roles.
For example, a healthcare company may need to grant some users access to a client’s personal health information (PHI). Other teams at the company might need to access the organization’s financial projections.
While both PHI and financial information fall under the umbrella of sensitive data, you should distinguish between these two distinctive types of information when granting access to the appropriate parties.
Each user or team should be granted access solely to the information that’s relevant for their work, rather than receiving a blanket pass to view all sensitive data within the company.
Data security
Strong data security requires a full, comprehensive accounting of all your company’s critical information. If you don’t know what you have, you can’t protect it – that’s why sensitive data discovery is so critical.
If you’re aware of the location and type of all your instances of sensitive data, you can implement practices such as encryption and data masking to keep it safe.
Sensitive data discovery is also important for effective risk evaluation and more targeted, informed security alerts. For example, users who download an abnormal amount of sensitive files can trigger a notification to your security team. To establish the benchmark that defines the normal and the abnormal, you’ll need to first define which files are considered sensitive. And to do that, you’ll need sensitive data discovery.
Top methods for sensitive data discovery
- Regular expressions (regex) - Regular expressions (regex) are patterns which may be leveraged to categorize or modify sequences of characters within strings of text.
- Data fingerprinting and exact data matching - This technique is based on comparing data within fields or documents against a preset list, in order to check if an exact match for that information exists.
- Natural Language Processing (NLP) - NLP can identify entities, extract features from unstructured text, and comprehend context and sentiment, making it an effective tool for detecting sensitive data.
- Asset metadata scans - Asset metadata provides information about the asset, such as the file name, the time last modified, etc. Sensitive data discovery tools can utilize this to quickly surmise whether an asset contains sensitive information and its sensitivity level.
- Data lineage - In data environments with structured data and multiple processes, data lineage can be a major help. If you can identify sensitive data in one database field, data lineage can track that data’s journey through your systems and classify as sensitive all instances leading to or derived from that field.
Challenges in sensitive data discovery
None of the main sensitive data discovery methods are perfect in and of themselves.
Regex frequently provides false positives, due to the fact that it matches patterns based solely on form, ignoring content and context. Without comprehending meanings or variations, regex may mistakenly identify similar but irrelevant data as matches, particularly when patterns are broad or ambiguously defined.
Exact data matching is unlikely to produce false positives, but it is more prone to false negatives due to its inflexibility.
With NLP, false positives can occur when the model is simplistic, such as with earlier versions of Named Entity Recognition (NER), or when the training data is too limited.
Asset metadata scans can produce both false positives and false negatives.
The most effective strategy for sensitive data discovery encompasses using all of the methods at your disposal, including third-party tools and other solutions specifically geared towards ensuring that you can detect and categorize each instance of critical information.