10 Ways to Improve Automated Data Classification

You’ve paid a lot of money for data classification software or spent countless hours developing a similar solution in house. But what good is it if the results aren’t precise? Today I’ll explain 10 technical and workflow-related things that you need to be doing to eliminate as many false positives and negatives as possible.

Use advanced regular expressions

All automated data discovery and classification solutions are based on regular expressions. I don’t have space to dive into the depths of RegEx theory right now; if you need a refresher, head over to this post.

My point here is that there’s no one RegEx to rule them all. For instance, even if multiple credit card numbers are matched by a regular expression that you got from your vendor or wrote yourself, that doesn’t mean that it’s a good RegEx. Allow me to demonstrate.

The RegEx d{16} looks for a sequence of 16 digits, which is what most credit card number are. But that doesn’t mean that you should be using this RegEx. There are tons of intricacies that it doesn’t take into account. On the one hand, you’ll probably get false positives, because not every sequence of sixteen digits is a credit card number. If you’re looking for Mastercard numbers, specify that it starts with 51-55 or 2221-2720. On the other hand, you’ll also miss legitimate credit card numbers that are written with dashes or spaces in between each set of 4 digits. You can avoid these false negatives by including optional whitespace and special characters in your RegEx.

The more general point is this: Always try to improve your RegExes. Make them specific, but don’t overdo it, because too many criteria can omit relevant results. Balance is the key.

Complement your RegExes with keywords and keyword phrases

Keywords are often used in data classification, but not in the way I think they should be. You see, the fact that a document mentions the phrase “credit card number” or “social security number” doesn’t mean that protected information is actually there. However, this is where most data classification solutions stop: If a document matches a keyword or phrase, it gets classified. The end.

But wouldn’t it make sense for keywords to act as a validation algorithm for regular expressions? That is, each relevant keyword or key phrase inside a document makes it more and more probable that the file actually contains a piece of sensitive data.

As I said earlier, not every 16-digit number is a credit card number. But, if the phrase “credit card” or “card number” appears very close to the number, doesn’t that make it more likely that that is indeed a credit card number? I think it does.

Take into account keyword and key phrases variations and synonyms

Here is a bane of keyword matching — there are a million ways to write a single keyword. Automated techniques like stemming can help you with plural and singular forms and other word inflections so you don’t have to type them in manually: passport/passports, license/licenses/licensing/licensed and so on.

But the situation quickly gets complicated with key phrases. Identifying meaningful concepts from phrases is no simple task, and very few data classification solutions are capable of doing it. Handling inflections of the constituent words of the phrase is just the start. Parts of the phrase might be split up between other words. Can your solution recognize that “Passports of UK citizens” reflects the same concept as its keyword “UK Passport”? If it can’t, the only way you can circumnavigate this issue is by splitting up key phrases into individual keywords. But remember that this will increase the number of false positives. A geographical essay on river “banks” is probably not something that you want to find when you’re looking for “bank” card data

And then there’s the issue of partial or fuzzy matching, which is especially relevant if your solution supports optical character recognition and can classify images. If images are of low quality, OCR can produce a lot of cryptic symbols where perfectly meaningful words used to be. Unless your data classification solution has a built-in capability for that sort of thing, there isn’t much you can do by playing with RegExes. Your best bet is to include as many key phrases and keywords as possible in your dictionary and hope that at least one of them will be recognized exactly as you wrote it.

Here are a few other things to keep in mind:

Capitalization — Are keywords in your solution case sensitive by default?
Word order — Will “card VISA” be a match if you use the key phrase “VISA card”?
Special symbols — Will “bank_card” be a match if you use the key phrase “bank card”?

If your data classification solution can’t take into account all of these intricacies, it is up to you to design a keyword and key phrase dictionary that will suit your needs. It’s not going to be pretty, but if you want to up the accuracy, it has to be done.

Apply algorithmic validation

This one’s real simple. You can use certain algorithms to validate a variety of identification numbers. For instance, the famous Luhn algorithm can be used to verify that credit card numbers, IMEI numbers or Canadian Social Insurance Numbers are real. Many data classification tools support algorithmic validation and it is, undoubtedly, useful.

There’s just one catch: Not all sensitive data can be validated with Luhn or other algorithms, in part because not all sensitive data contains digits. That’s why I’m more inclined to use keywords for RegEx verification. It’s just a much more universal and versatile method.

Implement classification scores and thresholds

This point is monumental in value. If we’re talking about real precision, it’s not enough to just split data into “classified” and “not classified”; you, my friend, need a scale. Each rule that matches against a document adds to the score for that document; only after it hits a certain threshold is the file classified. This system will enable you to create extremely precise classification rules that leave simple RegExes or keyword matching in the dust. The next three tips explain how this works.

Adjust the weights of different classification rules

A well-designed RegEx for a credit card number should contribute a bigger score to a document than a match to the simple key phrase “credit card,” right? That insight provides another tool for improving the precision of your data classification: Rank all RegExes and keywords by how relevant they are to a particular category and assign scores appropriately.

Note that two RegExes can have different scores. If you’re looking for Visa cards, you know that there are modern ones with 16 digits and older ones with 13 digits. A RegEx for the 13-digit cards should have a smaller score because it is much more likely that a 16-digit sequence is actually a VISA card number than a 13-digit sequence.

Build logic around your classification rules

Here’s the last piece of tech-related advice: When you’re building these complicated, uber-precise sets of classification rules, don’t forget to interconnect them.

For instance, suppose you created a very precise VISA card number RegEx and came up with tens of relevant keywords for it (good for you!). Since each keyword contributes a certain value to the classification score, a given file might be classified simply because of the sheer volume of keywords inside it. This, of course, is not ideal. In most cases, what you’re looking for is a RegEx plus one keyword. So let’s create a single “master” keyword, which is a set of all keywords that you listed, and assign it a score that, when added to RegEx score, will push the file over the threshold. This master keyword will match against a document only when a sufficient number of individual keywords match against that document.

Here’s another example of how you can build logic relations between different classification rules to improve precision. Most of you know what CVV and CCV are — the 3 digits on the back of your card. But did you know that CVV is used only by VISA and CVC is used only by MasterCard? I sure didn’t. And I bet you that most people use these terms interchangeably. So if someone mistakenly puts a VISA card number with CVC in a document, the data classification process can miss it. Therefore, it makes sense to bind VISA and MasterCard keywords together, so we mitigate the risk of false negatives due to improper data input. VISA keywords will contribute scores to MasterCard classification rules and vice versa. But of course, in this case it would make sense to reduce their scores.

Analyze near misses

Let’s talk about more process- and workflow-related things you can do to improve the precision of data classification. If you managed to implement classification thresholds and weights into your solution, you have given yourself much more leverage to improve precision. Now your data is not split into just two categories, classified and not classified. Instead, you can analyze “how much” the document was classified.

Why would you want to do that? You have probably already guessed, based on the heading of this section. Analyzing near misses — files that are below the classification threshold by 10% or less — can be quite useful. Sure it’s manual work, but you’re working with data that has a high chance of being relevant, so why not? By reviewing these documents, you will be able spot some that should have been classified, and then create new keywords to include in your classification rules. Not only will the new keywords push near misses that should be classified over the classification threshold (reducing false negatives), they will also often improve precision (reducing false positives). On the flip side, if you confirm that some of the near misses are in fact not relevant, you can determine how they got their high scores and delete excessive keywords from your classification rules, or even make them negative keywords, which prohibit a file containing them to be classified by the rule.

Use the content of the classified files to your advantage

When you’re creating data classification rules, you’re thinking about what content might be in a sensitive file and defining keywords and RegExes to spot it. Now it’s time to be more empirical. For example, suppose you’re looking for documents containing intellectual property. The rules that you created worked to some extent — you found some documents and verified that they contain IP. But you can’t be sure that there are no false negatives.

Why not take a look at the content of documents you found and see if there are additional keywords they have in common? Then you can use those keywords to find even more documents containing IP. It can be hard to implement, but it makes a lot of sense, if you think about it. Try it. Plus, by doing this on regular basis, you’ll keep maintain precise and up-to-date classification rules that will find your IP whenever and wherever it surfaces.

Rinse and repeat steps 1 through 9

Did you scroll all the way down here to peek ahead at the tenth mystery bullet? If so, go back and read the first nine!

Although this point may seem obvious, it’s the most important one. Data classification is an ongoing effort. These solutions very rarely work the way you want them to out of the box. And while I realize that a dedicated data classification specialist is overkill for most organizations, that doesn’t mean that data classification should be expected to work forever with a predefined set of rules and parameters. Your data is yours and no one else’s. Nobody but you knows how best to classify your data. The process should be continuous. Try to strike the perfect balance between precision and the effort required to keep it high. At some point, more precision is simply not worth it.

The iterative nature of data classification goes beyond improving precision. As both data you store and your business objectives evolve, so should your data classification strategy. Tomorrow there might be a new type of sensitive data, and you’ll need to be able to identify and secure it. Or an R&D team will need help organizing their documents for better efficiency, and you’ll need to work with them to come up with the most useful categories and classification rules. To adapt and accommodate these new business challenges, your data classification solution should not only be precise but flexible.

Ilia Sotnikov

Ilia Sotnikov is Security Strategist & Vice President of User Experience at Netwrix. He has over 20 years of experience in cybersecurity as well as IT management experience during his time at Netwrix, Quest Software, and Dell. In his current role, Ilia is responsible for technical enablement, UX design, and product vision across the entire product portfolio. Ilia’s main areas of expertise are data security and risk management. He works closely with analysts from firms such as Gartner, Forrester, and KuppingerCole to gain a deeper understanding of market trends, technology developments, and changes in the cybersecurity landscape. In addition, Ilia is a regular contributor at Forbes Tech Council where he shares his knowledge and insights regarding cyber threats and security best practices with the broader IT and business community.