Thoughts on working together, and supporting organizations
The need to classify kinds of data throughout the company is driven by risk audits (security, compliance, etc) that we have been recently exposed to.
Probably we are not yet at a risky scale, but it’s coming with Series B and other broad press. While we currently have ad-hoc data classifications, any audit that requires more than attestation will want to see more than we currently have today.
Conceivably, this kind of process(es) is to catch if PII is leaking. In that case, we would want fast(er) alerting so that we could try to prevent or halt a breach, and do forensics
We can probably say that there is a broader goal of understanding when data from one kind of activity appears in various scenarios - apps, teams, workflows, vendors, etc.
The most ideal solution would handle data from all of the major places that we have it:
Sourcing
The most ideal solution would handle data from all of the major places that we have it:
It’s likely that this list can morph, based on business practices.
When thinking of classification, we must keep in mind the purposes that are served by developing one:
Observability can be split into five key pillars:
We need to define ways to identify the things in each class of data.
There may be more than one way to query for something in a given classification. This could result in query groups, or building complex queries.
By using a centralized approach, we have the opportunity to consolidate ‘internal data classification’ needs under one app/roof. If we do not, we will need to maintain different queries in different locations/tools.
Monitoring will principally be the frequency and alerting, based on queries.
Regardless of which tool we would choose, the consolidation pattern would be to either a) sync data from others to the place that we monitor, or b) talk to the APIs of each and run queries against them for the ‘classified data patterns’. That is probably lots of tooling that we can rent instead of build + maintain.
We can split data search and monitoring in AWS via Macie, and other in Gsuite via a service level+config that Google offers, and others in the other data/doc repos. The downside of that alternative is that we would need to keep the rules for each in sync.
Prevalent models for flowing data by classification include:
The data matches metadata or per-record patterns that identify it as within the ROT scope (Redundant, Obsolete, Trivial), and removal should occur. Soft or hard delete options.
The source’s metadata, or via a native search, a policy exists that defines the need for the data to remain where in location. The reasons include for compliance, or for further processing within workflows native to that system.
The data falls under the scope that requires a chain of actions to be taken. Kind of a Chain-of-custody concept.
When a Source contains this kind of data at all even without knowing precise information about the records. The primary reason to think of this as a value is regarding access control, and entitlements auditing.
when a source is scanned ‘natively’ for data. Various data-level labelling is then attached. Note that the range of these labels/features can vary widely, depending on whether the data classification tool downloads/extracts the information from the source.
As you may already see, the patterns of this kind of project are principally oriented in the amount of data that needs to be searched, and the amount to be moved.
For all data stores, inventory the size of data, and the rate of growth. These will factor into whether you need to search-in-place, and therefore the kind of solution or provider you work with.
In one example, remember that backing up API-sourced data will always incur the cost of the frame of the data. E.G. if the data is in JSON format, then it the JSON structure and metadata will add to storage and search. You can compress such data reliably, which also can add complexity to interfacing with the data, potentially with the need to parcel it.
Could be things like