Sherlock AI: sensitive data discovery in sources
Sherlock AI is a Striim AI -powered feature that enables you to discover sensitive data that may flow into your applications from the configured sources in your Striim applications. You can launch Sherlock AI from the flow designer to discover sensitive data in the sources connected to that application, or you can launch Sherlock from the Striim AI menu to discover sensitive data in the sources connected to one or more applications.
The Striim admin can run Sherlock AI on any application that contains a source, irrespective of whether that application is running or not. You do not have to modify your application to run Sherlock. Typically, you can run Sherlock AI during design time and use the Sherlock report to design your Striim application so that it adheres to your organization's data governance policies.
How does Sherlock AI work?
Sherlock AI queries a random sample of data from the datasets that you have configured for your Striim application, and uses a combination of methods such as classification algorithms, field names, pattern matching and data analysis to discover sensitive information in the sampled source data. Sherlock AI does not query source data that is not configured in your applications, and does not share your data with other users or customers and will only use AI models configured by you.
Sherlock employs different sampling methods based on the types of readers in the application. For readers that support initial load, Sherlock reads a specific number of events from every configured source entity (such as a table, collection, directory, topic-partition, or other logical grouping of data objects in the source), and takes a random sample set from those events. For readers that support continuous replication such as CDC readers, Incremental Batch Reader, or Kafka reader, Sherlock reads up to a predefined number of events from the reader or continues reading up to predefined timeout, whichever comes first, and then takes a sample set from the events read so far.
The current sampling policies, which are not user configurable, are as follows:
For readers that support Initial Load:
count=100;samplecount=10;sampletype=Random
This policy fetches 100 events from every configured table, collection or other source entity, and takes 10 random sample events from those 100 events for analysis.
For readers that support continuous replication:
count=1000;samplecount=10;sampletype=Random
This policy fetches up to 1,000 events from the source, or wait up to 3 minutes to read data. Sherlock will wait up to 3 minutes to read data from the source, whichever comes first. Sherlock will then take up to 100 random samples from the events that its has read from the source for analysis.
Sherlock's sampling-based discovery method is light-weight compared to other tools that scan the entire source database for sensitive data. Sherlock does not exert the source database server and can provide relatively accurate results.
The accuracy of Sherlock's discovery depends on the data that it has sampled from the source and on the AI engines used. If the sampled data is not representative of the source dataset, then Sherlock may suggest incorrect results. For example, assume that Sherlock is reading from an Oracle database using the Oracle CDC reader, and has read 1,000 events from 50 source tables that were active during Sherlock's sampling period and published updates that Sherlock read, while 150 other source tables did not show any activity during the same sampling period. In this case, Sherlock can only analyze and report on some or all of the 50 active tables based on the 100 random samples that it chose for analysis, while it will remain silent about the 150 tables that did not publish updates during Sherlock's sampling period.
While Sherlock is reading data from the source, you may see a temporary drop in performance in your Striim applications. Sherlock is designed such that it will not impact any checkpoints that you or your application may have set for operations or recovery.
For a list of predefined sensitive data types that Sherlock can discover, see Sensitive Data Identifiers.
For more information on using Sherlock AI to discover sensitive data in a single app or multiple apps, see:
Limitations
Sherlock AI is currently available to the Striim admin user only.
Sherlock AI’s accuracy depends on the AI engines used and its sampling of the source data. AI features are not always accurate or error-free, and you acknowledge and agree that Striim AI (including Sherlock AI) may not properly detect, classify or encrypt, mask or otherwise protect all sensitive and other targeted information.
Sherlock AI does not currently support the detection of sensitive information in free-form text where only a certain part or parts of the textual content are sensitive.
Sherlock AI does not support binary data sources such as image, movie, audio, PDF or application files, or binary data types such as BLOB (Binary Large Object).