Skip to main content

Sherlock AI: sensitive data discovery in sources

Sherlock AI is a Striim AI-powered feature that enables you to discover sensitive data that may flow into your applications from the configured sources in your Striim applications. You can launch Sherlock AI from the flow designer to discover sensitive data in the sources connected to that application, or you can launch Sherlock from the Striim AI menu to discover sensitive data in the sources connected to one or more applications.

The Striim admin can run Sherlock AI on any application that contains a source, irrespective of whether that application is running or not. You do not have to modify your application to run Sherlock. Typically, you can run Sherlock AI during design time and use the Sherlock report to design your Striim application so that it adheres to your organization's data governance policies.

How does Sherlock AI work?

Sherlock AI queries a random sample of data from the datasets that you have configured for your Striim application, and uses a combination of methods such as classification algorithms, field names, pattern matching and data analysis to discover sensitive information in the sampled source data. Sherlock AI does not query source data that is not configured in your applications, and does not share your data with other users or customers and will only use AI models configured by you.

Sherlock employs different sampling methods based on the types of readers in the application. For readers that support initial load, Sherlock reads a specific number of events from every configured source entity (such as a table, collection, directory, topic-partition, or other logical grouping of data objects in the source), and takes a random sample set from those events. For readers that support continuous replication such as CDC readers, Incremental Batch Reader, or Kafka reader, Sherlock reads up to a predefined number of events from the reader or continues reading up to predefined timeout, whichever comes first, and then takes a sample set from the events read so far.

The current sampling policies, which are not user configurable, are as follows:

  • For readers that support Initial Load: count=100;samplecount=10;sampletype=Random

    This policy fetches 100 events from every configured table, collection or other source entity, and takes 10 random sample events from those 100 events for analysis.

  • For readers that support continuous replication: count=1000;samplecount=10;sampletype=Random

    This policy fetches up to 1,000 events from the source, or wait up to 3 minutes to read data. Sherlock will wait up to 3 minutes to read data from the source, whichever comes first. Sherlock will then take up to 100 random samples from the events that its has read from the source for analysis.

Sherlock's sampling-based discovery method is light-weight compared to other tools that scan the entire source database for sensitive data. Sherlock does not exert the source database server and can provide relatively accurate results.

The accuracy of Sherlock's discovery depends on the data that it has sampled from the source and on the AI engines used. If the sampled data is not representative of the source dataset, then Sherlock may suggest incorrect results. For example, assume that Sherlock is reading from an Oracle database using the Oracle CDC reader, and has read 1,000 events from 50 source tables that were active during Sherlock's sampling period and published updates that Sherlock read, while 150 other source tables did not show any activity during the same sampling period. In this case, Sherlock can only analyze and report on some or all of the 50 active tables based on the 100 random samples that it chose for analysis, while it will remain silent about the 150 tables that did not publish updates during Sherlock's sampling period.

While Sherlock is reading data from the source, you may see a temporary drop in performance in your Striim applications. Sherlock is designed such that it will not impact any checkpoints that you or your application may have set for operations or recovery.

For a list of predefined sensitive data types that Sherlock can discover, see Sensitive Data Identifiers.

For more information on using Sherlock AI to discover sensitive data in a single app or multiple apps, see:

Free-form text detection

Sherlock AI uses PII DeepScan to detect sensitive data in both structured fields and free-form text fields. In structured fields, the sensitive information spans the entire content of the column or field. In free-form text fields, sensitive information can appear anywhere within unstructured text data written in natural language.

Sherlock AI can identify multiple different sensitive data identifiers within a single free-form text field. For example, a "Doctor's Notes" column might contain Social Security Numbers, phone numbers, and bank account numbers embedded within paragraphs of medical observations.

Example: Structured fields vs. free-form text

The following table illustrates the difference between structured fields and free-form text fields.

Patient Name

Age

Phone Number

Credit Card Number

Doctor's Notes (Free-form text)

John Smith

32

650-789-7654

5674123567895432

Patient visited the clinic on 10/16/2023 and presented with symptoms... has a social security number 234-56-7897... scheduled to visit next week.

Alice Mason

45

652-678-1234

9876543213452678

Patient presented with conditions... having bank account number 2345678924 on file for billing purposes.

In this example, the first four columns (Patient Name, Age, Phone Number, and Credit Card Number) are structured fields. The entire cell content is detected as a single sensitive data identifier and can be masked or encrypted according to the configured settings.

The fifth column (Doctor's Notes) contains free-form text. Each cell contains sensitive information, but the entire cell content is not sensitive. A free-form text field may contain zero, one, or more than one occurrence of sensitive data, and may contain multiple types of sensitive data identifiers.

Example: Free-form text with multiple SDI types

The following paragraph demonstrates free-form text containing multiple types of sensitive data identifiers.

"John Doe, a resident of 123 Maple Street, Springfield, IL 62704, recently contacted customer support from his work number, (312) 555-7890. His email, john.doe82@gmail.com, was flagged when a large transaction from his Bank of America account ending in 6789 was processed. He received a text from (800) 123-4567 claiming to be from the fraud department. The suspicious transfer totaling $5,000 was sent to an account with IBAN DE89370400440532013000."

In this example, Sherlock AI can detect and identify multiple sensitive data identifiers including addresses, phone numbers, email addresses, bank account numbers, and IBANs, all within a single free-form text field.

Interpreting Sherlock reports for free-form text

When Sherlock AI analyzes free-form text fields, the report displays the count of each sensitive data identifier type detected within that field. Unlike structured fields where at most one SDI type is detected per row, free-form text fields can contain multiple SDI types, and the same SDI type can appear multiple times within a single field.

The SDI counts in the report represent the number of times each SDI type was found across all sampled rows for that field. These counts do not represent confidence scores or percentages. For free-form text columns, the sum of SDI counts can exceed the number of rows sampled because a single field can contain multiple sensitive data identifiers.

Limitations

  1. Sherlock AI is currently available to the Striim admin user only.

  2. Sherlock AI's accuracy depends on the AI engines used and its sampling of the source data. AI features are not always accurate or error-free, and you acknowledge and agree that Striim AI (including Sherlock AI) may not properly detect, classify or encrypt, mask or otherwise protect all sensitive and other targeted information.

  3. Sherlock AI does not support binary data sources such as image, movie, audio, PDF or application files, or binary data types such as BLOB (Binary Large Object).

  4. Sherlock AI can process input records up to 500,000 characters per request. This limit applies to the combined character count across all rows and fields in a single request. Records exceeding this limit cause the application to halt with a character limit exceeded error.