Skip to content

Commit

Permalink
#176 Updating filter docs. Using fuzzy property instead of sensitivit…
Browse files Browse the repository at this point in the history
…y level to apply checks.
  • Loading branch information
jzonthemtn committed Dec 23, 2024
1 parent 649d402 commit 6c1503f
Show file tree
Hide file tree
Showing 15 changed files with 93 additions and 22 deletions.
2 changes: 1 addition & 1 deletion docs/docs/filter_policies/filters.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Phileas uses several methods to identify phEyeFilter's names.
| [First Names](filters/persons_names/first-names.md) | Identifies common first names |
| [Surnames](filters/persons_names/surnames.md) | Identifies common surnames |
| [Person's Names (NER)](filters/persons_names/ph-eye) | Identifies full names using natural language processing analysis |
| [Physician's Names (NER)](filters/persons_names/physician-names-ner.md) | Identifies physician names using natural language processing analysis |
| [Physician's Names (NER)](filters/persons_names/physician-names) | Identifies physician names using natural language processing analysis |

### Other Filters

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@ At least one of `terms` or `files` must be provided.
### Optional Parameters

| Parameter | Description | Default Value |
| ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------- |
| ---------------- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| --------------------- |
| `enabled` | When set to false, the filter will be disabled and not applied | `true` |
| `ignored` | A list of terms to be ignored by the filter. | None |
| `fuzzy` | When set to true, the dictionary will employ fuzzy comparisons. Use the `sensitivity` parameter to control the level of fuzziness. Setting this value to false will disable fuzziness and provide a higher level of performance. | `false` |
| `sensitivity` | Controls the "fuzziness" of allowed values to account for misspellings and derivations. Valid values are `off` meaning only exact matches, `low`, `medium`, and `high`. Only applies when `fuzzy` is set to `true`. | `medium` |
| `classification` | Used to apply an arbitrary label to the identifier, such as "patient-id", or "account-number." | `"custom-identifier"` |
| `sensitivity` | Controls the "fuzziness" of allowed values to account for misspellings and derivations. Valid values are `low`, `medium`, and `high`. Only applies when `fuzzy` is set to `true`. | `medium` |

### Filter Strategies

Expand Down
1 change: 1 addition & 0 deletions docs/docs/filter_policies/filters/locations/cities.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ This filter has no required parameters.
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------- |
| `cityFilterStrategies` | A list of filter strategies. | None |
| `sensitivity` | Controls the "fuzziness" of allowed values to account for misspellings and derivations. Valid values are `low`, `medium`, and `high`. | `medium` |
| `capitalized` | Whether or not the first letter of the term must be capitalized to be redacted. | `false` |

### Filter Strategies

Expand Down
3 changes: 2 additions & 1 deletion docs/docs/filter_policies/filters/locations/counties.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,10 @@ This filter has no required parameters.
### Optional Parameters

| Parameter | Description | Default Value |
| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------- | ------------- |
|--------------------------|---------------------------------------------------------------------------------------------------------------------------------------|---------------|
| `countyFilterStrategies` | A list of filter strategies. | None |
| `sensitivity` | Controls the "fuzziness" of allowed values to account for misspellings and derivations. Valid values are `low`, `medium`, and `high`. | `medium` |
| `capitalized` | Whether or not the first letter of the term must be capitalized to be redacted. | `false` |

### Filter Strategies

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ This filter has no required parameters.
| -------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------- |
| `hospitalAbbreviationFilterStrategies` | A list of filter strategies. | None |
| `sensitivity` | Controls the "fuzziness" of allowed values to account for misspellings and derivations. Valid values are `low`, `medium`, and `high`. | `medium` |
| `capitalized` | Whether or not the first letter of the term must be capitalized to be redacted. | `false` |

### Filter Strategies

Expand Down
1 change: 1 addition & 0 deletions docs/docs/filter_policies/filters/locations/hospitals.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ This filter has no required parameters.
| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------- |
| `hospitalFilterStrategies` | A list of filter strategies. | None |
| `sensitivity` | Controls the "fuzziness" of allowed values to account for misspellings and derivations. Valid values are `low`, `medium`, and `high`. | `medium` |
| `capitalized` | Whether or not the first letter of the term must be capitalized to be redacted. | `false` |

### Filter Strategies

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ This filter has no required parameters.
| `stateAbbreviationsFilterStrategies` | A list of filter strategies. | None |
| `enabled` | When set to false, the filter will be disabled and not applied | `true` |
| `ignored` | A list of terms to be ignored by the filter. | None |
| `capitalized` | Whether or not the first letter of the term must be capitalized to be redacted. | `false` |

### Filter Strategies

Expand Down
1 change: 1 addition & 0 deletions docs/docs/filter_policies/filters/locations/states.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ This filter has no required parameters.
| `stateFilterStrategies` | A list of filter strategies. | None |
| `enabled` | When set to false, the filter will be disabled and not applied | `true` |
| `ignored` | A list of terms to be ignored by the filter. | None |
| `capitalized` | Whether or not the first letter of the term must be capitalized to be redacted. | `false` |

### Filter Strategies

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ This filter has no required parameters.
| `firstNameFilterStrategies` | A list of filter strategies. | None |
| `enabled` | When set to false, the filter will be disabled and not applied | `true` |
| `ignored` | A list of terms to be ignored by the filter. | None |
| `capitalized` | Whether or not the first letter of the term must be capitalized to be redacted. | `false` |

### Filter Strategies

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ This filter has no required parameters.
| `physicianNameFilterStrategies` | A list of filter strategies. | None |
| `enabled` | When set to false, the filter will be disabled and not applied | `true` |
| `ignored` | A list of terms to be ignored by the filter. | None |
| `capitalized` | Whether or not the first letter of the term must be capitalized to be redacted. | `false` |

### Filter Strategies

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ This filter has no required parameters.
| `surnameFilterStrategies` | A list of filter strategies. | None |
| `enabled` | When set to false, the filter will be disabled and not applied | `true` |
| `ignored` | A list of terms to be ignored by the filter. | None |
| `capitalized` | Whether or not the first letter of the term must be capitalized to be redacted. | `false` |

### Filter Strategies

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -801,7 +801,7 @@ public List<Filter> getFiltersForPolicy(final Policy policy, final Map<String, M
final boolean capitalized = customDictionary.isCapitalized();
LOGGER.info("Custom dictionary contains {} terms.", terms.size());

if(!SensitivityLevel.OFF.getName().equalsIgnoreCase(customDictionary.getSensitivity())) {
if(customDictionary.isFuzzy()) {

final SensitivityLevel sensitivityLevel = SensitivityLevel.fromName(customDictionary.getSensitivity());
enabledFilters.add(new FuzzyDictionaryFilter(FilterType.CUSTOM_DICTIONARY, filterConfiguration, sensitivityLevel, terms, capitalized));
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,4 +100,46 @@ public void filterCountiesHigh() throws Exception {

}

@Test
public void filterCountiesOffWithExactMatch() throws Exception {

final FilterConfiguration filterConfiguration = new FilterConfiguration.FilterConfigurationBuilder()
.withStrategies(List.of(new CountyFilterStrategy()))
.withAlertService(alertService)
.withAnonymizationService(new CountyAnonymizationService(new LocalAnonymizationCacheService()))
.withWindowSize(windowSize)
.build();

final FuzzyDictionaryFilter filter = new FuzzyDictionaryFilter(FilterType.LOCATION_COUNTY, filterConfiguration, SensitivityLevel.OFF, true);

FilterResult filterResult = filter.filter(getPolicy(), "context", "documentid", PIECE, "Lived in Fayette", attributes);

showSpans(filterResult.getSpans());

Assertions.assertEquals(1, filterResult.getSpans().size());
Assertions.assertTrue(checkSpan(filterResult.getSpans().get(0), 9, 16, FilterType.LOCATION_COUNTY));
Assertions.assertEquals("Fayette", filterResult.getSpans().get(0).getText());

}

@Test
public void filterCountiesOffNoExactMatch() throws Exception {

final FilterConfiguration filterConfiguration = new FilterConfiguration.FilterConfigurationBuilder()
.withStrategies(List.of(new CountyFilterStrategy()))
.withAlertService(alertService)
.withAnonymizationService(new CountyAnonymizationService(new LocalAnonymizationCacheService()))
.withWindowSize(windowSize)
.build();

final FuzzyDictionaryFilter filter = new FuzzyDictionaryFilter(FilterType.LOCATION_COUNTY, filterConfiguration, SensitivityLevel.OFF, true);

FilterResult filterResult = filter.filter(getPolicy(), "context", "documentid", PIECE, "Lived in Fyette", attributes);

showSpans(filterResult.getSpans());

Assertions.assertEquals(0, filterResult.getSpans().size());

}

}
Original file line number Diff line number Diff line change
Expand Up @@ -89,34 +89,42 @@ public FilterResult filter(Policy policy, String context, String documentId, int

// Exact matches.
if (matcher.find()) {

final int startPosition = matcher.start();
spans.add(createSpan(input, startPosition, startPosition + entry.length(), 1.0, context, documentId, entry, policy, attributes));

} else {

// Fuzzy matches.
final int spacesInEntry = StringUtils.countMatches(entry, " ");
// Only when the sensitivity level is not "off".
if(sensitivityLevel != SensitivityLevel.OFF) {

// Fuzzy matches.
final int spacesInEntry = StringUtils.countMatches(entry, " ");

for (final Position position : ngrams.get(spacesInEntry).keySet()) {

for(final Position position : ngrams.get(spacesInEntry).keySet()) {
// Compare string distance between word and ngrams.
final String ngram = ngrams.get(spacesInEntry).get(position);

// Compare string distance between word and ngrams.
final String ngram = ngrams.get(spacesInEntry).get(position);
if (ngram.length() > 2) {

if(ngram.length() > 2) {
if (requireCapitalization && Character.isUpperCase(ngram.charAt(0))) {

if (requireCapitalization && Character.isUpperCase(ngram.charAt(0))) {
final int start = position.getStart();
final int end = position.getEnd();

final int start = position.getStart();
final int end = position.getEnd();
// TODO: Should this be customizable in the dictionary's properties in the filter policy?
final LevenshteinDistance levenshteinDistance = LevenshteinDistance.getDefaultInstance();
final int distance = levenshteinDistance.apply(entry, ngram);

final LevenshteinDistance levenshteinDistance = LevenshteinDistance.getDefaultInstance();
final int distance = levenshteinDistance.apply(entry, ngram);
if (sensitivityLevel == SensitivityLevel.HIGH && distance < 1) {
spans.add(createSpan(input, start, end, 0.9, context, documentId, entry, policy, attributes));
} else if (sensitivityLevel == SensitivityLevel.MEDIUM && distance <= 2) {
spans.add(createSpan(input, start, end, 0.7, context, documentId, entry, policy, attributes));
} else if (sensitivityLevel == SensitivityLevel.LOW && distance < 3) {
spans.add(createSpan(input, start, end, 0.5, context, documentId, entry, policy, attributes));
}

if (sensitivityLevel == SensitivityLevel.HIGH && distance < 1) {
spans.add(createSpan(input, start, end, 0.9, context, documentId, entry, policy, attributes));
} else if (sensitivityLevel == SensitivityLevel.MEDIUM && distance <= 2) {
spans.add(createSpan(input, start, end, 0.7, context, documentId, entry, policy, attributes));
} else if (sensitivityLevel == SensitivityLevel.LOW && distance < 3) {
spans.add(createSpan(input, start, end, 0.5, context, documentId, entry, policy, attributes));
}

}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,10 @@ public class CustomDictionary extends AbstractFilter {
@Expose
private List<String> files;

@SerializedName("fuzzy")
@Expose
private boolean fuzzy = false;

@SerializedName("sensitivity")
@Expose
private String sensitivity = SensitivityLevel.OFF.getName();
Expand Down Expand Up @@ -96,4 +100,12 @@ public void setCapitalized(boolean capitalized) {
this.capitalized = capitalized;
}

public boolean isFuzzy() {
return fuzzy;
}

public void setFuzzy(boolean fuzzy) {
this.fuzzy = fuzzy;
}

}

0 comments on commit 6c1503f

Please sign in to comment.