Practical Supervised Machine Learning Classification of Highly Imbalanced Text

May 22, 2025 • White Paper

By
Austin Whisnant

In this paper, Austin Whisnant describes a machine learning model used to build a corpus of insider threat data to support insider threat research.

Publisher

Software Engineering Institute

DOI (Digital Object Identifier)

10.1184/R1/29120552

Topic or Tag

Abstract

As the insider threat problem grows and becomes more widely understood, software vendors have started offering more solutions for detecting, preventing, and evaluating the risks of insiders. It is important that these and future solutions are founded on reliable data and evidence-based research. This paper describes research into how to efficiently collect and classify United States Attorneys’ Office (USAO) press releases to determine which ones describe an insider threat. The goal of doing this is to create an automated process for collecting as many insider threat court cases as possible to build a repository of insider threat court cases to support ongoing research. SEI researchers used a machine learning model that gathered and encoded data from USAO press releases. They used this model to classify a corpus of over 200,000 press releases to classify over 24,000 press releases as discussing insider threat dating back to 2013 and will continue to use it going forward to collect new cases to include in the SEI insider threat repository.