Back in 2015, when he was still FBI Director, James Comey had this to say about the unavailability of reliable and comprehensive data on the use of deadly force by police officers in the United States:
It is unacceptable that The Washington Post and the Guardian newspaper from the U.K. are becoming the lead source of information about violent encounters between police and civilians… You can get online today and figure out how many tickets were sold to ‘The Martian,’ which I saw this weekend… The CDC can do the same with the flu… It’s ridiculous—it’s embarrassing and ridiculous—that we can’t talk about crime in the same way, especially in the high-stakes incidents when your officers have to use force.
Using the Guardian data as a benchmark, Franklin Zimring has estimated that the two major official data sources on deaths at the hands of law enforcement officials—the National Vital Statistics System and the Supplementary Homicide Reports—have historically undercounted police homicides by about a half. Researchers have accordingly turned to the crowdsourced data from traditional and non-traditional sources, including the incident lists complied by the teams at Mapping Police Violence (MPV) and Fatal Encounters.
Over the past couple of years I have been working with Pepe Montiel Olea and Dan O’Flaherty on trying to understand the staggering variation across agencies in the incidence of lethal force. Dan and I had previously documented the variation across states (and a few large cities) in our book, but state level data is too highly aggregated to be of much use in understanding this phenomenon. It is at the level of agencies that selection, training, leadership, and organizational culture reside, and we felt that this where the focus needed to be.
The first paper to emerge from our project has now been posted here, with scripts and data made available in an accompanying repository.
We assembled the data in stages, losing coverage (years, agencies, population served, and fatalities) at each stage while gaining variables (officers, population demographics, murders, gun prevalence, and so on). Each stage is described in detail below, and the data emerging from each stage has been posted here. The table below provides a summary of coverage at each stage:
In this post I’ll describe the steps we took to construct the data set, the difficulties encountered along the way, and the problems that remain. The goal is to provide a guide to anyone who wishes to use the data, and to allow us to get comments so that we can improve it.
First Stage
We began with the 2012 Law Enforcement Agency Identifiers Crosswalk (LEAIC), which contains information on more than 28,000 agencies, each associated with a unique nine-digit Originating Agency Identifier (ORI) code.
We attached to each of these agencies the total number of on-duty police killings for each of the years 2013-2020, using MPV as our source. This choice was based on the fact that the MPV data has an ORI field, allowing us to match most agencies to homicides mechanically. But there were a few exceptions—agencies whose ORI code in the MPV data does not correspond to a positive population agency in the Crosswalk, and for which a manual substitution must be made. The two most important manual matches involved the primary agencies serving Indianapolis and Jacksonville.
There were 8,529 total victims of deadly force in the MPV data over these eight years, of which 7,706 involved a single agency with known ORI code. The rest were associated with multiple agencies acting together (close to 800 deaths) or federal agencies such as the FBI or DEA (about 50 deaths). Of the 7,706 single agency on-duty homicides with known ORI code, we were able to match all but one to an agency in the Crosswalk.
Among the Crosswalk agencies are 13,397 local police departments and 2,869 sheriff’s offices together serving about 314 million people. We focused attention on these 16,266 agencies, since they are associated with well-defined, stable, and mutually exclusive residential populations. They accounted for a total of 7,115 deaths over this period, with the remainder associated with state agencies and special jurisdictions such as transit systems, campuses, courts, and corrections.
About 90% of those killed by officers attached to these police departments and sheriff’s offices are identified in MPV as having known race-ethnicity. Among these, 47% are identified as White, 29% as Black, 20% as Hispanic, 2% as Asian, and 1% each as Native American and Pacific Islander.
This set of 16,266 agencies, serving a population of 314 million, and associated with 7,115 homicides, constitutes our first stage data set.
Second Stage
Next, we added the number and gender composition of sworn officers from the LEOKA database, using the 2012 wave to match the Crosswalk data.
The LEOKA data also identifies about fifty felony murders of law enforcement officers nationwide every year, but even large agencies have no such murders in most years so this is not a variable we can use as an explanatory agency characteristic. Assaults on police officers have much higher frequency, but this data is incomplete and unreliable, as discussed further below.
Dropping agencies with zero officers in the LEOKA data, we arrive at 14,881 agencies, serving about 305 million people, employing 606,690 officers, and associated with 7,092 homicides. This constitutes our second stage data set.
Third Stage
Next, we added characteristics of the population served by the agencies, using the 2018 American Community Survey (ACS) from the Census Bureau—the total population, number below poverty, distribution by race-ethnicity, and land area.
At this stage we consider only local police departments, which (unlike sheriff’s offices) can be reliably associated with mutually exclusive civilian populations in the census data. This reduces our sample to 12,012 agencies.
However, even local police departments don’t serve areas that match perfectly to census geographies. To match civilian populations to agencies, we began by seeing if the agency corresponded to a census place. If so, the poverty rate, population demographics, and land area of the place was matched to the agency. If not, we repeated the process with county subdivisions. Finally, for county agencies, we matched census county data to the agency, after first subtracting the population and land area that had already been assigned to other agencies within the county.
In principle, once could treat sheriff’s offices like county police departments and apply the same procedure of subtraction, but this does not result in accurate or even meaningful results for many agencies. For instance, the population served by the Los Angeles County Sheriff’s Office is about 1.1 million in LEAIC and LEOKA, but the residual population of Los Angeles County (after subtracting that assigned to other agencies) is more than 2.8 million. Meanwhile some sheriff’s offices in counties that contain large agencies that span multiple counties end up with negative imputed populations. This is the case, for instance, with the Jackson County Sheriff’s Office in Missouri, and the Dallas County Sheriff’s Office in Texas.
Accordingly, we drop all sheriff’s offices from the data at this stage. After also dropping agencies that cannot be matched to positive census populations or land area, we arrive at 11,920 police departments, serving about 219 million people, employing about 445,000 officers, and associated with 5,065 homicides. This constitutes our third stage data set.
Fourth Stage
All the variables added to this stage, with the exception of police homicides, are time-invariant—population served and officers are from 2012, and census variables from 2018. We next add time-varying measures of reported crimes using Jacob Kaplan’s concatenated files for Offenses Known to Law Enforcement (OKLE). At this stage we restrict attention to the 2013-2019.
The data on offenses known is notoriously incomplete, with many agencies reporting partial counts or not reporting at all. For each agency-year we add a variable for months missing, as well as total counts for murder (and non-negligent manslaughter), violent crime, and property crime. We also add the population served for each year.
We lose a few agencies at this stage, arriving at 11,906 police departments, serving a population of about 219 million and employing about 445,000 officers. The number of homicides drops to 4,460 with one fewer year covered, and the number of murders reported over the seven year period is 88,606, or about 12,660 per year. This constitutes our fourth stage data set.
Fifth Stage
Finally, we add state, county, and regional variables.
We use mortality data from the Centers for Disease Control (CDC) to attach to each agency the number of gun deaths per 100,000 population in the county in which it is located. We use data from 1999-2016 and include all deaths (including suicides) resulting from firearm discharges, except for those attributed to law enforcement actions. This variable is a measure of gun prevalence at the county level.
Under the current federal civil standard, established in Tennessee v. Garner, “a police officer may use deadly force to prevent the escape of a fleeing suspect only if the officer has a good-faith belief that the suspect poses a significant threat of death or serious physical injury to the officer or others.” Some states use this standard as part of criminal law, while others continue to give police greater latitude in using deadly force. Flanders and Welling (2015) have classified states into those that have adopted Garner or Garner-like principles in criminal matters and those that have not done so. We use this to create a dummy variable assigned to each agency, based on the state in which it is located.
To capture the power of police unions in protecting officers from allegations of misconduct, we use Rushin (2016) to assign to each agency a variable indicating whether it is in a jurisdiction with a Law Enforcement Officer Bill of Rights (LEOBR).
Finally, we add variables indicating the census region and division in which the agency is located.
For privacy reasons, the CDC suppresses data for counties with ten or fewer gun deaths, and we drop all agencies in such counties. The result is 11,694 police departments, serving a population of about 219 million, employing about 444,000 officers, reporting 88,584 murders, and associated with 4,458 police homicides. This constitutes our fifth stage data set.
Final Thoughts
Users of this data who include the crime variables may wish to restrict attention only to those agencies with no months of crime data missing in any of the six years. Doing so would result in 7,664 police departments, serving about 195 million people, employing about 398,000 officers, associated with 4,081 homicides, and reporting 83,260 murders over seven years.
In addition, those who use the land area variable (to construct measures of population density for example) may wish to drop the agencies in Alaska and Hawaii, since many of these have jurisdiction over vast amounts of unpopulated land. The largest five agencies are in these two states. The largest of them, the North Slope Borough Police Department, is the only agency in a county spanning over 229,000 square kilometers, vastly greater even than the second largest agency by land area, which is the Hawaii County Police Department at 10,000 square kilometers. Measures of population density for such agencies (and many others in these states) are not very meaningful.
The data used for estimation in our paper is tailored to the particular model used there, and differs from our fifth stage data set in a few respects. We drop Alaska and Hawaii, and drop agencies that do not report crime data for all months in all years. We also cover only the six year period 2013-2018. We add assaults on police officers as a time-varying variable using LEOKA as source, and add a variable indicating whether the assault data are complete.
These assault data are incomplete and unreliable, even if one restricts attention to those reported as having submitted complete data. For instance, the New York City and Chicago police departments report no assaults in any of the years 2013-2018, even though information is said to be complete for 2018. The use of assault data as a robustness check in some cases might make sense, but one ought not to put too much reliance on any estimates that make use of it.
As noted above, the coverage of the data decreases as we move through the various stages, and the number of variables increases. We have posted data for all stages, since the trade-off between agency coverage and variable richness will vary depending on the question at hand, and some researchers may find the intermediate stage data sets to be most useful.
This is a first pass at constructing an agency-level data set on police use of deadly force. The set of agency characteristics is limited for the moment, but we hope to expand, refine, and improve the data over time, and would welcome guidance on how best to do so.
So intrigued by the numbers as you move into the third stage data set. Dropping about 2900 agencies you lose a whopping 2000 deaths, or ~.7 per agency. That seems per agency much higher than the <.5 deaths per agency remaining. Understand that you can’t get much from those agencies because of overlaps etc. but it does seem that there is a story there worth exploring.
This is very helpful and comprehensive! Thank you.