Seattle 911 EDA

This web report includes descriptive statistics of the Seattle 911 CAD data. The report starts with an overall summary of the structure of the dataset and then steps through each variable in the dataset.

Dataset Description

Let’s start by identifying the dimensions in the dataset.

## [1] 480811     17

There are 752,421 events and and 17 variables in the CAD data export. For this analysis, we narrow the focus to 911 and other telephone call types only. The reduced dataset contains 480,811 events.

The variable names from the CAD export are listed below.

##  [1] "CAD_Event_ID"                        "Dispatch_ID"                        
##  [3] "Event_First_Dispatch_Time_ATTR"      "Call_Priority_Code"                 
##  [5] "Call_Type_Desc"                      "Case_Type_Final_Desc"               
##  [7] "Case_Type_Initial_Desc"              "Clear_By_Desc"                      
##  [9] "Dispatch_Address"                    "Officer_Serial_Num"                 
## [11] "Precinct"                            "Sector"                             
## [13] "Squad_Desc"                          "Dispatch_Blurred_Latitude"          
## [15] "Dispatch_Blurred_Longitude"          "CAD_Event_Response_Time_Seconds_SUM"
## [17] "Total_Service_Time_Seconds_SUM"

Now, let’s find the number of categories in the categorical variables. In subsequent sections, I will step through each variable and summarize the distributions in greater detail.

# of Categories for Categorical Variables
Dispatch ID	Priority Code	Call Type	Case Type Final Desc	Case Type Initial Desc	Clear By Desc	Precinct	Sector
474,571	8	2	320	231	23	6	18

Dispatch ID - This is some sort of identifier. It’s interesting that the identifiers are not unique to each event. What does the dispatch ID identify? Is this identifier going to be relevant for our analysis?

Call priority codes and Call type description have a manageable number of categories - 9 and 2, respectively. Call type has 8 possible categories, but we are only focusing on two categories - 911 and other telephone (not 911). After taking a deeper dive into the univariate statistics in the sections below and understanding what these categories mean, we can decide whether any of these categories should be aggregated.

Case Type Final and Case Type Initial Descriptions - These two variables have the greatest number of categories with 343 and 235 categories, respectively. We will want to parse out the categories and see how to regroup into a smaller, more manageable set of categories for analysis. After looking over the categories we can figure out some strategies for aggregating categories.

Clear by description - There are 23 categories in this variable. After further review below, we can look to see if any aggregation is necessary.

Precinct is a categorical spatial indicator. It looks like the city is divided into 6 regional precincts.

Sector - There are 17 sectors. This variable appears to be another spatial category related to precinct. This will be described in the sector section below.

Before diving into the distributions of the categorical variables in greater detail, let’s take advantage of the fact that the data are time-stamped and get a sense of the frequency of events throughout the year.

Event Dates & Times

The data are time stamped to the minute. In the graph below, I have displayed the frequency of events per day. Hover your mouse over the line graph to see the number of events that occurred on a given day.

The date with the highest number of events recorded was 1,667, which was on July 5th. In general, the summer months appear to have higher frequencies that the rest of the year.

November 14th, 2019 is the date with the most marked decrease in events. There were only 61 events recorded on November 14th. This is far below other days with fewer events than normal, as shown in Table 1 below. It raises the possibility of a glitch in the reporting system for that day.

Table 1: Dates with Highest and Fewest Calls
Date	# Calls	Rank
2019-07-05	1,667	1
2019-06-12	1,637	2
2019-06-14	1,623	3
2019-07-13	1,617	4
2019-05-30	1,603	5
2019-06-10	1,590	6
2019-05-02	1,572	7
2019-07-04	1,572	7
2019-05-31	1,561	9
2019-06-01	1,556	10
2019-12-22	1,080	356
2019-03-03	1,068	357
2019-12-24	1,054	358
2019-11-28	1,038	359
2019-12-25	1,037	360
2019-02-10	1,030	361
2019-11-26	1,024	362
2019-02-03	999	363
2019-11-13	366	364
2019-11-14	61	365

On average, there were 1,317 calls for service per day in 2019. With the exception of the 76 event day on November 14th, there is not much of a skew to the distribution.

Table 2: Calls Over Time Summary
Daily Avg	Std. Dev	Median
1,317.29	147.2586	1,324

Call Priority Codes

Code 2 is the most common priority code recorded with a total of 189,788 events. According to Table 3, Code 2 is about 40% of the events in 2019. Just over 96% of the calls for service are categorized as being categorized as priority codes 1 through 3.

Codes 6 and 7 were very rare. They do not show up as clearly in the graph, but in Table 3 they total to 44 and 75 calls, respectively.

One other point to note is that there is not a code 8; the codes skip from 7 to 9.

Table 3: Call Priority Codes
Code	# Calls	%
1	154,849	32.21
2	189,788	39.47
3	118,628	24.67
4	8,971	1.87
5	6,565	1.37
6	44	0.01
7	75	0.02
9	1,891	0.39

Call Type Description

Table 4: Call Type Description
Type	# Events	%
911	325,008	67.6
TELEPHONE OTHER, NOT 911	155,803	32.4

We have retained only calls for service that came in via 911 or other telephone calls (not via 911). 911 calls are about 68% of the calls and other telephone source makes up the remaining 32%.

For reference, prior to reducing the dataset, 911 calls were about 43% of the events and other telephone was 21%.

Case Type Final Description

Flip through the pages in the table to view the number of events with each type of case final description. Recall that this variable has 320 different descriptions.

Some of these descriptions have a general description followed by a more specific description that follows a dash. We could parse on the general description and then aggregate to get a smaller set of categories. I demonstrate this in the table below.

This aggregation strategy reduced the number of categories by a little over half to 140. Disturbance cases are the most common followed by suspicious circumstances and traffic. If you flip through the pages, there are some categories that also appear to be similar to these top 3. For instance, traffic stop is listed on page 6, which seems like it could also fit under traffic. Also on page 6 is the category “Dist”, which is an abbreviation for disturbance. All of descriptions and frequencies for the final case type descriptions are listed in the exported Excel file (shared over email and on the github page).

Other Comments * Need to make sure to catch abbreviations using reg. expressions (e.g., burg –> burglary) * Similarly, use reg. expressions for categories that look alike but differ in terms of spacing (e.g., Arson, Bombs, Explo; Abandoned car & Abandoned vehicle) * “#NAME?” looks like it might be the classification for events that were not classified. There are 977 events with this classification, which is about 0.2 events.

Case Type Initial Descriptions

The top two initial case type descriptions are similar to the final case description types.

One note on structure of these descriptions is that not as many of these descriptions have the same structure as noted in the final descriptions, that is a general description followed by a more specific description/detail, with the two descriptions separated by a dash “-”. Below, I have parsed out the description as I did with the final case descriptions, however, it may be a less useful approach for this description.

Other Comments/Questions * Need to make sure to catch abbreviations using reg. expressions (e.g., HAZ –> HAZARD) * “#NAME?” shows up again in this set of descriptions, though not as frequently as it did in the final descriptions (n=12,132). * Would it be useful to compare final and initial descriptions? We could use some fuzzy matching and regular expressions if this is something important. If final descriptions are missing (meaning that they are coded as #NAME?) and initial descriptions are not missing, should the initial description be applied?

Aggregating reduced the number of descriptions down to 123. The top four descriptions remain the same, but the rest of the top 10 have shifted ranks (e.g., assault, trespass).

NOTE: Unknown is pretty substantial here (n=13,907, 2.89%). The #NAME? description is less frequent (n=528), but appears to also signify unknown case descriptions.

Clear by Description

Table 4: Clear By Descriptions
Description	# Events	%
ASSISTANCE RENDERED	167,457.0	34.83
REPORT WRITTEN (NO ARREST)	148,495.0	30.88
UNABLE TO LOCATE INCIDENT OR COMPLAINANT	55,366.0	11.52
PHYSICAL ARREST MADE	41,140.0	8.56
NO POLICE ACTION POSSIBLE OR NECESSARY	19,290.0	4.01
CITATION ISSUED (CRIMINAL OR NON-CRIMINAL)	11,705.0	2.43
RESPONDING UNIT(S) CANCELLED BY RADIO	10,169.0	2.11
ORAL WARNING GIVEN	5,772.0	1.20
FOLLOW-UP REPORT MADE	4,178.0	0.87
DUPLICATED OR CANCELLED BY RADIO	4,137.0	0.86
OTHER REPORT MADE	2,968.0	0.62
FALSE COMPLAINT/UNFOUNDED	2,872.0	0.60
STREET CHECK WRITTEN	2,477.0	0.52
-	1,878.0	0.39
INCIDENT LOCATED, PUBLIC ORDER RESTORED	1,839.0	0.38
RADIO BROADCAST AND CLEAR	429.0	0.09
PROBLEM SOLVING PROJECT	330.0	0.07
TRANSPORTATION OR ESCORT PROVIDED	143.0	0.03
NON-CRIMINAL REFERRAL	80.0	0.02
EXTRA UNIT	35.0	0.01
SERVICE OF DVPA ORDER	21.0	0.00
(NOT CURRENTLY USED) ALARM NO RESPONSE	16.0	0.00
NO SUCH ADDRESS OR LOCATION	14.0	0.00

In 38% of the service calls (n=286,250), assistance was rendered. The next most common response type was no arrest, but report, which was applied to about 31% of the calls (n=148,495).

The next most common clear by type was unable to locate incident or complainant. It was applied to about 11.5% of calls (n=55,366). 14 calls were marked as no such address or location (not sure if it is reasonable to consider this as similar to unable to locate incident).

A physical arrest was made in 8.5% of calls (n=41,140). No police action was possible or necessary in 4% of calls (n=19,290).

OTHER NOTES: * It looks like a dash “-” represents missing clear by description (n=1,878). * There are some descriptions that I do not know what they mean or how they differ from other descriptions. For instance, how are responding units canceled by radio and duplicated or canceled by radio different? * Unable to locate incident or complainant is about 11.5% of the events.

Precinct & Sector

Table 5: Calls per Precinct
Precinct	# Calls	%
NORTH	141,513	29.43
WEST	134,448	27.96
SOUTH	76,186	15.85
EAST	75,810	15.77
SOUTHWEST	51,572	10.73
UNKNOWN	1,282	0.27

The north and west precincts had the most calls with about 29% and 27% of all calls, respectively. South and and east precincts had similar shares of calls at about 16%. The southwest precinct had the fewest number of calls recorded - 51,572 (11%).

For 1,282 calls, the precinct is listed as unknown. We may be able to identify a precinct for these events if they have valid latitude and longitude coordinates. Let’s look to see if they do have lat and long:

Table 6: Unknown Precinct Coordinate Status
Coordinate Status	# Calls
Not valid coords	720
Valid coords	562

About 43% of the calls with an unknown precinct have coordinates within the geographic extent of Seattle. We can use 562 of these events with unknown precincts and assign them a precinct. When I create a spatial object from the coordinates, as shown a few sections below, I will be able to plot these. For some it may be obvious what the precinct is based on the precinct labels given to neighboring events. If the precinct classification is not obvious, the best thing to do would be to obtain a shapefile of the polygons for each of the five precincts, overlay it on the events and give the point the name of the polygon precinct that it falls within or nearest to. Seattle’s Open Data website has such a shapefile that I will call on and use in the spatial geoprocessing section below.

There are some interesting bivariate analyses that could be explored. For example, call priority codes and precincts. View the interactive stacked bar chart below.

A few things stand out in the stacked bar graph of call priority codes and precincts. * The breakdown of precincts within codes 1 and 2 are very similar. The north and west precincts have very similar shares in these two codes. * Most of the unknown precinct calls were classified as code 9. * The south precinct had no code 7 cases. * Over half of the calls in code 9, were in the western precinct.

Let’s turn to focus on the sectors. There are 17 distinct sector names. 1,282 calls were not given a sector. These calls are identical to those missing a precinct classification.

Table 7: Calls by Precinct-Sector
Precinct	Sector	# Calls	Percent
SOUTH	ROBERT	30,327	39.81
SOUTH	SAM	24,427	32.06
SOUTH	OCEAN	21,432	28.13
EAST	EDWARD	33,631	44.36
EAST	GEORGE	21,241	28.02
EAST	CHARLIE	20,938	27.62
SOUTHWEST	WILLIAM	25,852	50.13
SOUTHWEST	FRANK	25,720	49.87
WEST	KING	45,650	33.95
WEST	DAVID	32,159	23.92
WEST	MARY	29,461	21.91
WEST	QUEEN	27,178	20.21
NORTH	BOY	33,684	23.80
NORTH	UNION	32,389	22.89
NORTH	NORA	27,150	19.19
NORTH	LINCOLN	27,065	19.13
NORTH	JOHN	21,225	15.00
UNKNOWN	NA	1,282	100.00

Sectors are unique to precincts. We can think of a sectors as a subdivision of the precinct.

King sector in the western precinct leads in the number of calls with 45,650 calls. This is about 34% of all calls in the west precinct. The other three sectors in the western precinct - David, Mary, and Queen - have about 10% to 12% fewer events than King.

Boy in the north precinct and Edward in the east precinct are the sectors with the next highest frequency of calls with over 33,000 calls. The share of calls in Boy is not substantially greater than other sectors in the north. However, Edward clearly has the majority of calls in the east precinct, amounting to about 44% of all calls in the precint.

The two sectors of the southwest precinct - William and Frank - have a 50-50 split of the calls.

NOTE: The Seattle Open Data website does not appear to have a boundary shapefile or API for sector. This may be something to inquire about if we want to do point-in-polygon analyses at the sector level.

Squad Description

This is one of the variables with an unmanageable amount of categories. There are only 1,487 events missing a squad description. If you flip through the pages of the table you can see that the squad groups are named in various ways. Some are based on the field/area they work in (e.g., forensics, Arson/Bomb) and others are based on locations (i.e., precinct + sector). NOTE: If this is a variable that is considered important we would need to approach the aggregation like we would for the Case type descriptions using the first descriptor before the dash, regular expressions, and lazy matching to get broad categories and abbreviations, misspellings, and differences in ordering of words.

Officer Identifier

## [1] 1262

There are 1,262 officers in this dataset.

Response Time

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0      246      574     2121     1886 14773646

Response time for each event is reported in seconds. The summary statistics suggest that there are some very long response times that are outliers. The longest response time is 14,773,646 seconds, which would be many, many days long. Let’s parse the seconds into higher levels of time.

With the times parsed into periods and sorted from longest to shortest time, we can see that the longest time was 170 days and the case was a test call. This is probably a candidate for excluding. For completeness, below the data displayed sorted from shortest to longest, so that it is easier to see what the short response times are.

Total Service Time

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -2604     598    1455    2936    3664  107771     221

The distribution for total service time on events is strange. There are 221 events missing a total service time. Additionally, there is at least one event that had a negative total service time recorded. First, let’s see how many negative values we have.

Table 7: Total Service Time
Date	Service Time (Seconds)	response time parsed	Case Type Final
2019-11-03	-2,604	3H 41M 19S	TRAFFIC - PARKING VIOL (EXCEPT ABANDONED CAR)
2019-11-03	-1,829	1H 22M 0S	CRISIS COMPLAINT - GENERAL
2019-11-03	-1,829	1H 22M 0S	CRISIS COMPLAINT - GENERAL
2019-11-03	-1,091	9M 1S	DISTURBANCE - OTHER
2019-11-03	-998	4M 50S	ASSAULTS, OTHER
2019-11-03	-699	2M 9S	DISTURBANCE - OTHER
2019-11-03	-699	2M 9S	DISTURBANCE - OTHER

There are only 7 events in the dataset with negative values. When we include information like the event date, parsed response time, and case description type, we notice that two of these are duplicates. The other thing that stands out is that these events were all recorded on the same date, November 3rd. It is possible that the negative values were a recording error that occurred that day. We could also check for the average service time on other events of a similar type to see if the absolute value of total service time is reasonable.

Now, let’s look at the NA values.

The events with missing values vary on case types. There appears to be some duplicates, e.g., the assault-DV case on January 13th. Again, it seems like event date and response time would be useful for identifying duplicates this dataset.

For the sake of consistency, I parsed the total service time into time periods as I did with the response time. See some of the output below.

With the parsed by period version of service time, we see that the upper end of the service time distribution is 2 days.

Spatial Object

Before transforming the dataframe into a spatial object, the calls with missing or invalid coordinates need to be removed. After filtering those events out, the transformed spatial object contains 459,132 calls with locations. There were 21,679 calls for service that do not have valid coordinates. Mapping all of these calls as points results in over-plotting as shown below.

There are other approaches for visualizations that would be more informative. One approach is to create a point density map to show where the highest and lowest number of events per area occurred in the city. Another approach is to aggregate the points to meaningful geographic units like zipcodes or neighborhoods. The following sections demonstrate these approaches.

Point Density Mapping

This interactive map clusters the points that are proximate. Zoom into different parts of the city to where clusters tend to occur.

Smoothed Point Density Map

The interactive map below shows the areas with high density of calls. Only areas with statistically significant densities are mapped. Highest density areas are in yellow and lowest are red.

Precinct Aggregation & Mapping

Now, let’s turn to by visualizing the frequency of events in different geographic regions of Seattle. In one of the prior sections, I showed the frequency of events per precinct. However, approximately 5,000 of the calls did not have a precinct listed. Now that the dataframe has been transformed to a spatial object, I can identify a precinct for those locations based on which precinct each coordinate pair lies within.

Service Calls per precinct, spatial overlay version
Precinct	# Calls
NORTH	135,485
WEST	124,569
EAST	76,677
SOUTH	72,369
SOUTHWEST	48,901
NA	1,131

A couple of things standout from using the spatial overlay approach to assign precincts. First, the number of points that are not assigned to a precinct is 1,131. The reason these points are not assigned is because they lie outside of the precinct boundaries (see the map below). To make use all of these points, the best thing to do would be to keep the precincts that were provided in the original dataset. Then merge in the spatial overlay precincts for the subset of events that did not have valid coordinates. Finally, if there are still points missing precinct assignments, assign them to the precinct that they are nearest to. Let’s do that and then visualize the results.

The map shows not only the events per precinct, but also those events that are outside of the precinct boundaries. NOTE: I assigned the “outlying” points to the nearest precinct for the precinct layer. I included them in the visual just to show that some of the locations do lie outside of the the city limits.

Zipcode Aggregation & Mapping

Another aggregation we can perform and visualize is at the zipcode level. Zipcode boundaries were pulled from Seattle’s Open Data website.

The table below lists the count of events per zipcode. There is a sizable range in calls per zipcode from 7 to 46,396. The map below shows the counts per zipcode.

Calls for Service per zipcode
Zipcode	# Calls
98104	46,396.0
98101	35,380.0
98122	30,939.0
98118	28,848.0
98103	26,205.0
98144	24,904.0
98105	23,154.0
98125	22,243.0
98109	22,224.0
98108	18,709.0
98107	18,199.0
98133	17,998.0
98121	17,270.0
98106	15,464.0
98134	15,150.0
98115	14,375.0
98126	12,678.0
98102	12,366.0
98116	10,678.0
98117	10,288.0
98112	9,839.0
98119	9,588.0
98136	4,773.0
98199	4,488.0
98178	2,561.0
98177	1,848.0
98146	1,008.0
98195	839.0
98155	549.0
98168	89.0
98188	11.0
98166	7.0

Zipcodes in the core of the city tend to have the highest counts. The zipcodes along the southeastern edge of the city also have relatively high counts, especially compared to the zipcodes along the southwestern side of the city.

Neighborhood Aggregation & Mapping

The Seattle Open Data website also makes neighborhood boundaries available. In the table below, the events were aggregated to the neighborhoods. This should allow us to drill down to smaller units than the zipcodes. The neighborhoods and their counts are also featured in the map below. We see that the neighborhoods in the city’s core like the CBD, Broadway, and Pioneer Square had the highest calls. Just to the south of the city’s core, the Industrial District also had a relatively high number of calls. In the northern half of the city, the University District is the neighborhood with the highest number of calls.

Events per neighborhood
Neighborhood	# Calls
Central Business District	25,898.0
Broadway	24,422.0
Pioneer Square	23,567.0
Belltown	21,659.0
University District	18,528.0
Industrial District	15,100.0
Industrial District	15,100.0
First Hill	12,397.0
Greenwood	10,966.0
International District	10,836.0
North Beacon Hill	10,026.0
Adams	9,820.0
South Lake Union	9,762.0
Lower Queen Anne	9,678.0
Columbia City	9,217.0
Haller Lake	7,960.0
Fremont	7,726.0
Yesler Terrace	7,615.0
Atlantic	7,584.0
Georgetown	6,940.0
Minor	6,748.0
Dunlap	6,532.0
West Woodland	6,523.0
Wallingford	6,116.0
Pinehurst	5,917.0
Stevens	5,807.0
Bitter Lake	5,768.0
North College Park	5,586.0
South Delridge	5,407.0
Pike-Market	5,315.0
Brighton	5,267.0
Mid-Beacon Hill	5,165.0
Olympic Hills	5,127.0
Mount Baker	4,941.0
Cedar Park	4,772.0
Green Lake	4,651.0
High Point	4,549.0
Genesee	4,419.0
South Park	4,270.0
Maple Leaf	4,036.0
Roosevelt	4,011.0
Highland Park	3,958.0
North Admiral	3,957.0
Roxhill	3,740.0
East Queen Anne	3,606.0
North Delridge	3,488.0
Ravenna	3,471.0
Fairmount Park	3,177.0
Holly Park	3,047.0
North Queen Anne	2,960.0
Phinney Ridge	2,827.0
Alki	2,625.0
South Beacon Hill	2,613.0
Victory Heights	2,553.0
Interbay	2,550.0
Rainier Beach	2,431.0
Mann	2,412.0
Riverview	2,275.0
West Queen Anne	2,188.0
Broadview	2,068.0
Loyal Heights	2,061.0
Seward Park	1,996.0
Whittier Heights	1,992.0
Lawton Park	1,975.0
Eastlake	1,966.0
Crown Hill	1,951.0
Leschi	1,827.0
Westlake	1,761.0
Montlake	1,667.0
Wedgwood	1,610.0
Fauntleroy	1,509.0
Gatewood	1,471.0
Seaview	1,462.0
Sunset Hill	1,457.0
Rainier View	1,428.0
Matthews Beach	1,313.0
Madrona	1,275.0
Bryant	1,251.0
Meadowbrook	1,217.0
Southeast Magnolia	1,205.0
Arbor Heights	1,205.0
Sand Point	1,129.0
Madison Park	959.0
Laurelhurst	909.0
North Beach/Blue Ridge	781.0
Harrison/Denny-Blaine	739.0
Briarcliff	656.0
View Ridge	552.0
Harbor Island	468.0
Windermere	450.0
Portage Bay	421.0

Potential Next steps for mapping/spatial analysis:

Faceted maps of call locations subset by: 1) case type, 2)call priority, 3) clear by, 4) call type.
Density maps by any of the above categories.
Space-time slice maps/diagrams for cases of interest.
Bring in census block groups and integrate ACS demographics.