This web report includes descriptive statistics of the Seattle 911 CAD data. The report starts with an overall summary of the structure of the dataset and then steps through each variable in the dataset.
Let’s start by identifying the dimensions in the dataset.
## [1] 480811 17
There are 752,421 events and and 17 variables in the CAD data export. For this analysis, we narrow the focus to 911 and other telephone call types only. The reduced dataset contains 480,811 events.
The variable names from the CAD export are listed below.
## [1] "CAD_Event_ID" "Dispatch_ID"
## [3] "Event_First_Dispatch_Time_ATTR" "Call_Priority_Code"
## [5] "Call_Type_Desc" "Case_Type_Final_Desc"
## [7] "Case_Type_Initial_Desc" "Clear_By_Desc"
## [9] "Dispatch_Address" "Officer_Serial_Num"
## [11] "Precinct" "Sector"
## [13] "Squad_Desc" "Dispatch_Blurred_Latitude"
## [15] "Dispatch_Blurred_Longitude" "CAD_Event_Response_Time_Seconds_SUM"
## [17] "Total_Service_Time_Seconds_SUM"
Now, let’s find the number of categories in the categorical variables. In subsequent sections, I will step through each variable and summarize the distributions in greater detail.
Dispatch ID | Priority Code | Call Type | Case Type Final Desc | Case Type Initial Desc | Clear By Desc | Precinct | Sector |
---|---|---|---|---|---|---|---|
474,571 | 8 | 2 | 320 | 231 | 23 | 6 | 18 |
Dispatch ID - This is some sort of identifier. It’s interesting that the identifiers are not unique to each event. What does the dispatch ID identify? Is this identifier going to be relevant for our analysis?
Call priority codes and Call type description have a manageable number of categories - 9 and 2, respectively. Call type has 8 possible categories, but we are only focusing on two categories - 911 and other telephone (not 911). After taking a deeper dive into the univariate statistics in the sections below and understanding what these categories mean, we can decide whether any of these categories should be aggregated.
Case Type Final and Case Type Initial Descriptions - These two variables have the greatest number of categories with 343 and 235 categories, respectively. We will want to parse out the categories and see how to regroup into a smaller, more manageable set of categories for analysis. After looking over the categories we can figure out some strategies for aggregating categories.
Clear by description - There are 23 categories in this variable. After further review below, we can look to see if any aggregation is necessary.
Precinct is a categorical spatial indicator. It looks like the city is divided into 6 regional precincts.
Sector - There are 17 sectors. This variable appears to be another spatial category related to precinct. This will be described in the sector section below.
Before diving into the distributions of the categorical variables in greater detail, let’s take advantage of the fact that the data are time-stamped and get a sense of the frequency of events throughout the year.
The data are time stamped to the minute. In the graph below, I have displayed the frequency of events per day. Hover your mouse over the line graph to see the number of events that occurred on a given day.
The date with the highest number of events recorded was 1,667, which was on July 5th. In general, the summer months appear to have higher frequencies that the rest of the year.
November 14th, 2019 is the date with the most marked decrease in events. There were only 61 events recorded on November 14th. This is far below other days with fewer events than normal, as shown in Table 1 below. It raises the possibility of a glitch in the reporting system for that day.
Date | # Calls | Rank |
---|---|---|
2019-07-05 | 1,667 | 1 |
2019-06-12 | 1,637 | 2 |
2019-06-14 | 1,623 | 3 |
2019-07-13 | 1,617 | 4 |
2019-05-30 | 1,603 | 5 |
2019-06-10 | 1,590 | 6 |
2019-05-02 | 1,572 | 7 |
2019-07-04 | 1,572 | 7 |
2019-05-31 | 1,561 | 9 |
2019-06-01 | 1,556 | 10 |
2019-12-22 | 1,080 | 356 |
2019-03-03 | 1,068 | 357 |
2019-12-24 | 1,054 | 358 |
2019-11-28 | 1,038 | 359 |
2019-12-25 | 1,037 | 360 |
2019-02-10 | 1,030 | 361 |
2019-11-26 | 1,024 | 362 |
2019-02-03 | 999 | 363 |
2019-11-13 | 366 | 364 |
2019-11-14 | 61 | 365 |
On average, there were 1,317 calls for service per day in 2019. With the exception of the 76 event day on November 14th, there is not much of a skew to the distribution.
Daily Avg | Std. Dev | Median |
---|---|---|
1,317.29 | 147.2586 | 1,324 |
Code 2 is the most common priority code recorded with a total of 189,788 events. According to Table 3, Code 2 is about 40% of the events in 2019. Just over 96% of the calls for service are categorized as being categorized as priority codes 1 through 3.
Codes 6 and 7 were very rare. They do not show up as clearly in the graph, but in Table 3 they total to 44 and 75 calls, respectively.
One other point to note is that there is not a code 8; the codes skip from 7 to 9.
Code | # Calls | % |
---|---|---|
1 | 154,849 | 32.21 |
2 | 189,788 | 39.47 |
3 | 118,628 | 24.67 |
4 | 8,971 | 1.87 |
5 | 6,565 | 1.37 |
6 | 44 | 0.01 |
7 | 75 | 0.02 |
9 | 1,891 | 0.39 |
Type | # Events | % |
---|---|---|
911 | 325,008 | 67.6 |
TELEPHONE OTHER, NOT 911 | 155,803 | 32.4 |
We have retained only calls for service that came in via 911 or other telephone calls (not via 911). 911 calls are about 68% of the calls and other telephone source makes up the remaining 32%.
For reference, prior to reducing the dataset, 911 calls were about 43% of the events and other telephone was 21%.
Flip through the pages in the table to view the number of events with each type of case final description. Recall that this variable has 320 different descriptions.
Some of these descriptions have a general description followed by a more specific description that follows a dash. We could parse on the general description and then aggregate to get a smaller set of categories. I demonstrate this in the table below.
This aggregation strategy reduced the number of categories by a little over half to 140. Disturbance cases are the most common followed by suspicious circumstances and traffic. If you flip through the pages, there are some categories that also appear to be similar to these top 3. For instance, traffic stop is listed on page 6, which seems like it could also fit under traffic. Also on page 6 is the category “Dist”, which is an abbreviation for disturbance. All of descriptions and frequencies for the final case type descriptions are listed in the exported Excel file (shared over email and on the github page).
Other Comments * Need to make sure to catch abbreviations using reg. expressions (e.g., burg –> burglary) * Similarly, use reg. expressions for categories that look alike but differ in terms of spacing (e.g., Arson, Bombs, Explo; Abandoned car & Abandoned vehicle) * “#NAME?” looks like it might be the classification for events that were not classified. There are 977 events with this classification, which is about 0.2 events.
The top two initial case type descriptions are similar to the final case description types.
One note on structure of these descriptions is that not as many of these descriptions have the same structure as noted in the final descriptions, that is a general description followed by a more specific description/detail, with the two descriptions separated by a dash “-”. Below, I have parsed out the description as I did with the final case descriptions, however, it may be a less useful approach for this description.
Other Comments/Questions * Need to make sure to catch abbreviations using reg. expressions (e.g., HAZ –> HAZARD) * “#NAME?” shows up again in this set of descriptions, though not as frequently as it did in the final descriptions (n=12,132). * Would it be useful to compare final and initial descriptions? We could use some fuzzy matching and regular expressions if this is something important. If final descriptions are missing (meaning that they are coded as #NAME?) and initial descriptions are not missing, should the initial description be applied?
Aggregating reduced the number of descriptions down to 123. The top four descriptions remain the same, but the rest of the top 10 have shifted ranks (e.g., assault, trespass).
NOTE: Unknown is pretty substantial here (n=13,907, 2.89%). The #NAME? description is less frequent (n=528), but appears to also signify unknown case descriptions.
Description | # Events | % |
---|---|---|
ASSISTANCE RENDERED | 167,457.0 | 34.83 |
REPORT WRITTEN (NO ARREST) | 148,495.0 | 30.88 |
UNABLE TO LOCATE INCIDENT OR COMPLAINANT | 55,366.0 | 11.52 |
PHYSICAL ARREST MADE | 41,140.0 | 8.56 |
NO POLICE ACTION POSSIBLE OR NECESSARY | 19,290.0 | 4.01 |
CITATION ISSUED (CRIMINAL OR NON-CRIMINAL) | 11,705.0 | 2.43 |
RESPONDING UNIT(S) CANCELLED BY RADIO | 10,169.0 | 2.11 |
ORAL WARNING GIVEN | 5,772.0 | 1.20 |
FOLLOW-UP REPORT MADE | 4,178.0 | 0.87 |
DUPLICATED OR CANCELLED BY RADIO | 4,137.0 | 0.86 |
OTHER REPORT MADE | 2,968.0 | 0.62 |
FALSE COMPLAINT/UNFOUNDED | 2,872.0 | 0.60 |
STREET CHECK WRITTEN | 2,477.0 | 0.52 |
- | 1,878.0 | 0.39 |
INCIDENT LOCATED, PUBLIC ORDER RESTORED | 1,839.0 | 0.38 |
RADIO BROADCAST AND CLEAR | 429.0 | 0.09 |
PROBLEM SOLVING PROJECT | 330.0 | 0.07 |
TRANSPORTATION OR ESCORT PROVIDED | 143.0 | 0.03 |
NON-CRIMINAL REFERRAL | 80.0 | 0.02 |
EXTRA UNIT | 35.0 | 0.01 |
SERVICE OF DVPA ORDER | 21.0 | 0.00 |
(NOT CURRENTLY USED) ALARM NO RESPONSE | 16.0 | 0.00 |
NO SUCH ADDRESS OR LOCATION | 14.0 | 0.00 |
In 38% of the service calls (n=286,250), assistance was rendered. The next most common response type was no arrest, but report, which was applied to about 31% of the calls (n=148,495).
The next most common clear by type was unable to locate incident or complainant. It was applied to about 11.5% of calls (n=55,366). 14 calls were marked as no such address or location (not sure if it is reasonable to consider this as similar to unable to locate incident).
A physical arrest was made in 8.5% of calls (n=41,140). No police action was possible or necessary in 4% of calls (n=19,290).
OTHER NOTES: * It looks like a dash “-” represents missing clear by description (n=1,878). * There are some descriptions that I do not know what they mean or how they differ from other descriptions. For instance, how are responding units canceled by radio and duplicated or canceled by radio different? * Unable to locate incident or complainant is about 11.5% of the events.
Precinct | # Calls | % |
---|---|---|
NORTH | 141,513 | 29.43 |
WEST | 134,448 | 27.96 |
SOUTH | 76,186 | 15.85 |
EAST | 75,810 | 15.77 |
SOUTHWEST | 51,572 | 10.73 |
UNKNOWN | 1,282 | 0.27 |
The north and west precincts had the most calls with about 29% and 27% of all calls, respectively. South and and east precincts had similar shares of calls at about 16%. The southwest precinct had the fewest number of calls recorded - 51,572 (11%).
For 1,282 calls, the precinct is listed as unknown. We may be able to identify a precinct for these events if they have valid latitude and longitude coordinates. Let’s look to see if they do have lat and long:
Coordinate Status | # Calls |
---|---|
Not valid coords | 720 |
Valid coords | 562 |
About 43% of the calls with an unknown precinct have coordinates within the geographic extent of Seattle. We can use 562 of these events with unknown precincts and assign them a precinct. When I create a spatial object from the coordinates, as shown a few sections below, I will be able to plot these. For some it may be obvious what the precinct is based on the precinct labels given to neighboring events. If the precinct classification is not obvious, the best thing to do would be to obtain a shapefile of the polygons for each of the five precincts, overlay it on the events and give the point the name of the polygon precinct that it falls within or nearest to. Seattle’s Open Data website has such a shapefile that I will call on and use in the spatial geoprocessing section below.
There are some interesting bivariate analyses that could be explored. For example, call priority codes and precincts. View the interactive stacked bar chart below.
A few things stand out in the stacked bar graph of call priority codes and precincts. * The breakdown of precincts within codes 1 and 2 are very similar. The north and west precincts have very similar shares in these two codes. * Most of the unknown precinct calls were classified as code 9. * The south precinct had no code 7 cases. * Over half of the calls in code 9, were in the western precinct.
Let’s turn to focus on the sectors. There are 17 distinct sector names. 1,282 calls were not given a sector. These calls are identical to those missing a precinct classification.
Precinct | Sector | # Calls | Percent |
---|---|---|---|
SOUTH | ROBERT | 30,327 | 39.81 |
SOUTH | SAM | 24,427 | 32.06 |
SOUTH | OCEAN | 21,432 | 28.13 |
EAST | EDWARD | 33,631 | 44.36 |
EAST | GEORGE | 21,241 | 28.02 |
EAST | CHARLIE | 20,938 | 27.62 |
SOUTHWEST | WILLIAM | 25,852 | 50.13 |
SOUTHWEST | FRANK | 25,720 | 49.87 |
WEST | KING | 45,650 | 33.95 |
WEST | DAVID | 32,159 | 23.92 |
WEST | MARY | 29,461 | 21.91 |
WEST | QUEEN | 27,178 | 20.21 |
NORTH | BOY | 33,684 | 23.80 |
NORTH | UNION | 32,389 | 22.89 |
NORTH | NORA | 27,150 | 19.19 |
NORTH | LINCOLN | 27,065 | 19.13 |
NORTH | JOHN | 21,225 | 15.00 |
UNKNOWN | NA | 1,282 | 100.00 |
Sectors are unique to precincts. We can think of a sectors as a subdivision of the precinct.
King sector in the western precinct leads in the number of calls with 45,650 calls. This is about 34% of all calls in the west precinct. The other three sectors in the western precinct - David, Mary, and Queen - have about 10% to 12% fewer events than King.
Boy in the north precinct and Edward in the east precinct are the sectors with the next highest frequency of calls with over 33,000 calls. The share of calls in Boy is not substantially greater than other sectors in the north. However, Edward clearly has the majority of calls in the east precinct, amounting to about 44% of all calls in the precint.
The two sectors of the southwest precinct - William and Frank - have a 50-50 split of the calls.
NOTE: The Seattle Open Data website does not appear to have a boundary shapefile or API for sector. This may be something to inquire about if we want to do point-in-polygon analyses at the sector level.
This is one of the variables with an unmanageable amount of categories. There are only 1,487 events missing a squad description. If you flip through the pages of the table you can see that the squad groups are named in various ways. Some are based on the field/area they work in (e.g., forensics, Arson/Bomb) and others are based on locations (i.e., precinct + sector). NOTE: If this is a variable that is considered important we would need to approach the aggregation like we would for the Case type descriptions using the first descriptor before the dash, regular expressions, and lazy matching to get broad categories and abbreviations, misspellings, and differences in ordering of words.
## [1] 1262
There are 1,262 officers in this dataset.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 246 574 2121 1886 14773646
Response time for each event is reported in seconds. The summary statistics suggest that there are some very long response times that are outliers. The longest response time is 14,773,646 seconds, which would be many, many days long. Let’s parse the seconds into higher levels of time.
With the times parsed into periods and sorted from longest to shortest time, we can see that the longest time was 170 days and the case was a test call. This is probably a candidate for excluding. For completeness, below the data displayed sorted from shortest to longest, so that it is easier to see what the short response times are.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -2604 598 1455 2936 3664 107771 221
The distribution for total service time on events is strange. There are 221 events missing a total service time. Additionally, there is at least one event that had a negative total service time recorded. First, let’s see how many negative values we have.
Date | Service Time (Seconds) | response time parsed | Case Type Final |
---|---|---|---|
2019-11-03 | -2,604 | 3H 41M 19S | TRAFFIC - PARKING VIOL (EXCEPT ABANDONED CAR) |
2019-11-03 | -1,829 | 1H 22M 0S | CRISIS COMPLAINT - GENERAL |
2019-11-03 | -1,829 | 1H 22M 0S | CRISIS COMPLAINT - GENERAL |
2019-11-03 | -1,091 | 9M 1S | DISTURBANCE - OTHER |
2019-11-03 | -998 | 4M 50S | ASSAULTS, OTHER |
2019-11-03 | -699 | 2M 9S | DISTURBANCE - OTHER |
2019-11-03 | -699 | 2M 9S | DISTURBANCE - OTHER |
There are only 7 events in the dataset with negative values. When we include information like the event date, parsed response time, and case description type, we notice that two of these are duplicates. The other thing that stands out is that these events were all recorded on the same date, November 3rd. It is possible that the negative values were a recording error that occurred that day. We could also check for the average service time on other events of a similar type to see if the absolute value of total service time is reasonable.
Now, let’s look at the NA values.The events with missing values vary on case types. There appears to be some duplicates, e.g., the assault-DV case on January 13th. Again, it seems like event date and response time would be useful for identifying duplicates this dataset.
For the sake of consistency, I parsed the total service time into time periods as I did with the response time. See some of the output below.
With the parsed by period version of service time, we see that the upper end of the service time distribution is 2 days.
Before transforming the dataframe into a spatial object, the calls with missing or invalid coordinates need to be removed. After filtering those events out, the transformed spatial object contains 459,132 calls with locations. There were 21,679 calls for service that do not have valid coordinates. Mapping all of these calls as points results in over-plotting as shown below.
There are other approaches for visualizations that would be more informative. One approach is to create a point density map to show where the highest and lowest number of events per area occurred in the city. Another approach is to aggregate the points to meaningful geographic units like zipcodes or neighborhoods. The following sections demonstrate these approaches.
This interactive map clusters the points that are proximate. Zoom into different parts of the city to where clusters tend to occur.