Exploring fear of crime in England: the role of area deprivation and crime victimisation

SC385 | L5. Introduction to survey methods | 2025/26

Author

Pablo CA

Introduction

Fear of crime usually refers to the fear of personally becoming a victim of particular types of crime, e.g., a question in a victim survey that asks respondents how likely they think it is that they will be burgled (or what have you) in the coming year. It is more common than victimization and is important because it can lead to stress and behavioral precautions that impinge on quality of life (Lane, 2015). To what extent area deprivation and crime victimisation explain fear of crime? There is an ongoing discussion about the role of neighbourhood characteristics on individual outcomes. Some studies have shown that deprived areas are more likely to tend to be racially segregated, have a higher crime rate, and low-quality public services in the US (e.g., Sampson et al., 1997).

The objective of this lab will be to explore how the level of fear of crime is related to area deprivation and crime victimisation in England. The analysis will also control for other individual characteristics, like personal income or education.

R packages and options

Before introducing the survey we will be using for the analysis, let’s load the packages and set the options for the analysis. If you do not have the packages installed, you can start by installing them using the code below.

# install.packages(c("tidyverse", "survey", "labelled", "sjmisc", "naniar", "modelsummary"))

library(tidyverse)
library(survey)

The dataset

We will use the England and Wales Crime Survey (CSEW) 2017/18 teaching dataset for this lab. The CSEW is a large-scale, household-based survey designed to collect data on experiences of crime, perceptions of crime, and attitudes toward the criminal justice system. The survey is conducted throughout the year and collects data about the experiences of crime in the last 12 months. The interviews for the 2017/18 edition were conducted face-to-face, with sensitive topics addressed through a self-completion questionnaire. The survey is designed to be representative of the population living in private households in England and Wales.

The England and Wales Crime Survey (CSEW)

The CSEW aims to provide insights into crime trends over time, including those not reported to the police, and public perceptions of crime and safety.
The survey covers England and Wales and includes individuals aged 16 and over, with a separate module for children aged 10-15. All household members are invited to take part in the survey.
The survey uses a stratified, multi-stage random probability sample. A sample of postcode sectors (PSU) is selected from the postal address file. Then, addresses are selected within postcode sectors.
The 2017/18 CSEW interviewed approximately 35,000 adults and 3,000 children aged 10-15. The dataset you will use for the exercise is a subsample of the full dataset.

The data can be downloaded through the UK Data Service.

First, let’s read in the data you will find on Moodle in a RDS format. The dataset contains the responses of 32,101 adults to the individual interview and 22 variables. You can use the functions head() or glimpse() to have a first look at the dataset.

df <- read_rds("data/cse_2017-18_teaching.RDS")

head(df)

serial	worryx	wburgl	wmugged	wattack	wraceatt	carmot	cardamag	yrdeface	delibvio	imd_quint	income	sex	ageg	edu	bornuk	strata	psu	weight
423023	-0.2484040	Fairly worried	Not very worried	Not very worried	Not at all worried	Yes	No	No	No	Q5 (Less deprived areas)	15,000-29,999	Male	55-64	Apprenticeship or A/AS level	Born in UK	30011	11172802	0.6150459
423119	-0.8156970	Not very worried	Not at all worried	Not very worried	Not at all worried	No	NA	No	No	Q4	Under 15,000	Male	16-24	Apprenticeship or A/AS level	Born in UK	30011	11172418	1.6044675
423211	-0.5336742	Not very worried	Not very worried	Not at all worried	Not very worried	Yes	No	No	No	Q2	Under 15,000	Male	55-64	None	Born in UK	30011	11172294	1.1845964
423227	0.9187671	Not very worried	Fairly worried	Fairly worried	Not very worried	Yes	No	No	No	Q3	15,000-29,999	Female	35-44	Degree or diploma	Born in UK	30011	11172730	0.5956991
423231	-1.3814231	Not at all worried	Not at all worried	Not at all worried	Not at all worried	Yes	No	No	No	Q4	15,000-29,999	Male	55-64	None	Born in UK	30011	11172918	1.7307217
423323	-0.8141301	Not very worried	Not very worried	Not at all worried	Not at all worried	Yes	No	No	No	Q5 (Less deprived areas)	30,000-49,999	Male	35-44	Degree or diploma	Born in UK	30011	11172970	1.3163648

Exploring the dataset

The function var_label of the package labelled can extract the label of each variable in the data frame. The code below is a simple way to generate a data frame with two columns: the variable name and the label.

labelled::var_label(df) %>% 
  as_tibble() %>% # transform the output of var_label into a tibble object
  pivot_longer(everything(), names_to = "variable_name", values_to = "variable_label") # reshape the data frame from a wide format to long format (more convenient for exploring the data)

variable_name	variable_label
serial	Serial number (6 digits)
worryx	Worry about being a victim of crime (high score = high level of worry)
wburgl	How worried about having your home broken into
wmugged	How worried about being mugged and robbed
wattack	How worried about being physically attacked by strangers
wraceatt	How worried about being attacked because of skin colour, ethnic origin or religi
carmot	If has a car ot motorcycle
cardamag	If vehicle tampered with or damaged
yrdeface	If anything was damaged outside current residence
delibvio	If anyone has deliberately used force/violence on adult respondent
imd_quint	English Index of Multiple Deprivation 2015 (quintiles)
income	What is your personal (and partners) gross income
sex	Adult number 1 (respondent): Sex
ageg	Age group (7 bands)
edu	Respondent education (5 categories)
bornuk	Person was born in the UK
strata	Stratum (2015 definition)
psu	PSU (2015 definition)
weight	Individual level weight (mean=1)

Analysis objective

The objective of this lab is to analyse whether the level of fear of crime is related to area deprivation and crime victimisation controlling for other demographic factors: sex, age, education, personal income, and whether the person was born in the UK.

To measure fear of crime we will use the variable worryx. Higher scores in the variable indicate a higher level of worry about suffering a crime. This variable is derived from five items in the survey that measure how worried was that person about:
- having their home broken into (wburgl)
- being mugged and robbed (wmugged)
- being physically attacked by strangers (wattack)
- being attacked because of skin colour, ethnic origin or religion (wraceatt)
The analysis will use the variable worryx.
To measure crime victimisation we will use three variables that identify whether the respondent has been a victim of at least one type of crime in the last 12 months:
- If vehicle tampered with or damaged (cardamag)
- If anything was damaged outside current residence (yrdeface)
- If anyone has deliberately used force/violence on adult respondent (delibvio)
To measure the deprivation of the area where the respondent lives, we will use the variable Index of multiple Deprivation (IMD), which is an index that ranks the neighbourhoods in England from the most deprived to the least deprived. The IMD is a composite measure that includes income, employment, health, education, crime, and living environment.

Index of Multiple Deprivation (IMD)

The Index of Multiple Deprivation (IMD) measures relative deprivation in small areas across each of the countries of the United Kingdom.
Areas are ranked from the most deprived area (rank 1) to the least deprived area.
Each country measures deprivation in a slightly different way but the broad themes include income, employment, education, health, crime, barriers to housing and services, and the living environment.
The statistical unit areas used to provide indices of relative deprivation across the country are Lower layer Super Output Areas (LSOAs) [England, Wales], Data Zones [Scotland] and Super Output Areas or Wards [Northern Ireland].

More information about the IMD.

The analysis will be conducted in two steps:

Preparing variables for the analysis: dealing with missing values. We will explore the data and identify the missing values in the variables and will make decisions about how to deal with the missing values in these variables.
Exploring the factors related to fear of crime. We will estimate the relationship between the level of fear of crime, area deprivation and crime victimisation using a linear regression model. This will be an opportunity to learn how to estimate the model using the survey package in R, which takes into account some information about the complex sample design and weights to adjust for non-response.

1. Preparing the variables for the analysis: dealing with missing values

Types of missing data

Planned missingness. The researcher plans that some questions are not going to be presented to some respondents:
- A long questionnaire might be split into a common set of questions that are asked to all respondents and a different set of questions that are asked to a random subset.
- Some questions might be asked only to those who meet certain criteria. For example, questions about the experience of using public transport might be asked only to those who use public transport.
Unplanned missingness. Sometimes a respondent refuses to answer a question or, in self-administered mode, skips the questions. This can also happen by mistake. The result is that we do not have information about this question for this respondent.
Non-substantive responses. There are some responses that might be taken as substantive or missing depending on the research question and the objectives of the analysis. For example, a don’t know category is relevant if you intend to measure knowledge.

For a comprehensive guide on missing data in surveys read (Reid and Allum, 2019).

Exercise 1. Identifying and dealing with missing values

The first step in the analysis is to explore the data and identify the missing values. Before starting with the analysis, the analyst should have a clear answer to the following questions: What type and how many missing values are in the data? How can they affect the analysis? Which decisions, if any, do I need to make before proceeding with the analysis? This exercise will guide you through the process of identifying the missing values in the dataset.

E1.1 | What is the proportion of missing values in the dependent and independent variables?

Produce descriptives to examine the variables that will be included in the analysis: worryx, sex, ageg, edu, income, bornuk, cardamag , yrdeface, delibvio and imd_quint.

What proportion of missing values do you observe for each variable? Which variables would need some preparation before fitting the linear regression model?

a) Produce frequency tables of the categorical and ordinal variables (all except worryx). You can use the function select to select these variables and then use the function sjmisc::frq() to produce the frequency tables. The frequency tables will allow us to explore the variables and identify possible missing values. What proportion of missing values do you observe for each variable?

Selecting multiple variables: selector helpers

R has some helper functions to select multiple variables at once. The colon operator : can be used to select a range of variables. For example, var1:var3 will select all variables from var1 to var3. The c() function can be used to select multiple variables. For example, c(var1, var2, var3) will select the variables var1, var2, and var3.

If you use tidyverse (i.e., dplyr package) for data management some additional helpers are available:

starts_with("prefix") selects all variables that start with “prefix”.
ends_with("suffix") selects all variables that end with “suffix”.
contains("pattern") selects all variables that contain “pattern”.
everything() selects all variables.

More information (and helper functions) are available at tidy select.

💡 Solution E1.1 a)

As you can see below, some variables such as sex, ageg and imd_quint have no missing values. Other variables have a small number (<1%) of user missing values (i.e., NA, “Refused” or “Don’t Know”).

However, the variables cardamag has a high number of system missing (<NA>). Later in the exercise we will determine what is causing these missing values.

The variable income has also a higher number of missing values. Some people did not know or refused to answer the question. The variable income is a key variable in our analysis, so we need to decide how to deal with the missing values in this variable.

The frequency tables are also useful to identify the levels (or categories) that need to be set as missing for the analysis. For example, the variables cardamag, yrdeface, delibvio and bornuk have a category “Refused” and “Don’t know” that should be set as missing NA.

df %>% select(cardamag:bornuk) %>% sjmisc::frq()

If vehicle tampered with or damaged (cardamag) <categorical> 
# total N=7916 valid N=6350 mean=1.95 sd=0.23

Value      |    N | Raw % | Valid % | Cum. %
--------------------------------------------
Yes        |  318 |  4.02 |    5.01 |   5.01
No         | 6023 | 76.09 |   94.85 |  99.86
Refused    |    1 |  0.01 |    0.02 |  99.87
Don't know |    8 |  0.10 |    0.13 | 100.00
<NA>       | 1566 | 19.78 |    <NA> |   <NA>

If anything was damaged outside current residence (yrdeface) <categorical> 
# total N=7916 valid N=7916 mean=1.99 sd=0.13

Value      |    N | Raw % | Valid % | Cum. %
--------------------------------------------
Yes        |  115 |  1.45 |    1.45 |   1.45
No         | 7794 | 98.46 |   98.46 |  99.91
Refused    |    2 |  0.03 |    0.03 |  99.94
Don't know |    5 |  0.06 |    0.06 | 100.00
<NA>       |    0 |  0.00 |    <NA> |   <NA>

If anyone has deliberately used force/violence on adult respondent (delibvio) <categorical> 
# total N=7916 valid N=7916 mean=1.99 sd=0.12

Value      |    N | Raw % | Valid % | Cum. %
--------------------------------------------
Yes        |  110 |  1.39 |    1.39 |   1.39
No         | 7802 | 98.56 |   98.56 |  99.95
Refused    |    2 |  0.03 |    0.03 |  99.97
Don't know |    2 |  0.03 |    0.03 | 100.00
<NA>       |    0 |  0.00 |    <NA> |   <NA>

English Index of Multiple Deprivation 2015 (quintiles) (imd_quint) <categorical> 
# total N=7916 valid N=7916 mean=3.03 sd=1.40

Value                    |    N | Raw % | Valid % | Cum. %
----------------------------------------------------------
Q1 (Most deprived areas) | 1509 | 19.06 |   19.06 |  19.06
Q2                       | 1548 | 19.56 |   19.56 |  38.62
Q3                       | 1632 | 20.62 |   20.62 |  59.23
Q4                       | 1626 | 20.54 |   20.54 |  79.78
Q5 (Less deprived areas) | 1601 | 20.22 |   20.22 | 100.00
<NA>                     |    0 |  0.00 |    <NA> |   <NA>

What is your personal (and partners) gross income (income) <categorical> 
# total N=7916 valid N=7916 mean=2.77 sd=1.44

Value          |    N | Raw % | Valid % | Cum. %
------------------------------------------------
Under 15,000   | 1860 | 23.50 |   23.50 |  23.50
15,000-29,999  | 1998 | 25.24 |   25.24 |  48.74
30,000-49,999  | 1604 | 20.26 |   20.26 |  69.00
50,000 or over | 1453 | 18.36 |   18.36 |  87.35
Refused        |  590 |  7.45 |    7.45 |  94.81
Don't Know     |  411 |  5.19 |    5.19 | 100.00
<NA>           |    0 |  0.00 |    <NA> |   <NA>

Adult number 1 (respondent): Sex (sex) <categorical> 
# total N=7916 valid N=7916 mean=1.53 sd=0.50

Value  |    N | Raw % | Valid % | Cum. %
----------------------------------------
Male   | 3691 | 46.63 |   46.63 |  46.63
Female | 4225 | 53.37 |   53.37 | 100.00
<NA>   |    0 |  0.00 |    <NA> |   <NA>

Age group (7 bands) (ageg) <categorical> 
# total N=7916 valid N=7916 mean=4.19 sd=1.81

Value |    N | Raw % | Valid % | Cum. %
---------------------------------------
16-24 |  529 |  6.68 |    6.68 |   6.68
25-34 | 1187 | 14.99 |   14.99 |  21.68
35-44 | 1312 | 16.57 |   16.57 |  38.25
45-54 | 1359 | 17.17 |   17.17 |  55.42
55-64 | 1266 | 15.99 |   15.99 |  71.41
65-74 | 1252 | 15.82 |   15.82 |  87.23
75+   | 1011 | 12.77 |   12.77 | 100.00
<NA>  |    0 |  0.00 |    <NA> |   <NA>

Respondent education (5 categories) (edu) <categorical> 
# total N=7916 valid N=7880 mean=2.93 sd=1.24

Value                        |    N | Raw % | Valid % | Cum. %
--------------------------------------------------------------
None                         | 1563 | 19.74 |   19.84 |  19.84
O level/GCSE                 | 1353 | 17.09 |   17.17 |  37.01
Apprenticeship or A/AS level | 1381 | 17.45 |   17.53 |  54.53
Degree or diploma            | 3242 | 40.96 |   41.14 |  95.67
Other                        |  341 |  4.31 |    4.33 | 100.00
<NA>                         |   36 |  0.45 |    <NA> |   <NA>

Person was born in the UK (bornuk) <categorical> 
# total N=7916 valid N=7916 mean=1.84 sd=0.37

Value       |    N | Raw % | Valid % | Cum. %
---------------------------------------------
Born abroad | 1264 | 15.97 |   15.97 |  15.97
Born in UK  | 6640 | 83.88 |   83.88 |  99.85
Refused     |   11 |  0.14 |    0.14 |  99.99
Don't Know  |    1 |  0.01 |    0.01 | 100.00
<NA>        |    0 |  0.00 |    <NA> |   <NA>

b) Produce descritive table of the continuous variable worryx (dependent variable). You can use the functions sjmisc::descr() or summary() to obtain basic descriptives and the number (or percentage) of missing values. What proportion of missing values do you observe for worryx?

Note that higher values of worryx indicate a higher level of worry about suffering a crime, while lower values indicate a lower level of worry.

💡 Solution E1.1 b)

As you can see below, only 174 cases (2.2%) of all observations are missing for the variable worryx.

summary(df$worryx)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
-1.38142 -0.77283 -0.22938 -0.01316  0.36176  2.88169      174

E1.2 | Why the variable (`cardamg`) has a ~20% missing values (`<NA>`)?

a) The variable cardamag is about the respondents’ cars or motorcycles. It is time to look at the data frame (or questionnaire) to see whether these variables are planned or unplanned missingness. Identify the variable that measures whether the respondent has a car or motorcycle and produce cross-tabs using the function table(). Are the <NA> of these variables planned or unplanned missingness? How would you treat these missing values in the analysis?

💡 Solution E1.2 a)

The cross-tabs below show that the variable cardamag is only asked to those who own a car or motorcycle. Note that those who do not have a car or motorcycle carmot == "No" or refused to answer the question are missing at cardamag. For the purpose of our analysis, we can ignore the missing values in these variables (<NA>) and recode the <NA> as "No" at cardamag.

table(df$carmot, df$cardamag, useNA = "always")

         
           Yes   No Refused Don't know <NA>
  Yes      318 6023       1          8    0
  No         0    0       0          0 1565
  Refused    0    0       0          0    1
  <NA>       0    0       0          0    0

b) Recode the variable cardamag so that the <NA> values are recoded to "No". An issue to produce this recode is how to specify that the cases with <NA> are missing values. The function fct_na_value_to_level can be used to set the missing values as a level of the factor. See the example below to learn how it works.

Dealing with factors in R: the forcats package

The forcats package provides a set of functions to work with factors in R.

fct_recode() can be used to recode the levels of a factor.
fct_collapse() can be used to collapse the levels of a factor into new levels.

Sometimes we need to manipulate NA values as a level of the factor. The function fct_na_value_to_level() can be used to set the missing values as a level of the factor (see example below).

The forcats package is part of tidyverse and all its functions start with fct_. The forcats functions used to transform variables can be called within mutate(). More information are available at forcats.

# EXAMPLE: recode the variable education (edu) to a factor with three levels:
# "Degree", "No degree" and "Missing". The misisng category should include those
# with NA and those responding "Other" to the education question. This is easier
# to do with fct_collapse() function.

# 1. Set the missing values as a level of the factor
table(df$edu, useNA = "always")


                        None                 O level/GCSE 
                        1563                         1353 
Apprenticeship or A/AS level            Degree or diploma 
                        1381                         3242 
                       Other                         <NA> 
                         341                           36

levels(df$edu) # the NA is not a level of the factor

[1] "None"                         "O level/GCSE"                
[3] "Apprenticeship or A/AS level" "Degree or diploma"           
[5] "Other"

# we need to set NA as a level of the factor to manipulate it. 
# The function fct_na_value_to_level() can do this.
df <- df %>% 
  mutate(edu = fct_na_value_to_level(edu)) # set the missing values as a level

# check that now NA is a level so we can manipulate it.
levels(df$edu)

[1] "None"                         "O level/GCSE"                
[3] "Apprenticeship or A/AS level" "Degree or diploma"           
[5] "Other"                        NA

# 2. Use fct_collapse() to recode the variable.
df <- df %>% 
  mutate(edu3 = fct_collapse(edu,
                             Degree = "Degree or diploma",
                             Missing = c(NA, "Other"), 
                             other_level = "No degree"
                             ))

table(df$edu, df$edu3, useNA = "always")

                              
                               Degree Missing No degree <NA>
  None                              0       0      1563    0
  O level/GCSE                      0       0      1353    0
  Apprenticeship or A/AS level      0       0      1381    0
  Degree or diploma              3242       0         0    0
  Other                             0     341         0    0
  <NA>                              0      36         0    0

💡 Solution E1.2 b)

 df <- df %>% 
    mutate(cardamag = fct_na_value_to_level(cardamag), ## set the missing values NA as a level so we can recode it
           cardamag = fct_collapse(cardamag,
                                   No = c("No", NA)))
                                     
## check the variable
table(df$cardamag, useNA = "always")


       Yes         No    Refused Don't know       <NA> 
       318       7589          1          8          0

E1.3 | Focusing on the variable personal income (`income`), what are the sociodemographic characteristics of the sample members who did not provide a substantive answer to this question compare to those who didn’t?

a) Generate the variable income_miss that takes “Missing” if the income variable is <NA>, Refused or Don't know and “Non-missing” otherwise. Use the function mutate and fct_collapse to create the new variable. You can use the function fct_na_level_to_value() from the package forcats to set the missing values as a level of the factor.

💡 Solution E1.3 a)

 df <- df %>% 
    mutate(income = fct_na_value_to_level(income), ## set the missing values NA as a level
            income_miss = fct_collapse(income,
                                        Missing = c("Refused", 
                                                  "Don't Know", 
                                                  NA),
                                      other_level = "Non-missing"))
                                     
## check the variable
table(df$income, df$income_miss, useNA = "always")

                
                 Missing Non-missing <NA>
  Under 15,000         0        1860    0
  15,000-29,999        0        1998    0
  30,000-49,999        0        1604    0
  50,000 or over       0        1453    0
  Refused            590           0    0
  Don't Know         411           0    0
  <NA>                 0           0    0

b) Produce a cross-tab of the new variable income_miss with the variables sex, ageg, edu, bornuk and imd_quint. What are the sociodemographic characteristics of the sample members who did not provide a substantive answer to the question about their personal income? Could these differences affect the results of the analysis (e.g., regression coefficients)? You can use the sjmisc::flat_table() function to produce the cross-tab with row or column percentages.

Using sjmisc::flat_table() to produce cross-tabs

The function sjmisc::flat_table() can be used to produce cross-tabs. The function has several arguments to control the output of the table. The argument margin can be used to produce row or column percentages. The argument show_na can be used to show the missing values in the table. The argument show_total can be used to show the total of the table. This function is an alternative to table() and prop.table() functions in base R.

More information about sjmisc.

# EXAMPLE: Use flat_table() to produce a cross-tab of two variables 
# from the data frame df.
# margin can be 'counts', 'row', 'col' or 'cell'.

sjmisc::flat_table(df, edu, sex, margin = "col")

                             sex  Male Female
edu                                          
None                             18.28  21.19
O level/GCSE                     14.66  19.36
Apprenticeship or A/AS level     21.41  14.13
Degree or diploma                40.94  41.32
Other                             4.71   4.00

💡 Solution E1.3 b)

People who are older, have a lower level of education and were born abroad are less likely to report their income. The variable income_miss is not missing at random.

sjmisc::flat_table(df, sex, income_miss, margin = "row")

       income_miss Missing Non-missing
sex                                   
Male                 11.35       88.65
Female               13.78       86.22

sjmisc::flat_table(df, ageg, income_miss, margin = "row")

      income_miss Missing Non-missing
ageg                                 
16-24               15.88       84.12
25-34                9.27       90.73
35-44               10.14       89.86
45-54               10.60       89.40
55-64               13.19       86.81
65-74               13.34       86.66
75+                 19.39       80.61

sjmisc::flat_table(df, edu, income_miss, margin = "row")

                             income_miss Missing Non-missing
edu                                                         
None                                       19.32       80.68
O level/GCSE                               11.46       88.54
Apprenticeship or A/AS level               10.79       89.21
Degree or diploma                           9.78       90.22
Other                                      14.66       85.34

sjmisc::flat_table(df, bornuk, income_miss, margin = "row")

            income_miss Missing Non-missing
bornuk                                     
Born abroad               16.77       83.23
Born in UK                11.73       88.27
Refused                   90.91        9.09
Don't Know                 0.00      100.00

c) Produce a box plot of the variable worryx by income_miss. Do those who did not provide information about their income have different levels of worry about suffering a crime? Could these differences affect the results of the analysis (e.g., regression coefficients)? You can use ggplot2 function to produce the boxplot (see example below).

Using ggplot2 to produce boxplots

The ggplot2 package provides a set of functions to produce data visualisations in R. It is based on the Grammar of Graphics (Wilkinson, 2005). The basic idea is to build a plot layer by layer. The first layer is the data, the second layer is the aesthetic mapping, and the third layer is the geometric object.

The function ggplot(data, mapping) is used to create a ggplot object.
The function geom_boxplot() tells ggplot2 that you want the information displayed with a boxplot.
The functions are combined using the + operator.

ggplot2 package information is available at ggplot2.

How to interpret a boxplot?

Center: The median line inside the box shows the center of the data.
Spread: The length of the box shows the interquartile range (IQR), which is the spread of the middle 50% of the data. The total length of the plot (max - min) shows the overall range.
Skewness: The position of the median within the box and the relative lengths of the whiskers can indicate the symmetry or skew of the data.
Outliers: Any data points that fall outside the whiskers are often shown as individual points, indicating potential outliers.

# EXAMPLE: boxplot of worryx by edu.

ggplot(df, aes(x = edu, y = worryx)) +
  geom_boxplot()

Warning: Removed 174 rows containing non-finite outside the scale range
(`stat_boxplot()`).

💡 Solution E1.3 c)

Those not providing information about their income show a similar distribution of the variable worryx compared to those who provided information.

ggplot(df, aes(x = income_miss, y = worryx)) +
  geom_boxplot()

Warning: Removed 174 rows containing non-finite outside the scale range
(`stat_boxplot()`).

E1.4 | How to deal with missing values? Alternatives available for the analyst

Dealing with missing values in survey data analysis

Listwise deletion or complete case analysis. The analyst can decide to exclude all cases with missing values in any of the variables involved in the analysis. For example, if you are producing cross-tabs using sex, age and income, all cases that have a missing values for any of the three variables will be excluded from the analysis.
Pairwise deletion. The analyst can decide to exclude cases with missing values only in the variables involved in the analysis. For example, in the scenario mentioned above, only the cases with missing values at sex or age would be excluded from the sex~age cross-tab, whilst only the cases with missing values at age or income would be excluded from the crosstab age~income. As a result each of the cross-tabs will involve a different number of cases.

This approach can lead to biased estimates if the missing values are not missing completely at random (MCAR), i.e. if those excluded due to missingness are different from those used in the analysis.

Imputation. There is a third, more complex set of alternatives that involve statistical methods to estimate the likely response of the respondents who did not provide a valid answer. This is called imputation. There are several methods to impute missing values, such as mean imputation, regression imputation, multiple imputation, etc. You can learn more about them in (van Buuren, 2018).

Since our objective is to estimate a regression model we will use a complete case analysis (listwise deletion). This means that we will exclude all cases with missing values in any of the variables involved in the analysis: worryx, sex, ageg, edu, bornuk, income, cardamag, yrdeface, delibvio and imd_quint will be used in the analysis.

a) We need to set as missing values <NA> the levels of some variables that have “Refused” or “Don’t know”. The function fct_na_level_to_value() can be used to set the missing values as a level of the factor.

Setting values as missing <NA> in R

There are some alternatives to set values as missing in R:

You can use R base to identify the cases that you want to set as missing and then use the assignment operator <- to set the value as missing. For example, df$income[df$income == "Refused"] <- NA will set the value “Refused” as missing in the variable income.
The function na_if() can be used to set a value as missing. For example, na_if(df$income, "Refused") will set the value “Refused” as missing in the variable x. The function na_if() can be used within mutate().
The forcats package provides the function fct_na_level_to_value() to set one or more levels of a factor as missing values. For example, fct_na_level_to_value(df$income, "Refused") will set the value “Refused” as missing in the variable income. The function can be used within mutate().

💡 Solution E1.4 a)

In the next chunk of code we set the missing values in the variables income, cardamag, yrdeface, bornuk and delibvio as <NA>.

## Set factor levels as NA for income and bornuk
df <- df %>% 
  mutate(income = fct_na_level_to_value(income, c("Refused", "Don't Know")),
         cardamag = fct_na_level_to_value(cardamag, c("Refused", "Don't know")),
         yrdeface = fct_na_level_to_value(yrdeface, c("Refused", "Don't know")),
         delibvio = fct_na_level_to_value(delibvio, c("Refused", "Don't know")),
         bornuk = fct_na_level_to_value(bornuk, c("Refused", "Don't Know"))
         )

## check that the non-valid levels are set as NA
df %>% 
  select(income, cardamag, yrdeface, delibvio, bornuk) %>%
  sjmisc::frq()

What is your personal (and partners) gross income (income) <categorical> 
# total N=7916 valid N=6915 mean=2.38 sd=1.09

Value          |    N | Raw % | Valid % | Cum. %
------------------------------------------------
Under 15,000   | 1860 | 23.50 |   26.90 |  26.90
15,000-29,999  | 1998 | 25.24 |   28.89 |  55.79
30,000-49,999  | 1604 | 20.26 |   23.20 |  78.99
50,000 or over | 1453 | 18.36 |   21.01 | 100.00
<NA>           | 1001 | 12.65 |    <NA> |   <NA>

If vehicle tampered with or damaged (cardamag) <categorical> 
# total N=7916 valid N=7907 mean=1.96 sd=0.20

Value |    N | Raw % | Valid % | Cum. %
---------------------------------------
Yes   |  318 |  4.02 |    4.02 |   4.02
No    | 7589 | 95.87 |   95.98 | 100.00
<NA>  |    9 |  0.11 |    <NA> |   <NA>

If anything was damaged outside current residence (yrdeface) <categorical> 
# total N=7916 valid N=7909 mean=1.99 sd=0.12

Value |    N | Raw % | Valid % | Cum. %
---------------------------------------
Yes   |  115 |  1.45 |    1.45 |   1.45
No    | 7794 | 98.46 |   98.55 | 100.00
<NA>  |    7 |  0.09 |    <NA> |   <NA>

If anyone has deliberately used force/violence on adult respondent (delibvio) <categorical> 
# total N=7916 valid N=7912 mean=1.99 sd=0.12

Value |    N | Raw % | Valid % | Cum. %
---------------------------------------
Yes   |  110 |  1.39 |    1.39 |   1.39
No    | 7802 | 98.56 |   98.61 | 100.00
<NA>  |    4 |  0.05 |    <NA> |   <NA>

Person was born in the UK (bornuk) <categorical> 
# total N=7916 valid N=7904 mean=1.84 sd=0.37

Value       |    N | Raw % | Valid % | Cum. %
---------------------------------------------
Born abroad | 1264 | 15.97 |   15.99 |  15.99
Born in UK  | 6640 | 83.88 |   84.01 | 100.00
<NA>        |   12 |  0.15 |    <NA> |   <NA>

b) We will create an indicator variable that will identify the cases that are complete. The function add_any_miss() from the package naniar can be used to create this indicator. The indicator will be used to exclude the cases with missing values in the analysis. How many cases will be excluded from the analysis?

naniar package: add_any_miss()

naniar package provides a set of functions to explore and visualise missing data in R. The package can be used to identify the patterns of missingness in the data. The function add_any_miss() can be used to create an flag variable that identifies the cases with missing values. The function add_miss_ind() can be used to create an indicator variable for each variable in the data frame.

Here is an example of how to use the function add_any_miss():

# This function will produce a new column in the dataset called {label}_all or {label}_vars that will be "complete" if the case has valid levels for all the variables in the list, and "missing" otherwise.

# add_any_miss(
#   data,
#   ...,
#   label = "any_miss",
#   missing = "missing",
#   complete = "complete"
# )

df %>% 
  naniar::add_any_miss(., c(worryx, imd_quint:bornuk), label = "complete_cases")

# A tibble: 7,916 × 22
   serial worryx wburgl        wmugged wattack wraceatt carmot cardamag yrdeface
    <dbl>  <dbl> <fct>         <fct>   <fct>   <fct>    <fct>  <fct>    <fct>   
 1 423023 -0.248 Fairly worri… Not ve… Not ve… Not at … Yes    No       No      
 2 423119 -0.816 Not very wor… Not at… Not ve… Not at … No     No       No      
 3 423211 -0.534 Not very wor… Not ve… Not at… Not ver… Yes    No       No      
 4 423227  0.919 Not very wor… Fairly… Fairly… Not ver… Yes    No       No      
 5 423231 -1.38  Not at all w… Not at… Not at… Not at … Yes    No       No      
 6 423323 -0.814 Not very wor… Not ve… Not at… Not at … Yes    No       No      
 7 423619  1.15  Fairly worri… Not ve… Fairly… Fairly … Yes    No       No      
 8 423707 -1.38  Not at all w… Not at… Not at… Not at … Yes    No       No      
 9 423927 NA     Don't know    Don't … Don't … Don't k… No     No       <NA>    
10 424219  0.904 Very worried  Fairly… Fairly… Not ver… No     No       No      
# ℹ 7,906 more rows
# ℹ 13 more variables: delibvio <fct>, imd_quint <fct>, income <fct>,
#   sex <fct>, ageg <fct>, edu <fct>, bornuk <fct>, strata <dbl>, psu <dbl>,
#   weight <dbl>, edu3 <fct>, income_miss <fct>, complete_cases_vars <chr>

More information about the package is available at naniar.

💡 Solution E1.4 b)

A total of 1,142 cases (14.5%) will be excluded from the analysis due to missing values in any of the variables involved in the analysis.

## Complete case analysis indicator
df <- df %>% 
  naniar::add_any_miss(., c(worryx, cardamag:bornuk), label = "complete_cases") 

## Number of cases excluded from the analysis
table(df$complete_cases_vars)


complete  missing 
    6774     1142

2. Exploring the factors related to fear of crime

Complex sample design and weights

Survey samples are designed in different and complex ways. Some important design features are:

Clustering: Households or individuals are sometimes clustered in geographical units (e.g., postcodes or LAs). These clusters can be selected to decrease the costs of the survey. The first set of clusters that are selected are called primary sampling units (PSUs).
Stratification divides the sample into mutually exclusive groups (strata) that are more homogeneous than the population. The sample is then selected from each stratum.
Weights are used to adjust for non-response and to make the sample representative of the population.

All these features can impact the estimates (e.g., mean, proportion, regression coefficients) and their standard errors (i.e., the precision of the estimate). The weights can impact both the point estimates and the standard errors, whilst the clustering and stratification can impact the standard errors only.

As an analyst, you will need to read the survey documentation to understand the sample design and the weights used in the survey. The survey package in R provides functions to estimate taking into account the complex sample design and weights.

Exercise 2. Using the `survey` package: bivariate analysis and fitting a linear regression model

The second step in the analysis is to explore the bivariate and multivariate relationships among the variables. The objective is to produce a series of mean-tabs and a linear regression model that allows us to understand the relationship between perceived insecurity, area deprivation and the other factors. In this exercise, you will use survey package to estimate means and a linear regression model. The box below provides some basic information for getting started with the survey package.

Getting started with the survey package

The survey package in R provides functions to estimate statistics taking into account the complex sample design and weights. When using the survey package the first step is to create a survey object using the function svydesign(). These are some of the main arguments of the function:

id: the variable that identifies the primary sampling unit (PSU) or cluster.
data: the data frame with the survey data.
strata: the variable that identifies the strata.
weights: the variable that contains the weights.

Note this package is formula-based, so you will use ~ to specify the variables (See example below).

# EXAMPLE: Set the complex sample design using the survey package
    ## The survey documentation can provide you with the name of the variables 
    ## relevant to specify the survey design. 
    ## Note this package is formula based, so you will use "~" quite a 
    ## few times, for example, to specify the variables.
    ## Let's assume the variable indicating the:
    ##        - PSU/clustering is `cluster` (if the sample is unclusterd use "~1")
    ##        - Strata is `strat`
    ##        - Weights is `wt`
    ##        - Data frame is `survey_df`

df_svy <- svydesign(id = ~cluster, 
                    data = survey_df, 
                    strata = ~strat,
                    weights = ~wt)

Then, this object will be the base to calculate all the statistics. Throughout the rest of the handout I will walk you through different functions of the package.

# EXAMPLE: Calculate the mean (and standard error) of a variable using the survey package
  
            svymean(~var1, 
                     df_svy,
                     na.rm = TRUE) # remove missing values as in mean function

More information about the package is available at survey.

E2.1 | Set the complex sample design using the `survey` package.

Look for the variables that identify the different elements of the complex sample design (i.e., clustering, stratification and weights) in the data frame and create a new survey design object df_svy using the function svydesign() (see box above).

💡 Solution E2.1

In this dataset, the names of the variables were helpful to identify them. The variable psu corresponds to the primary sampling units. The variable strata identifies the strata used to select the sample design. Finally, the variable weight identifies the non-response weight.

df_svy <- svydesign(id = ~psu, 
                    data = df, 
                    strata = ~strata,
                    weights = ~weight)

df_svy

Stratified 1 - level Cluster Sampling design (with replacement)
With (3272) clusters.
svydesign(id = ~psu, data = df, strata = ~strata, weights = ~weight)

E2.2 | How is fear of crime related to the other variables? Bivariate analysis

Produce a bivariate analysis of the relationship between the dependent variable worryx and the independent variables sex, ageg, edu, income , cardamag , yrdeface, delibvio and imd_quint. Use the function svyby() to calculate the mean and the standard error for each subgroup.

Mean-tabs with the survey package

The function svyby() can be used to produce means for the groups defined by a factor accounting for the survey sample design.

# EXAMPLE: svyby to produce means of var1 by var2
    ## svyby(~var1, ~var2, design, svymean)
    ## var1: the variable for which we want to calculate the mean
    ## var2: the variable (factor) that defines the groups
    ## design: the survey object created with svydesign()
    ## svymean: the function to calculate the mean

      tab_var1_var2 <- svyby(~var1, ~var2, 
                                    df_svy, svymean)

Earlier we decided that we were going to use a complete cases approach. The base for this analysis will only be those who did not have missing values in any of the variables involved in the analysis. Earlier we computed the variable complete_cases_vars that identifies the cases that are complete complete_cases_vars == "complete". You can use this variable to subset the observations in the data frame before producing the cross-tabs.

Filtering observations in a survey design object: subset()

Sometimes you will need to filter the observations in the survey object before producing the statistics. The function subset() can be used to filter the observations in the survey object.

# EXAMPLE: subset cases IF var1 == "Yes" AND var2 == "Married"
    ## subset(design, subset)

  df_svy_sub <- subset(df_svy, var1 == "Yes" & var2 == "Married")

# EXAMPLE: subset can also be wrapped in a function
                svymean(~var3, subset(df_svy, var1 == "Yes" & var2 == "Married"))

💡 Solution E2.2

Citizens living in more deprived areas, those who have suffered a crime in the last 12 months, those on lower income, females, those born abroad and with lower levels of education perceive a higher level of insecurity.

# For each independent variable (categorical/ordinal) 
#   produce a means table with the dependent variable worryx 
#   using only the subset of complete cases compelte_cases_vars == "complete" 

# e.g. for imd_quint, 
# tab_imd_worryx <- svyby(~worryx, ~imd_quint, subset(df_svy, complete_cases_vars == "complete"), svymean)


## a substantially less verbose way to produce these tables and stats

    ## extract colnames of the indep. vars
    colnam <- df %>% 
            select(cardamag:bornuk) %>%
            colnames()

    ## loop over the colnames to extract the percentage tables and stats
    ## function(x): [1] creates the svyby object 
    ##              [2] returns the list
    lapply(colnam, function(x) { 
      tab <- svyby(~worryx, as.formula(paste0("~", x)), subset(df_svy, complete_cases_vars == "complete"), svymean)
    
      tab

    })

[[1]]
    cardamag       worryx         se
Yes      Yes  0.344382189 0.07919318
No        No -0.008290981 0.01452959

[[2]]
    yrdeface     worryx        se
Yes      Yes 0.44180067 0.1255776
No        No 0.00124792 0.0145256

[[3]]
    delibvio      worryx         se
Yes      Yes 0.258249484 0.12859454
No        No 0.004258977 0.01458716

[[4]]
                                        imd_quint      worryx         se
Q1 (Most deprived areas) Q1 (Most deprived areas)  0.21690710 0.03698742
Q2                                             Q2  0.07364203 0.03542777
Q3                                             Q3 -0.05843809 0.02877482
Q4                                             Q4 -0.07593683 0.02996434
Q5 (Less deprived areas) Q5 (Less deprived areas) -0.09655166 0.02836542

[[5]]
                       income      worryx         se
Under 15,000     Under 15,000  0.11875058 0.03066763
15,000-29,999   15,000-29,999  0.07614139 0.02967978
30,000-49,999   30,000-49,999 -0.03556548 0.02619288
50,000 or over 50,000 or over -0.14914114 0.02632088

[[6]]
          sex     worryx         se
Male     Male -0.2028886 0.01958098
Female Female  0.2109731 0.02016750

[[7]]
       ageg       worryx         se
16-24 16-24 -0.015883071 0.05736638
25-34 25-34  0.051714691 0.03245941
35-44 35-44  0.067194770 0.03456990
45-54 45-54  0.030743556 0.03209527
55-64 55-64  0.005021828 0.03434807
65-74 65-74 -0.043102679 0.03133250
75+     75+ -0.140299431 0.03831185

[[8]]
                                                      edu      worryx
None                                                 None  0.11077648
O level/GCSE                                 O level/GCSE  0.07338347
Apprenticeship or A/AS level Apprenticeship or A/AS level -0.06038539
Degree or diploma                       Degree or diploma -0.03665160
Other                                               Other  0.12527457
                                     se
None                         0.03921926
O level/GCSE                 0.03536066
Apprenticeship or A/AS level 0.03021677
Degree or diploma            0.02103345
Other                        0.08246562

[[9]]
                 bornuk      worryx         se
Born abroad Born abroad  0.40835812 0.04309572
Born in UK   Born in UK -0.07756756 0.01452327

E2.3 | Which characteristics are related to higher levels of fear of crime?

Linear regression with the survey package: svyglm()

The function svyglm() can be used to fit a logistic regression model accounting for the survey sample design and weights. The function has several arguments to control the output of the model. The function can be wrapped in summary() to produce a summary of the model that includes the coefficients and their standard errors.

# EXAMPLE: 
    ## svyglm(formula, design)
    ## formula: ~var1+var2
    ## design: the survey object created with svydesign()

  model <- svyglm(~var1+var2, 
                   df_svy)

💡 Solution E2.3

People living in more deprived areas are exhibit higher levels of fear of crime. This is also true for those who suffered a crime in the last 12 months.

The other groups that have higher levels of fear of crime are those on lowe incomes, females, mid-aged people and those born abroad.

model_csd <- svyglm(worryx ~ imd_quint + cardamag + yrdeface + delibvio + income + sex + ageg + edu + bornuk, 
                    design = df_svy)

summary(model_csd)


Call:
svyglm(formula = worryx ~ imd_quint + cardamag + yrdeface + delibvio + 
    income + sex + ageg + edu + bornuk, design = df_svy)

Survey design:
svydesign(id = ~psu, data = df, strata = ~strata, weights = ~weight)

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        1.179795   0.194157   6.077 1.40e-09 ***
imd_quintQ2                       -0.082916   0.048644  -1.705 0.088391 .  
imd_quintQ3                       -0.176367   0.045892  -3.843 0.000124 ***
imd_quintQ4                       -0.156214   0.047266  -3.305 0.000962 ***
imd_quintQ5 (Less deprived areas) -0.137685   0.047514  -2.898 0.003788 ** 
cardamagNo                        -0.311400   0.075513  -4.124 3.84e-05 ***
yrdefaceNo                        -0.354290   0.108459  -3.267 0.001102 ** 
delibvioNo                        -0.215923   0.113941  -1.895 0.058190 .  
income15,000-29,999               -0.022808   0.043342  -0.526 0.598769    
income30,000-49,999               -0.120307   0.042197  -2.851 0.004389 ** 
income50,000 or over              -0.181400   0.046196  -3.927 8.82e-05 ***
sexFemale                          0.413343   0.027298  15.142  < 2e-16 ***
ageg25-34                          0.115748   0.068771   1.683 0.092469 .  
ageg35-44                          0.141705   0.071344   1.986 0.047105 *  
ageg45-54                          0.170911   0.068327   2.501 0.012428 *  
ageg55-64                          0.152820   0.069337   2.204 0.027604 *  
ageg65-74                          0.082057   0.066458   1.235 0.217038    
ageg75+                           -0.062922   0.070464  -0.893 0.371950    
eduO level/GCSE                    0.007468   0.052357   0.143 0.886592    
eduApprenticeship or A/AS level   -0.055792   0.051270  -1.088 0.276605    
eduDegree or diploma              -0.089645   0.048580  -1.845 0.065100 .  
eduOther                          -0.009974   0.088069  -0.113 0.909838    
bornukBorn in UK                  -0.469621   0.045851 -10.242  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.8574211)

Number of Fisher Scoring iterations: 2

E2.4 | How does the complex sample design of the survey affect you coefficient estimates compared to using a linear regression model?

In this final question you are asked to reproduce the model above but using a new survey object df_nosvy that does not take into account the complex sample design. Compare the coefficient estimates of the model with and without the complex sample design.

💡 Solution E2.4

df_nosvy <- svydesign(id = ~1, 
                    data = df)

Warning in svydesign.default(id = ~1, data = df): No weights or probabilities
supplied, assuming equal probability

model_nocsd <- svyglm(worryx ~ imd_quint + cardamag + yrdeface + delibvio + income + sex + ageg + edu + bornuk, 
                     design = df_nosvy)

modelsummary::modelsummary(list(CSD = model_csd, NO_CSD = model_nocsd), stars = T)

Warning in logLik.svyglm(x): svyglm not fitted by maximum likelihood.

Warning in logLik.svyglm(x): svyglm not fitted by maximum likelihood.

	CSD	NO_CSD
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
(Intercept)	1.180***	0.981***
	(0.194)	(0.161)
imd_quintQ2	-0.083+	-0.122**
	(0.049)	(0.039)
imd_quintQ3	-0.176***	-0.197***
	(0.046)	(0.039)
imd_quintQ4	-0.156***	-0.168***
	(0.047)	(0.039)
imd_quintQ5 (Less deprived areas)	-0.138**	-0.145***
	(0.048)	(0.040)
cardamagNo	-0.311***	-0.258***
	(0.076)	(0.059)
yrdefaceNo	-0.354**	-0.319**
	(0.108)	(0.105)
delibvioNo	-0.216+	-0.170+
	(0.114)	(0.102)
income15,000-29,999	-0.023	-0.033
	(0.043)	(0.034)
income30,000-49,999	-0.120**	-0.121***
	(0.042)	(0.037)
income50,000 or over	-0.181***	-0.192***
	(0.046)	(0.039)
sexFemale	0.413***	0.419***
	(0.027)	(0.023)
ageg25-34	0.116+	0.194***
	(0.069)	(0.055)
ageg35-44	0.142*	0.199***
	(0.071)	(0.055)
ageg45-54	0.171*	0.237***
	(0.068)	(0.054)
ageg55-64	0.153*	0.206***
	(0.069)	(0.054)
ageg65-74	0.082	0.127*
	(0.066)	(0.053)
ageg75+	-0.063	-0.008
	(0.070)	(0.058)
eduO level/GCSE	0.007	0.014
	(0.052)	(0.044)
eduApprenticeship or A/AS level	-0.056	-0.024
	(0.051)	(0.043)
eduDegree or diploma	-0.090+	-0.106**
	(0.049)	(0.039)
eduOther	-0.010	-0.040
	(0.088)	(0.073)
bornukBorn in UK	-0.470***	-0.454***
	(0.046)	(0.038)
Num.Obs.	6774	6774
R2	0.107	0.105
R2 Adj.	-1.179	0.102
AIC	18081.0	18294.4
BIC	32336.0	18535.4
Log.Lik.	-16062.152	-9161.865
RMSE	0.93	0.93