Are area depivation and crime victimisation related? An analysis using the England and Wales Crime Survey

SC385 | L10. Introduction to survey methods | 2024/25

Author

Pablo CA

Introduction

What is the role of area deprivation in crime victimisation? There is an ongoing discussion about the role of neighbourhood characteristics on individual outcomes. Some studies have shown that deprived areas are more likely to tend to be racially segregated, have a higher crime rate, and low-quality public services in the US (e.g., Sampson et al., 1997).

The objective of this lab will be to examine the relationship between area deprivation and crime victimisation in England. We will explore whether the citizens living in more deprived areas suffer more or less crime than those in more wealthy areas. The analysis will also control for other individual characteristics like personal income or education.

R packages and options

Before introducing the survey we will be using for the analysis, let’s load the packages and set the options for the analysis. If you do not have the packages installed, you can start by installing them using the code below.

# install.packages(c("tidyverse", "survey", "labelled", "sjmisc", "naniar"))

library(tidyverse)
library(survey)

The dataset

We will use the England and Wales Crime Survey (CSEW) 2017/18 teaching dataset for this lab. The CSEW is a large-scale, household-based survey designed to collect data on experiences of crime, perceptions of crime, and attitudes toward the criminal justice system. The survey is conducted throughout the year and collects data about the experiences of crime in the last 12 months. The interviews for the 2017/18 edition were carried out face-to-face, and sensitive topics were asked using a self-completion questionnaire. The survey is designed to be representative of the population living in private households in England and Wales.

The England and Wales Crime Survey (CSEW)

The CSEW aims to provide insights into crime trends over time, including those not reported to the police, and public perceptions of crime and safety.
The survey covers England and Wales and includes individuals aged 16 and over, with a separate module for children aged 10-15. All household members are invited to take part in the survey.
The survey uses a stratified, multi-stage random probability sample. A sample of postcode sectors (PSU) is selected from the postal address file. Then, addresses are selected within postcode sectors.
The 2017/18 CSEW interviewed approximately 35,000 adults and 3,000 children aged 10-15.

The data can be downloaded through the UK Data Service.

First, let’s read in the data you will find on Moodle in a RDS format. The dataset contains the responses of 32,101 adults to the individual interview and 22 variables. You can use the functions head() or glimpse() to have a first look at the dataset.

df <- read_rds("data/cse_2017-18_teaching.RDS")

head(df)

serial	carmot	mottheft	motstole	cardamag	homethef	yrhotry	yrhostol	yrdeface	persthef	delibdam	delibvio	imd_quint	income	sex	ageg	edu	employ	bornuk	strata	psu	weight
423002	No	NA	NA	NA	NA	No	No	No	No	No	No	Q4	5,000-9,999	Male	45-54	Degree or diploma	Economically inactive	Born in UK	30011	11172700	0.6713125
423008	Yes	No	No	No	NA	No	No	No	No	No	No	Q2	70,000 or over	Male	35-44	Degree or diploma	Employed	Born abroad	30011	11172666	1.3990165
423023	Yes	No	No	No	NA	No	No	No	No	No	No	Q5 (Less deprived areas)	25,000-29,999	Male	55-64	Apprenticeship or A/AS level	Employed	Born in UK	30011	11172802	0.6150459
423106	No	NA	NA	NA	NA	No	No	No	No	No	No	Q1 (Most deprived areas)	20,000-24,999	Male	16-24	None	Employed	Born abroad	30011	11172774	1.7655339
423119	No	NA	NA	NA	NA	No	No	No	No	No	No	Q4	10,000-14,999	Male	16-24	Apprenticeship or A/AS level	Employed	Born in UK	30011	11172418	1.6044675
423201	Yes	No	No	No	NA	No	No	No	No	No	No	Q5 (Less deprived areas)	30,000-34,999	Female	45-54	Degree or diploma	Employed	Born in UK	30011	11172054	1.1689697

Variables

The function var_label of the package labelled can extract the label of each variable in the data frame. The code below is a simple way to generate a data frame with two columns: the variable name and the label.

labelled::var_label(df) %>% 
  as_tibble() %>% # transform the output of var_label into a tibble object
  pivot_longer(everything(), names_to = "variable_name", values_to = "variable_label") # reshape the data frame from a wide format to long format (more convenient for exploring the data)

variable_name	variable_label
serial	Serial number (6 digits)
carmot	If has a car ot motorcycle
mottheft	If vehicle stolen or driven away without permission
motstole	If something stolen off or out of vehicle
cardamag	If vehicle tampered with or damaged
homethef	If anyone got into current residence to steal/try to steal
yrhotry	If anyone tried to get into current residence to steal/cause damage
yrhostol	If anything was stolen out of current residence
yrdeface	If anything was damaged outside current residence
persthef	If anything was stolen out of hands, pockets or bag
delibdam	If personal items have been deliberately damaged
delibvio	If anyone has deliberately used force/violence on adult respondent
imd_quint	English Index of Multiple Deprivation 2015 (quintiles)
income	What is your personal (and partners) gross income
sex	Adult number 1 (respondent): Sex
ageg	Age group (7 bands)
edu	Respondent education (5 categories)
employ	Respondent employment status
bornuk	Person was born in the UK
strata	Stratum (2015 definition)
psu	PSU (2015 definition)
weight	Individual level weight (mean=1)

Analysis objective

The objective of this lab is to estimate the relationship between area deprivation and crime victimisation controlling for other demographic factors: sex, age, education, employment status, whether the person was born in the UK and personal income.

To measure crime victimisation, we will create a new variable that identifies whether the respondent has been a victim of at least one type of crime in the last 12 months.
To measure the deprivation of the area where the respondent lives, we will use the variable Index of multiple Deprivation (IMD), which is an index that ranks the neighbourhoods in England from the most deprived to the least deprived. The IMD is a composite measure that includes income, employment, health, education, crime, and living environment.

Index of Multiple Deprivation (IMD)

The Index of Multiple Deprivation (IMD) measures relative deprivation in small areas across each of the countries of the United Kingdom.
Areas are ranked from the most deprived area (rank 1) to the least deprived area.
Each country measures deprivation in a slightly different way but the broad themes include income, employment, education, health, crime, barriers to housing and services, and the living environment.
The statistical unit areas used to provide indices of relative deprivation across the country are Lower layer Super Output Areas (LSOAs) [England, Wales], Data Zones [Scotland] and Super Output Areas or Wards [Northern Ireland].

More information about the IMD.

The analysis will be conducted in two steps:

Preparing variables for the analysis. We will compute the dependent variable crimvict using a set of variables from the survey. Before computing the variables, we will explore the data and identify the missing values in the variables. Then, we will also look into de data missingness of the independent variables. Finally, we will make decisions about how to deal with the missing values in these variables.
Exploring the relationship between area deprivation and crime victimisation. We will estimate the relationship between area deprivation and crime victimisation using a logistic regression model. This will be an opportunity to learn how to estimate the model using the survey package in R, which takes into account some information about the complex sample design and weights to adjust for non-response.

1. Preparing the variables for the analysis: dealing with missing values

Types of missing data

Planned missingness. The researcher plans that some questions are not going to be presented to some respondents. For example, it might be that a long questionnaire is split into a common set of questions that are asked to all respondents and different set of questions that are asked to a random subset. Also, some questions might be asked only to those who meet certain criteria. For example, questions about the experience of using of public transport might be asked only to those who use public transport.
Unplanned missingness. Sometimes a respondent refuses to answer a question or, in self-administered mode, skips the questions. This can also happen by mistake. The result is that we do not have information about this question for this respondent.
Non-substantive responses. There are some responses that might be taken as substantive or missing depending of the research question and the objectives of the analysis. For example, a respondent might answer “Don’t know” to a question about their income. This might be considered as a substantive response if the objective is to understand the knowledge of the respondent about their income. However, if the objective is to estimate the average income of the population, this response might be considered as missing.

For a comprehensive guide on missing data in surveys read (Reid and Allum, 2019).

Exercise 1. Identifying and dealing with missing values: the case of the CSEW

The first step in the analysis is to explore the data and identify the missing values. Before starting with the analysis, the analyst should have a clear answer to the following questions: What type and how many missing values are in the data? How can they affect my analysis? Which decisions, if any, do I need to make before proceeding with the analysis? This exercise will guide you through the process of identifying the missing values in the dataset.

Q1.1 | What percentage of the sample members have been victims of a crime in the last 12 months? Computing the dependent variable

a) Produce frequency tables of the variables that will be used to compute the dependent variable `crimvict`. The variables mottheft:delibvio contain the information about whether the respondent has been a victim of different crimes. You can use the function select to select these variables and then use the function sjmisc::frq() to produce the frequency tables. The frequency tables will allow us to explore the variables and identify possible missing values. What proportion of missing values do you observe for each variable?

Selecting multiple variables: selector helpers

R has some helper functions to select multiple variables at once. The colon operator : can be used to select a range of variables. For example, var1:var3 will select all variables from var1 to var3. The c() function can be used to select multiple variables. For example, c(var1, var2, var3) will select the variables var1, var2, and var3.

If you use tidyverse (i.e., dplyr package) for data management some additional helpers are available:

starts_with("prefix") selects all variables that start with “prefix”.
ends_with("suffix") selects all variables that end with “suffix”.
contains("pattern") selects all variables that contain “pattern”.
everything() selects all variables.

More information (and helper functions) are available at tidy select.

💡 Solution

As you can see below, the variables have a small number (<1%) of user missing values (i.e., “Refused” or “Don’t Know”). However, the variables mottheft , motstole, cardamag and hometheft present a high number of system missing (<NA>). The next step is to determine what is causing these missing values.

df %>% 
  select(mottheft:delibvio) %>% 
  sjmisc::frq()

If vehicle stolen or driven away without permission (mottheft) <categorical> 
# total N=32101 valid N=25812 mean=2.00 sd=0.07

Value      |     N | Raw % | Valid % | Cum. %
---------------------------------------------
Yes        |   128 |  0.40 |    0.50 |   0.50
No         | 25680 | 80.00 |   99.49 |  99.98
Refused    |     1 |  0.00 |    0.00 |  99.99
Don't know |     3 |  0.01 |    0.01 | 100.00
<NA>       |  6289 | 19.59 |    <NA> |   <NA>

If something stolen off or out of vehicle (motstole) <categorical> 
# total N=32101 valid N=25812 mean=1.97 sd=0.17

Value      |     N | Raw % | Valid % | Cum. %
---------------------------------------------
Yes        |   784 |  2.44 |    3.04 |   3.04
No         | 25020 | 77.94 |   96.93 |  99.97
Refused    |     1 |  0.00 |    0.00 |  99.97
Don't know |     7 |  0.02 |    0.03 | 100.00
<NA>       |  6289 | 19.59 |    <NA> |   <NA>

If vehicle tampered with or damaged (cardamag) <categorical> 
# total N=32101 valid N=25812 mean=1.96 sd=0.23

Value      |     N | Raw % | Valid % | Cum. %
---------------------------------------------
Yes        |  1222 |  3.81 |    4.73 |   4.73
No         | 24553 | 76.49 |   95.12 |  99.86
Refused    |     1 |  0.00 |    0.00 |  99.86
Don't know |    36 |  0.11 |    0.14 | 100.00
<NA>       |  6289 | 19.59 |    <NA> |   <NA>

If anyone got into current residence to steal/try to steal (homethef) <categorical> 
# total N=32101 valid N=2897 mean=2.00 sd=0.10

Value      |     N | Raw % | Valid % | Cum. %
---------------------------------------------
Yes        |    13 |  0.04 |    0.45 |   0.45
No         |  2880 |  8.97 |   99.41 |  99.86
Refused    |     0 |  0.00 |    0.00 |  99.86
Don't know |     4 |  0.01 |    0.14 | 100.00
<NA>       | 29204 | 90.98 |    <NA> |   <NA>

If anyone tried to get into current residence to steal/cause damage (yrhotry) <categorical> 
# total N=32101 valid N=32101 mean=1.99 sd=0.12

Value      |     N | Raw % | Valid % | Cum. %
---------------------------------------------
Yes        |   309 |  0.96 |    0.96 |   0.96
No         | 31758 | 98.93 |   98.93 |  99.89
Refused    |     5 |  0.02 |    0.02 |  99.91
Don't know |    29 |  0.09 |    0.09 | 100.00
<NA>       |     0 |  0.00 |    <NA> |   <NA>

If anything was stolen out of current residence (yrhostol) <categorical> 
# total N=32101 valid N=32101 mean=2.00 sd=0.06

Value      |     N | Raw % | Valid % | Cum. %
---------------------------------------------
Yes        |    85 |  0.26 |    0.26 |   0.26
No         | 32007 | 99.71 |   99.71 |  99.97
Refused    |     5 |  0.02 |    0.02 |  99.99
Don't know |     4 |  0.01 |    0.01 | 100.00
<NA>       |     0 |  0.00 |    <NA> |   <NA>

If anything was damaged outside current residence (yrdeface) <categorical> 
# total N=32101 valid N=32101 mean=1.99 sd=0.12

Value      |     N | Raw % | Valid % | Cum. %
---------------------------------------------
Yes        |   437 |  1.36 |    1.36 |   1.36
No         | 31644 | 98.58 |   98.58 |  99.94
Refused    |     5 |  0.02 |    0.02 |  99.95
Don't know |    15 |  0.05 |    0.05 | 100.00
<NA>       |     0 |  0.00 |    <NA> |   <NA>

If anything was stolen out of hands, pockets or bag (persthef) <categorical> 
# total N=32101 valid N=32101 mean=1.99 sd=0.10

Value      |     N | Raw % | Valid % | Cum. %
---------------------------------------------
Yes        |   304 |  0.95 |    0.95 |   0.95
No         | 31784 | 99.01 |   99.01 |  99.96
Refused    |     5 |  0.02 |    0.02 |  99.98
Don't know |     8 |  0.02 |    0.02 | 100.00
<NA>       |     0 |  0.00 |    <NA> |   <NA>

If personal items have been deliberately damaged (delibdam) <categorical> 
# total N=32101 valid N=32101 mean=2.00 sd=0.07

Value      |     N | Raw % | Valid % | Cum. %
---------------------------------------------
Yes        |   120 |  0.37 |    0.37 |   0.37
No         | 31970 | 99.59 |   99.59 |  99.97
Refused    |     5 |  0.02 |    0.02 |  99.98
Don't know |     6 |  0.02 |    0.02 | 100.00
<NA>       |     0 |  0.00 |    <NA> |   <NA>

If anyone has deliberately used force/violence on adult respondent (delibvio) <categorical> 
# total N=32101 valid N=32101 mean=1.99 sd=0.12

Value      |     N | Raw % | Valid % | Cum. %
---------------------------------------------
Yes        |   446 |  1.39 |    1.39 |   1.39
No         | 31640 | 98.56 |   98.56 |  99.95
Refused    |    11 |  0.03 |    0.03 |  99.99
Don't know |     4 |  0.01 |    0.01 | 100.00
<NA>       |     0 |  0.00 |    <NA> |   <NA>

b) The variables mottheft , motstole, cardamag are about the respondents’ cars or motorcycles. It is time to look at the data frame (or questionnaire) to see whether these variables are planned or unplanned missingness. Identify the variable that measures whether the respondent has a car or motorcycle and produce cross-tabs using the function table() . Are the <NA> of these variables planned or unplanned missingness? How would you treat these missing values when computing the variable crimvict?

💡 Solution

The cross-tabs below show that the variables mottheft , motstole and cardamag are only asked to those who own a car or motorcycle. Note that those who do not have a car or motorcycle carmot == "No" or refused to answer the question are missing at mottheft , motstole and cardamag. For the purpose of our analysis we can ignore the missing values in these variables (<NA>).

table(df$carmot, df$mottheft, useNA = "always")

         
            Yes    No Refused Don't know  <NA>
  Yes       128 25680       1          3     0
  No          0     0       0          0  6282
  Refused     0     0       0          0     7
  <NA>        0     0       0          0     0

table(df$carmot, df$motstole, useNA = "always")

         
            Yes    No Refused Don't know  <NA>
  Yes       784 25020       1          7     0
  No          0     0       0          0  6282
  Refused     0     0       0          0     7
  <NA>        0     0       0          0     0

table(df$carmot, df$cardamag, useNA = "always")

         
            Yes    No Refused Don't know  <NA>
  Yes      1222 24553       1         36     0
  No          0     0       0          0  6282
  Refused     0     0       0          0     7
  <NA>        0     0       0          0     0

c) Now, we know that some respondents refused or did not know the answer to the questions. In addition, some questions were only asked to a subset of respondents who have a certain characterisstic (e.g., having a car or motorbike). Now it is time to create a dummy variable that will identify those who have suffered at least one type of crime in the last 12 months.

Compute this new variable called crimvict, a factor with two levels: “Victim” and “Not a victim”. Use the variables mottheft:delibvio to generate the new variable. You can use the function mutate to create the new variable and the function case_when to define the conditions. The function fct_relevel can be used to reorder the levels of the factor.

💡 Solution

df <- df %>% 
  mutate(crimvict = fct_relevel(
                         as_factor(
                           case_when(mottheft == "Yes" |
                                   motstole == "Yes" |
                                   cardamag == "Yes" |
                                   homethef == "Yes" |
                                   yrhotry == "Yes"  |
                                   yrhostol == "Yes" |
                                   yrdeface == "Yes" |
                                   persthef == "Yes" |
                                   delibdam == "Yes" |
                                   delibvio == "Yes" ~ "Victim", 
                                   TRUE ~ "Not a victim")), sort))

# Check the new variable: those who responded "Yes" for each of the individual items should be "Victim" in the new variable crimvict

table(df$crimvict, df$mottheft, useNA = "always")

              
                 Yes    No Refused Don't know  <NA>
  Not a victim     0 22838       1          1  5866
  Victim         128  2842       0          2   423
  <NA>             0     0       0          0     0

table(df$crimvict, df$motstole, useNA = "always")

              
                 Yes    No Refused Don't know  <NA>
  Not a victim     0 22833       1          6  5866
  Victim         784  2187       0          1   423
  <NA>             0     0       0          0     0

table(df$crimvict, df$cardamag, useNA = "always")

              
                 Yes    No Refused Don't know  <NA>
  Not a victim     0 22810       1         29  5866
  Victim        1222  1743       0          7   423
  <NA>             0     0       0          0     0

table(df$crimvict, df$homethef, useNA = "always")

              
                 Yes    No Refused Don't know  <NA>
  Not a victim     0  2563       0          4 26139
  Victim          13   317       0          0  3065
  <NA>             0     0       0          0     0

table(df$crimvict, df$yrhotry, useNA = "always")

              
                 Yes    No Refused Don't know  <NA>
  Not a victim     0 28680       5         21     0
  Victim         309  3078       0          8     0
  <NA>             0     0       0          0     0

table(df$crimvict, df$yrhostol, useNA = "always")

              
                 Yes    No Refused Don't know  <NA>
  Not a victim     0 28699       5          2     0
  Victim          85  3308       0          2     0
  <NA>             0     0       0          0     0

table(df$crimvict, df$yrdeface, useNA = "always")

              
                 Yes    No Refused Don't know  <NA>
  Not a victim     0 28688       5         13     0
  Victim         437  2956       0          2     0
  <NA>             0     0       0          0     0

table(df$crimvict, df$persthef, useNA = "always")

              
                 Yes    No Refused Don't know  <NA>
  Not a victim     0 28694       5          7     0
  Victim         304  3090       0          1     0
  <NA>             0     0       0          0     0

table(df$crimvict, df$delibdam, useNA = "always")

              
                 Yes    No Refused Don't know  <NA>
  Not a victim     0 28696       5          5     0
  Victim         120  3274       0          1     0
  <NA>             0     0       0          0     0

## A less verbose alternative to compute the variable
# df <- df %>% 
#   rowwise() %>%
#   mutate(victcount = sum(c_across(mottheft:delibvio) == "Yes", na.rm = T),
#          crimvict = fct_relevel(
#                        as_factor(
#                           if_else(victcount > 0, "Victim", "Not a victim"))
#                           ))
# 
# 
# table(df$victcount, df$crimvict)

d) Produce a frequency table to answer Q1.1: What percentage of the sample members have been victims of a crime in the last 12 months?

From this frequency table, can we infer that 10.6% of people living in England were victims of a crime in 2017/18?

💡 Solution

  sjmisc::frq(df$crimvict)

x <categorical> 
# total N=32101 valid N=32101 mean=1.11 sd=0.31

Value        |     N | Raw % | Valid % | Cum. %
-----------------------------------------------
Not a victim | 28706 | 89.42 |   89.42 |  89.42
Victim       |  3395 | 10.58 |   10.58 | 100.00
<NA>         |     0 |  0.00 |    <NA> |   <NA>

Q1.2 | What is the proportion of missing values in the independent variables?

Produce frequency tables to examine the independent variables: sex, ageg, edu, employ, bornuk, pers_income and imd_quint. What proportion of missing values do you observe for each variable? Which variables would need some preparation before fitting the logistic regression model?

💡 Solution

Some variables such as sex, ageg and imd_quint have no missing values. The variables edu, employ and ukborn have a small number of missing values.

The variable income has a higher number of missing values. Some people did not know or refused to answer the question. The variable income is a key variable in our analysis, so we need to decide how to deal with the missing values in this variable.

The frequency tables are also useful to identify the levels (or categories) that need to be set as missing for the analysis. For example, the variable ukborn has a category “Refused” and “Don’t know” that should be set as missing.

 df %>% 
   select(imd_quint:bornuk) %>% 
   sjmisc::frq()

English Index of Multiple Deprivation 2015 (quintiles) (imd_quint) <categorical> 
# total N=32101 valid N=32101 mean=3.04 sd=1.40

Value                    |    N | Raw % | Valid % | Cum. %
----------------------------------------------------------
Q1 (Most deprived areas) | 6066 | 18.90 |   18.90 |  18.90
Q2                       | 6282 | 19.57 |   19.57 |  38.47
Q3                       | 6697 | 20.86 |   20.86 |  59.33
Q4                       | 6549 | 20.40 |   20.40 |  79.73
Q5 (Less deprived areas) | 6507 | 20.27 |   20.27 | 100.00
<NA>                     |    0 |  0.00 |    <NA> |   <NA>

What is your personal (and partners) gross income (income) <categorical> 
# total N=32101 valid N=31944 mean=7.47 sd=4.34

Value          |    N | Raw % | Valid % | Cum. %
------------------------------------------------
Under 5,000    | 1042 |  3.25 |    3.26 |   3.26
5,000-9,999    | 2938 |  9.15 |    9.20 |  12.46
10,000-14,999  | 3606 | 11.23 |   11.29 |  23.75
15,000-19,999  | 3177 |  9.90 |    9.95 |  33.69
20,000-24,999  | 2763 |  8.61 |    8.65 |  42.34
25,000-29,999  | 2336 |  7.28 |    7.31 |  49.66
30,000-34,999  | 1880 |  5.86 |    5.89 |  55.54
35,000-39,999  | 1772 |  5.52 |    5.55 |  61.09
40,000-44,999  | 1368 |  4.26 |    4.28 |  65.37
45,000-49,999  | 1345 |  4.19 |    4.21 |  69.58
50,000-59,999  | 1539 |  4.79 |    4.82 |  74.40
60,000-69,999  | 1201 |  3.74 |    3.76 |  78.16
70,000 or over | 3131 |  9.75 |    9.80 |  87.96
Refused        | 2375 |  7.40 |    7.43 |  95.40
Don't Know     | 1471 |  4.58 |    4.60 | 100.00
<NA>           |  157 |  0.49 |    <NA> |   <NA>

Adult number 1 (respondent): Sex (sex) <categorical> 
# total N=32101 valid N=32101 mean=1.53 sd=0.50

Value  |     N | Raw % | Valid % | Cum. %
-----------------------------------------
Male   | 14971 | 46.64 |   46.64 |  46.64
Female | 17130 | 53.36 |   53.36 | 100.00
<NA>   |     0 |  0.00 |    <NA> |   <NA>

Age group (7 bands) (ageg) <categorical> 
# total N=32101 valid N=32101 mean=4.20 sd=1.81

Value |    N | Raw % | Valid % | Cum. %
---------------------------------------
16-24 | 2144 |  6.68 |    6.68 |   6.68
25-34 | 4805 | 14.97 |   14.97 |  21.65
35-44 | 5213 | 16.24 |   16.24 |  37.89
45-54 | 5508 | 17.16 |   17.16 |  55.05
55-64 | 5237 | 16.31 |   16.31 |  71.36
65-74 | 5123 | 15.96 |   15.96 |  87.32
75+   | 4071 | 12.68 |   12.68 | 100.00
<NA>  |    0 |  0.00 |    <NA> |   <NA>

Respondent education (5 categories) (edu) <categorical> 
# total N=32101 valid N=31977 mean=2.91 sd=1.24

Value                        |     N | Raw % | Valid % | Cum. %
---------------------------------------------------------------
None                         |  6379 | 19.87 |   19.95 |  19.95
O level/GCSE                 |  5713 | 17.80 |   17.87 |  37.81
Apprenticeship or A/AS level |  5584 | 17.40 |   17.46 |  55.28
Degree or diploma            | 12948 | 40.34 |   40.49 |  95.77
Other                        |  1353 |  4.21 |    4.23 | 100.00
<NA>                         |   124 |  0.39 |    <NA> |   <NA>

Respondent employment status (employ) <categorical> 
# total N=32101 valid N=32067 mean=1.82 sd=0.97

Value                 |     N | Raw % | Valid % | Cum. %
--------------------------------------------------------
Employed              | 18578 | 57.87 |   57.93 |  57.93
Unemployed            |   605 |  1.88 |    1.89 |  59.82
Economically inactive | 12884 | 40.14 |   40.18 | 100.00
<NA>                  |    34 |  0.11 |    <NA> |   <NA>

Person was born in the UK (bornuk) <categorical> 
# total N=32101 valid N=32095 mean=1.85 sd=0.36

Value       |     N | Raw % | Valid % | Cum. %
----------------------------------------------
Born abroad |  4960 | 15.45 |   15.45 |  15.45
Born in UK  | 27096 | 84.41 |   84.42 |  99.88
Refused     |    38 |  0.12 |    0.12 | 100.00
Don't Know  |     1 |  0.00 |    0.00 | 100.00
<NA>        |     6 |  0.02 |    <NA> |   <NA>

Q1.3 | Focusing on the variable personal income (`income`), what are the sociodemographic characteristics of the sample members who did not provide a substantive answer to this question compare to those who didn’t?

a) Generate the variable income_miss that takes “Missing” if the income variable is <NA>, Refused or Don't know and “Non-missing” otherwise. Use the function mutate and fct_collapse to create the new variable. An issue to produce this recode is how to specify that the cases with <NA> are missing values. The function fct_na_value_to_level can be used to set the missing values as a level of the factor. See the example below to learn how it works.

Dealing with factors in R: the forcats package

The forcats package provides a set of functions to work with factors in R.

fct_recode() can be used to recode the levels of a factor.
fct_collapse() can be used to collapse the levels of a factor into new levels.

Sometimes we need to manipulate NA values as a level of the factor. The function fct_na_value_to_level() can be used to set the missing values as a level of the factor (see example below).

The forcats package is part of tidyverse and all its functions start with fct_. The forcats functions used to transform variables can be called within mutate(). More information are available at forcats.

# EXAMPLE: recode the variable education (edu) to a factor with three levels:
# "Degree", "No degree" and "Missing". The misisng category shoul include those
# with NA and those responding "Other" to the education question. This is easier
# to do with fct_collapse() function.

# 1. Set the missing values as a level of the factor
table(df$edu, useNA = "always")


                        None                 O level/GCSE 
                        6379                         5713 
Apprenticeship or A/AS level            Degree or diploma 
                        5584                        12948 
                       Other                         <NA> 
                        1353                          124

levels(df$edu) # the NA is not a level of the factor

[1] "None"                         "O level/GCSE"                
[3] "Apprenticeship or A/AS level" "Degree or diploma"           
[5] "Other"

# we need to set NA as a level of the factor to manipulate it. 
# The function fct_na_value_to_level() can do this.
df <- df %>% 
  mutate(edu = fct_na_value_to_level(edu)) # set the missing values as a level

# check that now NA is a level so we can manipulate it.
levels(df$edu)

[1] "None"                         "O level/GCSE"                
[3] "Apprenticeship or A/AS level" "Degree or diploma"           
[5] "Other"                        NA

# 2. Use fct_collapse() to recode the variable.
df <- df %>% 
  mutate(edu3 = fct_collapse(edu,
                             Degree = "Degree or diploma",
                             Missing = c(NA, "Other"), 
                             other_level = "No degree"
                             ))

table(df$edu, df$edu3, useNA = "always")

                              
                               Degree Missing No degree  <NA>
  None                              0       0      6379     0
  O level/GCSE                      0       0      5713     0
  Apprenticeship or A/AS level      0       0      5584     0
  Degree or diploma             12948       0         0     0
  Other                             0    1353         0     0
  <NA>                              0     124         0     0

💡 Solution

 df <- df %>% 
    mutate(income = fct_na_value_to_level(income), ## set the missing values NA as a level
            income_miss = fct_collapse(income,
                                        Missing = c("Refused", 
                                                  "Don't Know", 
                                                  NA),
                                      other_level = "Non-missing"))
                                     
## check the variable
table(df$income, df$income_miss, useNA = "always")

                
                 Missing Non-missing <NA>
  Under 5,000          0        1042    0
  5,000-9,999          0        2938    0
  10,000-14,999        0        3606    0
  15,000-19,999        0        3177    0
  20,000-24,999        0        2763    0
  25,000-29,999        0        2336    0
  30,000-34,999        0        1880    0
  35,000-39,999        0        1772    0
  40,000-44,999        0        1368    0
  45,000-49,999        0        1345    0
  50,000-59,999        0        1539    0
  60,000-69,999        0        1201    0
  70,000 or over       0        3131    0
  Refused           2375           0    0
  Don't Know        1471           0    0
  <NA>               157           0    0

b) Produce a cross-tab of the new variable income_miss with the variables crimvict, sex, ageg, edu, employ, bornuk and imd_quint. What are the sociodemographic characteristics of the sample members who did not provide a substantive answer to the question about their personal income? Could these differences affect the results of the analysis (e.g., regression coefficients)? You can use the sjmisc::flat_table() function to produce the cross-tab with row or column percentages.

Using sjmisc::flat_table() to produce cross-tabs

The function sjmisc::flat_table() can be used to produce cross-tabs. The function has several arguments to control the output of the table. The argument margin can be used to produce row or column percentages. The argument show_na can be used to show the missing values in the table. The argument show_total can be used to show the total of the table. This function is an alternative to table() and prop.table() functions in base R.

More information about sjmisc.

# EXAMPLE: Use flat_table() to produce a cross-tab of two variables 
# from the data frame df.
# margin can be 'counts', 'row', 'col' or 'cell'.

sjmisc::flat_table(df, edu, sex, margin = "col")

                             sex  Male Female
edu                                          
None                             18.27  21.42
O level/GCSE                     15.57  19.87
Apprenticeship or A/AS level     21.47  13.96
Degree or diploma                40.31  40.65
Other                             4.38   4.10

💡 Solution

Those not providing information about their income are more likely to be older, have a lower level of education, be unemployed, and not born in the UK. The variable income_miss is not missing at random.

sjmisc::flat_table(df, crimvict, income_miss, margin = "row")

             income_miss Missing Non-missing
crimvict                                    
Not a victim               12.70       87.30
Victim                     10.52       89.48

sjmisc::flat_table(df, sex, income_miss, margin = "row")

       income_miss Missing Non-missing
sex                                   
Male                 11.27       88.73
Female               13.52       86.48

sjmisc::flat_table(df, ageg, income_miss, margin = "row")

      income_miss Missing Non-missing
ageg                                 
16-24               14.88       85.12
25-34                9.26       90.74
35-44                9.15       90.85
45-54               10.28       89.72
55-64               12.41       87.59
65-74               13.66       86.34
75+                 20.78       79.22

sjmisc::flat_table(df, edu, income_miss, margin = "row")

                             income_miss Missing Non-missing
edu                                                         
None                                       19.34       80.66
O level/GCSE                               11.04       88.96
Apprenticeship or A/AS level               11.10       88.90
Degree or diploma                           9.49       90.51
Other                                      14.49       85.51

sjmisc::flat_table(df, employ, income_miss, margin = "row")

                      income_miss Missing Non-missing
employ                                               
Employed                             8.90       91.10
Unemployed                          10.91       89.09
Economically inactive               17.47       82.53

sjmisc::flat_table(df, bornuk, income_miss, margin = "row")

            income_miss Missing Non-missing
bornuk                                     
Born abroad               16.49       83.51
Born in UK                11.61       88.39
Refused                   86.84       13.16
Don't Know                 0.00      100.00

Q1.4 | How to deal with missing values? Alternatives available for the analyst

Dealing with missing values in survey data analysis

Listwise deletion or complete case analysis. The analyst can decide to exclude all cases with missing values in any of the variables involved in the analysis. For example, if you are producing cross-tabs using sex, age and income, all cases that have a missing values for any of the three variables will be excluded from the analysis.
Pairwise deletion. The analyst can decide to exclude cases with missing values only in the variables involved in the analysis. For example, in the scenario mentioned above, only the cases with missing values at sex or age would be excluded from the sex~age cross-tab, whilst only the cases with missing values at age or income would be excluded from the analysis. As a result each of the cross-tabs will involve a different number of cases.

This approach can lead to biased estimates if the missing values are not missing completely at random (MCAR), i.e. if those excluded due to missingness are different from those used in the analysis.

Imputation. There is a third more complex set of alternatives that involve statistical methods to estimate the likely response of the respondents who did not provide a valid answer. This is called imputation. There are several methods to impute missing values, such as mean imputation, regression imputation, multiple imputation, etc. You can learn more about them in (van Buuren, 2018).

Since our objective is to estimate a model we will use a complete case analysis (listwise). This means that we will exclude all cases with missing values in any of the variables involved in the analysis: crimvic, sex, ageg, edu, employ, bornuk, income and imd_quint will be used in the analysis.

First, we need to set as missing values <NA> the levels of some variables that have “Refused” or “Don’t know”. The function fct_na_level_to_value() can be used to set the missing values as a level of the factor.

Setting values as missing <NA> in R

There are some alternatives to set values as missing in R:

You can use R base to identify the cases that you want to set as missing and then use the assignment operator <- to set the value as missing. For example, df$income[df$income == "Refused"] <- NA will set the value “Refused” as missing in the variable income.
The function na_if() can be used to set a value as missing. For example, na_if(df$income, "Refused") will set the value “Refused” as missing in the variable x. The function na_if() can be used within mutate().
Theforcats package provides the function fct_na_level_to_value() to set one or more levels of a factor as missing values. For example, fct_na_level_to_value(df$income, "Refused") will set the value “Refused” as missing in the variable income. The function can be used within mutate().

Second, we will create an indicator variable that will identify the cases that are complete. The function add_any_miss() from the package naniar can be used to create this indicator. The indicator will be used to exclude the cases with missing values in the analysis.

naniar package: add_any_miss()

naniar package provides a set of functions to explore and visualise missing data in R. The package can be used to identify the patterns of missingness in the data. The function add_any_miss() can be used to create an indicator variable that identifies the cases with missing values. The function add_miss_ind() can be used to create an indicator variable for each variable in the data frame.

Here is an example of how to use the function add_any_miss():

# This function will produce a new column in the dataset called {label}_all or {label}_vars that will be "complete" if the case has valid levels for all the variables in the list, and "missing" otherwise.

# add_any_miss(
#   data,
#   ...,
#   label = "any_miss",
#   missing = "missing",
#   complete = "complete"
# )

df %>% 
  select(imd_quint:crimvict) %>% 
  naniar::add_any_miss(., imd_quint:crimvict, label = "listwise")

# A tibble: 32,101 × 12
   imd_quint         income sex   ageg  edu   employ bornuk strata    psu weight
   <fct>             <fct>  <fct> <fct> <fct> <fct>  <fct>   <dbl>  <dbl>  <dbl>
 1 Q4                5,000… Male  45-54 Degr… Econo… Born …  30011 1.12e7  0.671
 2 Q2                70,00… Male  35-44 Degr… Emplo… Born …  30011 1.12e7  1.40 
 3 Q5 (Less deprive… 25,00… Male  55-64 Appr… Emplo… Born …  30011 1.12e7  0.615
 4 Q1 (Most deprive… 20,00… Male  16-24 None  Emplo… Born …  30011 1.12e7  1.77 
 5 Q4                10,00… Male  16-24 Appr… Emplo… Born …  30011 1.12e7  1.60 
 6 Q5 (Less deprive… 30,00… Fema… 45-54 Degr… Emplo… Born …  30011 1.12e7  1.17 
 7 Q2                40,00… Male  35-44 Degr… Emplo… Born …  30011 1.12e7  2.80 
 8 Q2                10,00… Male  55-64 None  Emplo… Born …  30011 1.12e7  1.18 
 9 Q3                15,00… Fema… 35-44 Degr… Emplo… Born …  30011 1.12e7  0.596
10 Q4                15,00… Male  55-64 None  Emplo… Born …  30011 1.12e7  1.73 
# ℹ 32,091 more rows
# ℹ 2 more variables: crimvict <fct>, listwise_vars <chr>

More information about the package is available at naniar.

💡 Solution

In the next chunk of code we set the missing values in the variables income and bornuk as <NA>.

## Set factor levels as NA for income and bornuk
df <- df %>% 
  mutate(income = fct_na_level_to_value(income, c("Refused", "Don't Know")),
         bornuk = fct_na_level_to_value(bornuk, c("Refused", "Don't Know")))

## check that the non-valid levels are set as NA
df %>% 
  select(income, bornuk) %>%
  sjmisc::frq()

What is your personal (and partners) gross income (income) <categorical> 
# total N=32101 valid N=28098 mean=6.53 sd=3.73

Value          |    N | Raw % | Valid % | Cum. %
------------------------------------------------
Under 5,000    | 1042 |  3.25 |    3.71 |   3.71
5,000-9,999    | 2938 |  9.15 |   10.46 |  14.16
10,000-14,999  | 3606 | 11.23 |   12.83 |  27.00
15,000-19,999  | 3177 |  9.90 |   11.31 |  38.31
20,000-24,999  | 2763 |  8.61 |    9.83 |  48.14
25,000-29,999  | 2336 |  7.28 |    8.31 |  56.45
30,000-34,999  | 1880 |  5.86 |    6.69 |  63.14
35,000-39,999  | 1772 |  5.52 |    6.31 |  69.45
40,000-44,999  | 1368 |  4.26 |    4.87 |  74.32
45,000-49,999  | 1345 |  4.19 |    4.79 |  79.11
50,000-59,999  | 1539 |  4.79 |    5.48 |  84.58
60,000-69,999  | 1201 |  3.74 |    4.27 |  88.86
70,000 or over | 3131 |  9.75 |   11.14 | 100.00
<NA>           | 4003 | 12.47 |    <NA> |   <NA>

Person was born in the UK (bornuk) <categorical> 
# total N=32101 valid N=32056 mean=1.85 sd=0.36

Value       |     N | Raw % | Valid % | Cum. %
----------------------------------------------
Born abroad |  4960 | 15.45 |   15.47 |  15.47
Born in UK  | 27096 | 84.41 |   84.53 | 100.00
<NA>        |    45 |  0.14 |    <NA> |   <NA>

## Complete case analysis indicator
df <- df %>% 
  naniar::add_any_miss(., imd_quint:crimvict, label = "listwise")

2. Exploring the relationship between area deprivation and crime victimisation

Complex sample design and weights

Survey samples are designed in different and complex ways. Some important design features are:

Clustering: Households or individuals are sometimes clustered in geographical units (e.g., postcodes or LAs). These clusters can be selected to decrease the costs of the survey. The first set of clusters that are selected are called primary sampling units (PSUs).
Stratification divides the sample into mutually exclusive groups (strata) that are more homogeneous than the population. The sample is then selected from each stratum.
Weights are used to adjust for non-response and to make the sample representative of the population.

All these features can impact the estimates (e.g., mean, proportion, regression coefficients) and their standard errors (i.e., the precision of the estimate). The weights can impact both the point estimates and the standard errors, whilst the clustering and stratification can impact the standard errors only.

As an analyst, you will need to read the survey documentation to understand the sample design and the weights used in the survey. The survey package in R provides functions to estimate taking into account the complex sample design and weights.

Exercise 2. Using the `survey` package: bivariate analysis and fitting a logistic regression model

The second step in the analysis is to explore the bivariate and multivariate relationships among the variables. The objective is to produce a series of cross-tabs and a logistic regression model that allows us to understand the relationship between area deprivation and crime victimisation. In this exercise, you will use the survey package to estimate cross-tabs and use a logistic regression model. The box below provides some basic information for getting started with the survey package.

Getting started with the survey package

The survey package in R provides functions to estimate statistics taking into account the complex sample design and weights. When using the survey package the first step is to create a survey object using the function svydesign(). These are some of the main arguments of the function:

id: the variable that identifies the primary sampling unit (PSU) or cluster.
data: the data frame with the survey data.
strata: the variable that identifies the strata.
weights: the variable that contains the weights.

Note this package is formula-based, so you will use ~ to specify the variables (See example below).

# EXAMPLE: Set the complex sample design using the survey package
    ## The survey documentation can provide you with the name of the variables 
    ## relevant to specify the survey design. 
    ## Note this package is formula based, so you will use "~" quite a 
    ## few times, for example, to specify the variables.
    ## Let's assume the variable indicating the:
    ##        - PSU/clustering is `cluster` (if the sample is unclusterd use "~1")
    ##        - Strata is `strat`
    ##        - Weights is `wt`
    ##        - Data frame is `survey_df`

df_svy <- svydesign(id = ~cluster, 
                    data = survey_df, 
                    strata = ~strat,
                    weights = ~wt)

Then, this object will be the base to calculate all the statistics. Throughout the rest of the handout I will walk you through different functions of the package.

# EXAMPLE: Calculate the mean (and standard error) of a variable using the survey package
  
            svymean(~var1, 
                     df_svy,
                     na.rm = TRUE) # remove missing values as in mean function

More information about the package is available at survey.

Q2.1 | Set the complex sample design using the `survey` package.

Look for the variables that identify the different elements of the complex sample design (i.e., clustering, stratification and weights) in the data frame and create a new survey design object df_svy using the function svydesign() (see box above).

💡 Solution

In this dataset, the names of the variables were helpful to identify them. The variable psu corresponds to the primary sampling units. The variable strata identifies the strata used to select the sample design. Finally, the variable weight identifies the non-response weight.

df_svy <- svydesign(id = ~psu, 
                    data = df, 
                    strata = ~strata,
                    weights = ~weight)

df_svy

Stratified 1 - level Cluster Sampling design (with replacement)
With (6130) clusters.
svydesign(id = ~psu, data = df, strata = ~strata, weights = ~weight)

Q2.2 | What are the demographic characteristics of the population victim of a crime in the last 12 months?

Produce a bivariate analysis of the relationship between the dependent variable crimvict and the independent variables sex, ageg, edu, employ, bornuk, income and imd_quint. Use the function svytable() to produce the cross-tabs.

Cross-tabs with the survey package

The function svytable() can be used to produce cross-tabs accounting for the survey sample design. The function has several arguments to control the output of the table. The function can be wrapped in summary() to produce a summary of the table that includes an adjusted chi-square test.

# EXAMPLE: svytable object to produce a cross-tab of var1 and var2
    ## svytable(formula, design)
    ## formula: ~var1+var2
    ## design: the survey object created with svydesign()

      tab_var1_var2 <- svytable(~var1+var2, 
                                    df_svy)

    ## This will print an adjusted chi-square test result
      summary(tab_var1_var2)

The function prop.table() can be used to produce the table with row or column percentages.

# EXAMPLE: produce a cross-tab with column percentages
    ## use the svytable object tab_var1_var2
    ## margin = 1 shows row percentages; margin = 2 shows column percentages; 

    prop.table(tab_var1_var2, margin = 2)*100

Earlier we decided that we were going to use a listwise or complete cases approach. The base for this analysis will only be those who did not have missing values in any of the variables involved in the analysis. Earlier we computed the variable listwise_vars that identifies the cases that are complete listwise_vars == "complete". You can use this variable to subset the observations in the data frame before producing the cross-tabs.

Filtering observations in a survey design object: subset()

Sometimes you will need to filter the observations in the survey object before producing the statistics. The function subset() can be used to filter the observations in the survey object.

# EXAMPLE: subset cases IF var1 == "Yes" AND var2 == "Married"
    ## subset(design, subset)

  df_svy_sub <- subset(df_svy, var1 == "Yes" & var2 == "Married")

# EXAMPLE: subset can also be wrapped in a function
                svymean(~var3, subset(df_svy, var1 == "Yes" & var2 == "Married"))

💡 Solution

Citizens living in more deprived areas, younger, higher education, and unemployed are more likely to be victims of and report a crime.

# For each independent variable: 
# 1) produce a cross-tab with the dependent variable crimvict 
#   using only the subset of complete cases listwise_vars == "complete" 
# 2) produce the percentage table prop.table()*100
# 3) produce the summary of the table that includes an adjusted chi-square test
# see below for a much less verbose alternative using lapply()

# tab_imd_crimvict <- svytable(~imd_quint+crimvict, subset(df_svy, listwise_vars == "complete"))
# prop.table(tab_imd_crimvict, margin = 1)*100
# summary(tab_imd_crimvict)
# 
# tab_income_crimvict <- svytable(~income + crimvict, subset(df_svy, listwise_vars == "complete"))
# prop.table(tab_income_crimvict, margin = 1)*100
# summary(tab_income_crimvict)
# 
# tab_sex_crimvict <- svytable(~sex+crimvict, subset(df_svy, listwise_vars == "complete"))
# prop.table(tab_sex_crimvict, margin = 1)*100
# summary(tab_sex_crimvict)
# 
# tab_ageg_crimvict <- svytable(~ageg+crimvict, subset(df_svy, listwise_vars == "complete"))
# prop.table(tab_ageg_crimvict, margin = 1)*100
# summary(tab_ageg_crimvict)
# 
# tab_edu_crimvict <- svytable(~edu+crimvict, subset(df_svy, listwise_vars == "complete"))
# prop.table(tab_edu_crimvict, margin = 1)*100
# summary(tab_edu_crimvict)
# 
# tab_employ_crimvict <- svytable(~employ+crimvict, subset(df_svy, listwise_vars == "complete"))
# prop.table(tab_employ_crimvict, margin = 1)*100
# summary(tab_employ_crimvict)
# 
# tab_bornuk_crimvict <- svytable(~bornuk+crimvict, subset(df_svy, listwise_vars == "complete"))
# prop.table(tab_bornuk_crimvict, margin = 1)*100
# summary(tab_bornuk_crimvict)

## a substantially less verbose way to produce these tables and stats

    ## extract colnames of the indep. vars
    colnam <- df %>% 
            select(imd_quint:bornuk) %>%
            colnames()

    ## loop over the colnames to extract the percentage tables and stats
    ## function(x): [1] creates the svytable object tab
    ##              [2] stores in a list the percentage table and the stats (chi-square test)
    ##              [3] returns the list
    lapply(colnam, function(x) { 
      tab <- svytable(as.formula(paste0("~", x, "+crimvict")), subset(df_svy, listwise_vars == "complete"))
    
    output <- list(tab = prop.table(tab, margin = 1)*100,
                   stats = summary(tab)[[2]]
                   )
    
    output

    })

[[1]]
[[1]]$tab
                          crimvict
imd_quint                  Not a victim    Victim
  Q1 (Most deprived areas)    85.593535 14.406465
  Q2                          86.236368 13.763632
  Q3                          88.207875 11.792125
  Q4                          90.061430  9.938570
  Q5 (Less deprived areas)    91.056566  8.943434

[[1]]$stats

    Pearson's X^2: Rao & Scott adjustment

data:  svychisq(as.formula(paste0("~", x, "+crimvict")), design = subset(df_svy,     listwise_vars == "complete"), statistic = "F")
F = 18.346, ndf = 3.9802, ddf = 23423.2805, p-value = 5.337e-15



[[2]]
[[2]]$tab
                crimvict
income           Not a victim   Victim
  Under 5,000        86.33730 13.66270
  5,000-9,999        88.62860 11.37140
  10,000-14,999      89.88413 10.11587
  15,000-19,999      88.70193 11.29807
  20,000-24,999      89.43096 10.56904
  25,000-29,999      88.49583 11.50417
  30,000-34,999      87.43539 12.56461
  35,000-39,999      87.67279 12.32721
  40,000-44,999      87.60582 12.39418
  45,000-49,999      89.44856 10.55144
  50,000-59,999      86.86598 13.13402
  60,000-69,999      87.15076 12.84924
  70,000 or over     87.49570 12.50430

[[2]]$stats

    Pearson's X^2: Rao & Scott adjustment

data:  svychisq(as.formula(paste0("~", x, "+crimvict")), design = subset(df_svy,     listwise_vars == "complete"), statistic = "F")
F = 1.6421, ndf = 11.579, ddf = 68145.147, p-value = 0.07587



[[3]]
[[3]]$tab
        crimvict
sex      Not a victim   Victim
  Male       87.98230 12.01770
  Female     88.54305 11.45695

[[3]]$stats

    Pearson's X^2: Rao & Scott adjustment

data:  svychisq(as.formula(paste0("~", x, "+crimvict")), design = subset(df_svy,     listwise_vars == "complete"), statistic = "F")
F = 1.4481, ndf = 1, ddf = 5885, p-value = 0.2289



[[4]]
[[4]]$tab
       crimvict
ageg    Not a victim    Victim
  16-24    83.964330 16.035670
  25-34    86.323446 13.676554
  35-44    87.018210 12.981790
  45-54    87.012241 12.987759
  55-64    89.148940 10.851060
  65-74    92.979486  7.020514
  75+      95.464105  4.535895

[[4]]$stats

    Pearson's X^2: Rao & Scott adjustment

data:  svychisq(as.formula(paste0("~", x, "+crimvict")), design = subset(df_svy,     listwise_vars == "complete"), statistic = "F")
F = 34.951, ndf = 5.3278, ddf = 31354.2051, p-value < 2.2e-16



[[5]]
[[5]]$tab
                              crimvict
edu                            Not a victim    Victim
  None                            91.337345  8.662655
  O level/GCSE                    88.311435 11.688565
  Apprenticeship or A/AS level    87.102063 12.897937
  Degree or diploma               87.424120 12.575880
  Other                           90.080638  9.919362

[[5]]$stats

    Pearson's X^2: Rao & Scott adjustment

data:  svychisq(as.formula(paste0("~", x, "+crimvict")), design = subset(df_svy,     listwise_vars == "complete"), statistic = "F")
F = 10.175, ndf = 3.9746, ddf = 23390.4897, p-value = 3.446e-08



[[6]]
[[6]]$tab
                       crimvict
employ                  Not a victim   Victim
  Employed                  86.97212 13.02788
  Unemployed                84.44346 15.55654
  Economically inactive     91.11154  8.88846

[[6]]$stats

    Pearson's X^2: Rao & Scott adjustment

data:  svychisq(as.formula(paste0("~", x, "+crimvict")), design = subset(df_svy,     listwise_vars == "complete"), statistic = "F")
F = 34.37, ndf = 1.9489, ddf = 11469.2804, p-value = 2.826e-15



[[7]]
[[7]]$tab
             crimvict
bornuk        Not a victim   Victim
  Born abroad     87.79636 12.20364
  Born in UK      88.35912 11.64088

[[7]]$stats

    Pearson's X^2: Rao & Scott adjustment

data:  svychisq(as.formula(paste0("~", x, "+crimvict")), design = subset(df_svy,     listwise_vars == "complete"), statistic = "F")
F = 0.78986, ndf = 1, ddf = 5885, p-value = 0.3742

Q2.3 | What is the relationship between area deprivation and crime victimisation? What are the other significant predictors of crime victimisation?

Fit a logistic regression model to estimate the relationship between area deprivation and crime victimisation. The model should include the independent variables sex, ageg, edu, employ, bornuk, income and imd_quint. Use the function svyglm() to fit the model.

Logistic regression with the survey package: svyglm()

The function svyglm() can be used to fit a logistic regression model accounting for the survey sample design and weights. The function has several arguments to control the output of the model. The function can be wrapped in summary() to produce a summary of the model that includes the coefficients and their standard errors.

# EXAMPLE: 
    ## svyglm(formula, design, family)
    ## formula: ~var1+var2
    ## design: the survey object created with svydesign()
    ## family: the family of the model, e.g., quasibinomial for logistic regression

  model <- svyglm(~var1+var2, 
                   df_svy,
                   family = quasibinomial)

💡 Solution

People living in more deprived areas are more likely to suffer and report a crime. The other significant predictors of crime victimisation are age, younger people are more likely to be victims of crime, and education, people with higher education are less likely to be victims of crime.

model_csd <- svyglm(crimvict ~ imd_quint + income + sex + ageg + edu + employ + bornuk, 
                    design = df_svy, 
                    family = quasibinomial)

summary(model_csd)


Call:
svyglm(formula = crimvict ~ imd_quint + income + sex + ageg + 
    edu + employ + bornuk, design = df_svy, family = quasibinomial)

Survey design:
svydesign(id = ~psu, data = df, strata = ~strata, weights = ~weight)

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       -1.69874    0.15382 -11.044  < 2e-16 ***
imd_quintQ2                       -0.07694    0.07201  -1.068  0.28535    
imd_quintQ3                       -0.24239    0.07635  -3.175  0.00151 ** 
imd_quintQ4                       -0.44850    0.07987  -5.615 2.06e-08 ***
imd_quintQ5 (Less deprived areas) -0.56583    0.08676  -6.521 7.55e-11 ***
income5,000-9,999                  0.06251    0.13754   0.454  0.64951    
income10,000-14,999               -0.01975    0.13797  -0.143  0.88616    
income15,000-19,999                0.06903    0.14023   0.492  0.62256    
income20,000-24,999               -0.03687    0.14352  -0.257  0.79729    
income25,000-29,999                0.06564    0.14647   0.448  0.65407    
income30,000-34,999                0.15036    0.15268   0.985  0.32475    
income35,000-39,999                0.12574    0.15016   0.837  0.40241    
income40,000-44,999                0.13242    0.15222   0.870  0.38440    
income45,000-49,999               -0.06427    0.15925  -0.404  0.68653    
income50,000-59,999                0.19541    0.15240   1.282  0.19982    
income60,000-69,999                0.16845    0.16090   1.047  0.29516    
income70,000 or over               0.17134    0.14169   1.209  0.22660    
sexFemale                         -0.03049    0.04617  -0.660  0.50907    
ageg25-34                         -0.24696    0.09302  -2.655  0.00795 ** 
ageg35-44                         -0.27293    0.09149  -2.983  0.00286 ** 
ageg45-54                         -0.25342    0.09412  -2.692  0.00711 ** 
ageg55-64                         -0.42682    0.09815  -4.349 1.39e-05 ***
ageg65-74                         -0.83235    0.11187  -7.440 1.15e-13 ***
ageg75+                           -1.24144    0.13934  -8.910  < 2e-16 ***
eduO level/GCSE                    0.09452    0.08396   1.126  0.26032    
eduApprenticeship or A/AS level    0.22773    0.08338   2.731  0.00633 ** 
eduDegree or diploma               0.23164    0.08012   2.891  0.00385 ** 
eduOther                           0.01968    0.14149   0.139  0.88936    
employUnemployed                   0.11993    0.15848   0.757  0.44925    
employEconomically inactive       -0.02075    0.06653  -0.312  0.75513    
bornukBorn in UK                   0.08278    0.06479   1.278  0.20141    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 0.9907107)

Number of Fisher Scoring iterations: 5

exp(coef(model_csd))

                      (Intercept)                       imd_quintQ2 
                        0.1829146                         0.9259447 
                      imd_quintQ3                       imd_quintQ4 
                        0.7847516                         0.6385847 
imd_quintQ5 (Less deprived areas)                 income5,000-9,999 
                        0.5678905                         1.0645035 
              income10,000-14,999               income15,000-19,999 
                        0.9804409                         1.0714655 
              income20,000-24,999               income25,000-29,999 
                        0.9638048                         1.0678405 
              income30,000-34,999               income35,000-39,999 
                        1.1622578                         1.1339872 
              income40,000-44,999               income45,000-49,999 
                        1.1415842                         0.9377524 
              income50,000-59,999               income60,000-69,999 
                        1.2158042                         1.1834728 
             income70,000 or over                         sexFemale 
                        1.1868982                         0.9699723 
                        ageg25-34                         ageg35-44 
                        0.7811681                         0.7611436 
                        ageg45-54                         ageg55-64 
                        0.7761380                         0.6525815 
                        ageg65-74                           ageg75+ 
                        0.4350267                         0.2889675 
                  eduO level/GCSE   eduApprenticeship or A/AS level 
                        1.0991306                         1.2557478 
             eduDegree or diploma                          eduOther 
                        1.2606679                         1.0198787 
                 employUnemployed       employEconomically inactive 
                        1.1274132                         0.9794646 
                 bornukBorn in UK 
                        1.0863016

Q2.4 | How does the complex sample design of the survey affect you coefficient estimates compared to using a simple logistic regression model?

In this final question you are asked to reproduce the model above but using a new survey object df_nosvy that does not take into account the complex sample design. Compare the coefficient estimates of the model with and without the complex sample design.

💡 Solution

df_nosvy <- svydesign(id = ~1, 
                    data = df)

Warning in svydesign.default(id = ~1, data = df): No weights or probabilities
supplied, assuming equal probability

model_nocsd <- svyglm(crimvict ~ imd_quint + income + sex + ageg + edu + employ + bornuk, 
                    design = df_nosvy, 
                    family = quasibinomial)

exp(cbind(OR = coef(model_csd), coef(model_nocsd)))

                                         OR          
(Intercept)                       0.1829146 0.1964295
imd_quintQ2                       0.9259447 0.9424983
imd_quintQ3                       0.7847516 0.7336915
imd_quintQ4                       0.6385847 0.6420192
imd_quintQ5 (Less deprived areas) 0.5678905 0.5738531
income5,000-9,999                 1.0645035 1.0154513
income10,000-14,999               0.9804409 0.9521272
income15,000-19,999               1.0714655 0.9691949
income20,000-24,999               0.9638048 0.8746230
income25,000-29,999               1.0678405 0.9913687
income30,000-34,999               1.1622578 0.9827621
income35,000-39,999               1.1339872 1.1054911
income40,000-44,999               1.1415842 1.1101780
income45,000-49,999               0.9377524 0.8895463
income50,000-59,999               1.2158042 1.0981366
income60,000-69,999               1.1834728 1.0227561
income70,000 or over              1.1868982 1.0864087
sexFemale                         0.9699723 0.9699128
ageg25-34                         0.7811681 0.8026176
ageg35-44                         0.7611436 0.7450115
ageg45-54                         0.7761380 0.7500114
ageg55-64                         0.6525815 0.6087448
ageg65-74                         0.4350267 0.4367209
ageg75+                           0.2889675 0.2740949
eduO level/GCSE                   1.0991306 1.0912705
eduApprenticeship or A/AS level   1.2557478 1.2429151
eduDegree or diploma              1.2606679 1.2974064
eduOther                          1.0198787 0.9180863
employUnemployed                  1.1274132 0.9551244
employEconomically inactive       0.9794646 0.9914116
bornukBorn in UK                  1.0863016 1.0903409

Introduction

R packages and options

The dataset

Variables

Analysis objective

1. Preparing the variables for the analysis: dealing with missing values

Exercise 1. Identifying and dealing with missing values: the case of the CSEW

Q1.1 | What percentage of the sample members have been victims of a crime in the last 12 months? Computing the dependent variable

Q1.2 | What is the proportion of missing values in the independent variables?

Q1.3 | Focusing on the variable personal income (income), what are the sociodemographic characteristics of the sample members who did not provide a substantive answer to this question compare to those who didn’t?

Q1.4 | How to deal with missing values? Alternatives available for the analyst

2. Exploring the relationship between area deprivation and crime victimisation

Exercise 2. Using the survey package: bivariate analysis and fitting a logistic regression model

Q2.1 | Set the complex sample design using the survey package.

Q2.2 | What are the demographic characteristics of the population victim of a crime in the last 12 months?

Q2.3 | What is the relationship between area deprivation and crime victimisation? What are the other significant predictors of crime victimisation?

Q2.4 | How does the complex sample design of the survey affect you coefficient estimates compared to using a simple logistic regression model?

Q1.3 | Focusing on the variable personal income (`income`), what are the sociodemographic characteristics of the sample members who did not provide a substantive answer to this question compare to those who didn’t?

Exercise 2. Using the `survey` package: bivariate analysis and fitting a logistic regression model

Q2.1 | Set the complex sample design using the `survey` package.