# install.packages(c("tidyverse", "survey", "labelled", "sjmisc", "naniar"))
library(tidyverse)
library(survey)
Are area depivation and crime victimisation related? An analysis using the England and Wales Crime Survey
SC385 | L10. Introduction to survey methods | 2024/25
Introduction
What is the role of area deprivation in crime victimisation? There is an ongoing discussion about the role of neighbourhood characteristics on individual outcomes. Some studies have shown that deprived areas are more likely to tend to be racially segregated, have a higher crime rate, and low-quality public services in the US (e.g., Sampson et al., 1997).
The objective of this lab will be to examine the relationship between area deprivation and crime victimisation in England. We will explore whether the citizens living in more deprived areas suffer more or less crime than those in more wealthy areas. The analysis will also control for other individual characteristics like personal income or education.
R packages and options
Before introducing the survey we will be using for the analysis, let’s load the packages and set the options for the analysis. If you do not have the packages installed, you can start by installing them using the code below.
The dataset
We will use the England and Wales Crime Survey (CSEW) 2017/18 teaching dataset for this lab. The CSEW is a large-scale, household-based survey designed to collect data on experiences of crime, perceptions of crime, and attitudes toward the criminal justice system. The survey is conducted throughout the year and collects data about the experiences of crime in the last 12 months. The interviews for the 2017/18 edition were carried out face-to-face, and sensitive topics were asked using a self-completion questionnaire. The survey is designed to be representative of the population living in private households in England and Wales.
The CSEW aims to provide insights into crime trends over time, including those not reported to the police, and public perceptions of crime and safety.
The survey covers England and Wales and includes individuals aged 16 and over, with a separate module for children aged 10-15. All household members are invited to take part in the survey.
The survey uses a stratified, multi-stage random probability sample. A sample of postcode sectors (PSU) is selected from the postal address file. Then, addresses are selected within postcode sectors.
The 2017/18 CSEW interviewed approximately 35,000 adults and 3,000 children aged 10-15.
The data can be downloaded through the UK Data Service.
First, let’s read in the data you will find on Moodle in a RDS format. The dataset contains the responses of 32,101 adults to the individual interview and 22 variables. You can use the functions head()
or glimpse()
to have a first look at the dataset.
<- read_rds("data/cse_2017-18_teaching.RDS")
df
head(df)
serial | carmot | mottheft | motstole | cardamag | homethef | yrhotry | yrhostol | yrdeface | persthef | delibdam | delibvio | imd_quint | income | sex | ageg | edu | employ | bornuk | strata | psu | weight |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
423002 | No | NA | NA | NA | NA | No | No | No | No | No | No | Q4 | 5,000-9,999 | Male | 45-54 | Degree or diploma | Economically inactive | Born in UK | 30011 | 11172700 | 0.6713125 |
423008 | Yes | No | No | No | NA | No | No | No | No | No | No | Q2 | 70,000 or over | Male | 35-44 | Degree or diploma | Employed | Born abroad | 30011 | 11172666 | 1.3990165 |
423023 | Yes | No | No | No | NA | No | No | No | No | No | No | Q5 (Less deprived areas) | 25,000-29,999 | Male | 55-64 | Apprenticeship or A/AS level | Employed | Born in UK | 30011 | 11172802 | 0.6150459 |
423106 | No | NA | NA | NA | NA | No | No | No | No | No | No | Q1 (Most deprived areas) | 20,000-24,999 | Male | 16-24 | None | Employed | Born abroad | 30011 | 11172774 | 1.7655339 |
423119 | No | NA | NA | NA | NA | No | No | No | No | No | No | Q4 | 10,000-14,999 | Male | 16-24 | Apprenticeship or A/AS level | Employed | Born in UK | 30011 | 11172418 | 1.6044675 |
423201 | Yes | No | No | No | NA | No | No | No | No | No | No | Q5 (Less deprived areas) | 30,000-34,999 | Female | 45-54 | Degree or diploma | Employed | Born in UK | 30011 | 11172054 | 1.1689697 |
Variables
The function var_label
of the package labelled
can extract the label of each variable in the data frame. The code below is a simple way to generate a data frame with two columns: the variable name and the label.
::var_label(df) %>%
labelledas_tibble() %>% # transform the output of var_label into a tibble object
pivot_longer(everything(), names_to = "variable_name", values_to = "variable_label") # reshape the data frame from a wide format to long format (more convenient for exploring the data)
variable_name | variable_label |
---|---|
serial | Serial number (6 digits) |
carmot | If has a car ot motorcycle |
mottheft | If vehicle stolen or driven away without permission |
motstole | If something stolen off or out of vehicle |
cardamag | If vehicle tampered with or damaged |
homethef | If anyone got into current residence to steal/try to steal |
yrhotry | If anyone tried to get into current residence to steal/cause damage |
yrhostol | If anything was stolen out of current residence |
yrdeface | If anything was damaged outside current residence |
persthef | If anything was stolen out of hands, pockets or bag |
delibdam | If personal items have been deliberately damaged |
delibvio | If anyone has deliberately used force/violence on adult respondent |
imd_quint | English Index of Multiple Deprivation 2015 (quintiles) |
income | What is your personal (and partners) gross income |
sex | Adult number 1 (respondent): Sex |
ageg | Age group (7 bands) |
edu | Respondent education (5 categories) |
employ | Respondent employment status |
bornuk | Person was born in the UK |
strata | Stratum (2015 definition) |
psu | PSU (2015 definition) |
weight | Individual level weight (mean=1) |
Analysis objective
The objective of this lab is to estimate the relationship between area deprivation and crime victimisation controlling for other demographic factors: sex, age, education, employment status, whether the person was born in the UK and personal income.
To measure crime victimisation, we will create a new variable that identifies whether the respondent has been a victim of at least one type of crime in the last 12 months.
To measure the deprivation of the area where the respondent lives, we will use the variable Index of multiple Deprivation (IMD), which is an index that ranks the neighbourhoods in England from the most deprived to the least deprived. The IMD is a composite measure that includes income, employment, health, education, crime, and living environment.
The Index of Multiple Deprivation (IMD) measures relative deprivation in small areas across each of the countries of the United Kingdom.
Areas are ranked from the most deprived area (rank 1) to the least deprived area.
Each country measures deprivation in a slightly different way but the broad themes include income, employment, education, health, crime, barriers to housing and services, and the living environment.
The statistical unit areas used to provide indices of relative deprivation across the country are Lower layer Super Output Areas (LSOAs) [England, Wales], Data Zones [Scotland] and Super Output Areas or Wards [Northern Ireland].
More information about the IMD.
The analysis will be conducted in two steps:
Preparing variables for the analysis. We will compute the dependent variable
crimvict
using a set of variables from the survey. Before computing the variables, we will explore the data and identify the missing values in the variables. Then, we will also look into de data missingness of the independent variables. Finally, we will make decisions about how to deal with the missing values in these variables.Exploring the relationship between area deprivation and crime victimisation. We will estimate the relationship between area deprivation and crime victimisation using a logistic regression model. This will be an opportunity to learn how to estimate the model using the
survey
package in R, which takes into account some information about the complex sample design and weights to adjust for non-response.
1. Preparing the variables for the analysis: dealing with missing values
Planned missingness. The researcher plans that some questions are not going to be presented to some respondents. For example, it might be that a long questionnaire is split into a common set of questions that are asked to all respondents and different set of questions that are asked to a random subset. Also, some questions might be asked only to those who meet certain criteria. For example, questions about the experience of using of public transport might be asked only to those who use public transport.
Unplanned missingness. Sometimes a respondent refuses to answer a question or, in self-administered mode, skips the questions. This can also happen by mistake. The result is that we do not have information about this question for this respondent.
Non-substantive responses. There are some responses that might be taken as substantive or missing depending of the research question and the objectives of the analysis. For example, a respondent might answer “Don’t know” to a question about their income. This might be considered as a substantive response if the objective is to understand the knowledge of the respondent about their income. However, if the objective is to estimate the average income of the population, this response might be considered as missing.
For a comprehensive guide on missing data in surveys read (Reid and Allum, 2019).
Exercise 1. Identifying and dealing with missing values: the case of the CSEW
The first step in the analysis is to explore the data and identify the missing values. Before starting with the analysis, the analyst should have a clear answer to the following questions: What type and how many missing values are in the data? How can they affect my analysis? Which decisions, if any, do I need to make before proceeding with the analysis? This exercise will guide you through the process of identifying the missing values in the dataset.
Q1.1 | What percentage of the sample members have been victims of a crime in the last 12 months? Computing the dependent variable
a) Produce frequency tables of the variables that will be used to compute the dependent variable `crimvict`. The variables mottheft:delibvio
contain the information about whether the respondent has been a victim of different crimes. You can use the function select
to select these variables and then use the function sjmisc::frq()
to produce the frequency tables. The frequency tables will allow us to explore the variables and identify possible missing values. What proportion of missing values do you observe for each variable?
R has some helper functions to select multiple variables at once. The colon operator :
can be used to select a range of variables. For example, var1:var3
will select all variables from var1
to var3
. The c()
function can be used to select multiple variables. For example, c(var1, var2, var3)
will select the variables var1
, var2
, and var3
.
If you use tidyverse (i.e., dplyr package) for data management some additional helpers are available:
starts_with("prefix")
selects all variables that start with “prefix”.ends_with("suffix")
selects all variables that end with “suffix”.contains("pattern")
selects all variables that contain “pattern”.everything()
selects all variables.
More information (and helper functions) are available at tidy select.
💡 Solution
As you can see below, the variables have a small number (<1%) of user missing values (i.e., “Refused” or “Don’t Know”). However, the variables mottheft
, motstole
, cardamag
and hometheft
present a high number of system missing (<NA>
). The next step is to determine what is causing these missing values.
%>%
df select(mottheft:delibvio) %>%
::frq() sjmisc
If vehicle stolen or driven away without permission (mottheft) <categorical>
# total N=32101 valid N=25812 mean=2.00 sd=0.07
Value | N | Raw % | Valid % | Cum. %
---------------------------------------------
Yes | 128 | 0.40 | 0.50 | 0.50
No | 25680 | 80.00 | 99.49 | 99.98
Refused | 1 | 0.00 | 0.00 | 99.99
Don't know | 3 | 0.01 | 0.01 | 100.00
<NA> | 6289 | 19.59 | <NA> | <NA>
If something stolen off or out of vehicle (motstole) <categorical>
# total N=32101 valid N=25812 mean=1.97 sd=0.17
Value | N | Raw % | Valid % | Cum. %
---------------------------------------------
Yes | 784 | 2.44 | 3.04 | 3.04
No | 25020 | 77.94 | 96.93 | 99.97
Refused | 1 | 0.00 | 0.00 | 99.97
Don't know | 7 | 0.02 | 0.03 | 100.00
<NA> | 6289 | 19.59 | <NA> | <NA>
If vehicle tampered with or damaged (cardamag) <categorical>
# total N=32101 valid N=25812 mean=1.96 sd=0.23
Value | N | Raw % | Valid % | Cum. %
---------------------------------------------
Yes | 1222 | 3.81 | 4.73 | 4.73
No | 24553 | 76.49 | 95.12 | 99.86
Refused | 1 | 0.00 | 0.00 | 99.86
Don't know | 36 | 0.11 | 0.14 | 100.00
<NA> | 6289 | 19.59 | <NA> | <NA>
If anyone got into current residence to steal/try to steal (homethef) <categorical>
# total N=32101 valid N=2897 mean=2.00 sd=0.10
Value | N | Raw % | Valid % | Cum. %
---------------------------------------------
Yes | 13 | 0.04 | 0.45 | 0.45
No | 2880 | 8.97 | 99.41 | 99.86
Refused | 0 | 0.00 | 0.00 | 99.86
Don't know | 4 | 0.01 | 0.14 | 100.00
<NA> | 29204 | 90.98 | <NA> | <NA>
If anyone tried to get into current residence to steal/cause damage (yrhotry) <categorical>
# total N=32101 valid N=32101 mean=1.99 sd=0.12
Value | N | Raw % | Valid % | Cum. %
---------------------------------------------
Yes | 309 | 0.96 | 0.96 | 0.96
No | 31758 | 98.93 | 98.93 | 99.89
Refused | 5 | 0.02 | 0.02 | 99.91
Don't know | 29 | 0.09 | 0.09 | 100.00
<NA> | 0 | 0.00 | <NA> | <NA>
If anything was stolen out of current residence (yrhostol) <categorical>
# total N=32101 valid N=32101 mean=2.00 sd=0.06
Value | N | Raw % | Valid % | Cum. %
---------------------------------------------
Yes | 85 | 0.26 | 0.26 | 0.26
No | 32007 | 99.71 | 99.71 | 99.97
Refused | 5 | 0.02 | 0.02 | 99.99
Don't know | 4 | 0.01 | 0.01 | 100.00
<NA> | 0 | 0.00 | <NA> | <NA>
If anything was damaged outside current residence (yrdeface) <categorical>
# total N=32101 valid N=32101 mean=1.99 sd=0.12
Value | N | Raw % | Valid % | Cum. %
---------------------------------------------
Yes | 437 | 1.36 | 1.36 | 1.36
No | 31644 | 98.58 | 98.58 | 99.94
Refused | 5 | 0.02 | 0.02 | 99.95
Don't know | 15 | 0.05 | 0.05 | 100.00
<NA> | 0 | 0.00 | <NA> | <NA>
If anything was stolen out of hands, pockets or bag (persthef) <categorical>
# total N=32101 valid N=32101 mean=1.99 sd=0.10
Value | N | Raw % | Valid % | Cum. %
---------------------------------------------
Yes | 304 | 0.95 | 0.95 | 0.95
No | 31784 | 99.01 | 99.01 | 99.96
Refused | 5 | 0.02 | 0.02 | 99.98
Don't know | 8 | 0.02 | 0.02 | 100.00
<NA> | 0 | 0.00 | <NA> | <NA>
If personal items have been deliberately damaged (delibdam) <categorical>
# total N=32101 valid N=32101 mean=2.00 sd=0.07
Value | N | Raw % | Valid % | Cum. %
---------------------------------------------
Yes | 120 | 0.37 | 0.37 | 0.37
No | 31970 | 99.59 | 99.59 | 99.97
Refused | 5 | 0.02 | 0.02 | 99.98
Don't know | 6 | 0.02 | 0.02 | 100.00
<NA> | 0 | 0.00 | <NA> | <NA>
If anyone has deliberately used force/violence on adult respondent (delibvio) <categorical>
# total N=32101 valid N=32101 mean=1.99 sd=0.12
Value | N | Raw % | Valid % | Cum. %
---------------------------------------------
Yes | 446 | 1.39 | 1.39 | 1.39
No | 31640 | 98.56 | 98.56 | 99.95
Refused | 11 | 0.03 | 0.03 | 99.99
Don't know | 4 | 0.01 | 0.01 | 100.00
<NA> | 0 | 0.00 | <NA> | <NA>
b) The variables mottheft
, motstole
, cardamag
are about the respondents’ cars or motorcycles. It is time to look at the data frame (or questionnaire) to see whether these variables are planned or unplanned missingness. Identify the variable that measures whether the respondent has a car or motorcycle and produce cross-tabs using the function table()
. Are the <NA>
of these variables planned or unplanned missingness? How would you treat these missing values when computing the variable crimvict
?
💡 Solution
The cross-tabs below show that the variables mottheft
, motstole
and cardamag
are only asked to those who own a car or motorcycle. Note that those who do not have a car or motorcycle carmot == "No"
or refused to answer the question are missing at mottheft
, motstole
and cardamag
. For the purpose of our analysis we can ignore the missing values in these variables (<NA>
).
table(df$carmot, df$mottheft, useNA = "always")
Yes No Refused Don't know <NA>
Yes 128 25680 1 3 0
No 0 0 0 0 6282
Refused 0 0 0 0 7
<NA> 0 0 0 0 0
table(df$carmot, df$motstole, useNA = "always")
Yes No Refused Don't know <NA>
Yes 784 25020 1 7 0
No 0 0 0 0 6282
Refused 0 0 0 0 7
<NA> 0 0 0 0 0
table(df$carmot, df$cardamag, useNA = "always")
Yes No Refused Don't know <NA>
Yes 1222 24553 1 36 0
No 0 0 0 0 6282
Refused 0 0 0 0 7
<NA> 0 0 0 0 0
c) Now, we know that some respondents refused or did not know the answer to the questions. In addition, some questions were only asked to a subset of respondents who have a certain characterisstic (e.g., having a car or motorbike). Now it is time to create a dummy variable that will identify those who have suffered at least one type of crime in the last 12 months.
Compute this new variable called crimvict
, a factor with two levels: “Victim” and “Not a victim”. Use the variables mottheft:delibvio
to generate the new variable. You can use the function mutate
to create the new variable and the function case_when
to define the conditions. The function fct_relevel
can be used to reorder the levels of the factor.
💡 Solution
<- df %>%
df mutate(crimvict = fct_relevel(
as_factor(
case_when(mottheft == "Yes" |
== "Yes" |
motstole == "Yes" |
cardamag == "Yes" |
homethef == "Yes" |
yrhotry == "Yes" |
yrhostol == "Yes" |
yrdeface == "Yes" |
persthef == "Yes" |
delibdam == "Yes" ~ "Victim",
delibvio TRUE ~ "Not a victim")), sort))
# Check the new variable: those who responded "Yes" for each of the individual items should be "Victim" in the new variable crimvict
table(df$crimvict, df$mottheft, useNA = "always")
Yes No Refused Don't know <NA>
Not a victim 0 22838 1 1 5866
Victim 128 2842 0 2 423
<NA> 0 0 0 0 0
table(df$crimvict, df$motstole, useNA = "always")
Yes No Refused Don't know <NA>
Not a victim 0 22833 1 6 5866
Victim 784 2187 0 1 423
<NA> 0 0 0 0 0
table(df$crimvict, df$cardamag, useNA = "always")
Yes No Refused Don't know <NA>
Not a victim 0 22810 1 29 5866
Victim 1222 1743 0 7 423
<NA> 0 0 0 0 0
table(df$crimvict, df$homethef, useNA = "always")
Yes No Refused Don't know <NA>
Not a victim 0 2563 0 4 26139
Victim 13 317 0 0 3065
<NA> 0 0 0 0 0
table(df$crimvict, df$yrhotry, useNA = "always")
Yes No Refused Don't know <NA>
Not a victim 0 28680 5 21 0
Victim 309 3078 0 8 0
<NA> 0 0 0 0 0
table(df$crimvict, df$yrhostol, useNA = "always")
Yes No Refused Don't know <NA>
Not a victim 0 28699 5 2 0
Victim 85 3308 0 2 0
<NA> 0 0 0 0 0
table(df$crimvict, df$yrdeface, useNA = "always")
Yes No Refused Don't know <NA>
Not a victim 0 28688 5 13 0
Victim 437 2956 0 2 0
<NA> 0 0 0 0 0
table(df$crimvict, df$persthef, useNA = "always")
Yes No Refused Don't know <NA>
Not a victim 0 28694 5 7 0
Victim 304 3090 0 1 0
<NA> 0 0 0 0 0
table(df$crimvict, df$delibdam, useNA = "always")
Yes No Refused Don't know <NA>
Not a victim 0 28696 5 5 0
Victim 120 3274 0 1 0
<NA> 0 0 0 0 0
## A less verbose alternative to compute the variable
# df <- df %>%
# rowwise() %>%
# mutate(victcount = sum(c_across(mottheft:delibvio) == "Yes", na.rm = T),
# crimvict = fct_relevel(
# as_factor(
# if_else(victcount > 0, "Victim", "Not a victim"))
# ))
#
#
# table(df$victcount, df$crimvict)
d) Produce a frequency table to answer Q1.1: What percentage of the sample members have been victims of a crime in the last 12 months?
From this frequency table, can we infer that 10.6% of people living in England were victims of a crime in 2017/18?
💡 Solution
::frq(df$crimvict) sjmisc
x <categorical>
# total N=32101 valid N=32101 mean=1.11 sd=0.31
Value | N | Raw % | Valid % | Cum. %
-----------------------------------------------
Not a victim | 28706 | 89.42 | 89.42 | 89.42
Victim | 3395 | 10.58 | 10.58 | 100.00
<NA> | 0 | 0.00 | <NA> | <NA>
Q1.2 | What is the proportion of missing values in the independent variables?
Produce frequency tables to examine the independent variables: sex
, ageg
, edu
, employ
, bornuk
, pers_income
and imd_quint
. What proportion of missing values do you observe for each variable? Which variables would need some preparation before fitting the logistic regression model?
💡 Solution
Some variables such as sex
, ageg
and imd_quint
have no missing values. The variables edu
, employ
and ukborn
have a small number of missing values.
The variable income
has a higher number of missing values. Some people did not know or refused to answer the question. The variable income
is a key variable in our analysis, so we need to decide how to deal with the missing values in this variable.
The frequency tables are also useful to identify the levels (or categories) that need to be set as missing for the analysis. For example, the variable ukborn
has a category “Refused” and “Don’t know” that should be set as missing.
%>%
df select(imd_quint:bornuk) %>%
::frq() sjmisc
English Index of Multiple Deprivation 2015 (quintiles) (imd_quint) <categorical>
# total N=32101 valid N=32101 mean=3.04 sd=1.40
Value | N | Raw % | Valid % | Cum. %
----------------------------------------------------------
Q1 (Most deprived areas) | 6066 | 18.90 | 18.90 | 18.90
Q2 | 6282 | 19.57 | 19.57 | 38.47
Q3 | 6697 | 20.86 | 20.86 | 59.33
Q4 | 6549 | 20.40 | 20.40 | 79.73
Q5 (Less deprived areas) | 6507 | 20.27 | 20.27 | 100.00
<NA> | 0 | 0.00 | <NA> | <NA>
What is your personal (and partners) gross income (income) <categorical>
# total N=32101 valid N=31944 mean=7.47 sd=4.34
Value | N | Raw % | Valid % | Cum. %
------------------------------------------------
Under 5,000 | 1042 | 3.25 | 3.26 | 3.26
5,000-9,999 | 2938 | 9.15 | 9.20 | 12.46
10,000-14,999 | 3606 | 11.23 | 11.29 | 23.75
15,000-19,999 | 3177 | 9.90 | 9.95 | 33.69
20,000-24,999 | 2763 | 8.61 | 8.65 | 42.34
25,000-29,999 | 2336 | 7.28 | 7.31 | 49.66
30,000-34,999 | 1880 | 5.86 | 5.89 | 55.54
35,000-39,999 | 1772 | 5.52 | 5.55 | 61.09
40,000-44,999 | 1368 | 4.26 | 4.28 | 65.37
45,000-49,999 | 1345 | 4.19 | 4.21 | 69.58
50,000-59,999 | 1539 | 4.79 | 4.82 | 74.40
60,000-69,999 | 1201 | 3.74 | 3.76 | 78.16
70,000 or over | 3131 | 9.75 | 9.80 | 87.96
Refused | 2375 | 7.40 | 7.43 | 95.40
Don't Know | 1471 | 4.58 | 4.60 | 100.00
<NA> | 157 | 0.49 | <NA> | <NA>
Adult number 1 (respondent): Sex (sex) <categorical>
# total N=32101 valid N=32101 mean=1.53 sd=0.50
Value | N | Raw % | Valid % | Cum. %
-----------------------------------------
Male | 14971 | 46.64 | 46.64 | 46.64
Female | 17130 | 53.36 | 53.36 | 100.00
<NA> | 0 | 0.00 | <NA> | <NA>
Age group (7 bands) (ageg) <categorical>
# total N=32101 valid N=32101 mean=4.20 sd=1.81
Value | N | Raw % | Valid % | Cum. %
---------------------------------------
16-24 | 2144 | 6.68 | 6.68 | 6.68
25-34 | 4805 | 14.97 | 14.97 | 21.65
35-44 | 5213 | 16.24 | 16.24 | 37.89
45-54 | 5508 | 17.16 | 17.16 | 55.05
55-64 | 5237 | 16.31 | 16.31 | 71.36
65-74 | 5123 | 15.96 | 15.96 | 87.32
75+ | 4071 | 12.68 | 12.68 | 100.00
<NA> | 0 | 0.00 | <NA> | <NA>
Respondent education (5 categories) (edu) <categorical>
# total N=32101 valid N=31977 mean=2.91 sd=1.24
Value | N | Raw % | Valid % | Cum. %
---------------------------------------------------------------
None | 6379 | 19.87 | 19.95 | 19.95
O level/GCSE | 5713 | 17.80 | 17.87 | 37.81
Apprenticeship or A/AS level | 5584 | 17.40 | 17.46 | 55.28
Degree or diploma | 12948 | 40.34 | 40.49 | 95.77
Other | 1353 | 4.21 | 4.23 | 100.00
<NA> | 124 | 0.39 | <NA> | <NA>
Respondent employment status (employ) <categorical>
# total N=32101 valid N=32067 mean=1.82 sd=0.97
Value | N | Raw % | Valid % | Cum. %
--------------------------------------------------------
Employed | 18578 | 57.87 | 57.93 | 57.93
Unemployed | 605 | 1.88 | 1.89 | 59.82
Economically inactive | 12884 | 40.14 | 40.18 | 100.00
<NA> | 34 | 0.11 | <NA> | <NA>
Person was born in the UK (bornuk) <categorical>
# total N=32101 valid N=32095 mean=1.85 sd=0.36
Value | N | Raw % | Valid % | Cum. %
----------------------------------------------
Born abroad | 4960 | 15.45 | 15.45 | 15.45
Born in UK | 27096 | 84.41 | 84.42 | 99.88
Refused | 38 | 0.12 | 0.12 | 100.00
Don't Know | 1 | 0.00 | 0.00 | 100.00
<NA> | 6 | 0.02 | <NA> | <NA>
Q1.3 | Focusing on the variable personal income (income
), what are the sociodemographic characteristics of the sample members who did not provide a substantive answer to this question compare to those who didn’t?
a) Generate the variable income_miss
that takes “Missing” if the income variable is <NA>
, Refused
or Don't know
and “Non-missing” otherwise. Use the function mutate
and fct_collapse
to create the new variable. An issue to produce this recode is how to specify that the cases with <NA>
are missing values. The function fct_na_value_to_level
can be used to set the missing values as a level of the factor. See the example below to learn how it works.
forcats
package
The forcats
package provides a set of functions to work with factors in R.
fct_recode()
can be used to recode the levels of a factor.fct_collapse()
can be used to collapse the levels of a factor into new levels.
Sometimes we need to manipulate NA
values as a level of the factor. The function fct_na_value_to_level()
can be used to set the missing values as a level of the factor (see example below).
The forcats
package is part of tidyverse and all its functions start with fct_
. The forcats functions used to transform variables can be called within mutate()
. More information are available at forcats.
# EXAMPLE: recode the variable education (edu) to a factor with three levels:
# "Degree", "No degree" and "Missing". The misisng category shoul include those
# with NA and those responding "Other" to the education question. This is easier
# to do with fct_collapse() function.
# 1. Set the missing values as a level of the factor
table(df$edu, useNA = "always")
None O level/GCSE
6379 5713
Apprenticeship or A/AS level Degree or diploma
5584 12948
Other <NA>
1353 124
levels(df$edu) # the NA is not a level of the factor
[1] "None" "O level/GCSE"
[3] "Apprenticeship or A/AS level" "Degree or diploma"
[5] "Other"
# we need to set NA as a level of the factor to manipulate it.
# The function fct_na_value_to_level() can do this.
<- df %>%
df mutate(edu = fct_na_value_to_level(edu)) # set the missing values as a level
# check that now NA is a level so we can manipulate it.
levels(df$edu)
[1] "None" "O level/GCSE"
[3] "Apprenticeship or A/AS level" "Degree or diploma"
[5] "Other" NA
# 2. Use fct_collapse() to recode the variable.
<- df %>%
df mutate(edu3 = fct_collapse(edu,
Degree = "Degree or diploma",
Missing = c(NA, "Other"),
other_level = "No degree"
))
table(df$edu, df$edu3, useNA = "always")
Degree Missing No degree <NA>
None 0 0 6379 0
O level/GCSE 0 0 5713 0
Apprenticeship or A/AS level 0 0 5584 0
Degree or diploma 12948 0 0 0
Other 0 1353 0 0
<NA> 0 124 0 0
💡 Solution
<- df %>%
df mutate(income = fct_na_value_to_level(income), ## set the missing values NA as a level
income_miss = fct_collapse(income,
Missing = c("Refused",
"Don't Know",
NA),
other_level = "Non-missing"))
## check the variable
table(df$income, df$income_miss, useNA = "always")
Missing Non-missing <NA>
Under 5,000 0 1042 0
5,000-9,999 0 2938 0
10,000-14,999 0 3606 0
15,000-19,999 0 3177 0
20,000-24,999 0 2763 0
25,000-29,999 0 2336 0
30,000-34,999 0 1880 0
35,000-39,999 0 1772 0
40,000-44,999 0 1368 0
45,000-49,999 0 1345 0
50,000-59,999 0 1539 0
60,000-69,999 0 1201 0
70,000 or over 0 3131 0
Refused 2375 0 0
Don't Know 1471 0 0
<NA> 157 0 0
b) Produce a cross-tab of the new variable income_miss
with the variables crimvict
, sex
, ageg
, edu
, employ
, bornuk
and imd_quint
. What are the sociodemographic characteristics of the sample members who did not provide a substantive answer to the question about their personal income? Could these differences affect the results of the analysis (e.g., regression coefficients)? You can use the sjmisc::flat_table()
function to produce the cross-tab with row or column percentages.
sjmisc::flat_table()
to produce cross-tabs
The function sjmisc::flat_table()
can be used to produce cross-tabs. The function has several arguments to control the output of the table. The argument margin
can be used to produce row or column percentages. The argument show_na
can be used to show the missing values in the table. The argument show_total
can be used to show the total of the table. This function is an alternative to table()
and prop.table()
functions in base R.
More information about sjmisc.
# EXAMPLE: Use flat_table() to produce a cross-tab of two variables
# from the data frame df.
# margin can be 'counts', 'row', 'col' or 'cell'.
::flat_table(df, edu, sex, margin = "col") sjmisc
sex Male Female
edu
None 18.27 21.42
O level/GCSE 15.57 19.87
Apprenticeship or A/AS level 21.47 13.96
Degree or diploma 40.31 40.65
Other 4.38 4.10
💡 Solution
Those not providing information about their income are more likely to be older, have a lower level of education, be unemployed, and not born in the UK. The variable income_miss
is not missing at random.
::flat_table(df, crimvict, income_miss, margin = "row") sjmisc
income_miss Missing Non-missing
crimvict
Not a victim 12.70 87.30
Victim 10.52 89.48
::flat_table(df, sex, income_miss, margin = "row") sjmisc
income_miss Missing Non-missing
sex
Male 11.27 88.73
Female 13.52 86.48
::flat_table(df, ageg, income_miss, margin = "row") sjmisc
income_miss Missing Non-missing
ageg
16-24 14.88 85.12
25-34 9.26 90.74
35-44 9.15 90.85
45-54 10.28 89.72
55-64 12.41 87.59
65-74 13.66 86.34
75+ 20.78 79.22
::flat_table(df, edu, income_miss, margin = "row") sjmisc
income_miss Missing Non-missing
edu
None 19.34 80.66
O level/GCSE 11.04 88.96
Apprenticeship or A/AS level 11.10 88.90
Degree or diploma 9.49 90.51
Other 14.49 85.51
::flat_table(df, employ, income_miss, margin = "row") sjmisc
income_miss Missing Non-missing
employ
Employed 8.90 91.10
Unemployed 10.91 89.09
Economically inactive 17.47 82.53
::flat_table(df, bornuk, income_miss, margin = "row") sjmisc
income_miss Missing Non-missing
bornuk
Born abroad 16.49 83.51
Born in UK 11.61 88.39
Refused 86.84 13.16
Don't Know 0.00 100.00
Q1.4 | How to deal with missing values? Alternatives available for the analyst
Listwise deletion or complete case analysis. The analyst can decide to exclude all cases with missing values in any of the variables involved in the analysis. For example, if you are producing cross-tabs using sex, age and income, all cases that have a missing values for any of the three variables will be excluded from the analysis.
Pairwise deletion. The analyst can decide to exclude cases with missing values only in the variables involved in the analysis. For example, in the scenario mentioned above, only the cases with missing values at sex or age would be excluded from the sex~age cross-tab, whilst only the cases with missing values at age or income would be excluded from the analysis. As a result each of the cross-tabs will involve a different number of cases.
This approach can lead to biased estimates if the missing values are not missing completely at random (MCAR), i.e. if those excluded due to missingness are different from those used in the analysis.
- Imputation. There is a third more complex set of alternatives that involve statistical methods to estimate the likely response of the respondents who did not provide a valid answer. This is called imputation. There are several methods to impute missing values, such as mean imputation, regression imputation, multiple imputation, etc. You can learn more about them in (van Buuren, 2018).
Since our objective is to estimate a model we will use a complete case analysis (listwise). This means that we will exclude all cases with missing values in any of the variables involved in the analysis: crimvic
, sex
, ageg
, edu
, employ
, bornuk
, income
and imd_quint
will be used in the analysis.
First, we need to set as missing values <NA>
the levels of some variables that have “Refused” or “Don’t know”. The function fct_na_level_to_value()
can be used to set the missing values as a level of the factor.
<NA>
in R
There are some alternatives to set values as missing in R:
You can use R base to identify the cases that you want to set as missing and then use the assignment operator
<-
to set the value as missing. For example,df$income[df$income == "Refused"] <- NA
will set the value “Refused” as missing in the variableincome
.The function
na_if()
can be used to set a value as missing. For example,na_if(df$income, "Refused")
will set the value “Refused” as missing in the variablex
. The functionna_if()
can be used withinmutate()
.The
forcats
package provides the functionfct_na_level_to_value()
to set one or more levels of a factor as missing values. For example,fct_na_level_to_value(df$income, "Refused")
will set the value “Refused” as missing in the variableincome
. The function can be used withinmutate()
.
Second, we will create an indicator variable that will identify the cases that are complete. The function add_any_miss()
from the package naniar
can be used to create this indicator. The indicator will be used to exclude the cases with missing values in the analysis.
add_any_miss()
naniar package provides a set of functions to explore and visualise missing data in R. The package can be used to identify the patterns of missingness in the data. The function add_any_miss()
can be used to create an indicator variable that identifies the cases with missing values. The function add_miss_ind()
can be used to create an indicator variable for each variable in the data frame.
Here is an example of how to use the function add_any_miss()
:
# This function will produce a new column in the dataset called {label}_all or {label}_vars that will be "complete" if the case has valid levels for all the variables in the list, and "missing" otherwise.
# add_any_miss(
# data,
# ...,
# label = "any_miss",
# missing = "missing",
# complete = "complete"
# )
%>%
df select(imd_quint:crimvict) %>%
::add_any_miss(., imd_quint:crimvict, label = "listwise") naniar
# A tibble: 32,101 × 12
imd_quint income sex ageg edu employ bornuk strata psu weight
<fct> <fct> <fct> <fct> <fct> <fct> <fct> <dbl> <dbl> <dbl>
1 Q4 5,000… Male 45-54 Degr… Econo… Born … 30011 1.12e7 0.671
2 Q2 70,00… Male 35-44 Degr… Emplo… Born … 30011 1.12e7 1.40
3 Q5 (Less deprive… 25,00… Male 55-64 Appr… Emplo… Born … 30011 1.12e7 0.615
4 Q1 (Most deprive… 20,00… Male 16-24 None Emplo… Born … 30011 1.12e7 1.77
5 Q4 10,00… Male 16-24 Appr… Emplo… Born … 30011 1.12e7 1.60
6 Q5 (Less deprive… 30,00… Fema… 45-54 Degr… Emplo… Born … 30011 1.12e7 1.17
7 Q2 40,00… Male 35-44 Degr… Emplo… Born … 30011 1.12e7 2.80
8 Q2 10,00… Male 55-64 None Emplo… Born … 30011 1.12e7 1.18
9 Q3 15,00… Fema… 35-44 Degr… Emplo… Born … 30011 1.12e7 0.596
10 Q4 15,00… Male 55-64 None Emplo… Born … 30011 1.12e7 1.73
# ℹ 32,091 more rows
# ℹ 2 more variables: crimvict <fct>, listwise_vars <chr>
More information about the package is available at naniar.
💡 Solution
In the next chunk of code we set the missing values in the variables income
and bornuk
as <NA>
.
## Set factor levels as NA for income and bornuk
<- df %>%
df mutate(income = fct_na_level_to_value(income, c("Refused", "Don't Know")),
bornuk = fct_na_level_to_value(bornuk, c("Refused", "Don't Know")))
## check that the non-valid levels are set as NA
%>%
df select(income, bornuk) %>%
::frq() sjmisc
What is your personal (and partners) gross income (income) <categorical>
# total N=32101 valid N=28098 mean=6.53 sd=3.73
Value | N | Raw % | Valid % | Cum. %
------------------------------------------------
Under 5,000 | 1042 | 3.25 | 3.71 | 3.71
5,000-9,999 | 2938 | 9.15 | 10.46 | 14.16
10,000-14,999 | 3606 | 11.23 | 12.83 | 27.00
15,000-19,999 | 3177 | 9.90 | 11.31 | 38.31
20,000-24,999 | 2763 | 8.61 | 9.83 | 48.14
25,000-29,999 | 2336 | 7.28 | 8.31 | 56.45
30,000-34,999 | 1880 | 5.86 | 6.69 | 63.14
35,000-39,999 | 1772 | 5.52 | 6.31 | 69.45
40,000-44,999 | 1368 | 4.26 | 4.87 | 74.32
45,000-49,999 | 1345 | 4.19 | 4.79 | 79.11
50,000-59,999 | 1539 | 4.79 | 5.48 | 84.58
60,000-69,999 | 1201 | 3.74 | 4.27 | 88.86
70,000 or over | 3131 | 9.75 | 11.14 | 100.00
<NA> | 4003 | 12.47 | <NA> | <NA>
Person was born in the UK (bornuk) <categorical>
# total N=32101 valid N=32056 mean=1.85 sd=0.36
Value | N | Raw % | Valid % | Cum. %
----------------------------------------------
Born abroad | 4960 | 15.45 | 15.47 | 15.47
Born in UK | 27096 | 84.41 | 84.53 | 100.00
<NA> | 45 | 0.14 | <NA> | <NA>
## Complete case analysis indicator
<- df %>%
df ::add_any_miss(., imd_quint:crimvict, label = "listwise") naniar
2. Exploring the relationship between area deprivation and crime victimisation
Survey samples are designed in different and complex ways. Some important design features are:
Clustering: Households or individuals are sometimes clustered in geographical units (e.g., postcodes or LAs). These clusters can be selected to decrease the costs of the survey. The first set of clusters that are selected are called primary sampling units (PSUs).
Stratification divides the sample into mutually exclusive groups (strata) that are more homogeneous than the population. The sample is then selected from each stratum.
Weights are used to adjust for non-response and to make the sample representative of the population.
All these features can impact the estimates (e.g., mean, proportion, regression coefficients) and their standard errors (i.e., the precision of the estimate). The weights can impact both the point estimates and the standard errors, whilst the clustering and stratification can impact the standard errors only.
As an analyst, you will need to read the survey documentation to understand the sample design and the weights used in the survey. The survey
package in R provides functions to estimate taking into account the complex sample design and weights.
Exercise 2. Using the survey
package: bivariate analysis and fitting a logistic regression model
The second step in the analysis is to explore the bivariate and multivariate relationships among the variables. The objective is to produce a series of cross-tabs and a logistic regression model that allows us to understand the relationship between area deprivation and crime victimisation. In this exercise, you will use the survey package to estimate cross-tabs and use a logistic regression model. The box below provides some basic information for getting started with the survey
package.
survey
package
The survey
package in R provides functions to estimate statistics taking into account the complex sample design and weights. When using the survey package the first step is to create a survey object using the function svydesign()
. These are some of the main arguments of the function:
id
: the variable that identifies the primary sampling unit (PSU) or cluster.data
: the data frame with the survey data.strata
: the variable that identifies the strata.weights
: the variable that contains the weights.
Note this package is formula-based, so you will use ~
to specify the variables (See example below).
# EXAMPLE: Set the complex sample design using the survey package
## The survey documentation can provide you with the name of the variables
## relevant to specify the survey design.
## Note this package is formula based, so you will use "~" quite a
## few times, for example, to specify the variables.
## Let's assume the variable indicating the:
## - PSU/clustering is `cluster` (if the sample is unclusterd use "~1")
## - Strata is `strat`
## - Weights is `wt`
## - Data frame is `survey_df`
<- svydesign(id = ~cluster,
df_svy data = survey_df,
strata = ~strat,
weights = ~wt)
Then, this object will be the base to calculate all the statistics. Throughout the rest of the handout I will walk you through different functions of the package.
# EXAMPLE: Calculate the mean (and standard error) of a variable using the survey package
svymean(~var1,
df_svy,na.rm = TRUE) # remove missing values as in mean function
More information about the package is available at survey.
Q2.1 | Set the complex sample design using the survey
package.
Look for the variables that identify the different elements of the complex sample design (i.e., clustering, stratification and weights) in the data frame and create a new survey design object df_svy
using the function svydesign()
(see box above).
💡 Solution
In this dataset, the names of the variables were helpful to identify them. The variable psu
corresponds to the primary sampling units. The variable strata
identifies the strata used to select the sample design. Finally, the variable weight
identifies the non-response weight.
<- svydesign(id = ~psu,
df_svy data = df,
strata = ~strata,
weights = ~weight)
df_svy
Stratified 1 - level Cluster Sampling design (with replacement)
With (6130) clusters.
svydesign(id = ~psu, data = df, strata = ~strata, weights = ~weight)
Q2.2 | What are the demographic characteristics of the population victim of a crime in the last 12 months?
Produce a bivariate analysis of the relationship between the dependent variable crimvict
and the independent variables sex
, ageg
, edu
, employ
, bornuk
, income
and imd_quint
. Use the function svytable()
to produce the cross-tabs.
survey
package
The function svytable()
can be used to produce cross-tabs accounting for the survey sample design. The function has several arguments to control the output of the table. The function can be wrapped in summary()
to produce a summary of the table that includes an adjusted chi-square test.
# EXAMPLE: svytable object to produce a cross-tab of var1 and var2
## svytable(formula, design)
## formula: ~var1+var2
## design: the survey object created with svydesign()
<- svytable(~var1+var2,
tab_var1_var2
df_svy)
## This will print an adjusted chi-square test result
summary(tab_var1_var2)
The function prop.table()
can be used to produce the table with row or column percentages.
# EXAMPLE: produce a cross-tab with column percentages
## use the svytable object tab_var1_var2
## margin = 1 shows row percentages; margin = 2 shows column percentages;
prop.table(tab_var1_var2, margin = 2)*100
Earlier we decided that we were going to use a listwise or complete cases approach. The base for this analysis will only be those who did not have missing values in any of the variables involved in the analysis. Earlier we computed the variable listwise_vars
that identifies the cases that are complete listwise_vars == "complete"
. You can use this variable to subset the observations in the data frame before producing the cross-tabs.
survey
design object: subset()
Sometimes you will need to filter the observations in the survey object before producing the statistics. The function subset()
can be used to filter the observations in the survey object.
# EXAMPLE: subset cases IF var1 == "Yes" AND var2 == "Married"
## subset(design, subset)
<- subset(df_svy, var1 == "Yes" & var2 == "Married")
df_svy_sub
# EXAMPLE: subset can also be wrapped in a function
svymean(~var3, subset(df_svy, var1 == "Yes" & var2 == "Married"))
💡 Solution
Citizens living in more deprived areas, younger, higher education, and unemployed are more likely to be victims of and report a crime.
# For each independent variable:
# 1) produce a cross-tab with the dependent variable crimvict
# using only the subset of complete cases listwise_vars == "complete"
# 2) produce the percentage table prop.table()*100
# 3) produce the summary of the table that includes an adjusted chi-square test
# see below for a much less verbose alternative using lapply()
# tab_imd_crimvict <- svytable(~imd_quint+crimvict, subset(df_svy, listwise_vars == "complete"))
# prop.table(tab_imd_crimvict, margin = 1)*100
# summary(tab_imd_crimvict)
#
# tab_income_crimvict <- svytable(~income + crimvict, subset(df_svy, listwise_vars == "complete"))
# prop.table(tab_income_crimvict, margin = 1)*100
# summary(tab_income_crimvict)
#
# tab_sex_crimvict <- svytable(~sex+crimvict, subset(df_svy, listwise_vars == "complete"))
# prop.table(tab_sex_crimvict, margin = 1)*100
# summary(tab_sex_crimvict)
#
# tab_ageg_crimvict <- svytable(~ageg+crimvict, subset(df_svy, listwise_vars == "complete"))
# prop.table(tab_ageg_crimvict, margin = 1)*100
# summary(tab_ageg_crimvict)
#
# tab_edu_crimvict <- svytable(~edu+crimvict, subset(df_svy, listwise_vars == "complete"))
# prop.table(tab_edu_crimvict, margin = 1)*100
# summary(tab_edu_crimvict)
#
# tab_employ_crimvict <- svytable(~employ+crimvict, subset(df_svy, listwise_vars == "complete"))
# prop.table(tab_employ_crimvict, margin = 1)*100
# summary(tab_employ_crimvict)
#
# tab_bornuk_crimvict <- svytable(~bornuk+crimvict, subset(df_svy, listwise_vars == "complete"))
# prop.table(tab_bornuk_crimvict, margin = 1)*100
# summary(tab_bornuk_crimvict)
## a substantially less verbose way to produce these tables and stats
## extract colnames of the indep. vars
<- df %>%
colnam select(imd_quint:bornuk) %>%
colnames()
## loop over the colnames to extract the percentage tables and stats
## function(x): [1] creates the svytable object tab
## [2] stores in a list the percentage table and the stats (chi-square test)
## [3] returns the list
lapply(colnam, function(x) {
<- svytable(as.formula(paste0("~", x, "+crimvict")), subset(df_svy, listwise_vars == "complete"))
tab
<- list(tab = prop.table(tab, margin = 1)*100,
output stats = summary(tab)[[2]]
)
output
})
[[1]]
[[1]]$tab
crimvict
imd_quint Not a victim Victim
Q1 (Most deprived areas) 85.593535 14.406465
Q2 86.236368 13.763632
Q3 88.207875 11.792125
Q4 90.061430 9.938570
Q5 (Less deprived areas) 91.056566 8.943434
[[1]]$stats
Pearson's X^2: Rao & Scott adjustment
data: svychisq(as.formula(paste0("~", x, "+crimvict")), design = subset(df_svy, listwise_vars == "complete"), statistic = "F")
F = 18.346, ndf = 3.9802, ddf = 23423.2805, p-value = 5.337e-15
[[2]]
[[2]]$tab
crimvict
income Not a victim Victim
Under 5,000 86.33730 13.66270
5,000-9,999 88.62860 11.37140
10,000-14,999 89.88413 10.11587
15,000-19,999 88.70193 11.29807
20,000-24,999 89.43096 10.56904
25,000-29,999 88.49583 11.50417
30,000-34,999 87.43539 12.56461
35,000-39,999 87.67279 12.32721
40,000-44,999 87.60582 12.39418
45,000-49,999 89.44856 10.55144
50,000-59,999 86.86598 13.13402
60,000-69,999 87.15076 12.84924
70,000 or over 87.49570 12.50430
[[2]]$stats
Pearson's X^2: Rao & Scott adjustment
data: svychisq(as.formula(paste0("~", x, "+crimvict")), design = subset(df_svy, listwise_vars == "complete"), statistic = "F")
F = 1.6421, ndf = 11.579, ddf = 68145.147, p-value = 0.07587
[[3]]
[[3]]$tab
crimvict
sex Not a victim Victim
Male 87.98230 12.01770
Female 88.54305 11.45695
[[3]]$stats
Pearson's X^2: Rao & Scott adjustment
data: svychisq(as.formula(paste0("~", x, "+crimvict")), design = subset(df_svy, listwise_vars == "complete"), statistic = "F")
F = 1.4481, ndf = 1, ddf = 5885, p-value = 0.2289
[[4]]
[[4]]$tab
crimvict
ageg Not a victim Victim
16-24 83.964330 16.035670
25-34 86.323446 13.676554
35-44 87.018210 12.981790
45-54 87.012241 12.987759
55-64 89.148940 10.851060
65-74 92.979486 7.020514
75+ 95.464105 4.535895
[[4]]$stats
Pearson's X^2: Rao & Scott adjustment
data: svychisq(as.formula(paste0("~", x, "+crimvict")), design = subset(df_svy, listwise_vars == "complete"), statistic = "F")
F = 34.951, ndf = 5.3278, ddf = 31354.2051, p-value < 2.2e-16
[[5]]
[[5]]$tab
crimvict
edu Not a victim Victim
None 91.337345 8.662655
O level/GCSE 88.311435 11.688565
Apprenticeship or A/AS level 87.102063 12.897937
Degree or diploma 87.424120 12.575880
Other 90.080638 9.919362
[[5]]$stats
Pearson's X^2: Rao & Scott adjustment
data: svychisq(as.formula(paste0("~", x, "+crimvict")), design = subset(df_svy, listwise_vars == "complete"), statistic = "F")
F = 10.175, ndf = 3.9746, ddf = 23390.4897, p-value = 3.446e-08
[[6]]
[[6]]$tab
crimvict
employ Not a victim Victim
Employed 86.97212 13.02788
Unemployed 84.44346 15.55654
Economically inactive 91.11154 8.88846
[[6]]$stats
Pearson's X^2: Rao & Scott adjustment
data: svychisq(as.formula(paste0("~", x, "+crimvict")), design = subset(df_svy, listwise_vars == "complete"), statistic = "F")
F = 34.37, ndf = 1.9489, ddf = 11469.2804, p-value = 2.826e-15
[[7]]
[[7]]$tab
crimvict
bornuk Not a victim Victim
Born abroad 87.79636 12.20364
Born in UK 88.35912 11.64088
[[7]]$stats
Pearson's X^2: Rao & Scott adjustment
data: svychisq(as.formula(paste0("~", x, "+crimvict")), design = subset(df_svy, listwise_vars == "complete"), statistic = "F")
F = 0.78986, ndf = 1, ddf = 5885, p-value = 0.3742
Q2.3 | What is the relationship between area deprivation and crime victimisation? What are the other significant predictors of crime victimisation?
Fit a logistic regression model to estimate the relationship between area deprivation and crime victimisation. The model should include the independent variables sex
, ageg
, edu
, employ
, bornuk
, income
and imd_quint
. Use the function svyglm()
to fit the model.
survey
package: svyglm()
The function svyglm()
can be used to fit a logistic regression model accounting for the survey sample design and weights. The function has several arguments to control the output of the model. The function can be wrapped in summary()
to produce a summary of the model that includes the coefficients and their standard errors.
# EXAMPLE:
## svyglm(formula, design, family)
## formula: ~var1+var2
## design: the survey object created with svydesign()
## family: the family of the model, e.g., quasibinomial for logistic regression
<- svyglm(~var1+var2,
model
df_svy,family = quasibinomial)
💡 Solution
People living in more deprived areas are more likely to suffer and report a crime. The other significant predictors of crime victimisation are age, younger people are more likely to be victims of crime, and education, people with higher education are less likely to be victims of crime.
<- svyglm(crimvict ~ imd_quint + income + sex + ageg + edu + employ + bornuk,
model_csd design = df_svy,
family = quasibinomial)
summary(model_csd)
Call:
svyglm(formula = crimvict ~ imd_quint + income + sex + ageg +
edu + employ + bornuk, design = df_svy, family = quasibinomial)
Survey design:
svydesign(id = ~psu, data = df, strata = ~strata, weights = ~weight)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.69874 0.15382 -11.044 < 2e-16 ***
imd_quintQ2 -0.07694 0.07201 -1.068 0.28535
imd_quintQ3 -0.24239 0.07635 -3.175 0.00151 **
imd_quintQ4 -0.44850 0.07987 -5.615 2.06e-08 ***
imd_quintQ5 (Less deprived areas) -0.56583 0.08676 -6.521 7.55e-11 ***
income5,000-9,999 0.06251 0.13754 0.454 0.64951
income10,000-14,999 -0.01975 0.13797 -0.143 0.88616
income15,000-19,999 0.06903 0.14023 0.492 0.62256
income20,000-24,999 -0.03687 0.14352 -0.257 0.79729
income25,000-29,999 0.06564 0.14647 0.448 0.65407
income30,000-34,999 0.15036 0.15268 0.985 0.32475
income35,000-39,999 0.12574 0.15016 0.837 0.40241
income40,000-44,999 0.13242 0.15222 0.870 0.38440
income45,000-49,999 -0.06427 0.15925 -0.404 0.68653
income50,000-59,999 0.19541 0.15240 1.282 0.19982
income60,000-69,999 0.16845 0.16090 1.047 0.29516
income70,000 or over 0.17134 0.14169 1.209 0.22660
sexFemale -0.03049 0.04617 -0.660 0.50907
ageg25-34 -0.24696 0.09302 -2.655 0.00795 **
ageg35-44 -0.27293 0.09149 -2.983 0.00286 **
ageg45-54 -0.25342 0.09412 -2.692 0.00711 **
ageg55-64 -0.42682 0.09815 -4.349 1.39e-05 ***
ageg65-74 -0.83235 0.11187 -7.440 1.15e-13 ***
ageg75+ -1.24144 0.13934 -8.910 < 2e-16 ***
eduO level/GCSE 0.09452 0.08396 1.126 0.26032
eduApprenticeship or A/AS level 0.22773 0.08338 2.731 0.00633 **
eduDegree or diploma 0.23164 0.08012 2.891 0.00385 **
eduOther 0.01968 0.14149 0.139 0.88936
employUnemployed 0.11993 0.15848 0.757 0.44925
employEconomically inactive -0.02075 0.06653 -0.312 0.75513
bornukBorn in UK 0.08278 0.06479 1.278 0.20141
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasibinomial family taken to be 0.9907107)
Number of Fisher Scoring iterations: 5
exp(coef(model_csd))
(Intercept) imd_quintQ2
0.1829146 0.9259447
imd_quintQ3 imd_quintQ4
0.7847516 0.6385847
imd_quintQ5 (Less deprived areas) income5,000-9,999
0.5678905 1.0645035
income10,000-14,999 income15,000-19,999
0.9804409 1.0714655
income20,000-24,999 income25,000-29,999
0.9638048 1.0678405
income30,000-34,999 income35,000-39,999
1.1622578 1.1339872
income40,000-44,999 income45,000-49,999
1.1415842 0.9377524
income50,000-59,999 income60,000-69,999
1.2158042 1.1834728
income70,000 or over sexFemale
1.1868982 0.9699723
ageg25-34 ageg35-44
0.7811681 0.7611436
ageg45-54 ageg55-64
0.7761380 0.6525815
ageg65-74 ageg75+
0.4350267 0.2889675
eduO level/GCSE eduApprenticeship or A/AS level
1.0991306 1.2557478
eduDegree or diploma eduOther
1.2606679 1.0198787
employUnemployed employEconomically inactive
1.1274132 0.9794646
bornukBorn in UK
1.0863016
Q2.4 | How does the complex sample design of the survey affect you coefficient estimates compared to using a simple logistic regression model?
In this final question you are asked to reproduce the model above but using a new survey object df_nosvy
that does not take into account the complex sample design. Compare the coefficient estimates of the model with and without the complex sample design.
💡 Solution
<- svydesign(id = ~1,
df_nosvy data = df)
Warning in svydesign.default(id = ~1, data = df): No weights or probabilities
supplied, assuming equal probability
<- svyglm(crimvict ~ imd_quint + income + sex + ageg + edu + employ + bornuk,
model_nocsd design = df_nosvy,
family = quasibinomial)
exp(cbind(OR = coef(model_csd), coef(model_nocsd)))
OR
(Intercept) 0.1829146 0.1964295
imd_quintQ2 0.9259447 0.9424983
imd_quintQ3 0.7847516 0.7336915
imd_quintQ4 0.6385847 0.6420192
imd_quintQ5 (Less deprived areas) 0.5678905 0.5738531
income5,000-9,999 1.0645035 1.0154513
income10,000-14,999 0.9804409 0.9521272
income15,000-19,999 1.0714655 0.9691949
income20,000-24,999 0.9638048 0.8746230
income25,000-29,999 1.0678405 0.9913687
income30,000-34,999 1.1622578 0.9827621
income35,000-39,999 1.1339872 1.1054911
income40,000-44,999 1.1415842 1.1101780
income45,000-49,999 0.9377524 0.8895463
income50,000-59,999 1.2158042 1.0981366
income60,000-69,999 1.1834728 1.0227561
income70,000 or over 1.1868982 1.0864087
sexFemale 0.9699723 0.9699128
ageg25-34 0.7811681 0.8026176
ageg35-44 0.7611436 0.7450115
ageg45-54 0.7761380 0.7500114
ageg55-64 0.6525815 0.6087448
ageg65-74 0.4350267 0.4367209
ageg75+ 0.2889675 0.2740949
eduO level/GCSE 1.0991306 1.0912705
eduApprenticeship or A/AS level 1.2557478 1.2429151
eduDegree or diploma 1.2606679 1.2974064
eduOther 1.0198787 0.9180863
employUnemployed 1.1274132 0.9551244
employEconomically inactive 0.9794646 0.9914116
bornukBorn in UK 1.0863016 1.0903409