
Preparation and initial survey analysis
Preparation-and-initial-survey-analysis.RmdThis article shows how to use tools from OMESurvey to check and clean the provided survey data and produce an initial report on ‘missingness’ in the survey
In this article we will show how to prepare and perform initial survey analysis on fake data based on the year 7 pilot study. This fake data has the same structure as those we receive. We show the data for 5 participants (transposed so the table is easier to see).
data = OMESurvey::survey_example
# Remove the themes and add spacing to imagine how the data will arrive)
split = names(data) |> stringr::str_split('__')
names(data) = lapply(split, function(x){if(length(x) == 1){return(x)} ; x[2]}) |> unlist() |> stringr::str_replace_all('_', ' ')Data Cleaning
Our first aim is to clean the data such that the data is ready to be analysed and no further modifications are required. Note that after the first clean we will need to do sanity checks in case there are other issues in the data.
The initial part of cleaning will involve:
- Adding theme / group information to the column names. To using the
tools herein the column name of questions should be in the format
XX__YYwhere XX is the theme/group and YY the question (i.e we use ’__’ double underscore to represent the split between group/theme and the question). - Convert ‘N-A’ to NA.
- Get the school ID from the bar code.
- Create a combined school and class ID (in case multiple schools have the same class ID).
# Clean 1.
# A) Adding theme/group to column names.
group = c('Theme 1' |> rep(6),
'Theme 2'|> rep(7),
'Theme 3'|> rep(12),
'Theme 4'|> rep(4),
'Theme 5'|> rep(12),
'Theme 6'|> rep(7),
'Theme 7'|> rep(6),
'Theme 8'|> rep(7))
names(data)[6:ncol(data)] = paste0(group, '__', names(data)[6:ncol(data)])
names(data) = names(data) |> stringr::str_replace_all(pattern = ' ', replacement = '_')
# B) NAs
for(i in 1:ncol(data)){
data[,i][data[,i] == 'N-A'] = NA
# data[,i] = data[,i] |> stringr::str_to_sentence() # commented out for example.
}
# C + D) School ID and combine school and class information.
school = data$Barcode_ID |> as.character() |> substr(3, 5)
school_and_class = paste0(school,'__', data$`Group_/_class`)
# Combine into a cleaned dataset.
data_clean = data.frame(data[,1:4], school, school_and_class, data[,5:ncol(data)])We show the data for same 5 participants as before (transposed so the table is easier to see).
After the first step of cleaning we see the addition of the school and school_class combined columns new column names and NAs added.
After the first step of cleaning we can now run checks for other
errors in the dataset. To check for issues we can use
check_survey() which checks for barcode issues, unexpected
values in question columns and questions which have multiple answers
(i.e. a value of “99”).
The function returns a list informing if we have any issues of the type listed above.
check = OMESurvey::check_survey(data)
check
#> $barcode
#> $barcode$isIssue
#> [1] FALSE
#>
#> $barcode$issues
#> NULL
#>
#>
#> $question_text
#> $question_text$isIssue
#> [1] TRUE
#>
#> $question_text$issues
#> question Unexpected.value Count Indices
#> 5 Theme_3__Question_21 Neither 2 187, 352
#> 7 Theme_3__Question_22 Almost Never 1 3
#> 11 Theme_5__Question_41 DIsagree a lot 1 252
#> expected
#> 5 Almost always, Most of the time, Half the time, Some of the time, Almost never
#> 7 Almost always, Most of the time, Half the time, Some of the time, Almost never
#> 11 Agree a lot, Agree a little, Neither, Disagree a little, Disagree a lot
#>
#>
#> $multiple_answers
#> $multiple_answers$isIssue
#> [1] TRUE
#>
#> $multiple_answers$issues
#> question Unexpected.value Count Indices
#> 1 Theme_1__Question_3 99 1 344
#> 2 Theme_1__Question_4 99 3 240, 251, 402
#> 3 Theme_2__Question_9 99 2 241, 260
#> 4 Theme_3__Question_20 99 1 56
#> 6 Theme_3__Question_21 99 1 387
#> 8 Theme_5__Question_32 99 2 272, 480
#> 9 Theme_5__Question_35 99 2 154, 177
#> 10 Theme_5__Question_37 99 1 21
#> 12 Theme_6__Question_44 99 2 278, 317
#> 13 Theme_7__Question_52 99 3 26, 128, 335
#> 14 Theme_8__Question_56 99 1 369
#> expected
#> 1 Agree a lot, Agree a little, Neither, Disagree a little, Disagree a lot
#> 2 Agree a lot, Agree a little, Neither, Disagree a little, Disagree a lot
#> 3 Agree a lot, Agree a little, Neither, Disagree a little, Disagree a lot
#> 4 Almost always, Most of the time, Half the time, Some of the time, Almost never
#> 6 Almost always, Most of the time, Half the time, Some of the time, Almost never
#> 8 Agree a lot, Agree a little, Neither, Disagree a little, Disagree a lot
#> 9 Agree a lot, Agree a little, Neither, Disagree a little, Disagree a lot
#> 10 Agree a lot, Agree a little, Neither, Disagree a little, Disagree a lot
#> 12 Agree a lot, Agree a little, Neither, Disagree a little, Disagree a lot
#> 13 Agree a lot, Agree a little, Neither, Disagree a little, Disagree a lot
#> 14 Agree a lot, Agree a little, Neither, Disagree a little, Disagree a lot
#>
#> $multiple_answers$barcodes
#> [1] 1200700304 1200700360 1200700413 1200700392 1200700402 1200700410
#> [7] 1200800516 1200700389 1200700423 1200900701 1200800572 1200800585
#> [13] 1200800456 1201000757 1201000896 1200800469 1200800497 1201000908
#> [19] 1200700320We see that most errors occur from casing issues (capital N in
never), since this isn’t case specific to this survey we can fix this
using stringr::str_to_sentence() in the cleaning section
(which is already there just commented out to show the issue). The issue
of ‘Neither’ being used instead of ‘Half the time’ is case specific and
can be fixed for this individual report.
# Run extra cleaning with issues found.
for(i in 1:ncol(data)){
data_clean[,i] = data_clean[,i] |> stringr::str_to_sentence()
}
# Fix neither to Half the time.
data_clean$Theme_3__Question_21[data_clean$Theme_3__Question_21 == 'Neither'] = 'Half the time'If we want to check the survey images for records where we have “99s” we can filter the SharePoint search by the barcodes using the following text
paste0(check$multiple_answers$barcodes, collapse = " OR ")
#> [1] "1200700304 OR 1200700360 OR 1200700413 OR 1200700392 OR 1200700402 OR 1200700410 OR 1200800516 OR 1200700389 OR 1200700423 OR 1200900701 OR 1200800572 OR 1200800585 OR 1200800456 OR 1201000757 OR 1201000896 OR 1200800469 OR 1200800497 OR 1201000908 OR 1200700320"Checking for issues in the ID
As a further precaution we may want to check that the IDs we have in our survey dataset aligns with the information already held for ID and name combinations. This will be useful to track issues in the ID, changes in names or to check for children who used an ‘empty form’ rather then the survey meant for them (is it was missing or misplaced for example).
Within OMESurvey the function
check_IDs() performs the ID checks. An example is shown
below.
# Use a large extract of the data as the 'true' values
true_df = data[1:100,]
# Small subset for our survey results.
data_ex = data[1:10,]
#Change some names and IDs.
data_ex$First_Name[3] = 'Luke'
data_ex$Barcode_ID[7] = data_ex$Barcode_ID[7] |> stringr::str_extract(pattern = '[0-9]{8}')
# Check the IDs using
check_IDs(data = data_ex, true_df = true_df)
#> $length_issue
#> [1] "12008005"
#>
#> $inconsistent_issue
#> Data: ID Data: First Name Data: Surname Potential: ID
#> 1 1200800442 Luke Powell <NA>
#> 2 12008005 Jake Powell 1200800551
#> Potential: First Name Potential: Surname
#> 1 Jake Powell
#> 2 <NA> <NA>Missingness report
Now we have cleaned the data we can create a missingness report to explore if there are empty surveys, which questions are more often missing, etc.
Within OMESurvey we have a pre-written markdown
report which takes survey data as an input and analyses the missingness.
Therefore we only need to run the function
create_missingness_report() to create the report for
whichever survey data we are analysing.
# Create the missingness report.
OMESurvey::create_missingness_report(survey_data = data_clean,
survey_name = 'example survey',
output_dir = getwd()
)Note that by default this creates a .html report where the charts are
interactive (i.e has hover enabled, plus potential other features). This
allows for more informative charts however it comes at the cost of not
been able to edit the document. Therefore using the input
report_kind = 'static' will render the same report except
the output will be in .docx allowing text to be edited/added at the
expense of the interactive plots (where they will be replaced with
static screengrabs).