Skip to contents

This article shows how to use tools from OMESurvey to check and clean the provided survey data and produce an initial report on ‘missingness’ in the survey

In this article we will show how to prepare and perform initial survey analysis on fake data based on the year 7 pilot study. This fake data has the same structure as those we receive. We show the data for 5 participants (transposed so the table is easier to see).

data = OMESurvey::survey_example

# Remove the themes and add spacing to imagine how the data will arrive)
split = names(data) |> stringr::str_split('__')
names(data) = lapply(split, function(x){if(length(x) == 1){return(x)} ; x[2]}) |> unlist() |> stringr::str_replace_all('_', ' ')

Data Cleaning

Our first aim is to clean the data such that the data is ready to be analysed and no further modifications are required. Note that after the first clean we will need to do sanity checks in case there are other issues in the data.

The initial part of cleaning will involve:

  • Adding theme / group information to the column names. To using the tools herein the column name of questions should be in the format XX__YY where XX is the theme/group and YY the question (i.e we use ’__’ double underscore to represent the split between group/theme and the question).
  • Convert ‘N-A’ to NA.
  • Get the school ID from the bar code.
  • Create a combined school and class ID (in case multiple schools have the same class ID).
# Clean 1. 

# A) Adding theme/group to column names.
group = c('Theme 1' |> rep(6),
          'Theme 2'|> rep(7), 
          'Theme 3'|> rep(12), 
          'Theme 4'|> rep(4),
          'Theme 5'|> rep(12), 
          'Theme 6'|> rep(7),
          'Theme 7'|> rep(6), 
          'Theme 8'|> rep(7))             

names(data)[6:ncol(data)] = paste0(group, '__', names(data)[6:ncol(data)]) 
names(data) = names(data) |> stringr::str_replace_all(pattern = ' ', replacement = '_')

# B) NAs
for(i in 1:ncol(data)){
  data[,i][data[,i] == 'N-A'] = NA
  # data[,i] = data[,i] |> stringr::str_to_sentence() # commented out for example.
}

# C + D) School ID and combine school and class information.
school = data$Barcode_ID |> as.character() |>  substr(3, 5)
school_and_class = paste0(school,'__', data$`Group_/_class`) 

# Combine into a cleaned dataset. 
data_clean = data.frame(data[,1:4], school, school_and_class, data[,5:ncol(data)])

We show the data for same 5 participants as before (transposed so the table is easier to see).

After the first step of cleaning we see the addition of the school and school_class combined columns new column names and NAs added.

After the first step of cleaning we can now run checks for other errors in the dataset. To check for issues we can use check_survey() which checks for barcode issues, unexpected values in question columns and questions which have multiple answers (i.e. a value of “99”).

The function returns a list informing if we have any issues of the type listed above.

check = OMESurvey::check_survey(data)
check
#> $barcode
#> $barcode$isIssue
#> [1] FALSE
#> 
#> $barcode$issues
#> NULL
#> 
#> 
#> $question_text
#> $question_text$isIssue
#> [1] TRUE
#> 
#> $question_text$issues
#>                question Unexpected.value Count  Indices
#> 5  Theme_3__Question_21          Neither     2 187, 352
#> 7  Theme_3__Question_22     Almost Never     1        3
#> 11 Theme_5__Question_41   DIsagree a lot     1      252
#>                                                                          expected
#> 5  Almost always, Most of the time, Half the time, Some of the time, Almost never
#> 7  Almost always, Most of the time, Half the time, Some of the time, Almost never
#> 11        Agree a lot, Agree a little, Neither, Disagree a little, Disagree a lot
#> 
#> 
#> $multiple_answers
#> $multiple_answers$isIssue
#> [1] TRUE
#> 
#> $multiple_answers$issues
#>                question Unexpected.value Count       Indices
#> 1   Theme_1__Question_3               99     1           344
#> 2   Theme_1__Question_4               99     3 240, 251, 402
#> 3   Theme_2__Question_9               99     2      241, 260
#> 4  Theme_3__Question_20               99     1            56
#> 6  Theme_3__Question_21               99     1           387
#> 8  Theme_5__Question_32               99     2      272, 480
#> 9  Theme_5__Question_35               99     2      154, 177
#> 10 Theme_5__Question_37               99     1            21
#> 12 Theme_6__Question_44               99     2      278, 317
#> 13 Theme_7__Question_52               99     3  26, 128, 335
#> 14 Theme_8__Question_56               99     1           369
#>                                                                          expected
#> 1         Agree a lot, Agree a little, Neither, Disagree a little, Disagree a lot
#> 2         Agree a lot, Agree a little, Neither, Disagree a little, Disagree a lot
#> 3         Agree a lot, Agree a little, Neither, Disagree a little, Disagree a lot
#> 4  Almost always, Most of the time, Half the time, Some of the time, Almost never
#> 6  Almost always, Most of the time, Half the time, Some of the time, Almost never
#> 8         Agree a lot, Agree a little, Neither, Disagree a little, Disagree a lot
#> 9         Agree a lot, Agree a little, Neither, Disagree a little, Disagree a lot
#> 10        Agree a lot, Agree a little, Neither, Disagree a little, Disagree a lot
#> 12        Agree a lot, Agree a little, Neither, Disagree a little, Disagree a lot
#> 13        Agree a lot, Agree a little, Neither, Disagree a little, Disagree a lot
#> 14        Agree a lot, Agree a little, Neither, Disagree a little, Disagree a lot
#> 
#> $multiple_answers$barcodes
#>  [1] 1200700304 1200700360 1200700413 1200700392 1200700402 1200700410
#>  [7] 1200800516 1200700389 1200700423 1200900701 1200800572 1200800585
#> [13] 1200800456 1201000757 1201000896 1200800469 1200800497 1201000908
#> [19] 1200700320

We see that most errors occur from casing issues (capital N in never), since this isn’t case specific to this survey we can fix this using stringr::str_to_sentence() in the cleaning section (which is already there just commented out to show the issue). The issue of ‘Neither’ being used instead of ‘Half the time’ is case specific and can be fixed for this individual report.

# Run extra cleaning with issues found.
for(i in 1:ncol(data)){
  data_clean[,i] = data_clean[,i] |> stringr::str_to_sentence() 
}

# Fix neither to Half the time.
data_clean$Theme_3__Question_21[data_clean$Theme_3__Question_21 == 'Neither'] = 'Half the time'

If we want to check the survey images for records where we have “99s” we can filter the SharePoint search by the barcodes using the following text

paste0(check$multiple_answers$barcodes, collapse = " OR ")
#> [1] "1200700304 OR 1200700360 OR 1200700413 OR 1200700392 OR 1200700402 OR 1200700410 OR 1200800516 OR 1200700389 OR 1200700423 OR 1200900701 OR 1200800572 OR 1200800585 OR 1200800456 OR 1201000757 OR 1201000896 OR 1200800469 OR 1200800497 OR 1201000908 OR 1200700320"

Checking for issues in the ID

As a further precaution we may want to check that the IDs we have in our survey dataset aligns with the information already held for ID and name combinations. This will be useful to track issues in the ID, changes in names or to check for children who used an ‘empty form’ rather then the survey meant for them (is it was missing or misplaced for example).

Within OMESurvey the function check_IDs() performs the ID checks. An example is shown below.

# Use a large extract of the data as the 'true' values
true_df = data[1:100,]

# Small subset for our survey results.
data_ex  = data[1:10,]
#Change some names and IDs.
data_ex$First_Name[3] = 'Luke'
data_ex$Barcode_ID[7] = data_ex$Barcode_ID[7] |> stringr::str_extract(pattern = '[0-9]{8}')


# Check the IDs using 
check_IDs(data = data_ex, true_df = true_df)
#> $length_issue
#> [1] "12008005"
#> 
#> $inconsistent_issue
#>     Data: ID Data: First Name Data: Surname Potential:  ID
#> 1 1200800442             Luke        Powell           <NA>
#> 2   12008005             Jake        Powell     1200800551
#>   Potential: First Name Potential: Surname
#> 1                  Jake             Powell
#> 2                  <NA>               <NA>

Missingness report

Now we have cleaned the data we can create a missingness report to explore if there are empty surveys, which questions are more often missing, etc.

Within OMESurvey we have a pre-written markdown report which takes survey data as an input and analyses the missingness. Therefore we only need to run the function create_missingness_report() to create the report for whichever survey data we are analysing.

# Create the missingness report.
OMESurvey::create_missingness_report(survey_data = data_clean, 
                                     survey_name = 'example survey',
                                     output_dir = getwd()
                                     )

Note that by default this creates a .html report where the charts are interactive (i.e has hover enabled, plus potential other features). This allows for more informative charts however it comes at the cost of not been able to edit the document. Therefore using the input report_kind = 'static' will render the same report except the output will be in .docx allowing text to be edited/added at the expense of the interactive plots (where they will be replaced with static screengrabs).