Using the scripts

Each code list in the repository has (or will soon have) an associated R script and Stata do file that demonstrate one method of implementing each code list. You are, of course, free to use and adapt these as you wish. Here we provide a brief guide to the scripts and we also provide an example of working with multiple phenotypes. Throughout, we try to demonstrate parsimonious coding, storing in memory as few objects as possible, themselves as small as possible, for example by dropping redundant flag variables and entire objects from memory when no longer needed.

How to find the scripts

The R scripts and Stata do files can be found on each code list page. Some lists may show “Not yet available” while we develop the scripts.

A screenshot with a red arrow showing where to download a code list's R script.

Alternatively, you can download the scripts directly from our GitHub repository.

We are currently working on deployment of the Repository in the Secure Research Service. ECHILD users who need the scripts can e-mail Matthew Jay to find out where they are.

Data used

Every script uses the same dataset to demonstrate the code list: HES admitted patient care records for 2019/2020. Mostly, we only use the top 100,000 rows for the sake of speed, though for some we use all available rows where numbers of certain phenotypes were expected to be small.

Packages required

R users will see that the scripts all make use of data.table and RODBC. The package data.table is extremely useful for working with large datasets due to its efficiency and relative simplicity. Its syntax is also slightly simpler than base R. RODBC enables us to connect to SQL Server to load the data into memory.

Script structure

Our standardised approach to formatting code lists means that we can also adopt a standardised method of implementing them (allowing for slight variations where required to address each code list’s idiosyncrasies). Each script therefore has essentially the same structure:

  1. Global settings, such as the working directory are set and packages are loaded.
  2. The data and code list file are loaded.
  3. Preliminary cleaning. Mostly, this involves removing the dot from ICD-10 and OPCS-4 codes in the code list (as HES omits the dot) and truncating HES diagnosis fields to the first four characters. This is to remove the D and A that are coded in the dagger and asterisk system, which is not relevant to any code lists currently in the Repository.
  4. Converting diagnostic and (where relevant) operation data to long format. We do this because there are up to 20 diagnoses and up to 24 procedures recorded per episode in HES, most of which are actually empty. Converting to long format and removing empty values makes it substantially easier to work with the data and avoids the need of computationally intensive and unnecessary for loops or use of the apply() function.
  5. Identify relevant codes. This is usually done by creating a binary flag that indicates that a given code is in our target code list.
  6. Deal with flags. For example, in the Hardelid et al list of chronic health conditions, some codes are only valid if the admission length of stay is less than three days and others are only valid if the patient is aged at least 10 years. Generally, we deal with these conditions by setting our binary flags to false where these conditions are not met. Once we have done this, we drop all codes that are not validly in our code list.
  7. Create flags in our original data. Finally, we go back to our original dataset (in the case of the example scripts, the 2019/2020 HES data) and create binary flags that indicate a patient ever had the target phenotype. It is a very straightforward process to extend this logic and only include, for example, codes that occur within a certain observation period.

Working with multiple phenotypes

All example scripts only deal with one phenotype at a time. However, users may wish to work with multiple phenotypes at once. We will soon include here example of code that can be used to efficiently implement several phenotypes at once, drawing on research currently underway.