Using the scripts

Each code list in the repository has (or will soon have) an associated R script and Stata do file that demonstrate one method of implementing each code list. You are, of course, free to use and adapt these as you wish. Here we provide a brief guide to the scripts and we also provide an example of working with multiple phenotypes. Throughout, we try to demonstrate parsimonious coding, storing in memory as few objects as possible, themselves as small as possible, for example by dropping redundant flag variables and entire objects from memory when no longer needed.

How to find the scripts

The R scripts and Stata do files can be found on each code list page. Some lists may show “Not yet available” while we develop or update the scripts.

A screenshot with a red arrow showing where to download a code list's R script.

Data used

Every script uses the same dataset to demonstrate the code list: HES admitted patient care records for 2019/2020. Mostly, we only use the top 100,000 rows for the sake of speed, though for some we use all available rows where numbers of certain phenotypes were expected to be small. See the ECHILD How To Guides for examples of how to implement these code lists in a more real-world setting.

Packages required

R users will see that the scripts all make use of data.table and RODBC. The package data.table is extremely useful for working with large datasets due to its efficiency and relative simplicity. Its syntax is also slightly simpler than base R. RODBC enables us to connect to SQL Server to load the data into memory.

Script structure

Our standardised approach to formatting code lists means that we can also adopt a standardised method of implementing them (allowing for slight variations where required to address each code list’s idiosyncrasies). Each script therefore has essentially the same structure:

Global settings, such as the working directory are set and packages are loaded.
The data and code list file are loaded.
Converting diagnostic and (where relevant) operation data to long format. We do this because there are up to 20 diagnoses and up to 24 procedures recorded per episode in HES, most of which are actually empty. Converting to long format and removing empty values makes it substantially easier to work with the data and avoids the need of computationally intensive and unnecessary for loops or use of the apply() function.
Identify relevant codes. This is usually done by creating a binary flag that indicates that a given code is in our target code list.
Deal with flags. For example, in the Hardelid et al list of chronic health conditions, some codes are only valid if the admission length of stay is at least three days and others are only valid if the patient is aged at least 10 years. Generally, we deal with these conditions by setting our binary flags to false where these conditions are not met. Once we have done this, we drop all codes that are not validly in our code list.
Create flags in our original data. Finally, we go back to our original dataset (in the case of the example scripts, the 2019/2020 HES data) and create a spine of patients, where one row represents one patient. We then create binary flags that indicate a patient ever had the target phenotype. It is a very straightforward process to extend this logic and only include, for example, codes that occur within a certain observation period.

Working with multiple phenotypes in a more real-world setting

All example scripts only deal with one phenotype at a time. However, users may wish to work with multiple phenotypes at once. If you are looking for help to implement multiple phenotypes and/or want to see implementation in a real-world setting, we recommend you consult the ECHILD How To Guides. The ECHILD How To Guides will you take you through all the steps of creating an analysis-ready dataset, combining NPD and HES, including applying clinical phenotypes.