Primer
By “phenotype” we mean anything observable about a unit of study. In the context of ECHILD, and especially the Hospital Episode Statistics (HES), we are often referring to health conditions. However a phenotype may be broader and we have admitted to the Repository, for example, code lists that cover births.
In epidemiological terms, phenotyping is part of the process of ascertaining exposures, outcomes or other variables. One of the advantages of datasets such as the HES Admitted Patient Care data is the availability of rich ICD-10 diagnostic and OPCS-4 procedure data, recorded by trained clinical coders in each hospital according to national standards. This information (as well as other fields) can be used to ascertain whether children have particular conditions or other phenotypes.
A code list is therefore a list of ICD-10, OPCS-4 or other codes that can be used to identify records indicative of the target phenotype in administrative data. Within the Repository, all code lists are available to download as machine-readable *.csv files which can be implemented relatively easily. Each list is accompanied by sample R and Stata code that you are free to use, adapt or ignore as you wish.
How do ICD-10 and OPCS-4 work?
As most code lists in the repository primarily make use of ICD-10 and, to a lesser extent, OPCS-4, we provide here a brief overview of how these coding systems work in HES. For full details and the broader context, users can consult the World Health Organization’s browser, which contains the full ICD-10 classification plus links to the user guide, and the NHS Classifications Browser, which hosts an OPCS-4 and ICD-10 browser.
ICD-10
ICD-10 is the 10th edition of the World Health Organization’s International Statistical Classification of Diseases and Related Health Problems. It is a list of 3-character codes arranged in 21 chapters. Every code consists of a letter and two numbers. For example, chapter I (Roman numerals) is entitled “certain infectious and parasitic diseases” and contains codes beginning with either an A
or a B
(strictly speaking, the letters are called “blocks” though many users colloquially refer to the blocks as chapters). Other chapters contain codes relevant either to disease type (e.g., neoplasms in chapter II, blocks C
and D
) or body site (e.g., diseases of the eye and adnexa in chapter VII, block H
). ICD-10 not only contains diseases, however, but codes indicating circumstances for contact with healthcare. For example, chapter XV (block O
) deals with pregnancy, childbirth and the puerperium, chapter XX (blocks V
to Y
) codes external causes of morbidity, such as injuries, and chapter XXI (block Z
) treats factors influencing health status, such as homelessness.
Each 3-character code can be divided into up to ten 4-character subcodes (.0
to .9
). For example, the 3-character code J93
encodes pneumothorax. This code has four 4-character subcodes: J93.0
(spontaneous tension pneumothorax), J93.1
(other spontaneous pneumothorax), J93.8
(other pneumothorax) and J93.9
(pneumothorax, unspecified). Subcode .8
usually refers to “other” or overlapping conditions and .9
refers to unspecified forms. Many codes also have inclusion or exclusion terms, indicating how certain entities should be coded. Thus, the following pneumothoraces should not be coded as J93
, but as something else: congenital or perinatal (P25.1
); traumatic (S27.0
); tuberculosis (current disease) (A15
to A16
); and pyopneumothorax (J86.-
, where the hyphen indicates that a subcode should be recorded).
Special characters: asterisks (*) and daggers (†)
ICD-10 contains a number of special characters, many of which are self-explanatory when reading the classification. Details can be found in the WHO’s user guide. One convention, however, that is not self-evident and is visible in HES is the asterisk (*) and dagger (†) system. These are used in situations where an underlying disease (marked with a dagger in ICD-10 and with a D
in HES) is coded along with an additional, optional code for its manifestation (marked with an asterisk in ICD-10 and an A
in HES). For this reason, HES users will find ICD-10 codes longer than 4 characters in the data. The code lists currently in the Repository do not make use of this information, so users will see that our example R and Stata scripts simply ignore them.
OPCS-4
OPCS-4 (the 4th edition of the OPCS Classification of Interventions and Procedures) is similar to ICD-10 in that it, too, is a list of 3-character codes, each with a letter (designating a block) and two numbers. Each is divided into up to nine 4-character subcodes (.1
to .9
). Unlike ICD-10, OPCS-4 was developed in the UK with the intention of supporting work in the NHS. Cross-country comparisons using procedure codes are therefore even more complex than when using ICD-10 codes. An example of such a study is the cross-country investigation of orchidopexy by Jay et al (2020), that made use of ICD-10 codes Q53
, Q55.0
and Q55.1
to identify cryptorchidism in all jurisidctions. In the UK, OPCS-4 codes N08
and N09
were used to identify orchidopexy, but other procedure coding systems (the Nordic Medico-Statistical Committee, the Canadian Classification of Health Interventions and the Australian Classification of Health Interventions) in other countries. Another complication of OPCS-4 is that it is subject to frequent changes, which can sometimes be quite significant in terms of the codes used and whether auxiliary codes are required. Fortunately, tables of equivalences are available (see below).
How HES is coded
Professional clinical coders employed by hospitals code all admissions on patient discharge according to the NHS National Clinical Coding Standards, available via the NHS Classifications Browser. HES users should always be aware that practice in coding can change over time and across hospitals. In our experience, meeting with clinical coders where possible is invaluable in aiding understanding of how coding occurs and possible limitations in using these codes.
ICD-10 and OPCS-4 data files
Users needing ICD-10 or OPCS-4 (including the tables of equivalences for code changes) in spreadsheet format can obtain these from the NHS’s Technology Reference Update Distribution website. R packages and Stata commands for working with ICD-10 codes also exist, though we are not able to advise on their use.
Examples of phenotype code lists
Currently within the Repository, we have a range of code lists that cover different target phenotypes. The Repository is fully searchable and you can consult the index for an overview of what is currently available.
Code lists may cover conditions generically defined (e.g., chronic health conditions) or target specific conditions (e.g., severe congenital heart defects). A code list might also target something other than a health condition, such as births.
To take one example, consider the Hardelid et al list, which targets chronic health conditions generally. The target phenotype is in fact “any health problem likely to require follow-up for more than one year, where follow-up could be repeated hospital admission, specialist consultation through outpatient department visits, medication or use of support services.” The list contains 1,371 distinct ICD-10 codes, which are divided into nine body systems (e.g. respiratory, cardiac) and further into sub-groups. It has been used, for example, in studies such as on childhood mortality (Hardelid, Dattani and Gilbert, 2014) and the cumulative incidence of chronic health conditions across childhood (Jay et al., 2024).
Caveats of phenotype code lists
You must be careful to recognise possible limitations in using code lists. First, when using administrative data, you must always consider the possibility of various biases such as that induced by the fact that patients admitted to hospital are generally in poorer health than those not admitted. This means you are less likely to detect all cases and where cases are detected, they are likely to be more severe than in community settings.
You must also consider the sensitivity and specificity of the code list, an assessment which must be made on a case-by-case basis. Sensitivity analyses are often required to expand or narrow the scope of a code list in any given project.
Timing can be difficult. Whereas each patient episode in HES is date stamped, the date of the episode is not necessarily the date when symptoms began or when the diagnosis was first made; a patient may have had a particular condition long before it is first detected in HES. Likewise, it may not be possible to determine if and when a patient recovers from their condition, nor any functional impairments or effects on quality of life.
Each code list also has its idiosyncrasies. For example, in the Hardelid et al list for chronic health conditions, some codes are only valid if the admission is at least 3 days long and others are only valid where the patient is at least ten years old. Each code list in the Repository is formatted in the same way, meaning that identifying these factors (which are also documented on this website) should be straightforward.
Emergency admissions
One particular caveat to bear in mind is that some code lists are only designed for use with emergency admissions. This is the case with the lists of adversity-related admissions and stress-related presentations (version 1 and version 2). You should therefore also consult the emergency admissions code list that identifies HES admissions as such with reference to admimeth
(admission method) values 21
to 25
, 28
, and 2A
to 2D
(all of which are documented as such in the HES Technical Output Specification). Note that codes 25
and 2A
to 2D
were not available before 2013/14, and so you might see papers, such as Herbert et al (2015) where these codes are not used.
Code lists from other countries
ECHILD users should take particular care with code lists developed in other jurisdictions. Firstly, the use of the same codes in different places may vary, depending on factors such as incentives from government and healthcare bodies to record particular conditions or use particular codes. These may be for reimbursement or other service planning reasons that are independent of the underlying epidemiology, presentation or management of any given condition.
Secondly, adaptations are often necessary in order to use the code system in the UK. In the United States and Canada, modified versions of the ICD-10 coding system are often used, the ICD-10-CM and ICD-10-CA, respectively (other modifications also exist though to date no code list in the Repository has used them). These contain some more detailed classifications (5 character codes) than are available in the standard international ICD-10 and are not used in HES. Where such codes are included in code lists in the Repository, they are truncated to the 4 character codes on which they are based, though this may compromise the validity of the code list, or part of it. This is particularly the case for Feudtner et al’s list of complex chronic conditions, which, for example, contains the code Z94.8
twice, once under malignancy and once under gastroenterological conditions. This is because the ICD-10-CM codes Z94.81
(bone marrow transplant status) and Z94.82
(intestine transplant status) are truncated to the standard ICD-10 Z94.8
(other transplanted organ and tissue status, which incorporates both without distinction).
Some code lists also use ICD-9 codes and/or ICD-10-PCS (procedure codes). Because these are not used in HES, and therefore not used in ECHILD, they are removed from code lists in this Repository. As it happens, the version of Feudtner et al’s list in the Repository has had 1,097 codes removed and 194 codes truncated (all of which are documented according to our design principles).
Developing a phenotype code list
Precise methods vary, and all users are encouraged to carefully consult the original publications that accompany each phenotype code list in the Repository. Development, however, would normally occur in a manner like the following.
- After identifying, and defining well, the target phenotype, the relevant coding system (e.g., ICD-10) is consulted to identify all candidate codes. This would include scouring all available codes, bearing in mind that relevant codes may appear in more than one chapter or under various subcodes. Data other than diagnoses and procedures may also be relevant, such as in the Zylbersztejn et al births list, which uses a combination of diagnostic and other information to identify maternity admissions.
- There could then be a process of consultation with clinical experts, coders, other researchers and patient and public groups to refine the list. This is to assess face validity as well as to identify additional codes or potential problems with candidate codes.
- Ideally, the list would be validated against a gold standard source, such as original hospital records, though this is not always possible due to practical constraints.
- Finally, further validation occurs through using the code list in the administrative data, including sensitivity analyses with different observation windows and more or less restricted code groups. It can also be useful to examine incidence of codes over time to identify possible changes in coding unexplained by clinical or epidemiological factors.
Updating code lists
Naturally, the above process of validation is a very complex process that does not end after initial validation. Changes in coding practice and healthcare policy can affect the validity and utility of a code list or particular codes in it. It may therefore be necessary to update code lists over time. While it is beyond the scope of the Repository to carry out this updating work, updated lists may be admitted to it. Compare, for example, the code list of stress-related emergency presentations developed by Blackburn et al with that of Ní Chobhthaigh et al. The latter is an update of the former, accounting for the latest research and practice in paediatric mental health.