Note: this post is a condensed version of ‘Linkage error and linkage bias: a guide for IDI users’ (Kvalsvig, Gibb &Teng 2019).
The aim of this post is to provide a brief overview of how datasets are linked in the IDI, to explain linkage rates and introduce readers to the potential implications of false-links and missed links.
How are datasets linked in the IDI?
The IDI spine aims to capture everyone who has ever been a resident in New Zealand. It is constructed through the linkage of tax records (from 1999 onwards), NZ birth records (from 1920 onwards) and visa records (from 1997 onwards) (Black, 2016).
Other data sources in the IDI are each linked to the IDI spine separately (see more here: https://vhin.co.nz/guides/idi-spine/). For example, health data are linked to the spine, and separately education data are linked to the spine. Health and education data are not directly linked, rather the link between these records exists through their linkages to the spine. This means that links between individual’s records from health and education are only possible for those people in the IDI spine.
For data from within the same agency, unique identifiers such as the NHI number in health records allow for records to be exactly matched before being linked to the spine.
For linkages to the spine there is no such common identifier. Therefore, matching is done using information such as date of birth, the ‘soundex’ of the first and last name (allowing for different spellings of the name, but that result in the same sounding name) and sex. For household surveys address information on meshblock of residence is also used to link surveys to the spine.
These linkages are redone every time the IDI is ‘refreshed’. When a refresh happens, data within the IDI are updated (to include newer data) and additional data sources can be included in the IDI. The IDI can be refreshed up to four times a year. Each time the IDI is refreshed, StatsNZ emails IDI users to provide updates on changes in the data available and the quality of the linkage.
What are linkage rates?
Linkage rates refer to the percentage of individuals that are observed both in the spine and the data source being linked to the spine. The link rate is estimated by:
(note the data source being linked could be data from the Ministry of Health, data from the Ministry of Education, a social survey, etc)
Link rates for the most recent refresh at the time of writing, the September 2019 refresh, are presented below. These are taken from the information provided to IDI users via email by Stats NZ.
The link rate is not an especially good measure of linkage quality. The global link rate varies substantially, from 14% for the MBIE data to over 90% for the census and household surveys (GSS, HLFS, PIACC, and HES). It is important to note that the link rates reported are at the global data source level (including multiple data sets from a single agency, covering all time periods, and all individuals). For example, the global link rate appears very low for the MBIE, but these data include tourists from border movement records. Tourists are not expected to be included in the spine and contribute to the low link rate. The link rate is in part a measure of the extent of overlap between the population in the dataset and the population in the spine. If the dataset contains a lot of people who are not in the spine, the link rate will be low even if the linking was perfectly accurate. More information on link rates is available in the IDI linkage report that is published with every IDI refresh (Statistical Methods Statistics New Zealand, 2019).
Common explanations for failure to link individuals to the spine include:
- Individuals in the data source to be linked do not meet the requirements to be included in the spine (were not registered as born in NZ, they never worked in New Zealand and they never applied for a visa to be in New Zealand) such as residents of Australian, Cook Islands, Niue or Tokelau visiting New Zealand.
- Too much missing information or the quality of information was inadequate to link them successfully. A specific example is females or others with changing surnames for example due to marriage.
- Duplicates in the data source being linked to the spine – the residuals could be duplicates of people that have been linked.
Recently, linkages between household surveys (GSS, HLFS, PIACC, and HES) and the spine were improved by modifying the linkage method to include entire address histories for everyone, rather than a representation of their address on 2013 census night. This strategy significantly increased survey link rates (from ~80% to >90%), and reduced age and ethnic differences in these link rates.
To inform users of the IDI about the quality of the linkage, false positive rates (false links) are reported for each data linkage. Stats NZ aims to keep false-positive rates below 2% for each data linkage. The next section provides further information on false-links.
Linkage error occurs when either data records from different individuals are erroneously linked (False match/false positives), or where records from the same individual are not linked (missed match/false negative). The table below shows the linkage possibilities for each pair of data sets linked.
There is often a tradeoff that happens between the rate of false links and the rate of missed links, as steps to reduce the rate of false links (more stringent matching criteria) may increase the rate of missed links.
False links (false-positives)
In terms of the IDI, false links mean that an individual’s record from an agency or survey data source (e.g. health data for a health care user) is incorrectly linked to a different person on the spine. If person A refers to the individual within the health data and person B refers to the individual on the spine, this would mean all the health data attached to person A, for example, will now be incorrectly linked to all the other data (education/justice/survey data etc) of person B.
This will mean that when looking at health-related data (i.e. utilisation, costs, risk factors and outcomes) for person B, all other admin data may be correctly linked but the health-related data will be for another person.
Missed links (false-negatives)
Missed links occur where an individual’s record from an agency or survey data source (e.g. education data for a student) is not linked to that person on the spine. Instead it is assumed this data belongs to a different person who does not exist on the spine (i.e to a never resident). This new identity and its associated education administrative data are effectively lost to most analyses and the education information cannot be linked to any other data collections (eg health).
This will mean that when looking at education-related data (i.e. achievement, performance and outcomes) for a spine identity, all values will be missing, appearing as if the individual has had no interactions with the education system. It will effectively be lost to any analysis using only spine linked individuals.
What are the implications for researchers?
False links will create misclassification errors as (for example) educational achievement for person A is attached to the record for person B. Missed links create missing data and reduce the sample size available to researchers. Missed links can also create misclassification when comparing service users to non-service users (e.g. did not get hospitalised over X period, did not receive X benefit). It is impossible to separate out those who legitimately did not use a service, to those who are missing these data because of linkage errors.
However linkage errors themselves are not the major cause for concern for researchers using the IDI. Some level of linkage error is inevitable. Rather, the major cause for concern is if these errors vary by important factors in an analysis such as age, ethnicity, or socioeconomic position. Differential linkage error – called linkage bias – can result in incorrect estimates potentially leading to incorrect conclusions being drawn from an analysis. Missed links can result in selection bias if subgroups are more or less likely to link – for example if data quality varies on important linking variables (such as name) by an important factor in the analysis such as ethnicity.
The potential impact of linkage bias on research is unique to each research project. This issue is explored in more detail in a recent Healthier Lives National Science Challenge publication by Kvalsvig et al. (2019), who provide a detailed description of the potential impact of linkage errors and bias on IDI research, suggestions for how to begin to quantify the impact of bias, and the next steps for future research to identify potential linkage errors and adjust analyses to account for potential linkage bias.
Something to consider is that each linkage happens separately for each data source and the spine, and also within the spine. A 2% false link rate for each data linkage could be add to be a much larger rate when data from multiple data sources (and multiple linkages) are combined for research purposes.
Stats NZ provides information on the linkage rates and the false link rate, but it does not currently provide estimates for the rate of missed links. Estimation of missed links is a developing field and it is much more difficult to estimate the rate of missed matches through existing techniques such as clerical review.
We recommend researchers understand the linkage process, the linkage error and potential sources of linkage bias in the data sets they are using in the IDI. The potential impact of linkage bias on results should be discussed in reports based on IDI data. Further research on IDI linkage error and bias is needed to ensure the quality of information obtained from IDI data, and researchers should advocate for more research on this important topic.
BLACK, A. 2016. The IDI prototype spine’s creation and coverage, Available from http://www.stats.govt.nz. In: STATISTICS NEW ZEALAND (ed.) Working Paper No 16-03.
KVALSVIG, A., GIBB, S. & TENG, A. 2019. Linkage error and linkage bias: A guide for IDI users. Wellington: University of Otago Wellington. Available from http://www.vhin.co.nz.
STATISTICAL METHODS STATISTICS NEW ZEALAND 2019. Integrated Data Infrastructure (IDI) refresh: linking report: September 2019 refresh.
By Sheree Gibb, Amanda Kvalsvig, Andrea Teng
Originally published December 2019
This work is licensed under a Creative Commons Attribution 4.0 International License.