Frequently researchers want to add data into the Datalab environment or link their own datasets to the IDI or LBD in order to answer specific research questions. This blog post gives some tips about the ways you can add external data to the IDI. StatsNZ has published some advice about this process on their website. This post will build on this StatsNZ advice from an IDI data user perspective.

Ethics approval

The first step is to ensure you have ethics approval for your study. This is done by applying to the relevant committee at your institution, for example the University ethics committee. Ensure that the ethics committee has information on the data linkage and what consent you have for this data linkage. StatsNZ may also require you to contact the Health and Disability Ethics Committee to assess whether your project is in scope. Frequently these IDI data analyses are out of scope. StatsNZ will require evidence of this from HDEC (eg an email).

Requests to save external information in your Datalab folder

Whilst in the Datalab environment you may want to use external information to help with your research; for example statistical code, written documents or simple datasets. This type of information does not require StatsNZ to do any linkage to the other IDI and LBD datasets. You can simply email your files and request the Microdata Access team to copy them into your Datalab folders, giving the appropriate folder name. This is the recommended process for bringing in statistical code into the Datalab environment (eg SIAL code, format variables etc).

This process is not appropriate if you want to link individual record data to the IDI. There are privacy implications, consent considerations and a specific process developed by StatsNZ that has been designed for this process. This is described below.

StatsNZ IDI dataload process for linking a dataset to the IDI

In 2017, StatsNZ introduced an application process to load datasets into the IDI to both formalise and prioritise IDI ad hoc data loads. The process involves meeting with StatsNZ, completing the expression of interest and subsequent Data Ethics and Privacy Assessment forms to get StatsNZ approval and then getting a Data Sharing Agreement signed between StatsNZ and the data owners (Board level). An initial meeting with Stats and the data owners helps to ensure everyone is happy with what the project aims to achieve and understands what linked data is. The dataload application form requires users to provide evidence of individual consent and consideration of the relevant privacy issues, as well as explaining the benefits of linking and a summary of the quality of the data. As well as applying for the dataload you will also need to submit a project application to use the IDI datalab.

There are two types of dataloads; full integration and adhoc dataload. A full integration is recommended when:

  • The data is intended to be updated regularly (at least annually)
  • There are no known unique identifier series already within the IDI, and data will be linked probabilistically using personal details
  • There are a large number of tables.

A full integration is tied to the refresh cycle and may take longer for the data to be available than an ad-hoc load. However, it allows data to be updated regularly as part of the quarterly IDI refresh cycle and go through additional processing if necessary.

An ad-hoc load is recommended when:

  • The data has a unique identifier already within the IDI (e.g. IR number, NHI, DIA registration number)
  • The data is a one-off load
  • There is a small number of tables.

In some cases, ad-hoc loads can be completed as Fast Match Loads. This is most suitable when:

  • There is a smaller population that have not yet been linked to the IDI
  • There is not a known unique identifier within the IDI
  • The load is outside of the refresh cycle
  • Name, sex, and date of birth variables are all present.

StatsNZ encourages data owners to give permission for the linked data to subsequently be used by other researchers.

Data is usually transferred to StatsNZ by secure file transfer protocol (SFTP). A SFTP account is set up for data suppliers when an application to add data is accepted. In rare cases, data may also be person to person transfer of an iron key (a secured USB drive). Access passwords should always be communicated separate to the transfer. The process by StatsNZ to link the data to the IDI can take upward of three months and depends on the prioritisation that the dataload is given and resources available at StatsNZ at the time of the transfer.

Data linking is done by StatsNZ via one of two ways; deterministically using government agency identifiers or probabilistically. Deterministic linkage requires the dataset to be submitted to StatsNZ with government agency identifiers (eg NHI). Probabilistic linkage is less exact and more resource intensive. A statistical programme links records based on similarity in names, sex, date of birth etc. Probabilistic linkage can sometime be avoided by sourcing government agency identifiers from other sources. For example if you don’t have NHI numbers for your health research population you can contact the Ministry of Health to request they match your study participants to an NHI number for a small fee.

Once your data is linked it is saved into the IDI_Adhoc database, for ad-hoc loads, or the relevant IDI_Clean database, for full integration.. The next step is to check the data for completeness and quality.

  • Did all the records get linked as expected?
  • How have missing data and zeros been treated?
  • Each individual in the dataset should have a unique identifier. These should be the StatsNZ encrypted government agency identifiers and not the snz_uid, because that ID changes every refresh.
  • How many records are present in the spine, and have been successfully matched to other datasets in the IDI?

Linking data to the IDI can add significant value to a research project however the current process has a long lead in time and requires careful advanced planning to gain the full benefits. Linking of new data is a way your research can support and contribute to future research and improve health and social policy and delivery in New Zealand.

By Andrea Teng, Hayley Denison and Nevil Pierse, with helpful edits by SNZ.

Version: Original 23 May 2018, updated 19 August 2021.