This guide describes the Census 2018 datasets in the IDI. It covers the structure of the data, how it differs from the 2013 Census, issues with the data particularly relating to the mitigation strategies used to account for the low response rate at the 2018 Census, and the advantages and disadvantages of the 2018 Census data for research and analysis.
- Census 2018 data differs from previous census data in some important ways:
- It contains much more imputation than previous censuses. Administrative data has been used to impute missing variables and also to impute entire records (people).
- Individuals have been repatriated to their household of usual residence.
- There are new variables and also variables that have been dropped from previous censuses.
- Detailed reports are available on the methodologies used in the 2018 Census and on the quality of different census variables.
- Code and methods used for previous censuses may not work with 2018 data.
- The 2018 census is considerably more complex than previous censuses and analysts should familiarise themselves with the methodology and quality evaluations before using Census 2018 data.
The New Zealand Census provides an official count of how many people and dwellings there are in the country at a set point in time. It also provides detailed social, cultural and socio-economic information about the total New Zealand population and key groups in the population. Censuses in New Zealand have been undertaken every five years since 1851, with a few exceptions. The most recent exception was the 2011 Census being delayed until 2013 because of the Christchurch earthquakes.
Datasets from two censuses are in the IDI: the 2013 Census datasets were added in the October 2015 refresh (20151012), and 2018 Census data was added in the January 2020 refresh (20200120). Standalone (i.e., not linked to IDI) census datasets are also available within the Datalab. Stats NZ has also linked historical censuses together into a longitudinal census dataset, and this is available to researchers outside the IDI. Interested users should contact Stats NZ for access to these additional datasets.
What 2018 Census data can I find in the IDI?
There are five 2018 Census data tables in the IDI:
- The individual table <cen_clean.census_individual_2018>
contains demographic data on all New Zealand inhabitants; it also includes cultural affiliation, occupation, health, and education variables.
- The family table <cen_clean.census_family_2018>
contains data on family composition, family income, and income sources.
- The extended family table <cen_clean.census_ext_family_2018>
contains data on extended family type and income.
- The household table <cen_clean.census_household_2018>
contains data on household composition and household income.
- The dwelling table <cen_clean.census_dwelling_2018>
provides details on bedroom count, housing quality measures, and amenities.
Further information on 2018 Census data structure and variables can be found at https://www.stats.govt.nz/assets/Uploads/Methods/2018-Census-data-user-guide/2018-census-data-user-guide.pdf.
Differences between 2013 and 2018 Census data
2018 Census data differs from 2013 Census data in some important ways. There was more imputation, people were repatriated to their usual residence, new variables were available, and old variables were no longer available.
A range of factors (for a summary see Jack and Graziadei, 2019) meant that the 2018 Census had a much lower response rate than previous censuses. More than 16% of New Zealand residents did not complete a questionnaire for the 2018 Census, compared to less than 8% in the 2013 Census, and non-completion was much higher for Māori and Pacific New Zealanders (2018 Census External Data Quality Panel, 2019a; p10). In response, Stats NZ initiated a large-scale mitigation project that involved the use of alternative data sources to fill the gaps, including data:
- from administrative tables in the IDI;
- from the 2013 Census; and
- imputed from a near neighbour (i.e., copied from the completed 2018 Census form of an individual with similar characteristics to the individual with missing data).
These data were used for two major types of imputation: item imputation (filling in missing responses from someone who filled out a census form but missed some questions) and record imputation (creating entire records for people who did not respond to the census).
Previous censuses have used minimal imputation, typically just item imputation for missing age, sex, usual resident status (for those away on census night), labour force status, and Māori descent (electoral). The large scale imputation used in the 2018 Census is unprecedented in New Zealand and has important implications for research using Census 2018 data. These are described in the ‘challenges’ section below.
Record imputation (for missing people)
For approximately 12.4% of the New Zealand population, no data was collected for Census 2018. For these missing individuals, Stats NZ used administrative data from the IDI to construct imputed individual records that were added to the census dataset.
Stats NZ linked the 2018 Census file to the IDI ERP, a population of people who were New Zealand residents as at the census date, constructed from administrative data (see Gibb et al., 2016 for more information). Records that were in the IDI-ERP, but not in the census file, were added to the census file. It is important to note that while the addition of these records results in a census file that has more complete coverage of the New Zealand resident population, there will still be undercoverage and overcoverage.
For a discussion of some of these issues see 2018 Census External Data Quality Panel, 2019a, chapter 3; for a formal analysis of overall coverage, see results from the Post‑enumeration Survey for 2018 at https://www.stats.govt.nz/information-releases/
Some imputed records have missing address data
The imputed census records did not have an address attached, but they need to be attached to a dwelling (address) in order to generate household and family statistics. To assign an address to imputed records, Stats NZ considered addresses from a selection of data sources in the IDI, and selected the most recently updated. This was used to assign the usual residence for that person. In some cases, an address could not be found, or modelling indicated that the address was likely to be incorrect. These individuals were removed from the census file.
Where possible, the addresses were used to assign individuals to dwellings. Modelling was also undertaken to determine the likelihood that these dwelling assignments were correct. When the probability of the dwelling being correct was too low, individuals were not assigned to a dwelling, but they could be assigned to a meshblock (small area) if there was a high probability that the meshblock was correct. As a result, there are individuals in the census file who have a meshblock but are not attached to a dwelling.
Additional methods were used for some special populations such as prisoners and defence force members, see Census External Data Quality Panel (2019a, pp20-21) for details.
Item imputation (for missing variable values)
Item imputation is used to fill in individual variable values where there is a census record for a person but some of the information is missing. This can happen if an individual does not answer some questions, or when a record is added to the file from administrative data (as described above). The amount of missing data differs by variable and ranges from 9.5% (educational institution address) to 55.3% (sector of ownership). For most variables it was less than 20% (https://www.stats.govt.nz/reports/2018-census-external-data-quality-panel-data-sources-for-key-2018-census-individual-variables). The amount of missing data also varies across region, age, ethnic group, and other factors.
To impute individual items Stats NZ used a combination of methods including sourcing information from administrative data, copying over values from the 2013 Census, and copying over values from a similar census record (using the ‘Canadian census edit and imputation system’ [CANCEIS] tool; for a description see Census External Data Quality Panel, 2019a, pp21-22). The exact methods used differed by variable, and more detail can be found within the DataInfo+ “2018 Census information by variable and quality” page (http://datainfoplus.stats.govt.nz/Item/nz.govt.stats/2ae40a5d-64c8-4704-9829-45f802d78c6c). Imputed values can be identified using the ‘item source’ variables in the Census 2018 dataset. These variables specify the data source used for each case for each variable (e.g., ‘cen_ind_ethgr_impt_ind’ indicates the source of ethnicity data for each individual).
Data users should be aware that use of alternative data sources (i.e., administrative data; 2013 census; near-neighbour imputation) will have impacts on data quality. For example, where ethnicity data were not available from the 2018 Census, data were obtained from either the 2013 Census, administrative data, or near neighbour imputation (see http://datainfoplus.stats.govt.nz/Item/nz.govt.stats/7079024d-6231-4fc4-824f-dd8515d33141). Because individual reporting of ethnicity can change over time, and vary by context, using alternative data sources can result in a different and less up to date record of ethnicity than that recorded by the census. In particular, ethnicity from administrative sources might not be updated often, especially for people who used services infrequently.
Repatriation of absentees
Census ‘absentees’ are those who were not staying at their usual residence on census night. In previous censuses, household statistics showed these people as being part of the household that they were staying at on census night, rather the household in which they usually live. For the 2018 Census this has changed, and absentees have been ‘repatriated’ to their usual residence household. As a result, 2018 household statistics will show absentees as part of the household in which they usually live, rather than the one at which they were staying on census night. Note that household information will be missing if it was not possible to determine an individual’s usual residence household.
New and dropped variables
Census 2018 contains new variables, such as ‘usual residence one year ago’, ‘main means of travel to education’, ‘educational institution address’, and three measures of housing quality: ‘access to basic amenities’, ‘dampness’, and ‘mould’. There is also one key variable ‘iwi affiliation’ that was asked in 2018, but is not available in the 2018 Census dataset because the variable was considered ‘very poor’ quality. This is because of low response rates among Māori and the lack of suitable alternative data sources, see http://datainfoplus.stats.govt.nz/Item/nz.govt.stats/518050af-47e8-486a-8f3c-f0995d3a716b.
In addition, the format of some questions has changed and this will impact on what is being measured. For example, the travel to work question in 2013 asked about travel to work on census day; the 2018 question asked about usual means of travel to work.
There are several challenges when working with 2018 Census data in the IDI.
Adapting to new imputation and repatriation methods
The repatriation of absentees in the 2018 Census means that users will need to adopt different methods for dealing with absentee records, because methods used for previous censuses will no longer work. It is critical to process absentee records correctly or counts will be incorrect. Processes for repatriating absentees are complex and we recommend that users seek advice from Stats NZ before dealing with absentee records in the 2018 Census.
The imputation from administrative data may cause issues when Census 2018 is combined with, or compared to, administrative data. For example, the assessment of administrative ethnicity variables against the ‘gold standard’ of census ethnicity is not sensible for records where ethnicity has been imputed from administrative sources.
Administrative enumeration resulted in 357,000 people who could not be assigned to dwellings (see ‘Some imputed records have missing address data’, above). This will have an impact on the quality of household- and family-level variables (Census External Data Quality Panel, 2020, section 6).
Users will need to make choices about whether, and how, to use imputed data from the census depending on their specific aims and methodology. For example, to understand the potential impact of imputation on their analysis, users could undertake sensitivity analyses with and without the imputed records. See Atkinson et al., 2019 and Boven et al., 2021 for examples of this.
Difficulty comparing with previous censuses
The substantial changes to the 2018 Census mean that fair comparisons to previous censuses may not be possible. For example, comparisons of 2013 and 2018 smoking prevalence is likely to be biased because smoking data from 2018 was supplemented with smoking data from 2013. This will overestimate similarity between the two censuses, and underestimate downward trends in smoking (Census External Data Quality Panel, 2019b, section 2.2).
Quality of 2018 Census variables
Users should be aware of the uncertainty introduced by the new methods used, and the quality of specific census variables, particularly as it relates to the use of administrative data sources. Assessments of the quality of each census variable have been provided by
as well as by the Census External Data Quality panel https://www.stats.govt.nz/
reports/2018-census-external-data-quality-panel-assessment-of-variables). These assessments give a brief summary of the quality concerns for specific census variables, and should be required reading before working with these variables.
While these assessments largely restrict their focus to individual variables used on their own, it should be noted that the particularly high use of alternative data sources for the 2018 Census has the potential to skew estimates of association. In particular, associations between 2018 and 2013 Census variables may be artificially inflated for variables where 2013 Census data was used as an alternative data source (e.g., smoking, see above). Also, associations between census variables and other variables in the IDI may be inflated if the other variables were used as an alternative data source. For example, income recorded by the Inland Revenue Department was used as an alternative data source for census-recorded income, so associations between the two are also likely to be inflated.
It is less clear how associations with variables that used large amounts of nearest neighbour imputation (e.g., occupation) are affected. On the one hand, because imputed data were borrowed from another census respondent, it might be expected that this would weaken whatever associations are in the data. On the other hand, variables were imputed in blocks, which means data for variables within blocks were copied from the same census respondent (e.g., because income and occupation were imputed in the same block, data for both were often copied across from the same census respondent). This may maintain or even inflate associations between variables.
Lower levels of missing data
The incorporation of alternative data into the 2018 Census tended to result in fewer respondents with ‘no information’ than previous years. This is because alternative data sources were used for 2018 Census variables not only for ‘unit non-response’ (i.e., where a census form was not completed) but also for item non-response (i.e., where a census form was completed but information was missing for some variables). Indeed, for a number of 2018 Census variables in the IDI there are no records with ‘no information’. One consequence of this is that interpretation of change between the 2013 and 2018 Censuses must be undertaken with caution, as apparent change may be due to the higher completeness of the 2018 Census variable. The removal of the ‘No information’ category prior to calculating proportions can overcome this problem.
There have been some changes in the structure and format of some variables compared to previous censuses. Data users should check the format and structure of variables before using them. For example, in 2013, occupational codes are stored in both ANZSCO and NZSCO99 formats, but in 2018 these are only stored in ANZSCO formats. Another example is the income sources data, which was split across different variables in 2013 but was combined into a single variable for 2018.
The main geographic reporting units have also changed in the 2018 Census to Statistical Area 1 (SA1), although other units are still recorded in the census datasets (meshblock, Statistical Area 2 (SA2), Territorial Authority (TA), Regional Council (RC)). Also, care needs to be taken dealing with a dwelling’s geographies versus an individual’s geographies, which may differ for those who were absent from their usual residence on census night (see ‘repatriation of absentees’ section above).
Stats NZ census data quality training: https://rise.articulate.com/share/05dPJqhQYTAM5UFUtLc_B8h1Eho5kOS0
Data quality for 2018 Census, including reports from the Census External Data Quality Panel: https://www.stats.govt.nz/2018-census/data-quality-for-2018-census
Independent Review of New Zealand’s 2018 Census: https://www.stats.govt.nz/reports/
Census metadata in Datainfo+: http://datainfoplus.stats.govt.nz/Item/nz.govt.stats/
Stats NZ 2018 Census front page: https://www.stats.govt.nz/2018-census/
New Zealand Longitudinal Census: https://www.stats.govt.nz/methods/linking-censuses-new-zealand-longitudinal-census-19812006
2018 Census External Data Quality Panel. (2019a). Initial Report of the 2018 Census External Data Quality Panel. Retrieved from: https://www.stats.govt.nz/reports/initial-report-of-the-2018-census-external-data-quality-panel
2018 Census External Data Quality Panel. (2019b). 2018 Census External Data Quality Panel Assessment of Variables. Retrieved from: https://www.stats.govt.nz/reports/2018-census-external-data-quality-panel-assessment-of-variables
2018 Census External Data Quality Panel. (2020). Final Report of the 2018 Census External Data Quality Panel. Retrieved from: https://www.stats.govt.nz/reports/final-report-of-the-2018-census-external-data-quality-panel
Atkinson J, Salmond C, Crampton P (2020). NZDep2018 Index of Deprivation Final Research Report, December 2020. Wellington: University of Otago. Retrieved from: https://www.otago.ac.nz/wellington/otago823833.pdf
Boven N, Shackleton N, Bolton L, Milne B. (2021). The 2018 New Zealand Socioeconomic Index (NZSei-18): A Brief Technical Summary. Auckland: COMPASS Research Centre, the University of Auckland. Retrieved from: https://cdn.auckland.ac.nz/assets/auckland/
Gibb S, Bycroft C, Matheson-Dunning N (2016). Identifying the New Zealand resident population in the Integrated Data Infrastructure (IDI). Retrieved from: https://www.stats.govt.nz/assets/Research/Identifying-the-New-Zealand-resident-population-in-the-Integrated-Data-Infrastructure/identifying-nz-resident-population-in-idi.pdf
Jack M, Graziadei C. (2019). Report of the Independent Review of New Zealand’s 2018 Census. Wellington, New Zealand. Retrieved from: https://www.stats.govt.nz/assets/
Original 27/05/2021, written by Barry Milne, Sheree Gibb, June Atkinson, Martin von Randow, Natalia Boven, Kendra Telfer
This work is licensed under a Creative Commons Attribution 4.0 International License.