Using the datalab
Before you can work in the Datalab you will need to complete Statistics New Zealand’s confidentiality training and sign a secrecy agreement.
Statistics New Zealand has Datalab facilities available at their offices in Wellington, Auckland and Christchurch. These are available to all datalab users. Contact Statistics New Zealand’s microdata access team about accessing their datalabs ( email – firstname.lastname@example.org, phone – 04 931 4253).
Many institutions such as universities and government agencies have their own datalabs. Contact the relevant institution for details about accessing these.
The Datalab is a secure environment – nothing can be put into it or taken out of it without passing through SNZ’s microdata access team. The computer that you use to access the IDI can’t be used for anything else – it has no email and most websites are blocked. This means that you need to do some preparation in order to make the best use of your time in the Datalab. If you have code or other files that you want to use in the Datalab, email them to the microdata access team ahead of time so that they can put them in the Datalab environment for you. You are able to take a tablet or laptop into the Datalab, but you will not be able to transfer anything from that device into the Datalab environment. Use of it will be restricted to checking emails, searching the web, looking at your notes, etc.
The Datalab environment has SAS, SQL, STATA, and R Studio. If you need other software you will need to talk to the microdata access team well ahead of time (preferably when making your application).
If you have not used the IDI before, I suggest you spend your first session in the Datalab getting to know the structure and content of the IDI.
Have a look around the IDI wiki. This is only accessible from within the Datalab and contains data dictionaries, information about how to access IDI, efficiency tips, and more. There is a link to the wiki from your Datalab folder. If you are interested, you can also have a browse through the code sharing area and the discussion board. These are linked from the wiki under the ‘IDI research’ heading.
A good way to get a feel for the structure of the IDI is to use SQL Server Management Studio. You can use this in ‘point and click’ fashion, there is no need to program in SQL. There is a shortcut on the desktop of the Datalab computer. The IDI is organised into ‘refreshes’, which are versions or updates. Within each refresh there are a series of data collections, and within each data collection there are tables. The names of tables are displayed as collection.table. So, the table ‘moe.enrolment’ refers to the ‘enrolment’ table within the MoE (education) collection. Derived tables (tables created by IDI) are stored in the ‘data’ collection (eg data.personal_detail). In SQL viewer you can see the names of all tables, including those that you don’t have access to (and cannot open). For those that you do have access to, you can view them by right clicking on the table name and selecting ‘view top 1000 rows’.
Some tables contain a lot of variables, and it’s not always obvious what they mean. IDI data dictionaries are available through the wiki and have explanations of the content of IDI datasets. Some data dictionaries are also available outside the Datalab environment. The quality of the data dictionaries varies. If you have a question about the meaning of a particular variable, or where you might find something, don’t hesitate to ask the microdata access team, the IDI team, other researchers, or post a question on the wiki or on Meetadata. Sometimes it is also helpful to go back to the data dictionaries produced by the source agencies (eg for health or education). You can find these online.
You can also explore the structure through SAS, but it is less convenient. If you run a libname statement for the data collection(s) that you want to view (see the wiki for the format of the libname), you can then view the tables in the left panel underneath the libname.
It is recommended that you make your initial data extractions using SQL. This can be done through SQL server, or from within SAS using the ‘passthrough’ method described on the wiki (under the heading ‘An intro to SAS explicit passthrough queries’). It is also possible to extract data from the IDI using other means (eg a SAS data step) but it will be much slower than using SQL. That said, if you only want to extract a small amount of data, it probably won’t make much difference.
For R users, there is an example of how to connect to IDI using R in the code sharing area. Alternatively, you can make the initial extraction using SQL and then save a csv or similar file to read into R.
Once you have completed the initial extraction, you can save your dataset and continue to use SAS/STATA/R as you normally would. If you are saving your dataset as an SQL table, you can store it in the sandpit (if you do the initial extraction through SQL server, this is your only option). SAS datasets, csv files and other formats can be stored in the Datalab folder. The datalab folder can also be used to store code, notes, tables and write-ups, and any other documents you are using within the Datalab. One Datalab folder is allocated to each microdata access application. If your application is large and contains several sub-projects it is suggested you set up sub-folders within the Datalab folder.
Some examples of information you might need, and where to find it
Age and date of birth
Age and sex can be found in the personal detail table. This table covers all individuals who have a record anywhere in the IDI. Month and year of birth is available, but not day of birth. If an individual has different sex or date of birth recorded in different data sources, IDI has an algorithm for deriving the values in the personal detail table (ask the microdata access team if you would like details).
Census measured self-defined ethnicity is considered one of the best quality and most appropriate measures of ethnicity. Recent refreshes of IDI provide source-ranked ethnicity table which provides ethnicity from one source, with first priority on the census followed by DIA and then health. More details about this table are available on the Wiki or Meetadata.
For further information on ethnicity in the IDI see – Reid, G, Bycroft, C, Gleisner, F (2016). Comparison of ethnicity information in administrative data and the census. Retrieved from http://archive.stats.govt.nz/methods/research-papers/topss/maori-info-admin-data.aspx
Information about geographic location is recorded in several sources in the IDI. As of the December 2015 refresh, the address sources were IRD, ACC, Health (NHI and PHO), Education, MSD, Census and HLFS. At any given time, an individual may have several different addresses recorded with different administrative providers (if, for example, they have moved house and updated their address with their GP, but not with the IRD). These might be different types of addresses, for example MSD has both residential and postal addresses. You will need to think about which sources and types are best suited to your needs.
IDI geocodes all address information that they receive to co-ordinate point. The variables that contain this information are called ‘snz_idi_address_register_uid’ or similar. The coordinate points identify a land parcel (roughly equivalent in most cases to an address or dwelling). However, these have been encrypted so that they are not identifiable. The same encryption key has been used on different IDI datasets (so, for example, you can be sure that location ‘10076534’ in health data is the same location as ‘10076534’ in tax data). However, the encrypted locations cannot reveal the actual location of a dwelling.
Locations are also coded to statistical areas, such as meshblock, territorial authority, and regional council area. These are not encrypted.
Geographic location information is available in the source data collections, and is also combined into two tables in IDI. The ‘address_notification_full’ table contains a list of all address updates that have been recorded in all IDI data sources, along with a flag indicating the data source. The ‘address_notification’ table uses prioritisation to get a single geographic location for an individual at any given time. The prioritisation process is described in IDI geographic tables metadata on the wiki.
When geographic information is needed use the ‘address_notification_full’ table, and select the most recently updated address at the reference date. This method has been shown to provide a more accurate address (based on comparisons against addresses recorded in census data) than a prioritisation method (see http://archive.stats.govt.nz/methods/research-papers/topss/quality-geo-info-idi.aspx).
You might be interested in identifying whether or not an individual was in NZ at a given date. The easiest place to get this information is from the person_overseas_spell table in the ‘data’ collection. This table lists periods spent overseas by each individual in the IDI, from 1998 onwards (migration data in IDI does not go further back than 1998). Each row in the table is one overseas spell (so an individual might have many rows in the table). For each overseas spell there is a start date, end date, and length (in days). There are default dates used for a long time in the past or a long time in the future for those arriving in the country for the first time, or those who have left but not returned. The table is derived from the MBIE migration tables, which are a list of all border movements in and out of New Zealand (data are collected from passports, not from the departure/arrival card that you fill out. IDI does hold the card data, but it is a different collection called ‘New Zealand Customs Service’). If you need more information than what is in the person_overseas_spell table, you will have to get access to the MBIE and/or Customs tables (MBIE is better for some purposes, Customs for others, you will need to talk to SNZ about which collection would suit your needs better). But for many people, the information in the person_overseas_spell table will be enough.
Getting data out of the datalab
The Datalab is a secure environment and you cannot remove data, tables, graphs, etc without (1) applying confidentialisation in accordance with SNZ’s rules, and (2) having it checked by SNZ checkers.
Important: There is a 5 working day turnaround for checking and releasing outputs from the datalab. If you need some numbers, graphs or tables for a presentation or report, you need to plan ahead.
The process for confidentialising and releasing datalab outputs is described in the Microdata Output Guide. If you have any questions, don’t be afraid to ask the microdata access team. SNZ has the right to revoke your datalab access if you break the rules, so it is in your interests to make sure you are following them correctly.
By Sheree Gibb
Version: Original 10 March 2016, last updated 6 March 2018