It is convention (but not necessary) to write SQL command words in capital letters, and table and variable names in lower case. Square brackets are often used around server, table and variable names, but these are not necessary in most cases.
The most basic command you will use regularly in SQL is selecting a set of variables (columns) or observations (rows) from an existing table. The following code will select the top 100 rows of each variable from the table called table_name. In this case the character * indicates that we want to include all of the variables/columns. Note that ; is a statement terminator in SQL, and while not always required in SQL server its use is encouraged.
The following code examples relate to:
- Extracting data
- Subsetting according to criteria
- Creating tables
- Updating existing tables
SELECT TOP 100 * FROM [IDI_Clean].[data].[personal_detail];
If the number of rows is not indicated, the command will return all of the rows. To return all of the rows in table_name, but for only a subset of the variables, * is replaced by the column names of the variables we are interested in separated by commas:
SELECT snz_uid, snz_sex_code FROM [IDI_Clean].[data].[personal_detail];
The code above will return all rows for the variables called snz_uid and snz_sex_code.
Variables can easily be renamed within this step by writing AS and then the new variable name. Note that you will not be able to refer to this variable by its new name until you make a subsequent query. The following code renames the ‘snz_sex_code’ variable to ‘gender’:
SELECT snz_uid, snz_sex_code AS gender FROM [IDI_Clean].[data].[personal_detail];
Subsetting according to criteria
If we want to filter the results, for example to only view rows where some condition is true, the command WHERE can be used:
SELECT * FROM [IDI_Clean].[data].[personal_detail] WHERE snz_uid = 12345;
The code above will only select rows where snz_uid is ‘12345’.
The commands AND and OR can also be used in conjunction with WHERE to create more complicated filtering criteria. The following code section selects all rows where snz_uid = 1 and the sex variable is not blank:
SELECT * FROM [IDI_Clean].[data].[personal_detail] WHERE snz_uid = 12345 AND (snz_sex_code IS NOT NULL);
Similar queries are possible using <, > and =.
The command GROUP BY can be used to collapse rows together and calculate summary measures such as means or sums for groups. For example, if we wanted to count the number of hospital admissions for each snz_uid we could use the following code:
SELECT snz_uid, COUNT(moh_evt_event_id) as n_hospital_adm FROM [IDI_Clean].[moh_clean].[pub_fund_hosp_discharges] GROUP BY snz_uid;
Similar queries are also possible by using AVG(), MIN() and MAX(). Subsetting and grouping can be combined together into a single select query.
The results can also be ordered based on one or more variables, either in ascending (ASC) or descending (DESC) order. The following code will list the results ordered by sex with the lowest values first, and then by ascending snz_uid. By default, ordering is in ascending order.
SELECT snz_uid, snz_sex_code
ORDER BY snz_sex_code DESC, snz_uid ASC;
Selecting data using the above commands will open a window in SQL server below your script allowing you to view the results, but if you want to save the results in a table you need to use the INTO command. This requires write access to the Sandpit folder for your project. The following command will save the table produced into the Sandpit in the folder for project MAA2018-xxxx (where xxxx is your project number), and the new table will be saved with the name income_table.
SELECT snz_uid, snz_sex_code INTO [IDI.Sandpit].[DL-MAA2018-xxxx].[income_table] FROM [IDI_Clean].[data].[personal_detail];
It is also possible to write data into a temporary table by using # in front of the table name. This table will disappear when you close SQL Server Management Studio, and is only accessible within that same query window. Here is an example:
SELECT snz_uid, snz_sex_code INTO #my_table FROM [IDI_Clean].[data].[personal_detail];
If you run code to create a table and that table already exists, whether it is a normal or temporary table, SQL will produce an error and the new table will not be created. You need to delete the old version of the table before you can create it again.
While you can manually delete tables, if you have code that you will run multiple times it can be more efficient to automate the process. The following code segment first checks if a table called table_name exists in the Sandpit area already, and if it does it deletes it. The ‘U’ specifies that we are interested in tables.
IF OBJECT_ID(‘[IDI_Sandpit].[DL-MAA2017-project_number].[table_name]’,’U’) IS NOT NULL DROP TABLE table_name;
When working with a temporary table called #temp_table, we would perform the same operation with the following code:
IF OBJECT_ID('tempdb..#temp_table','U') IS NOT NULL DROP TABLE #temp_table;
Joins are a commonly used set of functions in SQL. They allow you to join/combine the data from multiple tables into a single new table using one or more IDs that are common to both/all of the tables involved. There are a range of different joins that can be used to combine data. At the end of this document we will include some helpful SQL links, including some that go into the properties and characteristics of the different joins. The command ON is used when performing joins to specify which variables link the tables.
SELECT p.snz_uid, p.snz_sex_code as gender, h.moh_evt_evst_date FROM [IDI_Clean].[data].[personal_detail] AS p LEFT JOIN [IDI_Clean].[moh_clean].[pub_fund_hosp_discharges] AS h ON p.snz_uid = h.snz_uid;
The code above performs a LEFT JOIN. This means that every row of [personal_detail] will be present in the new table (at least once, but also potentially multiple times if it links to multiple items in [pub_fund_hosp_discharges]), while [pub_fund_hosp_discharges] rows will only be present if the snz_uid from that row also exists in [personal_detail]. When specifying the tables to be used, we also assign each of them an alias using AS. When declaring the variables we are selecting, the alias is added as a prefix to each variable, so p.snz_uid indicates that we are selecting the snz_uid from the [personal_detail] table. If there is no matching [pub_fund_hosp_discharges] data for one of the snz_uid in [personal_detail], then the hospitalisations column in the new table for that row will be blank. For this join snz_uid is the identifier that is common between the two tables. Even though we have not included it in the list of variables to be selected, snz_uid needs to exist in the [pub_fund_hosp_discharges] table (we have selected the birth table snz_uid only). If both tables have the same variable name (such as with snz_uid in this case), you need to either include only one copy (as we have here), or rename one of them in the second table. To include all variables from a table, you can use the ‘*’ after the alias.
Other useful types of joins are INNER JOIN (selects records that are common to both tables) and FULL JOIN (selects records that are in either table).
Using existing tables
You can also add a column to a table you have already created and stored in the Sandpit. This is done in two steps, you first ALTER the table to add a column of a specified data type, this will be an empty column to start with, and then in a separate command you UPDATE the table to add data to that column. There are many different data types that can be used in SQL; for a detailed list of the available options see the links at the end of the document.
ALTER TABLE table_name ADD new_variable INT, new_variable_2 VARCHAR(20);
The command above will add an empty integer variable called new_variable to table_name, and a character variable column (VARCHAR) with a maximum of 20 characters called new_variable_2. UPDATE can be used to assign values to both the new variables, and well as to overwrite values to others that were already in the table. The following code will update the rows where income is greater than 75,000, setting the new variable to 1, and changing Old_variable to “High income”. All other variables that haven’t been specified will remain unchanged.
UPDATE table_name SET new_variable = 1, new_variable_2 = “High income” WHERE income > 75000;
UPDATE can also be used to get data from another table by combining it with a JOIN. The following code sets new_variable in table_name equal to income from other_table, matching rows based on snz_uid. Note that we use the alias for table_name in the UPDATE command.
UPDATE t1 SET t1.new_variable = t2.income FROM table_name AS t1 LEFT JOIN other_table AS t2 ON t1.snz_uid = t2.snz_uid;
When running code in SQL Management Server that deletes, creates, or alters tables, GO can also be a useful command and can sometimes avoid errors. GO breaks code up into batches that are processed separately by the server. It is placed between batches of code like this:
IF OBJECT_ID(‘tempdb..#temp_population’,’U’) IS NOT NULL DROP TABLE #temp_population; GO SELECT * INTO #temp_population FROM [IDI_Clean].[dia_clean].[births] WHERE dia_bir_birth_year_nbr > 2000;
Running SQL code within other programmes
Within SAS you can run SQL code by writing ‘proc sql;’ and then the SQL code, followed by ‘quit;’. However, this executes that SQL code within SAS, requiring data to be moved from SQL servers to the SAS server. Therefore in many cases it is not any more efficient than running the equivalent SAS code. One exception is with joins, where SQL code can be more efficient as datasets do not need to be sorted prior to joining.
The most efficient way to use SQL within SAS is to use the ‘passthrough’ procedure. This passes the code to the SQL server to be executed there, and exports the results back to SAS. It is therefore more efficient than executing the code on the SAS server. Here is an example that will access the pho_enrolment table in the IDI_Clean refresh and create a table called pho_raw in the SAS working directory:
proc sql; connect to odbc (dsn="idi_clean_20171020_srvprd"); create table pho_raw as select * from connection to odbc (select snz_uid, moh_pho_sex_snz_code as sex, moh_pho_ethnicity_1_code from moh_clean.pho_enrolment ); disconnect from odbc; quit;
The SQL databases within the IDI can also be accessed within R. This is done using the package ‘RODBC’. First initialise the library, making sure you’ve installed it first:
then specify the connection that will be used with the RODBC function odbcDriverConnect. The following code illustrates this, creating a variable called Con that contains the connection that you will use inside the IDI:
Con <- odbcDriverConnect(‘Driver=ODBC Driver 11 for SQL Server; Trusted_Connection=YES; Server=WPRDSQL36.stats.govt.nz,49530;Database=IDI_Clean’)
In this case, the connection is being set up to access data from IDI_Clean. If we wanted to instead access the Sandpit or an earlier refresh this would have to be specified in this connection.
Now the RODBC function sqlQuery can be used to send SQL commands to the database from within R. By default sqlQuery takes two inputs, the first being the connection used (which we have specified as Con in the above code segment), and the second is a character string containing the SQL code to be run. The following code will create a variable in R called births that will contain all birth records for those born after the year 2000.
births <- sqlQuery(Con, ‘SELECT * FROM [IDI_Clean].[dia_clean].[births] WHERE [dia_bir_birth_year_nbr] > 2000;’)
When working with very large data sets there may be advantages to only loading the selected data into R in batches. The sqlQuery command has an optional input called max, which specifies the maximum number of rows to transfer to R. Remaining rows can then be retrieved using the sqlGetResults command, which only requires the connection channel (Con in this case) and the new max number of rows. The following code again selects all births after the year 2000, but only the top 100 rows, and then uses sqlGetResults to access and append the next 100 rows. close(Con) closes the connection we have opened to the SQL server.
births <- sqlQuery(Con, ‘SELECT * FROM [IDI_Clean].[dia_clean].[births] WHERE [dia_bir_birth_year_nbr] > 2000;’,max=100) births <- rbind(births, sqlGetResults(Con,max=100)) close(Con)
At the time of writing it was not possible to connect directly to an SQL database via Stata within the datalab environment.
PDF available here (for a printable version)
By Oliver Robertson and Sheree Gibb. Thanks to Maddie White for helpful comments on an earlier draft.
Original post 14 August 2018