NHANES Environmental Chemical Data Tutorial - Descriptive Statistics

Task 3: How to Generate Geometric Means Using SUDAAN

In this example, you will use SAS-callable SUDAAN to generate tables of geometric means, standard errors and 95% confidence intervals for Mono-(2-ethyl)-hexyl phthalate by age, sex, race-ethnicity, and survey cycle.

Step 1: Use proc descript to generate geometric means in SUDAAN

To calculate the geometric means and standard errors, you will use SAS-callable SUDAAN because this software takes into account the complex survey design of NHANES data when determining variance estimates. The SUDAAN procedure, proc descript, is used to generate geometric means and standard errors. The output statement is used to output those estimates along with the sample size (nsum) (i.e., the number of survey participants with known values for the variable of interest). The general program for obtaining weighted geometric means and standard errors is below.

WARNING

The design variables, sdmvstra and sdmvpsu, are provided in the demographic data files and are used to calculate variance estimates.

Generate Means in SUDAAN
Statements	Explanation
proc descript data= nh.Phthalate_analysis_data design=WR atlevel1= design=WR atlevel1=1 atlevel2=2 noprint notsorted ;	Use the proc descript procedure to generate geometric means and specify the sample design using the design option WR (with replacement). The data statement refers to the permanent dataset, Phthalate_analysis_data, created in module 10. The option noprint is used to limit the output that is printed. The option notsorted is used since you did not use the SAS procedure proc sort to sort the dataset by strata (sdmvstra) and PSU (sdmvpsu). The atlevel=1 (strata) and atlevel=2 (PSU) are necessary to calculate the degrees of freedom for the t-statistic.
NEST sdmvstra sdmvpsu;	Use the nest statement with strata (sdmvstra) and PSU (sdmvpsu) to account for the design effects.
weight WTSPH6YR;	Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the 6-year Phthalate Subsample Weight (WTSPH6YR) is used.
subgroup age5cat riagendr reth4cat sddsrvyr ;	Use the subgroup statement lists the categorical variables for which statistics are requested. This example uses 5 age categories (age5cat), gender (riagendr), race-ethnicity (reth4cat) and survey cycle (sddsrvyr). These variables also appear in the table statement.
levels 5 2 3 3 ;	Use the levels statement to define the number of categories in each of the subgroup variables. The level must be an integer greater than 0. This example uses five age categories, two genders, three race-ethnicity groups and three survey cycles.
var URXMHP;	Use the var statement to name the variable(s) to be analyzed. In this example, the Mono-(2-ethyl)-hexyl phthalate variable (URXMHP) is used.
table rage5cat riagendr reth4cat sddsrvyr;	Use the table statement to specify tabulations for which estimates are requested. If a table statement is not present, a one—dimensional distribution is generated for each variable in the subgroup statement. In this example the estimates are for age categories (age5cat), gender (riagendr), race-ethnicity (reth4cat) and survey cycle (sddsrvyr).
output nsum geomean segeomean atlev2 atlev1/ filename=out3d replace;	Use the output statement to create a dataset that can be used to calculate a 95% confidence interval for the geometric mean. In this example, the sample size (nsum), geometric mean (geomean), standard error of the geometric mean (segeomean), PSU (atlev2) and strata (atlev1) are output to a dataset named out3d. The replace option is necessary when creating a dataset in the output statement. Note: For a complete list of statistics that can be requested on the output statement see SUDAAN Users Manual.
rtitle "Geometric Means of Mono-(2-ethyl)-hexyl phthalate and standard error for age, sex, race-ethnicity and survey cycle: NHANES 1999-2004" ;	Use the rtitle statement to assign a heading for each page of output.
run ;	The run statement signifies the end of the procedure.
data Result1 keep=Analyte Demo Nsum geomean segeomean L95CI_T U95CI_T) ;	Create a new dataset (result1) to calculate the 95% confidence interval of the geometric mean. The keep statement indicates the variables that will be kept in the final dataset.
set out3d;	The set statement sets the dataset that was created in the output statement above.
dfs=atlev2-atlev1;	Data step to calculate the degrees of freedom used with the t-statistic. Degrees of freedom are calculated by subtracting the strata (atlev1) from the PSU (atlev2).
L95CI_T=round(geomean+tinv(.025, dfs)segeomean, .01); U95CI_T=round(geomean+tinv(.975, dfs)segeomean, .01);	Data step to calculate the lower (L95CI_T) and upper (U95CI_T) confidence interval of the geometric mean using the t-distribution (tinv) and the standard error of the geometric mean (segeomean) calculated in the proc descript procedure above.
length Analyte $24 Demo $16;	Use the length statement to define the lengths of new variables created in the data step.
if variable=1 then Analyte="MPH (ng/ml)"; if age5cat=0 then demo="All"; else if age5cat=1 then demo="6-11 years"; else if age5cat=2 then demo="12-19 years"; else if age5cat=3 then demo="20-40 years"; else if age5cat=4 then demo="40-59 years"; else if age5cat=5 then demo="60+ years"; if riagendr=1 then demo="Male"; else if riagendr=2 then demo="Female"; if reth4cat=1 then demo="NH white"; else if reth4cat=2 then demo="NH black"; else if reth4cat=3 then demo="Mexican American"; if sddsrvyr=1 then demo="99-00"; else if sddsrvyr=2 then demo="01-02"; else if sddsrvyr=3 then demo="03-04";	Define new variables using if/then statements.
if demo^=' ';	Only select records where the demo variable is not equal (^=) to missing.
Run;	The run statement signifies the end of the data step.
proc print data=Result1 noobs;	Use proc print to view the results of the data step.
id Analyte Demo;	The id statement identifies the analyte and the demographic variable being displayed.
var Nsum geomean segeomean L95CI_T U95CI_T;	The var statement identifies the variables to be displayed in the output window.
title "Geometric mean of Mono-(2-ethyl)-hexyl phthalate by demographics and survey cycle";	The title statement assigns a heading for the output.
footnote1 "Some results may differ from Environ. Exp. Report due to updates in file";	The footnote statement assigns a footnote for the output.
Run;	The run statement signifies the end of the procedure.

Step 2: Review output

The output will list the sample sizes, geometric means, standard error of the geometric mean and the upper and lower 95% confidence interval.

The output shows sample sizes, geometric means, standard error of the geometric mean and the upper and lower 95% confidence interval for each of the demographic groups.
Also notice that the geometric mean in 2003-2004, 2.34 ng/mL is larger than the median results, 1.9 ng/mL, (50th percentile) from the percentile program in Task 2.

View animation of program and output

Close Window to return to module page.