Continuous NHANES Web Tutorial: Clean & Recode Data: Task 3c

Task 3c: How to Identify Outliers and Evaluate Their Impact Using Stata

In this task, you will check for outliers and their potential impact using the following steps:

Run a univariate analysis to obtain all default descriptive statistics.
Plot survey weight against the distribution of the variable.
Identify outliers and compare the outlier-deleted estimates with the original estimates that include the outliers.

Step 1: Check distributions by running a univariate analysis

Before you analyze your data, it is very important that you check the distribution and normality of the data and identify outliers for continuous variables.

Program to Plot Distribution of Continuous Variable

Use the summarize command with the detail option to get descriptive statistics, such as mean, minimum and maximum values, standard deviation, and skewness, etc. for the participants who were interviewed and examined in the MEC and who were age 20 years and older. Use the histogram command with the normal option to graph the continuous variable cholesterol and draw the normal distribution curve. Use the graph box command to draw a box chart graph of the continuous variable cholesterol.

summarize lbxtc [w=wtmec4yr] if (ridageyr >=20 & ridageyr <.) & ridstatr==2, detail

histogram lbxtc if (ridageyr >=20 & ridageyr <.) & ridstatr==2, normal
graph save "C:\STATA\tutorial\descriptive\histogram_discriptive.gph", replace

graph box lbxtc [w=wtmec4yr], medtype(line), if (ridageyr >=20 & ridageyr <.) & /// ridstatr==2
graph save "C:\STATA\tutorial\descriptive\box_plot.gph", replace

Highlighted items from the univariate analysis output:

In this example, five outlier values with serum cholesterol values over 475 mg/dl are identified in the distribution.
Watch animation of program and output
Can't view the demonstration? Try our Tech Tips for troubleshooting help.

Step 2: Plot a Graph of Survey Weight Against the Distribution of the Variable

In this example, you will plot the 4-year MEC survey weight (wtmec4yr) against the distribution of the cholesterol variable to determine whether the extreme observations are outliers.

Plot Exam Weight Against Cholesterol

Use the graph twoway scatter command to plot the total serum cholesterol (lbxtc) by the corresponding sample weight for each observation in the dataset for participants who were interviewed and examined in the MEC and who were age 20 years and older. Use the mlabel option to label the sequence numbers associated with extreme values in the output.

graph twoway scatter wtmec4yr lbxtc if (ridageyr >=20 & ridageyr <.)& ridstatr==2, mlabel(seqn) title(NHANES 1999-2002: adults age 20 years and older)

Highlighted items from plotting the survey weight against the distribution of the cholesterol variable:

Three outliers with serum cholesterol values higher than 600 mg/dl are identified from the plot.
None of these three observations has an extremely large survey weight.

Watch animation of program and output

Step 3: Identify Outliers and Compare Estimates with Outliers Deleted Against the Original Estimates with Outliers Included

In this step you will:

delete the three outliers identified in the plot above using the SEQN numbers; and
compare the mean of the new dataset without the outliers against the mean of the original dataset that includes the outliers to check the impact of the outlier observations.

Program to Create Dataset Without Outliers and Output Means of Both Datasets

Use the label commands to describe labels used for the race/ethnicity variable values.

Use the drop command to delete the outliers using their SEQNs previously identified in the plot of survey weight versus distribution of the variable. The SEQNs associated with these outliers are labeled on the scatter plot output under plot exam weight against cholesterol.

Use the mean command to determine the mean and standard error for both the dataset with the outliers and the dataset without outliers by race/ethnicity for participants who were interviewed and examined in the MEC and who were age 20 years and older. Use the weight option to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 4 years of data is used.

label define race 1 "Mex American",
label define race 2 "Other Hispanic", add
label define race 3 "NH White ", add
label define race 4 "NH Black", add
label define race 5 "Other Race - Including Multi-Racial", add
label values ridreth1 race

drop if seqn==10494 | seqn==13866 | seqn==17821
save C:\Nhanes\Data\exclu_3sps, replace

// mean total cholesterol without extreme values
mean lbxtc if (ridageyr >=20 & ridageyr <.) & ridstatr==2 [pweight=wtmec4yr], over(ridreth1)

// Mean of serum total cholesterol - including outliers

use C:\Nhanes\Data\demo_bp2b, clear
// mean total cholesterol with etreme values included
mean lbxtc if (ridageyr >=20 & ridageyr <.) & ridstatr==2 [pweight=wtmec4yr], over(ridreth1)

Highlighted items from comparison of the results with and without outliers:

In this example, the outliers do not significantly affect mean cholesterol values of the race/ethnicity subgroups or the overall mean. Therefore, you will use the dataset with the outliers for your analysis.

Watch animation of program and output

Close Window