In this task, you will check for outliers and their potential impact using the following steps:
Before you analyze your data, it is very important that you check the distribution and normality of the data and identify outliers for continuous variables.
Use the summarize command with the detail option to get descriptive statistics, such as mean, minimum and maximum values, standard deviation, and skewness, etc. for the participants who were interviewed and examined in the MEC and who were age 20 years and older. Use the histogram command with the normal option to graph the continuous variable cholesterol and draw the normal distribution curve. Use the graph box command to draw a box chart graph of the continuous variable cholesterol.
summarize lbxtc [w=wtmec4yr] if (ridageyr
>=20 & ridageyr <.) & ridstatr==2, detail
histogram lbxtc if (ridageyr >=20 &
ridageyr <.) & ridstatr==2, normal
graph save "C:\STATA\tutorial\descriptive\histogram_discriptive.gph",
replace
graph box lbxtc [w=wtmec4yr], medtype(line),
if (ridageyr >=20 & ridageyr <.) & /// ridstatr==2
graph save "C:\STATA\tutorial\descriptive\box_plot.gph",
replace
Highlighted items from the univariate analysis output:
In this example, you will plot the 4-year MEC survey weight (wtmec4yr) against the distribution of the cholesterol variable to determine whether the extreme observations are outliers.
Use the graph twoway scatter command to plot the total serum cholesterol (lbxtc) by the corresponding sample weight for each observation in the dataset for participants who were interviewed and examined in the MEC and who were age 20 years and older. Use the mlabel option to label the sequence numbers associated with extreme values in the output.
graph twoway scatter wtmec4yr lbxtc if (ridageyr >=20 & ridageyr <.)& ridstatr==2, mlabel(seqn) title(NHANES 1999-2002: adults age 20 years and older)
Highlighted items from plotting the survey weight against the distribution of the cholesterol variable:
In this step you will:
Use the label commands to describe labels used for the race/ethnicity variable values.
Use the drop command to delete the outliers using their SEQNs previously identified in the plot of survey weight versus distribution of the variable. The SEQNs associated with these outliers are labeled on the scatter plot output under plot exam weight against cholesterol.
Use the mean command to determine the mean and standard error for both the dataset with the outliers and the dataset without outliers by race/ethnicity for participants who were interviewed and examined in the MEC and who were age 20 years and older. Use the weight option to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 4 years of data is used.
label define race 1 "Mex American",
label define race 2 "Other Hispanic", add
label define race 3 "NH White ", add
label define race 4 "NH Black", add
label define race 5 "Other Race - Including
Multi-Racial", add
label values ridreth1 race
drop if seqn==10494 | seqn==13866 | seqn==17821
save C:\Nhanes\Data\exclu_3sps, replace
// mean total cholesterol without extreme
values
mean lbxtc if (ridageyr >=20 & ridageyr <.)
& ridstatr==2 [pweight=wtmec4yr], over(ridreth1)
// Mean of serum total cholesterol -
including outliers
use C:\Nhanes\Data\demo_bp2b, clear
// mean total cholesterol with etreme
values included
mean lbxtc if (ridageyr >=20 & ridageyr <.)
& ridstatr==2 [pweight=wtmec4yr], over(ridreth1)
Highlighted items from comparison of the results with and without outliers: