The SAS procedure, proc univariate, generates descriptive and summary statistics that are useful in describing the characteristics of a distribution. These statistics can also be used to determine whether parametric (for a normal distribution) or non-parametric tests are appropriate to use in your analysis. As noted in the Clean & Recode Data module it is advisable to check for extreme weights and outliers before starting any analysis.
Use the SAS procedure, proc univariate, to generate descriptive statistics. The frequency distribution can be presented in table or graphic format. The freq option generates the frequency distribution in tabular form by listing the number of observations for each value of the variable. Due to the large sample size and the possibility of a long list of different values, it is not reasonable to request the freq option for variables that are not nominal or ordinal. The plot option generates the frequency distribution in graphic form (histogram, box, and normal probability plots), and the normal option generates statistics to test the normality of the distribution.
These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.
Statements | Explanation |
---|---|
data=analysis_data ; by riagendr age; ; |
Use the sort procedure to sort data by the same variables used in the by statement of the univariate procedure. In the example, data is sorted by gender (riagendr) and age (age).
|
PLOT NORMAL ; |
Use the univariate procedure to generate descriptive statistics, which include number of missing values, mean, standard errors, percentiles, and extreme values. Use the plot option to generate histogram, box and normal probability plots, and the normal option to generate statistics to test normality. In this example, plots (plot) and normality test statistics (normal) are requested and the results will be sorted and generated separately for each combination of the variables on the by statement. |
where ridageyr >= 20 ; |
Use the where statement to select those 20 years and older. |
by riagendr age; |
The by statement determines the groups (all combinations of the variables defined by the var statement) that separate descriptive statistics will be produced. This statement should match the by statement in the sort procedure preceding it. |
VAR lbxtc; |
Use the var statement to indicate variable(s) for which descriptive measures are requested. In this example, the total cholesterol variable (lbxtc) is used. |
FREQ wtmec4yr; ; |
Use the freq option with the appropriate sample weight yields an estimate of the standard deviation whose denominator is the estimated population size. In this example, the 4-year examination weight (wtmec4yr) is used.
WARNING
The freq option, with the appropriate sample weight, yields an estimate of the standard deviation whose denominator is an estimate of the population size, i.e., the sum of the the sample weights. Using the weight option instead of the freq option yields an estimate of the standard error whose denominator is the sample size. |
The univariate procedure generates extensive descriptive statistics, including moments, percentiles, extremes, missing values, basic statistical measures, and tests for location. Below is a snapshot from the extensive output of the SAS program which shows the result of using the plot and normal options.
In some instances, you may not need all of the statistics generated by proc univariate. You can use proc univariate to select a few descriptive statistics and output the results to a SAS dataset to view.
These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.
Statements | Explanation |
---|---|
data=analysis_data; by riagendr age; ; |
Use the sort procedure to sort data by the same variables that will be used in the by statement of the univariate procedure. In the example, the data are sorted by gender (riagendr) and age (age). |
NOPRINT; |
Use the univariate procedure to generate descriptive statistics. Use the noprint option to suppress the detailed default descriptive statistics. |
where ridageyr >= |