High-Dimensional Data Analysis

Last updated on May 12, 2020. For current content related to this research please visit Runze Li’s website.

High-dimensional data, including genetic data, are becoming increasingly available as data collection technology evolves. Behavioral scientists need powerful, effective analytic methods to glean maximum scientific insight from these data.

Over the last few years, Runze Li and other statisticians have been developing new methods for analyzing high-dimensional data. Now, Center researchers are extending these methods for use in behavioral research focused on, for example, preventing drug abuse and HIV-risk behavior. Future statistical work will develop methods to analyze genetic data simultaneously with intensive longitudinal data. This work will allow scientists to identify which genetic, individual, and social factors predict drug abuse, HIV-risk behavior, and related health behaviors.

icon of a strand of DNA

Runze Li
John Dziak
​Lisa Dierker
Helen Kamens

High-Dimensional Variable Screening

In genetic studies, the number of variables is extremely large relative to the number of participants: there may be hundreds of subjects and hundreds of thousands of variables. This has a crippling effect on exploratory data analyses because nearly all multivariate procedures break down when the number of variables exceeds the sample size. As a result, it is necessary to reduce the number of variables to a subset of predictors that potentially impact the outcome of interest. High-dimensional variable-screening procedures allow researchers to narrow the subset of variables for the analysis.

We developed the VariableScreening R package to allow researchers to screen for meaningful variables.

Read about our statistical work on variable screening.

High-Dimensional Variable Selection

Other types of genetic studies focus on specific genes. This creates a situation in which the sample size is somewhat larger than the number of predictors (e.g., 500 subjects and 300 variables). In these situations, many variables are often highly correlated. A complicated model may include many insignificant variables, and it may have less predictive power and be difficult to interpret.

In these cases, a more parsimonious model becomes desirable. Approaches such as penalized least squares and penalized likelihood with the smoothly clipped absolute deviation (SCAD) penalty can select significant variables. We are developing broadly applicable techniques for high-dimensional variable selection. We also developed PROC SCAD, a pair of SAS procedures using the SCAD penalty for high-dimensional variable selection.

Read about our statistical work in variable selection, organized by data type.

Last updated: May 12, 2020