Data

In our experiment, we analyzed data from over 2,800 individuals obtained from 6 large clinical neuroimaging studies: the Alzheimer’s Disease Neuroimaging Initiative, or ADNI (www.adni-info.org), the Open-Access Series of Imaging Studies (OASIS, oasis-brains.org), the Autism Brain Imaging Data Exchange (ABIDE, tinyurl.com/fcon1000-abide), the Attention Deficit Hyperactivity Disorder (ADHD) sample from the ADHD-200 Consortium(tinyurl.com/fcon1000-adhd), the Center for Biomedical Research Excellence (COBRE) schizophrenia sample (tinyurl.com/fcon1000-cobre), and the MIND Clinical Imaging Consortium (MCIC) schizophrenia sample (coins.mrn.org). 

We used FreeSurfer (surfer.nmr.mgh.harvard.edu) (distribution version 5.1) – a freely available, widely used and extensively validated brain MRI analysis software package - to process the structural brain MRI scans and compute morphological measurements.

Please click on the individual data links at the end of this page (i.e., below) to access FreeSurfer-derived measurments.

To obtain the raw MRI data (for example, in order to employ an alternative image processing pipeline), we recommend users to obtain access from the original studies.

In our analyses, we defined four sets of measurements (or feature types) to be used by the prediction models. These were:

1) Feature set 1 (aseg): Volumes of the following anatomical structures (saved as stats/aseg.stats under the FreeSurfer subject directory), which were normalized with each subject’s ICV to account for head size variation.  Left and right cerebral white matter, cerebral cortex, lateral ventricle, inferior lateral ventricle, cerebellum white matter, cerebellum cortex, thalamus proper, caudate, putamen, pallidum, hippocampus, and amygdala, plus the 3rd and 4th ventricles.

2) Feature set 2 (aparc): Average thickness within the following cortical parcellations (saved as stats/lh.aparc.stats and stats/rh.aparc.stats under the FreeSurfer subject directory). Superior frontal, rostral middle frontal, caudal middle frontal, pars opercularis, pars triangularis, pars orbitalis, lateral orbitofrontal, medial orbitofrontal, precentral, paracentral, frontal pole, superior parietal, inferior parietal, supramarginal, postcentral, precuneus, superior temporal, middle temporal, inferior temporal, banks of the superior temporal sulcus, fusiform, transverse temporal, entorhinal, temporal pole,  parahippocampal, lateral occipital, lingual, cuneus, pericalcarine, rostral anterior (frontal), caudal anterior (frontal), posterior (parietal), isthmus (parietal).

3) Feature set 3 (aparc+aseg): The union of the first two feature sets.

4) Feature set 4 (thick): Cortical thickness values sampled onto the fsaverage5  template (which contains 3,414 vertices per hemisphere) and smoothed on the surface with an approximate Gaussian kernel of a full-width-half-max (FWHM) of 5mm. We note that we make available these measurements as sampled onto fsaverage (a higher resolution template), yet the first 3,414 values in these files correspond to the fsaverage5 vertices.

HOW TO INTERPRET/USE THE DOWLOADED DATA

For each target variable, there's a corresponding Matlab (mat) file, which contains following Matlab variables:

  • data_aseg: a matrix, where each row is a subject and each column is the volume of a brain structure normalized by headsize, or intracranial volume (ICV).
  • labels_aseg: a cell that contains the names of the structures that make up data_aseg.
  • data_aparc: a matrix, where each row is a subject and each column is the average thickness of cortical ROI.
  • labels_aparc: a cell that contains the names of the cortical ROIs that make up data_aparc.
  • thick_mat_lh, thick_mat_rh: two matrices (left and right hemsipheres) that contain cortical thickness data smoothed (FWHM 5 mm) and sampled on to fsaverage5. Eeach row is a separate subject. 
  • sbj_names: a cell that contains the subject IDs
  • labels: the target variable data for the subjects (The variable you are trying to predict).

We also distribute the following text (comma separated value, csv) files for each target variable:

  • *_data_aseg.csv: First line is a header that tells you what each entry corresponds to. Firs entry is SID (Subject ID). The following entries correspond to the brain structures. Each line is a separate subject. The numerical data are the volume of the brain structures normalized by headsize, or intracranial volume (ICV).
  • *_data_aparc.csv: First line is a header that tells you what each entry corresponds to. Firs entry is SID (Subject ID). The following entries correspond to the cortical ROI. Each line is a separate subject.
  • *_thick_fsaverage5_lh.csv: No headerline. Whole-hemisphere cortical thickness data sampled onto fsaverage5. lh stands for left hemisphere. Each line is a separate subject. the first line is not subject ID, but the thickness value at first surface vertex.
  • *_thick_fsaverage5_rh.csv: Right hemishpere.
  • *_Labels.csv: Headerline is SID, Label. First column is the subject IDs, second column is the value of the target variables you are trying to predict.

Finally, we distribute the 5-fold cross-validation lists that were used to compute the cross-validation results in our paper. These can be found under the folder 5fold.

Note, there are 100 different (random split, RS) 5-folds - not just one. This is somewhat different from the more conventional way of doing k-fold cross-validation by relying on a single k-fold partitioning of the data. Instead, we provide 100 random partitionings.

Each partitioning can be found under the RS* subdirectories. For example, under RS1, there are the first 5-fold cross-validation lists.

For each target variable, you have *_fold_k{1-5}_train.txt and *_fold_k{1-5}_test.txt. These are the lists of subjects (and corresponding target variable values, or labels) that should be used for training and testing respectively, in each one of the 5 folds.

You should then run your typical 5-fold cross-validation on each one of these RS directories. This will yield an estimate of accuracy (obtained by pooling across the five folds). In total, you will therefore have 100 estimates like these. The mean (average) across these 100 estimates will give you an unbiased estimate of cross-validation accuracy. One can also compute the uncertainty in these metrics. For example, you could compute the standard deviation.

We provide a sample Matlab script and results obtained with NAF on the Age_cont variable of MCIC for 100 random split 5-fold cross-validation runs.

Link to the Autism Brain Imaging Data Exchange Data analyzed in our experiment.

Link to the Attention Deficit and Hyperactivity Disorder (ADHD) Data analyzed in our experiment.

Link to the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data analyzed in our experiment.

Link to the Center for Biomedical Research Excellence (COBRE) schizophrenia data we analyzed in our experiment.

Link to the MIND Clinical Imaging Consortium (MCIC) schizophrenia data analyzed in our experiment.

Link to the Open-Access Series of Imaging Studies data analyzed in our experiment.