Objective In an effort to standardize behavioral measures and their data representation, the present study develops a methodology for incorporating measures found in the National Cancer Institute's (NCI) grid-enabled measures (GEM) portal, a repository for behavioral and social measures, into the cancer data standards registry and repository (caDSR).
Methods The methodology consists of four parts for curating GEM measures into the caDSR: (1) develop unified modeling language (UML) models for behavioral measures; (2) create common data elements (CDE) for UML components; (3) bind CDE with concepts from the NCI thesaurus; and (4) register CDE in the caDSR.
Results UML models have been developed for four GEM measures, which have been registered in the caDSR as CDE. New behavioral concepts related to these measures have been created and incorporated into the NCI thesaurus. Best practices for representing measures using UML models have been utilized in the practice (eg, caDSR). One dataset based on a GEM-curated measure is available for use by other systems and users connected to the grid.
Conclusions Behavioral and population science data can be standardized by using and extending current standards. A new branch of CDE for behavioral science was developed for the caDSR. It expands the caDSR domain coverage beyond the clinical and biological areas. In addition, missing terms and concepts specific to the behavioral measures addressed in this paper were added to the NCI thesaurus. A methodology was developed and refined for curation of behavioral and population science data.
There is an ever-increasing volume of biomedical data that is potentially available for sharing and reuse. This increase is due to several factors including the rapid creation of biomedical knowledge from ongoing biomedical research activities, the reduced cost of computer storage hardware, and advances in informatics technologies used to manage these data. At the same time, the growing adoption of electronic health record systems and other health information technologies in healthcare organizations,1–3 fueled by federal health information technology programs,4 which store clinical data in data elements rather that clinical narrative notes are creating new opportunities for outcomes researchers. Due to past efforts to create data standards in various biomedical domains, such as the clinical data interchange standards consortium (CDISC), systematized nomenclature of medicine—clinical terms, and logical observation identifiers names and codes, it is possible to conduct analyses of this growing body of disparate data; this transformation is increasing data liquidity,5,6 enabling clinical, research, and operational data to become available where and when they are needed. However, relatively little effort has been made to create data standards to support the needs of behavioral scientists and researchers in the domain of behavioral medicine.
Data standard, including terminologies and common data elements (CDE), is a critical first step towards achieving automated data integration. The National Cancer Institute (NCI) established an informatics infrastructure through the caBIG program with the goal of developing new tools, technologies and infrastructures to enable the interoperability of biomedical data from diverse systems.7,8 With this vision now being enabled by the national cancer informatics program,9 the resultant infrastructure utilizes terminologies housed in the NCI enterprise vocabulary services (EVS)10 and CDE in the NCI's cancer data standards repository (caDSR).11 The caDSR is a metadata registry and repository with a set of tools to create, find, and deploy CDE for use in the development of biomedical software applications and supports data management workflow and adherence to an extended version of the ISO/IEC 11179 metadata standard.12 In order to provide a completely unambiguous definition for each CDE, the caDSR binds its components to concepts from the NCI thesaurus13,14 and other controlled vocabularies contained in the NCI EVS. This system is potentially useful to behavioral scientists and researchers in behavioral medicine because it provides an infrastructure to share and reuse standardized datasets.
Data integration in behavioral science
Behavioral scientists are beginning to realize the benefits of integrating multidimensional and multilevel ‘big data’.8 Several projects have been developed to facilitate the sharing of behavioral data by establishing repositories for standard behavioral and social science measures. These projects include grid-enabled measures (GEM),15,16 consensus measures for phenotypes and exposures (PhenX),17–19 patient-reported outcomes measurement information system (PROMIS),20,21 NIH toolbox for the assessment of neurologic and behavioral functioning,22 and the national collaborative on childhood obesity research.23 Each of these repositories is intended to encourage the use of standard measures using common definitions and CDE in order to enable data aggregation and data integration across studies. However, each one also serves a different domain and is structured differently reducing interoperability across these systems and making integration or aggregation of data across studies from those distinct domains difficult to implement.
In contrast to these measure repositories, PopSciGrid demonstrated how data from disparate data sources could be integrated and analyzed by using CDE and an informatics infrastructure.24,25 The NCI's division of cancer control and population science funded the development of this project, which took subsets of variables focused on tobacco use from two public health survey datasets, and harmonized them by using the tools in the caDSR; it was this harmonization of the CDE that enabled the data integration.
The ability to resolve semantic conflicts between heterogeneous data sources is one of the major challenges in the data integration field. An ontology is one of the solutions used to address the semantic heterogeneity problem encountered during data integration.26,27 Data integration is a new and challenging area in the field of behavioral and population science due to the ‘lack of consensus' about measures among researchers.28 Several projects have been conducted to investigate this issue. For example, a taxonomy of 26 commonly used behavioral change techniques was developed in England.29 Professional organizations like the American Psychological Association have developed their own thesaurus of terms but these have not yet been adopted by other groups.
The GEM portal30 was conceptualized by the NCI to help accomplish two overarching goals: to evaluate and promote the use of standardized measures (ie, promote the use of common measures); and enable sharing of the resulting harmonized data that contain these common measures. GEM is a dynamic web-based portal15,16 that encourages and enables standardization of measures and data harmonization by enabling a virtual community of researchers who interact with each other using Web V.2.0 technologies. GEM contains behavioral and social science measures organized by theoretical constructs. Currently, it has 902 registered users and 457 measures based on 238 constructs, and these numbers are increasing every day. Through GEM, researchers are able to upload publicly available behavioral measures and associated metadata for use. Researchers can also provide comments and ratings about these measures and their associated theoretical constructs. GEM has an active user community. Currently, 143 (31%) of the measures have at least one rating and users have contributed 637 comments in total. There are 64 (7%) and 96 (11%) registered users who have rated a measure and have commented on a measure, respectively. GEM also provides a virtual platform for researchers to share their behavioral datasets (eg, it contains nine behavioral datasets). It provides an environment for ‘prospective meta-analyses' in which research is designed for integration.16
The major challenge in behavioral data sharing is a lack of standards. Before the formation of international standards that cover the domain of behavioral science, several standard platforms are being developed to cover clinical and biological domains in the USA. The caDSR is one of these platforms. In this project, we will explore an approach that expands the current standard platform to cover behavioral science and its applications. We will demonstrate this approach by developing behavioral data standards (CDE) to enlarge the domain coverage of the caDSR.
It is critical to represent data within a standard framework to facilitate data sharing. The development of a formal method that facilitates new domain datasets into an infrastructure (eg, grid) can not only speed up data sharing, but also improve data standards development and management. In this section, we will present the methodology to incorporate GEM measures, which is being promoted as a standard behavioral and social measure repository, into a standard data element repository (caDSR) by demonstrating the procedure for loading behavioral datasets into the grid.
As the architecture presented in figure 1 shows, the grid applications access and integrate the structured datasets on the grid (see dotted lines in figure 1) by using the CDE as the interface. It requires the datasets to be represented in a standard structure. The CDE are registered and maintained in a central repository. The registered CDE also serve as the interface for different grid applications.
In order to incorporate new domains (ie, behavioral and population science) into the grid, it requires developing and registering new CDE in the central repository (caDSR). The flowchart of the population science CDE developments is presented in figure 2. The curation methodology consists of four major parts: develop information models (unified modeling language; UML) for GEM measures; create CDE for components of UML models; bind CDE with concepts in the NCI thesaurus (semantic integration); and register the resulting metadata (CDE) in the caDSR. The outcome is the registration of measures, questions, and their responses as CDEs so that the variables, representing the measures' scores, can be shared via grid applications. In the present study, four widely used measures found in GEM (ie, the subjective numeracy scale (SNS),31,32 functional assessment of cancer therapy—general (FACT-G),33 perceived stress scale (PSS),34 and Center for Epidemiologic Studies depression scale (CES-D))35 were selected to demonstrate the methodology.
Develop UML models for GEM measures
Measures can be registered into the caDSR by manual curation or a semi-automated UML model approach. As a grid service (caGrid)36,37 cannot be created with manually curated CDE, the UML model approach was selected to automate measure curation. Another benefit of the UML model is that measures and their items can be represented in a form understandable to both researchers and informaticians. A UML model template was developed for the measures. In the template, each measure was represented as two UML classes corresponding to its items and scores. The first class (ie, measure class) contains the name and items of the measure that enable a researcher to query the data by the name of a measure and the associated items in that measure. The second class (ie, score class) contains the derived scores (eg, total, subscale or component) of the measure. For example, the two UML classes for the SNS are SubjectiveNumeracyScale and SubjectiveNumeracyScale_Score. The first class contains the name of the measure and its eight items. The second class contains its three derived scores: total score, abilitySubscale, and preferenceSubscale. This class has important information that a researcher might use to query for a particular area of interest (eg, ability subscale of the SNS). The two classes are connected by a one-to-one relationship. Figure 3 depicts the UML models for the SNS.
Create CDE for components of UML models
This process consists of two parts: create CDE for UML components; and organize the newly created CDE in the caDSR.
A CDE is created for each attribute of the UML class. For example, nine and four CDE are created for the SubjectiveNumeracyScale and SubjectiveNumeracyScale_Score class, respectively (see figure 3). In this step, existing CDE from the caDSR are reused or new CDE are created as appropriate. A gap analysis for comparison between GEM measures and registered measures in the caDSR was conducted before the start of this project. The results show that there were two GEM measures registered in the caDSR at the time. Reusing CDE is impossible as there are few CDE associated with GEM measures in the caDSR. Therefore, new CDE were developed for these four selected GEM measures.
The classification scheme (CS) is used to group CDE together logically in the caDSR. It provides a way to find CDE once they are registered in the caDSR. There are two types of CS: container type, which is used to group similar projects and their associated CDE; and project type, which is used to group CDE based on each project. We decided to group measures from different projects together in the container type CS. This facilitates researchers' ability to find measures in one place and use them from the CDE browser. Under the ‘measures' container (shown as a red container on the bottom of figure 4), users can easily find the four GEM measures curated by this group and their corresponding CDE. For example, a user can find all 13 CDE related to the SNS measure by clicking the ‘SubjectiveNumeracyScale’ under the ‘Measure’ container.
Bind CDE with concepts in the NCI thesaurus (semantic integration)
The UML model was exported as an XML metadata interchange (XMI) file that can be read by the semantic integration workbench (SIW).38 The XMI is a standard for exchanging metadata information via XML. The SIW was used to annotate CDE semantically with concepts from standard terminologies. The NCI EVS provides terminology content, tools and services to code, analyze, and share cancer information accurately. This study focuses on the annotation of cancer prevention and population science data. Therefore, the NCI thesaurus was selected to provide terminology content for annotation purposes.
Although CDE should be annotated by the concepts of the standard terminology, the contents of the CDE may not be covered completely by the standard terminology. For example, there is no hierarchical structure for behavioral and population science in the NCI thesaurus. It requires extending the scope of the standard terminology (NCI thesaurus). The guidelines for concept generation consist of three parts: (1) concepts; (2) concept definitions; and (3) hierarchical locations.
(1) Rules for the new concept generation are developed as follows: each measure name is assigned to a concept. For example, ‘C91235 SNs' is created for the SNS; each question in a measure is assigned to a concept. For example, ‘C91239 How Good At Fractions' is created to represent the question ‘How good are you at working with fractions?’; and each question response is represented by a concept (eg, ‘C91249 Not at all good’).
(2) Rules for the concept definition generation are developed as follows: the definition for a measure concept is the description of measure from GEM; there are two definitions for each question concept. The first definition is the original question. For example, the first definition for the ‘C91239 How Good At Fractions' is ‘How good are you at working with fractions?’ The second definition reflects the measure question's intent. For example, C91239 is also defined as ‘A survey question about how proficient a person is with percentages’.
(3) Rules for organizing concepts in the NCI thesaurus hierarchies are developed as follows: measure concepts are grouped into the existing clinical questionnaire branch ‘C91105 Clinical or Research Assessment Questionnaire’; question concepts are allocated under the ‘C91103 Findings-Based Question’; and question response concepts are grouped into the ‘C91106 Clinical or Research Assessment Answer’.
Register the resulting metadata (CDE) in the caDSR
The NCI tool, UML Loader,39 is used to register or upload the annotated XMI file (results from the previous step) into the caDSR.
Registered four GEM measures in the caDSR
Four GEM measures (ie, CES-D, SNS, PSS, and FACT-G) were registered in the caDSR. There are 83 CDE and 413 concepts from the NCI thesaurus that are associated with these four measures. The scope of the caDSR was extended by adding CDE for a new domain (ie, behavioral and population science) from this project. All CDE were newly created. They were organized into the ‘measure’ container in the caDSR.
A CDE may have more than one concept associated with it. For example, three concepts ‘C91235 SNS’, ‘C78209 Ability’, and ‘C25338 Score’ were used to annotate the CDE ‘SNS Ability Score’. The first concept (C91235) is a new concept. The other two are existing NCI thesaurus concepts. Table 1 shows the distributions of the CDE and NCI thesaurus concepts.
|GEM measures||No of CDE||No of concepts||No of newly created concepts||No of reusing concepts|
|GEM measures||No of CDE||No of concepts||No of newly created concepts||No of reusing concepts|
CDE, common data element; GEM, grid-enabled measures; PSS, perceived stress scale; SNS, subjective numeracy scale.
Developed new behavioral concepts for the NCI thesaurus
There were 66 new concepts proposed for addition into the NCI thesaurus. The NCI thesaurus accepted the proposal and added these concepts in its three hierarchies ‘C91105 Clinical or Research Assessment Questionnaire’, ‘C91103 Findings-Based Question’ and ‘C91106 Clinical or Research Assessment Answer’. Four out of the 66 concepts were created to represent the four GEM measures. They were grouped into the first hierarchy C91105. Fifty concepts were generated for the questions from the measures. They are children of the C91103. The rest of the concepts were created for the question responses. They are children of the C91106. These 66 new concepts, with 347 existing concepts, enable the NCI thesaurus to cover the CDE annotation for the four GEM measures completely. This process is a key step in promoting these CDE as standards.
Developed best practice for the curation process
A set of best practices was generated for the caDSR based on the lessons learned during each phase of the curation process of this project. The best practices outline techniques and methods for efficiently and consistently curating behavioral measures in the caDSR. The scope of the best practices includes naming conventions, definitions, and value domains. A completed report on best practices for behavioral measures curation can be found on the NCI Wiki.40 These guidelines provide standard procedures to incorporate behavioral measures into the caDSR. They will ease the burden of curation for future behavioral and population scientists.
The method and best practices developed in this study are now used by the caDSR curation team. For example, 12 measures such as 10 PROMIS global health, assessment of survivor concerns, and breast cancer treatment outcomes from the NCI cooperative groups' protocols have been curated into the caDSR based on these best practices. New measures such as breast lymphedema symptom survey and lymphedema and breast cancer questionnaire—short form are in progress. As a result of measure curation, new behavioral CDE and concepts are being developed for the caDSR and NCI thesaurus, respectively.
Uploaded one behavioral dataset on the grid and provided a link to GEM's ‘datasets' tab
One behavioral dataset was uploaded as a grid node for demonstration of the method. This was a study that surveyed a diverse population of Temple University undergraduate students (N=1200) to assess smoking and smoking-related behaviors (principal investigator Dr Suzanne M. Miller). This behavioral dataset contained one curated GEM measure (CES-D). Meanwhile, the metadata of this study and a link to the grid node was uploaded into GEM. Users can search the information regarding this study through GEM's ‘datasets' tab (see figure 5).
Importance of standards in behavioral science
The efforts described in this paper were accomplished to achieve several important goals: to develop standards for curating scientific measures related to behavioral research; and to use and share these standards to facilitate sharing of semantically interoperable data containing CDE. These standards are a critical step to bringing order to the world of behavioral science including research, when data can be shared for knowledge synthesis and practice so that interventions can be tested and compared and improved on. Although these ideas are novel to many behavioral and population scientists, this practice of creating and using standards has been happening within other disciplines, namely in the field of genetics, in which a fledgling area of study—where standards had not yet been created—and the sheer number of data points provided the necessary ingredients to push for creating harmonized data. Within behavioral science, it is heartening to see efforts such as Michie and colleagues who have developed a taxonomy of behavior change terms to help create evidence-based guidelines for implementation research. 29,41 It is hoped that these types of taxonomy and ontology development efforts, combined with the data integration work described in this paper, will eventually lead to faster, more efficient scientific discoveries that will lead to better and more sustainable behavior change.
Need for an ontology structure for behavioral and population science concepts
One issue was raised during the curation process: how to choose the ‘right’ concepts from the NCI thesaurus to annotate the model. Currently, the hierarchical structure and semantic relationship of the concepts for behavioral measures are not available in the NCI thesaurus. Behavioral concepts are organized in the hierarchies based on clinic domains. In order to annotate a behavioral measure, concepts have to be identified from different locations of the NCI thesaurus hierarchies. It impacts the efficiency for the concept selection. New concepts have to be generated if there is no existing concept in the NCI thesaurus. Currently, these new behavioral concepts are incorporated into three locations (questionnaire, question, and answer) in the NCI thesaurus. All concepts are children of these three concepts. It is recommended to enhance and build further the hierarchical and semantic relationships for behavioral and population science concepts inside the NCI thesaurus. The hierarchical structure will provide an appropriate location to host new concepts from the behavioral and population science domain. The semantic relationship will link behavioral concepts with existing concepts from other domains. This will improve the knowledge structure of the NCI thesaurus.
Data standardization in behavioral science
The importance of standards has been recognized in behavioral science and healthcare. An international standard needs to be recognized and adopted widely by its community. There are several ongoing projects that may serve as the standard for behavioral science (such as GEM, PROMIS, PhenX, etc.). Due to the complexity of behavioral science and resource limitations, these projects have their own scope and development methodologies to meet different requirements. Although these projects are at different phases, they may be the seeds for the formation of the future international standard. The existing standards need to expand scopes and applications. The method of CDE/caDSR developed for GEM may be applied to PROMIS and PhenX.
The biomedical research integrated domain group (BRIDG) is an overarching model for clinical research that was/is supported by four major stakeholders—the US Food and Drug Administration, HL7, CDISC and NCI.42 The NCI hosts BRIDG in the caDSR registry of data elements. All the elements in BRIDG are annotated with terminology from the NCI EVS. As CDISC is also one of the major stakeholders, their content in their study data tabulation model is also annotated with EVS terminology; often NCI and CDISC use the very same concepts/terminology to annotate the data elements. Every element in the caDSR can be aligned with the BRIDG. All the CDE in the caDSR are composed of terminology from EVS—data elements are created and annotated with terminology by using wizards (eg, SIW) and loaders (eg, UML Loader). So CDE/caDSR must use terminology. It is the same source of terminology used by the BRIDG. In addition, the caDSR has the ability to promote a CDE to a ‘standard’ CDE by changing its registration status after it goes through community review. A ‘standard’ CDE means that its content has been vetted and recommended for reuse.
Data sharing cannot be done appropriately without standards. The sharing of data is a challenge in behavioral science. Besides the challenges from technology, there are legal issues in dealing with proprietary measures, which applications like the GEM portal still need to address, especially with regard to sharing data collected from those measures. Behavioral and population scientists seem reluctant to share data, which may be for several reasons including: lack of technical expertise; time constraints; and fear of misuse of their data or being ‘scooped’. This paper demonstrates a methodology of developing and using data standards for behavioral and population science research. The standardization of data will promote data sharing and reuse among behavioral and population scientists. It also speeds up the adoption of new technologies in behavioral and population science.
There is a need to shift population science and behavioral research priorities from creating new measures to identifying and reusing standard measures to move research forward through knowledge synthesis.8 Using a web-based, grid infrastructure is one mechanism that allows scientists to share data and knowledge ranging from the molecular to the social level. A culture shift from ‘single principal investigator’ to ‘collaboration and team science’ is needed to enable technology truly to advance research.8
Behavioral and population science data can be standardized by using and extending current standards. This study demonstrates that CDE and ontology (NCI thesaurus) can be applied to standardize behavioral data using a smoking-related behavioral dataset. A new branch of CDE for behavioral science were developed for the caDSR that expands the caDSR domain coverage beyond the clinical and biological areas. In addition, missing terms and concepts specific to the behavioral measures addressed in this paper were added to the NCI thesaurus. A methodology was developed for curation of behavioral and population science data. Best practices and documentation, adopted as standards for registering measures in the caDSR, were developed for practical use.40 Although the caDSR is expanding quickly, it requires a concerted effort (ie, resources, policy, organization, innovation, etc.) to promote it as a viable international standard for biomedical research.
All the listed authors contributed substantially to the conception and design or analysis and interpretation of data. All the authors contributed drafts and revisions to the manuscript and approved the current revised version. No person who fulfills the criteria for authorship has been left out of the author list.
Provenance and peer review
Not commissioned; externally peer reviewed.