Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial (such as when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories, for example) domains. Data integration appears with increasing frequency as the volume (that is, big data) and the need to share existing data explodes. It has become the focus of extensive theoretical work, and numerous open problems remain unsolved. Data integration encourages collaboration between internal as well as external users.
Issues with combining heterogeneous data sources, often referred to as information silos, under a single query interface have existed for some time. In the early 1980s, computer scientists began designing systems for interoperability of heterogeneous databases. The first data integration system driven by structured metadata was designed at the University of Minnesota in 1991, for the Integrated Public Use Microdata Series (IPUMS). IPUMS used a data warehousing approach, which extracts, transforms, and loads data from heterogeneous sources into a single view schema so data from different sources become compatible. By making thousands of population databases interoperable, IPUMS demonstrated the feasibility of large-scale data integration. The data warehouse approach offers a tightly coupled architecture because the data are already physically reconciled in a single queryable repository, so it usually takes little time to resolve queries.
The data warehouse approach is less feasible for data sets that are frequently updated, requiring the extract, transform, load (ETL) process to be continuously re-executed for synchronization. Difficulties also arise in constructing data warehouses when one has only a query interface to summary data sources and no access to the full data. This problem frequently emerges when integrating several commercial query services like travel or classified advertisement web applications.
As of 2009 the trend in data integration favored loosening the coupling between data and providing a unified query-interface to access real time data over a mediated schema (see Figure 2), which allows information to be retrieved directly from original databases. This is consistent with the SOA approach popular in that era. This approach relies on mappings between the mediated schema and the schema of original sources, and transforming a query into specialized queries to match the schema of the original databases. Such mappings can be specified in two ways: as a mapping from entities in the mediated schema to entities in the original sources (the “Global As View” (GAV) approach), or as a mapping from entities in the original sources to the mediated schema (the “Local As View” (LAV) approach). The latter approach requires more sophisticated inferences to resolve a query on the mediated schema, but makes it easier to add new data sources to a (stable) mediated schema.
As of 2010 some of the work in data integration research concerns the semantic integration problem. This problem addresses not the structuring of the architecture of the integration, but how to resolve semantic conflicts between heterogeneous data sources. For example, if two companies merge their databases, certain concepts and definitions in their respective schemas like “earnings” inevitably have different meanings. In one database it may mean profits in dollars (a floating-point number), while in the other it might represent the number of sales (an integer). A common strategy for the resolution of such problems involves the use of ontologies which explicitly define schema terms and thus help to resolve semantic conflicts. This approach represents ontology-based data integration. On the other hand, the problem of combining research results from different bioinformatics repositories requires bench-marking of the similarities, computed from different data sources, on a single criterion such as positive predictive value. This enables the data sources to be directly comparable and can be integrated even when the natures of experiments are distinct.
As of 2011 it was determined that current data modeling methods were imparting data isolation into every data architecture in the form of islands of disparate data and information silos. This data isolation is an unintended artifact of the data modeling methodology that results in the development of disparate data models. Disparate data models, when instantiated as databases, form disparate databases. Enhanced data model methodologies have been developed to eliminate the data isolation artifact and to promote the development of integrated data models. One enhanced data modeling method recasts data models by augmenting them with structural metadata in the form of standardized data entities. As a result of recasting multiple data models, the set of recast data models will now share one or more commonality relationships that relate the structural metadata now common to these data models. Commonality relationships are a peer-to-peer type of entity relationships that relate the standardized data entities of multiple data models. Multiple data models that contain the same standard data entity may participate in the same commonality relationship. When integrated data models are instantiated as databases and are properly populated from a common set of master data, then these databases are integrated.
Since 2011, data hub approaches have been of greater interest than fully structured (typically relational) Enterprise Data Warehouses. Since 2013, data lake approaches have risen to the level of Data Hubs. (See all three search terms popularity on Google Trends.) These approaches combine unstructured or varied data into one location, but do not necessarily require an (often complex) master relational schema to structure and define all data in the Hub.
Consider a web application where a user can query a variety of information about cities (such as crime statistics, weather, hotels, demographics, etc.). Traditionally, the information must be stored in a single database with a single schema. But any single enterprise would find information of this breadth somewhat difficult and expensive to collect. Even if the resources exist to gather the data, it would likely duplicate data in existing crime databases, weather websites, and census data.
A data-integration solution may address this problem by considering these external resources as materialized views over a virtual mediated schema, resulting in “virtual data integration”. This means application-developers construct a virtual schema—the mediated schema—to best model the kinds of answers their users want. Next, they design “wrappers” or adapters for each data source, such as the crime database and weather website. These adapters simply transform the local query results (those returned by the respective websites or databases) into an easily processed form for the data integration solution (see figure 2). When an application-user queries the mediated schema, the data-integration solution transforms this query into appropriate queries over the respective data sources. Finally, the virtual database combines the results of these queries into the answer to the user’s query.
This solution offers the convenience of adding new sources by simply constructing an adapter or an application software blade for them. It contrasts with ETL systems or with a single database solution, which require manual integration of entire new data set into the system. The virtual ETL solutions leverage virtual mediated schema to implement data harmonization; whereby the data are copied from the designated “master” source to the defined targets, field by field. Advanced data virtualization is also built on the concept of object-oriented modeling in order to construct virtual mediated schema or virtual metadata repository, using hub and spoke architecture.
Each data source is disparate and as such is not designed to support reliable joins between data sources. Therefore, data virtualization as well as data federation depends upon accidental data commonality to support combining data and information from disparate data sets. Because of this lack of data value commonality across data sources, the return set may be inaccurate, incomplete, and impossible to validate.
One solution is to recast disparate databases to integrate these databases without the need for ETL. The recast databases support commonality constraints where referential integrity may be enforced between databases. The recast databases provide designed data access paths with data value commonality across databases.
The theory of data integration forms a subset of database theory and formalizes the underlying concepts of the problem in first-order logic. Applying the theories gives indications as to the feasibility and difficulty of data integration. While its definitions may appear abstract, they have sufficient generality to accommodate all manner of integration systems, including those that include nested relational / XML databases and those that treat databases as programs. Connections to particular databases systems such as Oracle or DB2 are provided by implementation-level technologies such as JDBC and are not studied at the theoretical level.
In the life sciences
Large-scale questions in science, such as global warming, invasive species spread, and resource depletion, are increasingly requiring the collection of disparate data sets for meta-analysis. This type of data integration is especially challenging for ecological and environmental data because metadata standards are not agreed upon and there are many different data types produced in these fields. National Science Foundation initiatives such as Datanet are intended to make data integration easier for scientists by providing cyberinfrastructure and setting standards. The five funded Datanet initiatives are DataONE, led by William Michener at the University of New Mexico; The Data Conservancy, led by Sayeed Choudhury of Johns Hopkins University; SEAD: Sustainable Environment through Actionable Data, led by Margaret Hedstrom of the University of Michigan; the DataNet Federation Consortium, led by Reagan Moore of the University of North Carolina; and Terra Populus, led by Steven Ruggles of the University of Minnesota. The Research Data Alliance, has more recently explored creating global data integration frameworks. The OpenPHACTS project, funded through the European Union Innovative Medicines Initiative, built a drug discovery platform by linking datasets from providers such as European Bioinformatics Institute, Royal Society of Chemistry, UniProt, WikiPathways and DrugBank.
- ^ Jump up to:ab c Maurizio Lenzerini (2002). “Data Integration: A Theoretical Perspective” (PDF). PODS 2002. pp. 233–246.
- ^Frederick Lane (2006). “IDC: World Created 161 Billion Gigs of Data in 2006”.
- ^John Miles Smith; et al. (1982). “Multibase: integrating heterogeneous distributed database systems”. AFIPS ’81 Proceedings of the May 4–7, 1981, National Computer Conference. pp. 487–499.
- ^Steven Ruggles, J. David Hacker, and Matthew Sobek (1995). “Order out of Chaos: The Integrated Public Use Microdata Series”. Historical Methods. 28. pp. 33–39.
- ^Jennifer Widom (1995). “Research problems in data warehousing”. CIKM ’95 Proceedings of the Fourth International Conference on Information and Knowledge Management. pp. 25–30.
- ^Shubhra S. Ray; et al. (2009). “Combining Multi-Source Information through Functional Annotation based Weighting: Gene Function Prediction in Yeast” (PDF). IEEE Transactions on Biomedical Engineering. 56 (2): 229–236. CiteSeerX 10.1.1.150.7928. doi:10.1109/TBME.2008.2005955. PMID 19272921.
- ^Michael Mireku Kwakye (2011). “A Practical Approach To Merging Multidimensional Data Models”. hdl:10393/20457.
- ^“Hub Lake and Warehouse search trends”.
- ^Alagić, Suad; Bernstein, Philip A. (2002). Database Programming Languages. Lecture Notes in Computer Science. 2397. pp. 228–246. doi:10.1007/3-540-46093-4_14. ISBN 978-3-540-44080-2.
- ^“Nested Mappings: Schema Mapping Reloaded” (PDF).
- ^“The Common Framework Initiative for algebraic specification and development of software” (PDF).
- ^Christoph Koch (2001). “Data Integration against Multiple Evolving Autonomous Schemata” (PDF). Archived from the original (PDF) on 2007-09-26.
- ^Jeffrey D. Ullman (1997). “Information Integration Using Logical Views”. ICDT 1997. pp. 19–40.
- ^ Jump up to:ab Alon Y. Halevy (2001). “Answering queries using views: A survey”(PDF). The VLDB Journal. pp. 270–294.
- ^George Konstantinidis; et al. (2011). “Scalable Query Rewriting: A Graph-based Approach” (PDF). in Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD’11, June 12-16, 2011, Athens, Greece.
- ^William Michener; et al. “DataONE: Observation Network for Earth”. www.dataone.org. Retrieved 2013-01-19.
- ^Sayeed Choudhury; et al. “Data Conservancy”. dataconservancy.org. Retrieved 2013-01-19.
- ^Margaret Hedstrom; et al. “SEAD Sustainable Environment – Actionable Data”. sead-data.net. Retrieved 2013-01-19.
- ^Reagan Moore; et al. “DataNet Federation Consortium”. datafed.org. Retrieved 2013-01-19.
- ^Steven Ruggles; et al. “Terra Populus: Integrated Data on Population and the Environment”. terrapop.org. Retrieved 2013-01-19.
- ^Bill Nichols. “Research Data Alliance”. rd-alliance.org. Retrieved 2014-10-01.