• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science

Previous Articles     Next Articles

Data quality assurance based on
metamodel control in big data environment

YANG Dongju1,2,XU Chenyang1,2   

  1. (1.Beijing Key Laboratory on Integration and Analysis of LargeScale Stream Data,Beijing 100144;
    2.Research Center for Cloud Computing,North China University of Technology,Beijing 100144,China)
     
  • Received:2018-08-10 Revised:2018-10-19 Online:2019-02-25 Published:2019-02-25

Abstract:

In  data integration process, more and more heterogeneous data sources bring new challenges and difficulties to the improvement of data quality after integration. Aiming at the data quality problems, such as data redundancy, invalidity, duplication, missing, inconsistency, error value and format error of the traditional ETL model after data integration, we propose an ETL integration model based on metadata model control. The mapping rules are defined in detail. By combining the metamodel and the mapping mechanism in extraction, transformation and loading phases, we can effectively guarantee the quality of integrated data. The proposed metamodel has been applied to the data integration business of scientific and technological resource management. The analysis on data integration examples of scientific and technological resources management shows that this data integration solution can effectively support the construction of data warehouses in the big data environment and improve data quality after integration.
 

Key words: big data, data warehouse, ETL, metadata model, mapping, data integration