• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 高性能计算 • 上一篇    下一篇

大数据环境下基于元模型控制的数据质量保障技术研究

杨冬菊1,2,徐晨阳1,2   

  1. (1.大规模流数据集成与分析技术北京市重点实验室,北京 100144;
    2.北方工业大学云计算研究中心,北京 100144)

     
  • 收稿日期:2018-08-10 修回日期:2018-10-19 出版日期:2019-02-25 发布日期:2019-02-25
  • 基金资助:

    国家自然科学基金重点项目(61832004)

Data quality assurance based on
metamodel control in big data environment

YANG Dongju1,2,XU Chenyang1,2   

  1. (1.Beijing Key Laboratory on Integration and Analysis of LargeScale Stream Data,Beijing 100144;
    2.Research Center for Cloud Computing,North China University of Technology,Beijing 100144,China)
     
  • Received:2018-08-10 Revised:2018-10-19 Online:2019-02-25 Published:2019-02-25

摘要:

数据集成环节,越来越丰富的异构源数据给集成后数据质量的提升带来了新的挑战和困难。针对传统ETL模型在数据集成后出现的数据冗余、无效、重复、缺失、不一致、错误值及格式出错等数据质量问题,提出了基于元数据模型控制的ETL集成模型,并对数据集成过程中的各种映射规则进行了详细的定义,通过将抽取、转换、加载环节的元模型和映射机制相结合,能够有效地保证集成后数据的数据质量。提出的元模型已经应用到科技资源管理数据集成业务中。通过科技资源管理数据集成实例分析,验证了此数据集成方案能够有效地支撑大数据环境下数据仓库的构建和集成后数据质量的提升。
 
 

关键词: 大数据, 数据仓库, ETL, 元数据模型, 映射, 数据集成

Abstract:

In  data integration process, more and more heterogeneous data sources bring new challenges and difficulties to the improvement of data quality after integration. Aiming at the data quality problems, such as data redundancy, invalidity, duplication, missing, inconsistency, error value and format error of the traditional ETL model after data integration, we propose an ETL integration model based on metadata model control. The mapping rules are defined in detail. By combining the metamodel and the mapping mechanism in extraction, transformation and loading phases, we can effectively guarantee the quality of integrated data. The proposed metamodel has been applied to the data integration business of scientific and technological resource management. The analysis on data integration examples of scientific and technological resources management shows that this data integration solution can effectively support the construction of data warehouses in the big data environment and improve data quality after integration.
 

Key words: big data, data warehouse, ETL, metadata model, mapping, data integration