Search

Article

x

留言板

姓名
邮箱
手机号码
标题
留言内容
验证码

downloadPDF
Citation:

Liu Yue, Liu Da-Hui, Ge Xian-Yuan, Yang Zheng-Wei, Ma Shu-Chang, Zou Zhe-Yi, Shi Si-Qi
PDF
HTML
Get Citation
  • Numerous data and knowledge generated and stored as text in peer-reviewed scientific literature are important for materials research and development. Although text mining can automatically explore this information, the barriers of acquiring high-quality textual data prevent its general application in materials science. Herein, we systematically analyze the issues of textual DATA QUALITY and related research from the perspectives of data quality and quantity. Following this, we propose a pipeline to construct high-quality datasets for text mining in materials science. In this pipeline, we utilize the traceable automatic acquisition scheme of literature to ensure the traceability of textual data. Then, a data processing method driven by downstream tasks is used to generate high-quality pre-annotated corpora conditioned on the characteristics of material texts. On this basis, we define a general annotation scheme derived from materials science tetrahedron to complete high-quality annotation. Finally, a conditional data augmentation model incorporating material domain knowledge (cDA-DK) is constructed to augment the data quantity. Experimental results on datasets with various material systems demonstrate that our method can effectively improve the accuracy of downstream models and the F1-score towards the named entity recognition task in NASICON-type solid electrolyte material reaches 84%. This study provides an important insight into the general application of text mining in materials science, and is expected to advance the material design and discovery driven by data and knowledge bidirectionally.
        Corresponding author:Shi Si-Qi,sqshi@shu.edu.cn
      • Funds:Project supported by the National Key Research and Development Program of China (Grant No. 2021YFB3802101), and the National Natural Science Foundation of China (Grant Nos. 92270124, 52073169, 52102313).
      [1]

      [2]

      [3]

      [4]

      [5]

      [6]

      [7]

      [8]

      [9]

      [10]

      [11]

      [12]

      [13]

      [14]

      [15]

      [16]

      [17]

      [18]

      [19]

      [20]

      [21]

      [22]

      [23]

      [24]

      [25]

      [26]

      [27]

      [28]

      [29]

      [30]

      [31]

      [32]

      [33]

    • 获取方式 数据库 文档类型 访问权限 文档数量 参考
      索引数据库 API CAplus 论文, 专利, 报告 订阅 www.cas.org/support/documentation/references
      DOAJ 论文 部分订阅 doaj.org
      PubMed Central 论文 开放获取 较少 www.ncbi.nlm.nih.gov/pmc
      Science Direct 论文 订阅 dev.elsevier.com/api_docs.html
      Scopus 摘要 开放获取 较少
      Springer Nature 论文, 书籍 订阅 dev.springernature.com/
      网络爬虫 网页 论文, 专利, 报告, 书籍 开放获取 requests.readthedocs.io, crummy.com/software/BeautifulSoup
      DownLoad: CSV

      名称 适用范围 是否开源 版本迭代 功能完备性 难易性 友好性
      OSCAR4[25] 化学反应和生物化学 普通
      ChemicalTagger[26] 化学合成作用和条件 普通
      ChemDataExtractor[27] 通用化学和材料科学领域 容易
      DownLoad: CSV

      来源 目标 标签数 标签类别 适用领域 应用实例
      Weston等[11] 构建材料领域最新研究结果
      与历史文献的关联
      7 无机材料, 相结构, 描述符, 属
      性, 应用, 合成方法, 表征方法
      无机材料 目标材料检索, 文献搜索
      与总结, 元信息分析
      He等[13] 从无机固相合成反应文献
      中挖掘反应前体信息
      3 材料, 合成反应前体,
      目标化合物
      无机固相
      合成反应
      固相合成反应前体
      数据挖掘, 元信息分析
      Friedrich等[12] 标注科学出版物中与SOFCs
      实验相关的信息
      4(SOFC) 17(SOFC-slot) 实验, 材料, 数值, 应用等 电池材料 构建SOFCs科学语料库并用
      于多个实验信息提取任务
      Wang等[10] 从文献中自动挖掘出数据驱动的材
      料设计模型所需的高质量可靠数据
      6 元素, 合金命名实体, 成分含
      量, 属性描述符, 属性值, 其他
      合金材料 钴基单晶高温合金${ {\rm{\gamma } } }'$
      相固溶温度预测
      Nie等[9] 构建语义表示框架以探索潜在
      的锂离子电池阴极材料
      3 无机材料, 锂离子电池
      阴极材料, 属性描述符
      电池材料 新型锂离子电池阴
      极材料设计与寻优
      DownLoad: CSV

      实体标签 定义 示例
      Composition 与化学式有关的内容; 描述材料内部与含量相关的内容等. NaCl, CaCl2; Na concentration, Electrons charge carriers.
      Structure 晶体结构; 相; 用于刻画晶体结构的名称等. Fcc, Phase; Bottleneck, Channel, Path.
      Property 带单位的可度量值; 材料表现出来定性的性质或现象;
      描述材料产生物理/化学行为或物理/化学机制的名词等.
      Conductivity, Activation, Radius; Ferroelectric, Metallic; Phase transition, Ionic reaction.
      Processing 材料合成技术或加工工艺; 材料改性手段等. Solid state reaction, Annealing; Doping.
      Characterization 用于表征材料的任何实验、理论、模型或公式等. XRD, STM, Photoluminescence, DFT;
      Bethe-Salpeter equation.
      Application 任何高级的应用; 任何特定的器件、系统等. Cathode, Photovoltaics; Battery Management System.
      Feature 样品类型、形状的特殊描述. Single crystal, Bulk, nanotube, Quantum dot.
      Condition 描述材料所处的环境或外部条件. 980 $°{\rm{C}}$, 1000 MPa.
      DownLoad: CSV

      关系标签 (A to B) 定义 可能存在此关系的实体类型
      Cause-Effect A对B有影响 Property-Property, Composition-Structure, Structure-Property, ...
      Component-Whole A是B的部分 Composition-Composition, ...
      Feature-Of A是B的特征 Feature-Composition, Feature-Application, ...
      Located-Of A占据了B位置 Composition-Structure, ...
      Instance-Of A是B的实例 Composition-Composition, Structure-Structure, Property-Property, ...
      Condition-On A的条件是B Processing-Condition, ...
      Method-Of A的表征方法是B Property-Characterization, ...
      Other A与B存在除上述关系类型外的其他关系
      DownLoad: CSV

      标注工具 适配任务 文本要求 角色管理权限 难易性 友好性 可扩展性 参考
      Label Studio 多模态信息标注 严格 不完善 普通 labelstud.io
      Brat 关系标注 一般 完善 普通 github.com/nlplab/brat
      Doccano 文本分类 严格 较完善 普通 github.com/doccano
      EasyData 实体与关系标注 一般 完善 容易 ai.baidu.com/easydata/
      DownLoad: CSV
      算法1 数据增强方法cDA-DK
      输入 原始数据集$ {D}_{{\rm{t}}{\rm{r}}{\rm{a}}{\rm{i}}{\rm{n}}} = \{\left({x}_{1}, {y}_{1}\right), \left({x}_{2}, {y}_{2}\right), \dots , \left({x}_{n}, {y}_{n}\right)\} $    预训练语言模型模型$ {P}_{{\rm{D}}{\rm{i}}{\rm{s}}{\rm{t}}{\rm{i}}{\rm{l}}{\rm{R}}{\rm{o}}{\rm{B}}{\rm{E}}{\rm{R}}{\rm{T}}{\rm{a}}} $
         材料领域词典$ C=\{{w}_{1}, {w}_{2}, \dots , {w}_{m}\} $输出 增强数据集$ {D}_{{\rm{s}}{\rm{y}}{\rm{n}}{\rm{t}}{\rm{h}}{\rm{e}}{\rm{t}}{\rm{i}}{\rm{c}}} $
      1: 开始
      2:for$ {w}_{i}\in C $do3:   $ {w}_{i} $输入至$ {P}_{{\rm{D}}{\rm{i}}{\rm{s}}{\rm{t}}{\rm{i}}{\rm{l}}{\rm{R}}{\rm{o}}{\rm{B}}{\rm{E}}{\rm{R}}{\rm{T}}{\rm{a}}} $的词汇表并训练其对应的词向量
      4: 在下游任务文本数据增强上微调$ {P}_{{\rm{D}}{\rm{i}}{\rm{s}}{\rm{t}}{\rm{i}}{\rm{l}}{\rm{R}}{\rm{o}}{\rm{B}}{\rm{E}}{\rm{R}}{\rm{T}}{\rm{a}}} $得到$ {F}_{{\rm{D}}{\rm{i}}{\rm{s}}{\rm{t}}{\rm{i}}{\rm{l}}{\rm{R}}{\rm{o}}{\rm{B}}{\rm{E}}{\rm{R}}{\rm{T}}{\rm{a}}} $
      5: 初始化$ {D}_{{\rm{s}}{\rm{y}}{\rm{n}}{\rm{t}}{\rm{h}}{\rm{e}}{\rm{t}}{\rm{i}}{\rm{c}}}=\left\{\right\} $
      6:for$ \left\{{x}_{i}, {y}_{i}\right\}\in {D}_{{\rm{t}}{\rm{r}}{\rm{a}}{\rm{i}}{\rm{n}}} $do
      7:  $ ({\widehat{x}}_{i}, {\widehat{y}}_{i})={F}_{{\rm{D}}{\rm{i}}{\rm{s}}{\rm{t}}{\rm{i}}{\rm{l}}{\rm{R}}{\rm{o}}{\rm{B}}{\rm{E}}{\rm{R}}{\rm{T}}{\rm{a}}}({x}_{i}, {y}_{i}) $ // 生成新的样本
      8:  $ {D}_{{\rm{s}}{\rm{y}}{\rm{n}}{\rm{t}}{\rm{h}}{\rm{e}}{\rm{t}}{\rm{i}}{\rm{c}}}={D}_{{\rm{s}}{\rm{y}}{\rm{n}}{\rm{t}}{\rm{h}}{\rm{e}}{\rm{t}}{\rm{i}}{\rm{c}}}\cup ({\widehat{x}}_{i}, {\widehat{y}}_{i}) $ // 生成样本加入增强数据集
      9: 结束
      DownLoad: CSV

      数据集 样本数 实体类型 实体数 关系类型 关系数
      CoNLL-2004 1, 441 4 5, 347 5 2, 020
      NASICON 2, 434 8 4, 857 8 2, 297
      DownLoad: CSV

      数据集 样本数 实体数 关系数 示例
      原始数据集 2434 4857 2297 The (O)ionic(B-Property) conductivity (I-Property) decreases (O) with (O) increasing (O)activation(B-Property) energy (I-Property) . (O)
      cDA-DK 增强数据集 4846 9714 4594 The (O)electrode(B-Property) conductivity (I-Property) decreases (O) with (O) increasing (O)electric(B-Property) energy (I-Property) . (O)
      DownLoad: CSV

      数据集名称 应用领域 重命名 样本量 语料规模 来源
      NASICON 实体识别数据集 NASICON 型固态电解质 Dataset 1 2, 434 55篇文献 领域专家标注
      Dataset 2 2, 434 数据增强
      Dataset 3 305 35篇文献 非专业人员标注
      Matscholar[11] 无机材料 Dataset 4 5, 459 800份摘要 领域专家标注
      Dataset 5 5, 459 数据增强
      DownLoad: CSV

      数据集 材料类别 样本量 Precision Recall F1-score
      Dataset 1 NASICON 型固态电解质 2, 434 0.78 0.83 0.80
      Dataset 2 2, 434 0.68 0.72 0.70
      Dataset 2+3 2, 739 0.83 0.85 0.84
      Dataset 4 无机材料 5, 459 0.86 0.90 0.88
      Dataset 5 5, 459 0.75 0.78 0.77
      DownLoad: CSV
    • [1]

      [2]

      [3]

      [4]

      [5]

      [6]

      [7]

      [8]

      [9]

      [10]

      [11]

      [12]

      [13]

      [14]

      [15]

      [16]

      [17]

      [18]

      [19]

      [20]

      [21]

      [22]

      [23]

      [24]

      [25]

      [26]

      [27]

      [28]

      [29]

      [30]

      [31]

      [32]

      [33]

    • [1] Chen Xin-Jie, Zhang Jing-Na, Zhang Hui-Tao, Xia Di-Meng, Xu Wen-Feng, Zhu Yi-Ning, Zhao Xing.Computed tomography data based X-ray spectrum estimation method. Acta Physica Sinica, 2023, 72(11): 118701.doi:10.7498/aps.72.20222307
      [2] Ma Jin-Long, Du Chang-Feng, Sui Wei, Xu Xiang-Yang.Data traffic capability of double-layer network based on coupling strength. Acta Physica Sinica, 2020, 69(18): 188901.doi:10.7498/aps.69.20200181
      [3] Lin Dan-Ying, Niu Jing-Jing, Liu Xiong-Bo, Zhang Xiao, Zhang Jiao, Yu Bin, Qu Jun-Le.Phasor analysis of fluorescence lifetime data and its application. Acta Physica Sinica, 2020, 69(16): 168703.doi:10.7498/aps.69.20200554
      [4] Wu Si-Yuan, Wang Yu-Qi, Xiao Rui-Juan, Chen Li-Quan.Development and application of battery materials database. Acta Physica Sinica, 2020, 69(22): 226104.doi:10.7498/aps.69.20201542
      [5] Guo Shu-Hui, Lu Xin.Live streaming: Data mining and behavior analysis. Acta Physica Sinica, 2020, 69(8): 088908.doi:10.7498/aps.69.20191776
      [6] Liu Zhen, Yang Xiao-Chao, Zhang Xiao-Xin, Zhang Shen-Yi, Yu Qing-Long, Zhang Xin, Xue Bing-Sen, Guo Jian-Guang, Zong Wei-Guo, Shen Guo-Hong, Bai Chao-Ping, Zhou Ping, Ji Wen-Tao.On-orbit cross-calibration and assimilation for relativistic electron observations from FengYun 4A and GOES-13. Acta Physica Sinica, 2019, 68(15): 159401.doi:10.7498/aps.68.20190433
      [7] Duan Yan-Hui, Wu Wen-Hua, Fan Zhao-Lin, Luo Jia-Qi.Proper orthogonal decomposition-based data mining of aerodynamic shape for design optimization. Acta Physica Sinica, 2017, 66(22): 220203.doi:10.7498/aps.66.220203
      [8] Liang Ming-Hui, Zheng Fei-Hu, An Zhen-Lian, Zhang Ye-Wen.Numerical extraction of electric field distribution from thermal pulse method based on Monte Carlo simulation. Acta Physica Sinica, 2016, 65(7): 077702.doi:10.7498/aps.65.077702
      [9] Jia Guo, Huang Xiu-Guang, Xie Zhi-Yong, Ye Jun-Jian, Fang Zhi-Heng, Shu Hua, Meng Xiang-Fu, Zhou Hua-Zhen, Fu Si-Zu.Experimental measurement of liquid deuterium equation of state data. Acta Physica Sinica, 2015, 64(16): 166401.doi:10.7498/aps.64.166401
      [10] Zhang Xin-Peng, Hu Niao-Qing, Cheng Zhe, Zhong Hua.Vibration data recovery based on compressed sensing. Acta Physica Sinica, 2014, 63(20): 200506.doi:10.7498/aps.63.200506
      [11] Su Yong, Fan Dong-Ming, You Wei.Gravity field model calculated by using the GOCE data. Acta Physica Sinica, 2014, 63(9): 099101.doi:10.7498/aps.63.099101
      [12] Yang Fu-Qiang, Zhang Ding-Hua, Huang Kui-Dong, Wang Kun, Xu Zhe.Review of reconstruction algorithms with incomplete projection data of computed tomography. Acta Physica Sinica, 2014, 63(5): 058701.doi:10.7498/aps.63.058701
      [13] Zhou Wen-Jing, Hu Wen-Tao, Qu Hui, Zhu Liang, Yu Ying-Jie.Recording and numerical reconstruction of single digital tomographic hologram. Acta Physica Sinica, 2012, 61(16): 164212.doi:10.7498/aps.61.164212
      [14] Hong Zhen-Jie, Liu Rong-Jian, Guo Peng, Dong Nai-Ming.Non-spherical symmetric inversion of ionospheric occultation data. Acta Physica Sinica, 2011, 60(12): 129401.doi:10.7498/aps.60.129401
      [15] Tan Ye, Yu Yu-Ying, Dai Cheng-Da, Tan Hua, Wang Qing-Song, Wang Xiang.Measurement of low-pressure Hugoniot data for bismuth with reverse-impact geometry. Acta Physica Sinica, 2011, 60(10): 106401.doi:10.7498/aps.60.106401
      [16] Cong Rui, Liu Shu-Lin, Ma Rui.An approach to phase space reconstruction from multivariate data based on data fusion. Acta Physica Sinica, 2008, 57(12): 7487-7493.doi:10.7498/aps.57.7487
      [17] Zhou Nan-Run, Zeng Gui-Hua, Gong Li-Hua, Liu San-Qiu.Quantum communication protocol for data link layer based on entanglement. Acta Physica Sinica, 2007, 56(9): 5066-5070.doi:10.7498/aps.56.5066
      [18] Liu Xin-Yuan, Xie Bai-Qing, Dai Yuan-Dong, Wang Fu-Ren, Li Zhuang-Zhi, Ma Ping, Xie Fei-Xiang, Yang Tao, Nie Rui-Juan.Adaptive noise cancellation for SQUID-based magnetocardiogram. Acta Physica Sinica, 2005, 54(4): 1937-1942.doi:10.7498/aps.54.1937
      [19] YANG LIN-BAO.SAMPLED-DATA FEEDBACK CONTROL FOR CHEN'S CHAOTIC SYSTEM. Acta Physica Sinica, 2000, 49(6): 1039-1042.doi:10.7498/aps.49.1039
      [20] WANG ZHU-XI, CHANG LI-YUAN.ON THE CALCULATION OF THE VIRIAL COEFFICIENTS OF HYDROGEN GAS FROM EXPERIMENTAL DATA. Acta Physica Sinica, 1965, 21(3): 508-518.doi:10.7498/aps.21.508
    • supplement补充材料-7-20222316-070701.pdf supplement
    Metrics
    • Abstract views:4828
    • PDF Downloads:175
    • Cited By:0
    Publishing process
    • Received Date:05 December 2022
    • Accepted Date:07 February 2023
    • Available Online:09 February 2023
    • Published Online:05 April 2023

      返回文章
      返回
        Baidu
        map