登录    注册    忘记密码    使用帮助

详细信息

南美白对虾养殖领域中文命名实体识别数据集构建  ( EI收录)  

Construction of Chinese Named Entity Recognition Dataset in Penaeus Vannamei Farming Field

文献类型:期刊文献

中文题名:南美白对虾养殖领域中文命名实体识别数据集构建

英文题名:Construction of Chinese Named Entity Recognition Dataset in Penaeus Vannamei Farming Field

作者:彭小红[1];邓峰[1];余应淮[1]

机构:[1]广东海洋大学数学与计算机学院,广东湛江524088

年份:2025

卷号:61

期号:9

起止页码:353

中文期刊名:计算机工程与应用

外文期刊名:Computer Engineering and Applications

收录:北大核心2023、、EI(收录号:20251918362881)、北大核心

基金:广东省对虾现代种业智慧平台(2022GCZX001)。

语种:中文

中文关键词:命名实体识别;VamNER数据集;标注者间一致性(IAA);基于变换器的双向编码器表示(BERT);双向长短期记忆神经网络(BiLSTM);条件随机场(CRF)

外文关键词:named entity recognition;VamNER dataset;inter-annotation agreement(IAA);bidirectional encoder representations from Transformers(BERT);bidirectional long short-term memory network(BiLSTM);conditional random fields(CRF)

中文摘要:该研究致力于构建一个高质量的数据集,用于南美白对虾养殖领域的命名实体识别(named entity recognition,NER)任务,命名为VamNER。为确保数据集的多样性,从CNKI数据库中收集了近10年的高质量论文,并结合权威书籍进行语料构建。邀请专家讨论实体类型,并经过专业培训的标注人员使用IOB2标注格式进行标注,标注过程分为预标注和正式标注两个阶段以提高效率。在预标注阶段,标注者间一致性(inter-annotation agreement,IAA)达到0.87,表明标注人员的一致性较高。最终,VamNER包含6115个句子,总字符数达384602,涵盖10个实体类型,共有12814个实体。研究通过与多个通用领域数据集和一个特定领域数据集进行比较,揭示了VamNER的独特特性。在实验中使用了预训练的基于变换器的双向编码器表示(bidirectional encoder representations from Transformers,BERT)模型、双向长短期记忆神经网络(bidirectional long short-term memory network,BiLSTM)和条件随机场模型(conditional random fields,CRF),最优模型在测试集上的F1值达到82.8%。VamNER成为首个专注于南美白对虾养殖领域的NER数据集,为中文特定领域NER研究提供了丰富资源,有望推动水产养殖领域NER研究的发展。

外文摘要:This research is dedicated to constructing a high-quality dataset for the named entity recognition(NER)task in the field of penaeus vannamei farming,named VamNER.In order to ensure the diversity of the dataset,high-quality papers in the past 10 years have collected from the CNKI database and combined with authoritative books for corpus construction.Experts are invited to discuss entity types,and professionally trained annotators use the IOB2 annotation format to annotate.The annotation process is divided into two stages:pre-annotation and formal annotation to improve efficiency.In the pre-annotation stage,the consistency between annotators reached 0.87,indicating that the consistency of annotators is high.Finally,VamNER contains 6115 sentences with a total number of characters of 384602,covering 10 entity types and a total of 12814 entities.The study reveals the unique properties of VamNER through comparing with multiple domain-general datasets and one domain-specific dataset.In the experiment,the pre-trained bidirectional encoder representations from Transformers(BERT)model,bidirectional long short-term memory network(BiLSTM)and conditional random fields(CRF)model are used.The F1 value of the optimal model on the test set reaches 82.8%.VamNER has become the first NER dataset focusing on the field of penaeus vannamei farming,providing rich resources for NER research in specific Chinese fields,and is expected to promote the development of NER research in the aquaculture field.

参考文献:

正在载入数据...

版权所有©广东海洋大学 重庆维普资讯有限公司 渝B2-20050021-8 
渝公网安备 50019002500408号 违法和不良信息举报中心