A Method for Extraction of Keywords from Safety Information in Civil Aviation
-
摘要: 对于民航安全信息,关键词有着体现文本概要的功能,在信息提取与调用等方面对民航安全工作者有所帮助.在研究当前关键词提取技术后,结合民航领域词特征,提出了一种以朴素贝叶斯模型为基础的关键词提取模型.模型中所选取的特征项为词语的词长、词性、以及包含词语位置与段落跨度的词频与逆向文档频率乘积(TF-IDF)值,特征项代表了每个候选词的基本属性.利用该模型对已人工标注好关键词的民航安全信息进行训练以获取各个特征项的概率,利用特征项概率计算每个备选词的关键词评分,将评分前3位的词汇输出为关键词.实验表明,针对民航安全信息的关键词提取,所提方法与传统的TF-IDF算法、KEA算法相比,准确率分别提高了18%和11.9%,民航词汇识别率分别提高了15.3%和12.3%.结果证明,与传统算法相比,所提方法能大幅提升关键词提取的准确率与民航词汇识别的能力.Abstract: Keywords in civil aviation can reflect synopsis of safety information.It is significant for security officers to extract and call information.An academic review of technologies to extract keywords is conducted in this paper.The features of keywords in civil aviation are analyzed.And a naive Bayes model for extraction of keywords is proposed.The selected features of this model are length of keywords;part of speech;frequency of words (including span of paragraph and position of words);and Term Frequency-Inverse Document Frequency (TF-IDF) value;which reflect the basic attributes of each candidate word.This model is trained by the safety information which has been manually labeled;in order to obtain the probability of each feature for extracting keywords.The probability of features is used to compute the score of all alternative words.The words with top three scores are regarded as keywords.Compared with the traditional TF-IDF algorithm and KEA algorithm;this method improves the accuracy by 18% and 11.9%;respectively.The recognition rate of words is also improved by 15.3% and 12.3%;respectively.The results show that;compared with other general methods;the accuracy and capability to recognize special words in civil aviation can be significantly improved by the method proposed in this study.
点击查看大图
计量
- 文章访问数: 236
- HTML全文浏览量: 53
- PDF下载量: 2
- 被引次数: 0