分词与词性标注

概述

因为中文的自然语言书写对于不同的词之间不会采用显示分隔符(如空格)进行分割,在大多数自然语言问题当中,分词都作为最基础的步骤。 词性用来描述一个词在上下文中的作用,而词性标注就是识别这些词的词性,以确定其在上下文中的作用。一般情况下,词性标注是建立在分词基础上的另一个自然语言处理的基础步骤。为了适应BosonNLP自然语言处理的需要, BosonNLP采用将分词和词性标注联合枚举的方法,实现了这一套分词和词性标注系统,并通过开放API接口的形式提供给其他开发者使用。

BosonNLP的分词和词性标注都是基于序列标注实现的,以词为单位对句子进行词边界和词性的标注即发挥了基于字符串匹配方法切分速度快、效率高等特点,又可以结合上下文识别生词、自动消除歧义,同时避免由于分词错误造成词性标注错误的级联放大。

BosonNLP分词和词性标注系统完全是自主实现的,在原有算法和语料的基础上,又加入了一些优化:

  • 加入了对url、email等特殊词的识别
  • 对词性标签进行调整和优化,实现了更细的标签划分(22个大类,69个标签)
  • 对训练语料进行修正
  • 加入繁简转化,可以处理繁体中文或者繁简混合的中文句子

BosonNLP分词和词性标注系统还提供了多种分词选项,以满足不同开发者的需求:

  • 空格保留选项
  • 新词枚举强度选项
  • 繁简转换选项
  • 特殊字符转换选项

性能测试:

  • 在人民日报测试集上测试结果:
  正确率 召回率 F1值
分词 0.976725 0.981847 0.979279
分词和词性标注 0.954014 0.959017 0.956509
  • 为了适应BosonNLP自然语言处理的需要,我们还准备了一个新的分词语料库,包括近两年的新闻、微博、点评等各个分类的数据。这个数据集上面会出现很多近几年出现的新词,一些不规范的网络术语,错别字等,因而处理难度更大。在这个数据集上面的测试结果如下:
  正确率 召回率 F1值
分词 0.969493 0.974508 0.971994
分词和词性标注 0.946201 0.951096 0.948642

调用说明

各个标签对应词性见 词性标注说明

URL
http://api.bosonnlp.com/tag/analysis?space_mode=0&oov_level=3&t2s=0&&special_char_conv=0

Parameters

space_mode (空格保留选项)

value 说明
0 不保留空格 (default)
1 连续出现的多个空格只保留一个
2 保留所有空格,不改变原文本的空格
3 对于英文单词间的空格不保留, 中文词之间的空格连续多次出现只保留一个

oov_level (新词枚举强度选项)

value 说明
0 不枚举新词,只有在词典中出现的词才会出现在分词结果中
1-4 允许出现新词,从1到4出现新词的可能性依次增大
default 3 (正常水平)

t2s (繁简转换选项)

value 说明
0 关闭繁简转换,保留原文本 (default)
1 将所有繁体中文转化成简体中文

special_char_conv (特殊字符转换选项)

value 说明
0 不进行特殊字符转换,保留原文本 (default)
1 进行特殊字符转换,将”\n”, “\r”, “\t”分别 转换成”_Enter_”, “_Enter_” , “_Tab_”
HTTP Method
POST
HTTP Header
Content-Type
application/json
Accept
application/json
X-Token
YOUR_API_TOKEN (需要替换成您自己的 Token)
HTTP 请求 Body

JSON 格式的需要做分词与词性标注的文本或者文本组成的列表。比如:

"\u8fd9\u4e2a\u4e16\u754c\u597d\u590d\u6742"

Note

我们限定了一次传入的文章数目不能超过100篇。

HTTP 返回 Body

JSON 格式的分词与词性标注结果。

key type 说明
word list 分词结果
tag list 词性标注结果

CURL 调用示例

不同的空格保留选项(space_mode):

$ curl -X POST \
     -H "Content-Type: application/json" \
     -H "Accept: application/json" \
     -H "X-Token: YOUR_API_TOKEN" \
     --data "\"人民法院案件受理制度改革  下月起法院将有案必立\"" \
     'http://api.bosonnlp.com/tag/analysis?space_mode=0&oov_level=3&t2s=0'
[{"tag": ["nl", "n", "n", "n", "n", "t", "f", "n", "d", "vyou", "n", "d", "v"], "word": ["人民法院", "案件", "受理", "制度", "改革", "下月", "起", "法院", "将", "有", "案", "必", "立"]}]
$ curl -X POST \
     -H "Content-Type: application/json" \
     -H "Accept: application/json" \
     -H "X-Token: YOUR_API_TOKEN" \
     --data "\"人民法院案件受理制度改革  下月起法院将有案必立\"" \
     'http://api.bosonnlp.com/tag/analysis?space_mode=1&oov_level=3&t2s=0'
[{"tag": ["nl", "n", "n", "n", "n", "w", "t", "f", "n", "d", "vyou", "n", "d", "v"], "word": ["人民法院", "案件", "受理", "制度", "改革", " ", "下月", "起", "法院", "将", "有", "案", "必", "立"]}]
$ curl -X POST \
     -H "Content-Type: application/json" \
     -H "Accept: application/json" \
     -H "X-Token: YOUR_API_TOKEN" \
     --data "\"人民法院案件受理制度改革  下月起法院将有案必立\"" \
     'http://api.bosonnlp.com/tag/analysis?space_mode=2&oov_level=3&t2s=0'
[{"tag": ["nl", "n", "n", "n", "n", "w", "t", "f", "n", "d", "vyou", "n", "d", "v"], "word": ["人民法院", "案件", "受理", "制度", "改革", "  ", "下月", "起", "法院", "将", "有", "案", "必", "立"]}]

不同的新词枚举强度选项(oov_level):

$ curl -X POST \
     -H "Content-Type: application/json" \
     -H "Accept: application/json" \
     -H "X-Token: YOUR_API_TOKEN" \
     --data "[\"亚投行意向创始成员国确定为57个\",\"“流量贵”频被吐槽\"]" \
     'http://api.bosonnlp.com/tag/analysis?space_mode=0&oov_level=0&t2s=0'
[{"tag": ["ns", "v", "n", "n", "vi", "n", "v", "v", "m", "q"], "word": ["亚", "投", "行", "意向", "创始", "成员国", "确定", "为", "57", "个"]}, {"tag": ["wyz", "n", "a", "wyy", "d", "pbei", "v"], "word": ["“", "流量", "贵", "”", "频", "被", "吐槽"]}]
$ curl -X POST \
     -H "Content-Type: application/json" \
     -H "Accept: application/json" \
     -H "X-Token: YOUR_API_TOKEN" \
     --data "[\"亚投行意向创始成员国确定为57个\",\"“流量贵”频被吐槽\"]" \
     'http://api.bosonnlp.com/tag/analysis?space_mode=0&oov_level=1&t2s=0'
[{"tag": ["ns", "n", "n", "vi", "n", "v", "v", "m", "q"], "word": ["亚", "投行", "意向", "创始", "成员国", "确定", "为", "57", "个"]}, {"tag": ["wyz", "n", "a", "wyy", "d", "pbei", "v"], "word": ["“", "流量", "贵", "”", "频", "被", "吐槽"]}]
$ curl -X POST \
     -H "Content-Type: application/json" \
     -H "Accept: application/json" \
     -H "X-Token: YOUR_API_TOKEN" \
     --data "[\"亚投行意向创始成员国确定为57个\",\"“流量贵”频被吐槽\"]" \
     'http://api.bosonnlp.com/tag/analysis?space_mode=0&oov_level=3&t2s=0'
[{"tag": ["n", "n", "vi", "n", "v", "v", "m", "q"], "word": ["亚投行", "意向", "创始", "成员国", "确定", "为", "57", "个"]}, {"tag": ["wyz", "n", "a", "wyy", "d", "pbei", "v"], "word": ["“", "流量", "贵", "”", "频", "被", "吐槽"]}]
$ curl -X POST \
     -H "Content-Type: application/json" \
     -H "Accept: application/json" \
     -H "X-Token: YOUR_API_TOKEN" \
     --data "[\"亚投行意向创始成员国确定为57个\",\"“流量贵”频被吐槽\"]" \
     'http://api.bosonnlp.com/tag/analysis?space_mode=0&oov_level=4&t2s=0'
[{"tag": ["n", "n", "vi", "n", "v", "v", "m", "q"], "word": ["亚投行", "意向", "创始", "成员国", "确定", "为", "57", "个"]}, {"tag": ["wyz", "n", "wyy", "d", "pbei", "v"], "word": ["“", "流量贵", "”", "频", "被", "吐槽"]}]

不同的繁简转换选项(t2s):

$ curl -X POST \
     -H "Content-Type: application/json" \
     -H "Accept: application/json" \
     -H "X-Token: YOUR_API_TOKEN" \
     --data "\"臺灣沿用傳統漢字,稱之為正體字\"" \
     'http://api.bosonnlp.com/tag/analysis?space_mode=0&oov_level=3&t2s=0'
[{"tag": ["ns", "v", "n", "nz", "wd", "v", "n"], "word": ["臺灣", "沿用", "傳統", "漢字", ",", "稱之為", "正體字"]}]
$ curl -X POST \
     -H "Content-Type: application/json" \
     -H "Accept: application/json" \
     -H "X-Token: YOUR_API_TOKEN" \
     --data "\"臺灣沿用傳統漢字,稱之為正體字\"" \
     'http://api.bosonnlp.com/tag/analysis?space_mode=0&oov_level=3&t2s=1'
[{"tag": ["ns", "v", "n", "n", "wd", "v", "n"], "word": ["台湾", "沿用", "传统", "汉字", ",", "称之为", "正体字"]}]

不同的特殊字符转换选项(special_char_conv):

$ curl -X POST \
     -H "Content-Type: application/json" \
     -H "Accept: application/json" \
     -H "X-Token: YOUR_API_TOKEN" \
     --data "[\"亚投行 意向创始成员国确定为57个\n\",\"“流量贵”频被吐槽\n\"]" \
     'http://api.bosonnlp.com/tag/analysis?space_mode=0&oov_level=3&t2s=0&special_char_conv=0'
[{"tag": ["n", "n", "vi", "n", "v", "v", "m", "q", "w"], "word": ["亚投行", "意向", "创始", "成员国", "确定", "为", "57", "个", "\n"]}, {"tag": ["wyz", "n", "wyy", "d", "pbei", "v", "w"], "word": ["“", "流量贵", "”", "频", "被", "吐槽", "\n"]}]
$ curl -X POST \
     -H "Content-Type: application/json" \
     -H "Accept: application/json" \
     -H "X-Token: YOUR_API_TOKEN" \
     --data "[\"亚投行 意向创始成员国确定为57个\n\",\"“流量贵”频被吐槽\n\"]" \
     'http://api.bosonnlp.com/tag/analysis?space_mode=0&oov_level=3&t2s=0&special_char_conv=1'
[{"tag": ["n", "n", "vi", "n", "v", "v", "m", "q", "w"], "word": ["亚投行", "意向", "创始", "成员国", "确定", "为", "57", "个", "_Enter_"]}, {"tag": ["wyz", "n", "wyy", "d", "pbei", "v", "w"], "word": ["“", "流量贵", "”", "频", "被", "吐槽", "_Enter_"]}]

Python 调用示例

# -*- encoding: utf-8 -*-
from __future__ import print_function, unicode_literals

import json
import requests


TAG_URL = 'http://api.bosonnlp.com/tag/analysis'
# 如果某个选项采用默认设置,可以在TAG_URL中省略,完整的TAG_URL如下:
# 'http://api.bosonnlp.com/tag/analysis?space_mode=0&oov_level=3&t2s=0&special_char_conv=0'
# 修改space_mode选项为1
# TAG_URL = \
#   'http://api.bosonnlp.com/tag/analysis?space_mode=1'
# 修改oov_level选项为1
# TAG_URL = \
#    'http://api.bosonnlp.com/tag/analysis?oov_level=1'
# 修改t2s选项为1
# TAG_URL= \
#     'http://api.bosonnlp.com/tag/analysis?t2s=1'
# 修改special_char_conv选项为1
# TAG_URL= \
# 'http://api.bosonnlp.com/tag/analysis?special_char_conv=1'

s = ['亚投行意向创始成员国确定为57个', '“流量贵”频被吐槽']
data = json.dumps(s)
headers = {'X-Token': 'YOUR_API_TOKEN'}
resp = requests.post(TAG_URL, headers=headers, data=data.encode('utf-8'))


for d in resp.json():
    print(' '.join(['%s/%s' % it for it in zip(d['word'], d['tag'])]))

运行

$ python tag_api_example.py
亚投行/n 意向/n 创始/vi 成员国/n 确定/v 为/v 57/m 个/q
“/wyz 流量/n 贵/a ”/wyy 频/d 被/pbei 吐槽/v

Python SDK 调用示例

# -*- encoding: utf-8 -*-
from __future__ import print_function, unicode_literals

from bosonnlp import BosonNLP

# 注意:在测试时请更换为您的API token。
nlp = BosonNLP('YOUR_API_TOKEN')

s = ['亚投行意向创始成员国确定为57个', '“流量贵”频被吐槽']

result = nlp.tag(s)
# 完整的参数调用格式如下:
# result = nlp.tag(s, space_mode=0, oov_level=3, t2s=0, special_char_conv=0)
# 修改space_mode选项为1,如下:
# result = nlp.tag(s, space_mode=1, oov_level=3, t2s=0, special_char_conv=0)
# 修改oov_level选项为1,如下:
# result = nlp.tag(s, space_mode=0, oov_level=1, t2s=0, special_char_conv=0)
# 修改t2s选项为1,如下:
# result = nlp.tag(s, space_mode=0, oov_level=3, t2s=1, special_char_conv=0)
# 修改特殊字符转换选项为1,如下:
# result = nlp.tag(s, space_mode=0, oov_level=3, t2s=0, special_char_conv=1)

for d in result:
    print(' '.join(['%s/%s' % it for it in zip(d['word'], d['tag'])]))

详细的 Python SDK 分词与词性标注文档请看 这里