海量中文分词文档--海量大数据分析平台智能中心

使用指南

欢迎使用HylandaNLP分词API服务

API服务是由海量信息技术有限公司提供的在线中文分词云计算服务。本文档主要针对API开发者，如果您对文档内容有疑问，可以通过客服群联系我们。首先，您需要注册海量大数据分析平台账号，完成后，需联系客服申请apikey，该密钥与您注册的账号将作为身份验证。系统当前处于测试阶段，各个账号均可免费使用，但每月的总调用次数不超过十万次。

分词

接口提供分词服务，支持分词，词性标注，专名识别

接口说明

URL

http://bigdata.hylanda.com/HylandaNlp/hlsegapi/wordseg

HTTP Method

POST

HTTP Header

参数	类型	值
Content-Type	String	application/json
Accept	String	application/json
id	String	YOUR_ID （需要替换成您自己的 id）
apikey	String	YOUR_API_KEY （需要替换成您自己的 apikey）

HTTP请求 Body

请求为JSON格式

参数	类型	值
content	String	待分词文本，utf8格式，长度不超过1万字
keyword	Boolean	是否输出关键词，true表示输出；false不输出(非必填项，默认输出)
uuid	Boolean	是否输出语义指纹(非必填项，默认输出)
mode	String	分词颗粒度。"large"为大颗粒分词，"search"为检索优化分词，"normal"为普通分词。（非必填项，默认输出普通分词）

HTTP返回格式

返回JSON格式

参数	类型	必需	值
ret	String	是	success成功，error失败
errsmg	String	否	分词失败后的错误提示信息
uuid	String	否	语义指纹
keyword	JSONArray	否	详见示例
items	JSONArray	是	详见示例

请求样例

curl
java
python

$curl -i -X POST \ 'http://bigdata.hylanda.com/HylandaNlp/hlsegapi/wordseg' \
-H 'id:xx' \
-H 'apikey:xx' \
--data '{"content":"欢迎使用海量分词", "mode":"search" }'
								
String url = "http://bigdata.hylanda.com/HylandaNlp/hlsegapi/wordseg";
HttpPost httpRequest = new HttpPost(url);
HttpClient httpClient = new DefaultHttpClient();
String strResult = null;
JSONObject datajson = new JSONObject;
datajson.put("content","欢迎使用海量分词");
datajson.put("mode","search"); //使用检索优化版分词

try {
	httpRequest.addHeader("id", id);
	httpRequest.addHeader("apikey", apikey);
	httpRequest.setEntity(new StringEntity(datajson.toJSONString(), "utf-8"));

	HttpResponse httpResponse = httpClient.execute(httpRequest);
	strResult = EntityUtils.toString(httpResponse.getEntity(), HTTP.UTF_8);
	} catch (UnsupportedEncodingException e) {
	e.printStackTrace();
} catch (ParseException e) {
	e.printStackTrace();
} catch (IOException e) {
	e.printStackTrace();
} finally {
		;
}
							
#! /usr/bin/python2
# -*-coding: UTF-8 -*-
import urllib, urllib2, sys, json
def testSegment():
app_id = "xx"
api_key = "xx"
seg_mode = " search"
url_seg = 'http://bigdata.hylanda.com/HylandaNlp/hlsegapi/wordseg'
is_get_keywords = True
testText = "欢迎使用海量分词"

body_value = {"content": testText, "mode": seg_mode, "keyword": str(is_get_keywords).lower()}
post_data  = json.JSONEncoder().encode(body_value)

request = urllib2.Request(url_seg, post_data)
request.add_header("id", app_id);
request.add_header("apikey", api_key);
response = urllib2.urlopen(request).read().decode('utf-8')

result = json.loads(response)
for x in result[u'items']: print ("%s/%s") % (x[u'word'], x[u'nature']),
print ""
for x in result[u'keywords']: print ("%s:%s") % (x[u'word'], x[u'weight']),

示例

						
请求body
{
    "content": "欢迎使用海量中文分词",
}

返回结果
{
    "ret": "success",
    "keywords": [
        {
            "weight": 2.01193821378137, //关键词权重
            "word": "分词"
        },
        {
            "weight": 1.49973885715222,
            "word": "海量"
        },
        {
            "weight": 0.934162899062701,
            "word": "中文"
        },
        {
            "weight": 0.669546468690703,
            "word": "欢迎"
        },
        {
            "weight": 0.479062470189289,
            "word": "使用"
        }
    ],
    "uuid": "13785293628915636434",
    "items": [
        {
            "offset": 0,  //分词在文章中的位置
            "len": 2,   
            "nature": "v", //分词词性
            "word": "欢迎"
        },
        {
            "offset": 2,
            "len": 2,
            "nature": "v",
            "word": "使用"
        },
        {
            "offset": 4,
            "len": 2,
            "nature": "n",
            "word": "海量"
        },
        {
            "offset": 6,
            "len": 2,
            "nature": "nz",
            "word": "中文"
        },
        {
            "offset": 8,
            "len": 2,
            "nature": "n",
            "word": "分词"
        }
    ]
}

词性缩写说明

词性	说明	词性	说明	词性	说明	词性	说明
Ag	形语素	A	形容词	Ad	副形词	An	名形词
B	区别词	c	连词	Dg	副语素	d	副词
e	叹词	f	方位词	g	语素	h	前接成分
i	成语	j	简称略语	k	后接成分	l	习用语
m	数词	Ng	名语素	n	名词	nr	人名
ns	地名	nt	机构团体	nz	其他专名	o	拟声词
p	介词	q	量词	r	代词	s	处所词
Tg	时语素	t	时间词	u	助词	Vg	动语素
v	动词	vd	副动词	vn	名动词	w	标点符号
x	非语素字	y	语气词	z	状态词

加载自定义词典

海量分词API支持用户自定义词典，用户需上传自定义词典，可有效识别应用场景中的小众词汇与类别。用户上传自定义词典格式如下：

1.用户自定义词典采用文本格式，utf-8编码，每行一个词

2.每个词包含三列属性，分别是词串、词的属性以及idf值的加权等级，并以Tab作为分隔，其中除了词串必填外，其他列可以不填，不填写则系统采用默认值

3.“#”表示注释，会在加载时被忽略

4.词的属性以西文逗号分隔，可以是词性、停止词标志或者自定义属性

5.词性标记参考北大标准，用于词性标注时参考，该项不填则默认为名词

6.停止词标志为：stopword，由SegOption.outputStopWord来控制是否输出停止词

7.自定义属性不参与分词过程，分词结果中若Token.userTag不为空，则可以获取到该词的自定义属性。

8.idf值的加权分5级，从低到高的定义是idf-lv1 — idf-lv5，等级越高则该词在关键词计算时的权重会越大，若不填写该值则系统默认是idf-lv3(中等权重）

接口说明

URL

http://bigdata.hylanda.com/HylandaNlp/hlsegapi/loadusrdict

HTTP Method

POST

HTTP Header

参数	类型	值
Content-Type	String	application/json
Accept	String	application/json
id	String	YOUR_ID （需要替换成您自己的 id）
apikey	String	YOUR_API_KEY （需要替换成您自己的 apikey）

HTTP请求 Body

请求为文件流格式，上传自定义词典文件大小不超过2M

HTTP返回格式

返回为JSON格式

参数	类型	必需	值
ret	String	是	success成功，error失败
errsmg	String	否	失败后的错误提示信息

请求样例

curl
java
python

curl -i -X POST 'http://bigdata.hylanda.com/HylandaNlp/hlsegapi/loadusrdict' \
 -H 'id: xx' \
 -H 'apikey: xx' \
 -F 'filename=@自定义词典文件名（全路径）'
								
String url = "http://bigdata.hylanda.com/HylandaNlp/hlsegapi/loadusrdict";
HttpPost httpRequest = new HttpPost(url);
HttpClient httpClient = new DefaultHttpClient();
String strResult = null;

try {
	File file = new File("usrdict.txt");
    InputStream input = new FileInputStream(file);
	httpRequest.addHeader("id", id);
	httpRequest.addHeader("apikey", apikey);
	httpRequest.setEntity(new InputStreamEntity(input, file.length()));
	HttpResponse httpResponse = httpClient.execute(httpRequest);
	strResult = EntityUtils.toString(httpResponse.getEntity(), HTTP.UTF_8);
} catch (UnsupportedEncodingException e) {
	e.printStackTrace();
} catch (ParseException e) {
	e.printStackTrace();
} catch (IOException e) {
	e.printStackTrace();
} finally {
	;
}
							
#! /usr/bin/python2
# -*-coding: UTF-8 -*-
import urllib, urllib2, sys, json

def testLoadUserDict():
app_id = "xx"
api_key = "xx"
url_load_usr_dict = 'http://bigdata.hylanda.com/HylandaNlp/hlsegapi/loadusrdict'
post_data = "爱他美奶粉\tn\n海量中文分词系统\tnz\tidf-lv5"

request = urllib2.Request(url_load_usr_dict, post_data)
request.add_header("id", app_id);
request.add_header("apikey", api_key);
response = urllib2.urlopen(request).read().decode('utf-8')

result = json.loads(response)
return result[u'ret'] == u'success'

示例

`自定义词典样例 … 投资部长合不拢嘴 l,d #卧虎藏龙 n,movie 苹果 n,nt,fruit,corp 的 stopword 海量中文分词系统 nz idf-lv5 … 返回结果 { "ret": "success", }`

卸载自定义词典

支持用户卸载上传自定义词典

接口说明

URL

http://bigdata.hylanda.com/HylandaNlp/hlsegapi/unloadusrdict

HTTP Method

POST

HTTP Header

参数	类型	值
Content-Type	String	application/json
Accept	String	application/json
id	String	YOUR_ID （需要替换成您自己的 id）
apikey	String	YOUR_API_KEY （需要替换成您自己的 apikey）

HTTP返回格式

返回为JSON格式

参数	类型	必需	值
ret	String	是	success成功，error失败
errsmg	String	否	失败后的错误提示信息

请求样例

curl
java
python

curl -i -X POST 'http://bigdata.hylanda.com/HylandaNlp/hlsegapi/unloadusrdict' \
 -H 'id: xx' \
 -H 'apikey: xx' \
								
String url = "http://bigdata.hylanda.com/HylandaNlp/hlsegapi/unloadusrdict";
HttpPost httpRequest = new HttpPost(url);
HttpClient httpClient = new DefaultHttpClient();
String strResult = null;
try {
	httpRequest.addHeader("id", id);
	httpRequest.addHeader("apikey", apikey);
	HttpResponse httpResponse = httpClient.execute(httpRequest);
	strResult = EntityUtils.toString(httpResponse.getEntity(), HTTP.UTF_8);
} catch (UnsupportedEncodingException e) {
	e.printStackTrace();
} catch (ParseException e) {
	e.printStackTrace();
} catch (IOException e) {
	e.printStackTrace();
} finally {
	;
}
							
#! /usr/bin/python2
# -*-coding: UTF-8 -*-
import urllib, urllib2, sys, json

def testUnloadUsrDict():
app_id = "xx"
api_key = "xx"
url_unload_usr_dict = 'http://bigdata.hylanda.com/HylandaNlp/hlsegapi/unloadusrdict'
post_data = ""

request = urllib2.Request(url_unload_usr_dict, post_data)
request.add_header("id", app_id);
request.add_header("apikey", api_key);
response = urllib2.urlopen(request).read().decode('utf-8')

result = json.loads(response)
return result[u'ret'] == u'success'

示例

`返回结果 { "ret": "success", }`

联系我们

电话

400-005-0958

邮箱

nlp@hylanda.com

使用指南

分词

接口说明

URL

HTTP Method

HTTP Header

HTTP请求 Body

HTTP返回格式

请求样例

示例

词性缩写说明

加载自定义词典

接口说明

URL

HTTP Method

HTTP Header

HTTP请求 Body

HTTP返回格式

请求样例

示例

卸载自定义词典

接口说明

URL

HTTP Method

HTTP Header

HTTP返回格式

请求样例

示例

联系我们

电话

邮箱

客服微信号

订阅号

QQ社群

使用指南

分词

接口说明

URL

HTTP Method

HTTP Header

HTTP请求 Body

HTTP返回格式

请求样例

示例

词性缩写说明

加载自定义词典

接口说明

URL

HTTP Method

HTTP Header

HTTP请求 Body

HTTP返回格式

请求样例

示例

卸载自定义词典

接口说明

URL

HTTP Method

HTTP Header

HTTP返回格式

请求样例

示例

联系我们

电话

邮箱

客服微信号

订阅号

QQ社群

错误提示