Upload
dmitry-kan
View
1.624
Download
10
Embed Size (px)
Citation preview
Linguistic Component: Tokenizer for the Russian language
Technical description
SemanticAnalyzer Group, 2013-08-29 www.semanticanalyzer.info
This document describes technical details of tokenizer for the Russian language. The component has two modes of operation:
Processing of generic texts: news, technical articles etc
Processing of Twitter messages
Demo package sent upon request contains the following:
Java library of tokenizer in a form of a binary
run_tokenizer.sh script for swift checking the functionality of the module
messages_to_tokenize.txt file containing examples of generic text and tweets for tokenization using the run_tokenizer.sh script
The algorithm is based on a set of rules, implemented using Flex (JFlex), which allow extracting individual tokens for a text stream.
Speed of processing
Server: Intel(R) Xeon(R) CPU X3363 @ 2.83GHz Operating system: ubuntu 10.04, Java 1.7.0_21 64 bit server 38497 characters/ms 5158 tokens/ms Tests were conducted in a single thread.
Format of the messages_to_tokenize.txt file This file describes input data for the tokenizer module for demo purposes. Format: Text\tText type Text contains textual data in Russian for tokenization \t – tab symbol Text type: supported values are GENERAL_TEXT and TWITTER.
Examples of tokenization The run_tokenizer.sh script will generate the following file: messages_to_tokenize.out. For the following input file messages_to_tokenize.txt: :)this is it! #По_русски @dm ;-D www.test.com/x?y TWITTER This output gets generated:
:)this is it! #По_русски @dm ;-D www.test.com/x?y TWITTER emopostkn, type: ALPHANUM this, type: ALPHANUM is, type: ALPHANUM it, type: ALPHANUM !, type: PUNCT #По_русски, type: TWITTER_HASHTAG @dm, type: TWITTER_USERNAME emopostkn, type: ALPHANUM www.test.com/x?y, type: HYPERLINK
Examples of using the library from the Java code
Tokenizer twitterTokenizer = new TwitterFlexTokenizer(new StringReader("#ht. done!"), true);
Token reusableToken = Token.newReusableToken(); while((reusableToken = twitterTokenizer.getNextToken(reusableToken)) != null) {
System.out.println(reusableToken); }
output: Token[text=#ht,type=TWITTER_HASHTAG] Token[text=done,type=ALPHANUM]