Linguistic Component: Tokenizer for the Russian language Technical description SemanticAnalyzer Group, 2013-08-29 www.semanticanalyzer.info This document describes technical details of tokenizer for the Russian language. The component has two modes of operation: Processing of generic texts: news, technical articles etc Processing of Twitter messages Demo package sent upon request contains the following: Java library of tokenizer in a form of a binary run_tokenizer.sh script for swift checking the functionality of the module messages_to_tokenize.txt file containing examples of generic text and tweets for tokenization using the run_tokenizer.sh script The algorithm is based on a set of rules, implemented using Flex (JFlex), which allow extracting individual tokens for a text stream. Speed of processing Server: Intel(R) Xeon(R) CPU X3363 @ 2.83GHz Operating system: ubuntu 10.04, Java 1.7.0_21 64 bit server 38497 characters/ms 5158 tokens/ms Tests were conducted in a single thread. Format of the messages_to_tokenize.txt file This file describes input data for the tokenizer module for demo purposes. Format: Text\tText type Text contains textual data in Russian for tokenization \t – tab symbol Text type: supported values are GENERAL_TEXT and TWITTER. Examples of tokenization The run_tokenizer.sh script will generate the following file: messages_to_tokenize.out. For the following input file messages_to_tokenize.txt: :)this is it! #По_русски @dm ;-D www.test.com/x?y TWITTER This output gets generated:

Linguistic component Tokenizer for the Russian language

Download PDF Report

Upload
dmitry-kan
View
1.624
Download
10

Embed Size (px)

Citation preview

Linguistic Component: Tokenizer for the Russian language

Technical description

SemanticAnalyzer Group, 2013-08-29 www.semanticanalyzer.info

This document describes technical details of tokenizer for the Russian language. The component has two modes of operation:

Processing of generic texts: news, technical articles etc

Processing of Twitter messages

Demo package sent upon request contains the following:

Java library of tokenizer in a form of a binary

run_tokenizer.sh script for swift checking the functionality of the module

messages_to_tokenize.txt file containing examples of generic text and tweets for tokenization using the run_tokenizer.sh script

The algorithm is based on a set of rules, implemented using Flex (JFlex), which allow extracting individual tokens for a text stream.

Speed of processing

Server: Intel(R) Xeon(R) CPU X3363 @ 2.83GHz Operating system: ubuntu 10.04, Java 1.7.0_21 64 bit server 38497 characters/ms 5158 tokens/ms Tests were conducted in a single thread.

Format of the messages_to_tokenize.txt file This file describes input data for the tokenizer module for demo purposes. Format: Text\tText type Text contains textual data in Russian for tokenization \t – tab symbol Text type: supported values are GENERAL_TEXT and TWITTER.

Examples of tokenization The run_tokenizer.sh script will generate the following file: messages_to_tokenize.out. For the following input file messages_to_tokenize.txt: :)this is it! #По_русски @dm ;-D www.test.com/x?y TWITTER This output gets generated:

http://www.semanticanalyzer.info/

Page 2: Linguistic component Tokenizer for the Russian language

:)this is it! #По_русски @dm ;-D www.test.com/x?y TWITTER emopostkn, type: ALPHANUM this, type: ALPHANUM is, type: ALPHANUM it, type: ALPHANUM !, type: PUNCT #По_русски, type: TWITTER_HASHTAG @dm, type: TWITTER_USERNAME emopostkn, type: ALPHANUM www.test.com/x?y, type: HYPERLINK

Examples of using the library from the Java code

Tokenizer twitterTokenizer = new TwitterFlexTokenizer(new StringReader("#ht. done!"), true);

Token reusableToken = Token.newReusableToken(); while((reusableToken = twitterTokenizer.getNextToken(reusableToken)) != null) {

System.out.println(reusableToken); }

output: Token[text=#ht,type=TWITTER_HASHTAG] Token[text=done,type=ALPHANUM]