Contents

Figures

  1. Ferret Example Screenshot

Tips

  1. Expand this list

Warnings

  1. The latest source code may be unstable



Chapter 1
Introduction

rmmseg-cpp is a high performance Chinese word segmentation utility for Ruby. It features full Ferret integration as well as support for normal Ruby program usage.

rmmseg-cpp is a re-written of the original RMMSeg gem in C++. RMMSeg is written in pure Ruby. Though I tried hard to tweak RMMSeg, it just consumes lots of memory and the segmenting process is rather slow.

The interface is almost identical to RMMSeg but the performance is much better. This gem is always preferable in production use. However, if you want to understand how the MMSEG segmenting algorithm works, the source code of RMMSeg is a better choice than this.

Chapter 2
Setup

2.1  Requirements

Your system needs the following software to run RMMSeg.

Software Notes
Ruby Version 1.8.x is required
RubyGems rmmseg-cpp is released as a gem
g++ Used to build the native extension

2.2  Installation

2.2.1  Using RubyGems

To install the gem remotely from RubyForge:

sudo gem install rmmseg-cpp

Or you can download the gem file manually from RubyForge and install it locally:

sudo gem install —local rmmseg-cpp-x.y.z.gem

2.2.2  From Git

To build the gem manually from the latest source code. You’ll need to have git and rake installed.

Warning 1.  The latest source code may be unstable

While I tried to avoid such kind of problems, the source code from the repository might still be broken sometimes. It is generally not recommended to follow the source code.
The source code of rmmseg-cpp is hosted at GitHub. You can get the source code by git clone:

git clone git://github.com/pluskid/rmmseg-cpp.git

then you can use Rake to build and install the gem:

cd rmmseg-cpp rake gem:install

Chapter 3
Usage

3.1  Stand Alone rmmseg

rmmseg-cpp comes with a script rmmseg. To get the basic usage, just execute it with -h option:

rmmseg -h

It reads from STDIN and print result to STDOUT. Here is a real example:

$ echo “我们都喜欢用 Ruby” | rmmseg 我们 都 喜欢 用 Ruby

3.2  Use in Ruby program

3.2.1  Initialize

To use rmmseg-cpp in Ruby program, you’ll first load it with RubyGems:

require 'rubygems'
require 'rmmseg'

Then you may customize the dictionaries used by rmmseg-cpp (see the rdoc on how to add your own dictionaries) and load all dictionaries:

RMMSeg::Dictionary.load_dictionaries

Now rmmseg-cpp will be ready to do segmenting. If you want to load your own customized dictionaries, please customize RMMSeg::Dictionary.dictionaries before calling load_dictionaries. e.g.

RMMSeg::Dictionary.dictionaries = [[:chars, "my_chars.dic"],
                                   [:words, "my_words.dic"],
                                   [:words, "my_words2.dic"]]

The basic format for char-dictionary and word-dictionary are similar. For each line, there is a number, then a space, then the string. Note there SHOULD be a newline at the end of the dictionary file. And the number in char-dictionary and word-dictionary has different meaning.

In char-dictionary, the number means the frequency of the character. In word-dictionary, the number mean the number of characters in the word. Note that this is NOT the number of bytes in the word.

3.2.2  Ferret Integration

To use rmmseg-cpp with Ferret, you’ll need to require the Ferret support of rmmseg-cpp (Of course you’ll also have to got Ferret installed. If you have problems running the belowing example, please try to update to the latest version of both Ferret and rmmseg-cpp first):

require 'rmmseg/ferret'

rmmseg-cpp comes with a ready to use Ferret analyzer:

analyzer = RMMSeg::Ferret::Analyzer.new { |tokenizer|
  Ferret::Analysis::LowerCaseFilter.new(tokenizer)
}
index = Ferret::Index::Index.new(:analyzer => analyzer)

A complete example can be found in misc/ferret_example.rb. The result of running that example is shown in Figure 1. Ferret Example Screenshot.

3.2.3  Normal Ruby program

rmmseg-cpp can also be used in normal Ruby programs. Just create an Algorithm object and call next_token until a nil is returned:

algor = RMMSeg::Algorithm.new(text)
loop do
  tok = algor.next_token
  break if tok.nil?
  puts "#{tok.text} [#{tok.start}..#{tok.end}]"
end

Chapter 4
Who use it

Tip 1.  Expand this list

If you used rmmseg-cpp and would like your project to appear in this list, please contact me.

  • JavaEye: One of the biggest software developper community in China.

Chapter 5
Resources