言語処理100本ノック2015 をRubyでやる【第6章】

コードは GitHub に上げています。この記事では省略した長い出力も output/ ディレクトリに置いてます。

今回は第 6 章「英語テキストの処理」です。やっと折り返しですね！

英語のテキスト（nlp.txt）に対して，以下の処理を実行せよ．

50. 文区切り

(. or ; or : or ? or !) → 空白文字 → 英大文字というパターンを文の区切りと見なし，入力された文書を1行1文の形式で出力せよ．

解答

# 50.rb

require 'active_record'

File.open('nlp.txt') do |lines|
  lines.each do |line|
    sentences = line.scan(/[A-Z].+?[.;:?!](?=(?:\s[A-Z])|\n)/)
    if sentences.present?
      sentences.each do |sentence|
        puts sentence.rstrip
      end
    else
      line.scan(/^(.+?)\n$/) do |chars|
        puts chars
      end
  end
end

問題文に示されたルールの正規表現だと Natural language processing history などの見出しが拾えないので、別で拾ってあげます（/^(.+?)\n$/）。

出力（一部省略）

Natural language processing
From Wikipedia, the free encyclopedia
Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.
As such, NLP is related to the area of humani-computer interaction.
Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation.
History
The history of NLP generally starts in the 1950s, although work can be found from earlier periods.
In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.
The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English.
The authors claimed that within three or five years, machine translation would be a solved problem.
...

51. 単語の切り出し

空白を単語の区切りとみなし，50の出力を入力として受け取り，1行1単語の形式で出力せよ．ただし，文の終端では空行を出力せよ．

解答

require './util'

File.open('50.txt') do |sentences|
  sentences.each do |sentence|
    puts sentence.split
    puts
  end
end

sentence の終わりで puts します。

出力（一部省略）

Natural
language
processing

From
Wikipedia,
the
free
encyclopedia

Natural
...

52. ステミング

51の出力を入力として受け取り，Porterのステミングアルゴリズムを適用し，単語と語幹をタブ区切り形式で出力せよ． Pythonでは，Porterのステミングアルゴリズムの実装としてstemmingモジュールを利用するとよい．

解答

require 'active_record'
require 'lingua/stemmer'

stemmer = Lingua::Stemmer.new(language: 'en')

File.open('51.txt') do |words|
  words.each do |word|
    target = word.chomp
    if target.present?
      puts "#{target}\t#{stemmer.stem(target)}"
    else
      puts
    end
  end
end

ruby-stemmer という gem を使いました。たしか Python で stemming を使ってやった人と結果を合わせたら一緒だったと思います。空行なら word.chomp が空文字になるので、このときは空行を出力しました。

github.com

出力（一部省略）

Natural  Natur
language    languag
processing  process

From    From
Wikipedia,  Wikipedia,
the the
free    free
encyclopedia    encyclopedia

Natural Natur
...

53. Tokenization

Stanford Core NLPを用い，入力テキストの解析結果をXML形式で得よ．また，このXMLファイルを読み込み，入力テキストを1行1単語の形式で出力せよ．

解答

./corenlp.sh -annotators tokenize,ssplit,pos,lemma,parse,ner,dcoref, --file nlp.txt

後の問題を見ると、単語・レンマ・品詞・参照表現・代表参照表現・係り受け解析の結果が必要なようです。参照表現を得るのは dcoref ですが、依存関係の表を見ると他の annotator も必要でした。

Annotator dependencies | Stanford CoreNLP

# util.rb

require 'rexml/document'

def xml_elements(filename: 'nlp.txt.xml')
  doc = REXML::Document.new(File.new(filename))
  doc.elements
end

def sentence_tokens
  xml_elements.each('root/document/sentences/sentence') do |sentence|
    sentence.elements.each('tokens/token') do |token|
      block_given? ? yield(sentence, token) : token
    end
  end
end

class REXML::Element
  alias_method :to_text, :text
  remove_method :text

  %w(type idx).each do |attribute_name|
    define_method attribute_name do
      self.attributes[attribute_name]
    end
  end

  %w(id).each do |attribute_name|
    define_method attribute_name do
      self.attributes[attribute_name].to_i
    end
  end

  %w(sentence start end).each do |element_name|
    define_method element_name do
      self.elements[element_name].to_text.to_i
    end
  end

  %w(word lemma POS NER text governor dependent).each do |element_name|
    define_method element_name do
      self.elements[element_name].to_text
    end
  end
end

XML は REXML を使って読むことにしました。sentence を開いて token を 1 つずつ見て……という作業はよく出てくるので sentence_tokens にまとめました。

また、id や word といった attributes や elements はよく参照する割に書く量が多いのでメソッドを作りました。text という element があるのですが、メソッドを作る際に REXML::Element#text と名前が被ってしまうので、to_text というエイリアスメソッドを作って元々の text を削除しました。

# 53.rb

require './util'

sentence_tokens do |_, token|
  puts token.word
end

出力（一部省略）

Natural
language
processing
From
Wikipedia
,
the
free
encyclopedia
Natural
...

54. 品詞タグ付け

Stanford Core NLPの解析結果XMLを読み込み，単語，レンマ，品詞をタブ区切り形式で出力せよ．

解答

require './util'

sentence_tokens do |_, token|
  puts [
    token.word,
    token.lemma,
    token.POS
  ].join("\t")
end

出力（一部省略）

Natural  natural JJ
language    language    NN
processing  processing  NN
From    from    IN
Wikipedia   Wikipedia   NNP
,   ,   ,
the the DT
free    free    JJ
encyclopedia    encyclopedia    NN
Natural natural JJ
...

55. 固有表現抽出

入力文中の人名をすべて抜き出せ．

解答

require './util'

sentence_tokens do |_, token|
  puts token.word if token.NER == 'PERSON'
end

NER は Named Entity Recognition の略で、意味は「固有表現認識」です。

出力

Alan
Turing
Joseph
Weizenbaum
MARGIE
Schank
Wilensky
Meehan
Lehnert
Carbonell
Lehnert
Racter
Jabberwacky
Moore

56. 共参照解析

Stanford Core NLPの共参照解析の結果に基づき，文中の参照表現（mention）を代表参照表現（representative mention）に置換せよ．ただし，置換するときは，「代表参照表現（参照表現）」のように，元の参照表現が分かるように配慮せよ．

解答

# util.rb

...

class Mention
  attr_accessor :start, :endd, :text, :representative_text

  def initialize(start, endd, text, representative_text)
    @start = start
    @endd = endd
    @text = text
    @representative_text = representative_text
  end
end

それぞれの参照表現について、開始位置、終了位置、参照表現テキスト、代表参照表現テキストを持っておきます。

# 56.rb

require './util'

# 文ごとの参照表現
sentence_mentions = []

xml_elements.each('root/document/coreference/coreference') do |coreference|
  representative = coreference.elements['mention[@representative="true"]']
  representative_text = representative.text

  coreference.elements.each('mention[not(@representative)]') do |mention|
    sentence = mention.sentence
    start = mention.start
    endd = mention.end - 1
    text = mention.text
    sentence_mentions[sentence] ||= []
    sentence_mentions[sentence] << Mention.new(start, endd, text, representative_text)
  end
end

endds = []
outputs = []

sentence_tokens do |sentence, token|
  mentions = sentence_mentions[sentence.id]
               &.select { |mention| mention.start == token.id }
               &.sort_by { |mention| -mention.endd }

  output = ''

  mentions&.each do |mention|
    endds << mention.endd
    output += "[#{mention.representative_text}("
  end

  output += token.word

  endds.count(token.id).times { output += ')]' }
  endds.delete(token.id)

  outputs << output
end

puts outputs.join(' ')

end は予約語なので変数名は endd にしています。

代表参照表現は入れ子になる場合もあるので [代表参照表現(参照表現)] のように置換することにしました。入れ子になった代表参照表現の終了位置が同じ場合に正しい数だけカッコを閉じるため、endds に終了位置情報を追加していき、参照表現を出力し終わったら endds に含まれた現在の token の id の数だけ閉じカッコを出力します。

出力（一部省略、表示用に適宜改行を追加）

...
However , [the systems(systems based on [hand-written rules(hand-written rules)])] can only be 
made more accurate by increasing the complexity of [the rules([the rules(the rules)] , which is a 
much more difficult task)] . In particular , there is a limit to the complexity of systems based on 
hand-crafted rules , beyond which the systems become more and more unmanageable . However , 
creating more data to input to [Systems based on machine-learning algorithms(machine-learning systems)] 
...

57. 係り受け解析

Stanford Core NLPの係り受け解析の結果（collapsed-dependencies）を有向グラフとして可視化せよ．可視化には，係り受け木をDOT言語に変換し，Graphvizを用いるとよい．また，Pythonから有向グラフを直接的に可視化するには，pydotを使うとよい．

解答

require 'gviz'
require './util'

# 1文目
sentence = xml_elements['root/document/sentences/sentence']

Graph do
  sentence.elements.each('dependencies[@type="collapsed-dependencies"]/dep') do |dep|
    unless dep.type == 'punct'
      governor_id = dep.elements['governor'].idx.to_id
      dependent_id = dep.elements['dependent'].idx.to_id
      route governor_id => dependent_id
      node governor_id, label: dep.governor
      node dependent_id, label: dep.dependent
    end
  end

  save('57', :png)
end

前々回の 44. 係り受け木の可視化と同じです。type が collapsed-dependencies になっている dependencies の dep を順番に見て、主要部 => 従属部 の形でグラフを書いていきます。

yamasy1549.hateblo.jp

出力

f:id:yamasy1549:20180421223639p:plain

58. タプルの抽出

Stanford Core NLPの係り受け解析の結果（collapsed-dependencies）に基づき，「主語述語目的語」の組をタブ区切り形式で出力せよ．ただし，主語，述語，目的語の定義は以下を参考にせよ．

述語: nsubj関係とdobj関係の子（dependant）を持つ単語

主語: 述語からnsubj関係にある子（dependent）

目的語: 述語からdobj関係にある子（dependent）

解答

# util.rb

...
class Dependant
  attr_accessor :governor, :dependent, :governor_idx, :dependent_idx

  def initialize(governor, dependent, governor_idx, dependent_idx)
    @governor = governor
    @dependent = dependent
    @governor_idx = governor_idx
    @dependent_idx = dependent_idx
  end
end
...

nsubj 関係または dobj 関係の子を Dependant クラスのインスタンスとして持つことにします。

# 58.rb

require './util'

xml_elements.each('root/document/sentences/sentence') do |sentence|
  nsubj_list = []
  dobj_list = []

  sentence.elements.each('dependencies[@type="collapsed-dependencies"]/dep') do |dep|
    type = dep.type

    if type == 'nsubj' || type == 'dobj'
      governor = dep.governor
      dependent = dep.dependent
      governor_idx = dep.elements['governor'].idx.to_i
      dependent_idx = dep.elements['dependent'].idx.to_i
      eval("#{type}_list") << Dependant.new(governor, dependent, governor_idx, dependent_idx)
    end
  end

  governor_idx = nsubj_list.map(&:governor_idx) & dobj_list.map(&:governor_idx)
  governor_idx.each do |idx|
    nsubj = nsubj_list.select { |dep| dep.governor_idx == idx }
    dobj = dobj_list.select { |dep| dep.governor_idx == idx }

    subjects   = nsubj.map(&:dependent).uniq # 主語
    predicates = nsubj.map(&:governor).uniq  # 述語
    objects    = dobj.map(&:dependent).uniq  # 目的語
    puts subjects.product(predicates, objects).map{ |*word| word.join("\t")}
  end
end

最初に nsubj_list と dobj_list を作っておきます。nsubj は主格で述語に係る名詞句、dobjは目的格で述語に係る名詞句、の意味らしいです（日本語 Universal Dependencies の試案 [PDF]）。nsubj_list と dobj_list に共通する governor_idx を持つ dep が、主語や述語や目的語となります。

出力

understanding    enabling    computers
others  involve generation
Turing  published   article
experiment  involved    translation
ELIZA   provided    interaction
patient exceeded    base
ELIZA   provide response
which   structured  information
underpinnings   discouraged sort
that    underlies   approach
Some    produced    systems
which   make    decisions
systems rely    which
that    contains    errors
implementations involved    coding
algorithms  take    set
Some    produced    systems
which   make    decisions
models  have    advantage
they    express certainty
Systems have    advantages
Automatic   make    use
that    make    decisions

59. S式の解析

Stanford Core NLPの句構造解析の結果（S式）を読み込み，文中のすべての名詞句（NP）を表示せよ．入れ子になっている名詞句もすべて表示すること．

解答

# util.rb

...
class Object
  def is_pair?
    self.is_a?(Array) && self.first.is_a?(String) && self.last.is_a?(String)
  end
end

[String, String] となる配列を pair と呼ぶことにします。これは S 式のひとつの単位で、最小の句です。

# 89.rb

require './util'

def parse(s_expr)
  read(tokenize(s_expr))
end

def tokenize(s_expr)
  s_expr.gsub(/[()]/, ' \0 ').split
end

def read(tokens)
  token = tokens.shift
  if token == '('
    l = []
    l << read(tokens) until tokens[0] == ')'
    tokens.shift
    l
  else
    token
  end
end

def output_words(expr)
  if expr.is_pair?
    expr.last
  else
    expr[1..-1].map { |e| output_words(e) }.join(' ')
  end
end

def evaluate(expr, pos: 'NP')
  if expr.first == pos
    puts output_words(expr)
  end
  expr[1..-1].each { |e| evaluate(e) } unless expr.is_pair?
end

xml_elements.each('root/document/sentences/sentence/parse') do |parse|
  parse.to_text
    .yield_self { |text| parse(text) }
    .yield_self { |text| evaluate(text) }
end

tokenize では ((( みたいにカッコが連続しているので、間に空白を入れて split しやすくしています。read では token を 1 つずつ見ていって、( から ) までを l という変数に入れてやって再帰します。

pair は [品詞名, 単語] の形になり、入れ子になると [品詞名1, [品詞名2, 単語]] のようになります。output_words では、一番外側の品詞（品詞名1）が 'NP' なら、内側に向かって単語部分を順番に取り出しています。

yield_self は Ruby 2.5 から新しく入ったメソッドで、一度使ってみたかったので使いました。yield_self を使わず書くと以下のようになりますが、使ったほうが「つながってる、一連の処理」感が出て良いのかなと思います。

parsed = parse(parse.text)
evaluate(parsed)

出力（一部省略）

Natural language processing
Wikipedia
the free encyclopedia Natural language processing -LRB- NLP -RRB-
the free encyclopedia Natural language processing
NLP
a field of computer science , artificial intelligence , and linguistics concerned with the interactions between computers and human -LRB- natural -RRB- languages
a field of computer science
a field
computer science
artificial intelligence
...

もっと良い書き方あるよ〜などあれば issue とかで教えてもらえるとうれしいです！

github.com

ようじょのおえかきちょう

ふぇぇお医者さんにペン持ったらダメっていわれた〜〜

言語処理100本ノック2015 をRubyでやる【第6章】

50. 文区切り

解答

出力（一部省略）

51. 単語の切り出し

解答

出力（一部省略）

52. ステミング

解答

出力（一部省略）

53. Tokenization

解答

出力（一部省略）

54. 品詞タグ付け

解答

出力（一部省略）

55. 固有表現抽出

解答

出力

56. 共参照解析

解答

出力（一部省略、表示用に適宜改行を追加）

57. 係り受け解析

解答

出力

58. タプルの抽出

解答

出力

59. S式の解析

解答

出力（一部省略）