Tech Tips: SparkのCountVectorizerを使ってみた

2016年6月6日月曜日

SparkのCountVectorizerを使ってみた

NLP系の前処理としてBow行列を作りたい場合、CountVectorizerが便利です。

サンプル

package com.kenjih

import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.ml.feature.CountVectorizer
import org.apache.spark.sql.SQLContext
import org.apache.spark.ml.feature.{ CountVectorizer, CountVectorizerModel }

object CountVectorizerSample {

  def run(sc: SparkContext): Unit = {
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._

    val path = "data/sample1.txt"
    val rdd = sc.textFile(path).map { line =>
      val ws = line.split("\t")
      val textId = ws(0)
      val words = ws(1).split(" ")
      (textId, words)
    }
    val df = rdd.toDF("id", "text")

    val cvm: CountVectorizerModel = new CountVectorizer()
      .setInputCol("text")
      .setOutputCol("features")
      .setMinDF(2)
      .fit(df)

    cvm.vocabulary.zipWithIndex.map(_.swap).foreach(println)
    cvm.transform(df).select("features").show()
  }

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)
    run(sc)
  }
}

実行結果

kenjih$ cat data/sample1.txt 
0 hello hello hello
1 hello world
2 goodbye world
3 I love you
4 you love me% 
kenjih$ 
kenjih$ spark-submit --master local --class com.kenjih.CountVectorizerSample target/scala-2.10/spark-sample.jar
(0,hello)
(1,you)
(2,love)
(3,world)
+-------------------+
|           features|
+-------------------+
|      (4,[0],[3.0])|
|(4,[0,3],[1.0,1.0])|
|      (4,[3],[1.0])|
|(4,[1,2],[1.0,1.0])|
|(4,[1,2],[1.0,1.0])|
+-------------------+

説明

CountVectorizerを適用するcolumnはArray[String]にしておく。
setMinDFでDocument Frequencyの最小値を設定できる。（最小値未満のドキュメントにしか単語はcorpusに含めない）
CountVectorizerModel.vocabularyでcorpusの単語を参照できる。zipWithIndexしておくと、確認のときに便利。

6 件のコメント:

answes2018年8月30日 13:31
Thank you: recently am stated Blog, while writing I got some issue , in that i read u r site it's some what clear to write please keep:
Best Online Training Institute in Chennai | Best Software Training Institute in Chennai
返信削除
返信
Unknown2018年11月26日 10:30
Hi,
I must appreciate you for providing such a valuable content for us. This is one amazing piece of article.

Palo Alto Online Training
返信削除
返信
Amrita Bansal2020年12月9日 5:56
This is my first time visit to your blog and I am very interested in the articles that you serve. Provide enough knowledge for me. Thank you for sharing useful and don't forget, keep sharing useful info:

Bigdata Hadoop Training in Gurgaon
Spark Training in Gurgaon
返信削除
返信
Pankaj Nagla2021年10月4日 9:13
Thank you for Sharing. Very easy to use. Time effective.
Ccsp Certification
返信削除
返信
vishesh paliwal2021年10月4日 18:59
Thanks for sharing the information..
Aws course
返信削除
返信
Tarun2021年12月23日 7:04
So much convincing piece of information on sap analytics cloud training
返信削除
返信

コメントを追加

Tech Tips

Page List

Search on the blog

2016年6月6日月曜日

SparkのCountVectorizerを使ってみた

6 件のコメント:

Blogger Syntax Highliter