サンプル
package com.kenjih import org.apache.spark.{ SparkConf, SparkContext } import org.apache.spark.ml.feature.CountVectorizer import org.apache.spark.sql.SQLContext import org.apache.spark.ml.feature.{ CountVectorizer, CountVectorizerModel } object CountVectorizerSample { def run(sc: SparkContext): Unit = { val sqlContext = new SQLContext(sc) import sqlContext.implicits._ val path = "data/sample1.txt" val rdd = sc.textFile(path).map { line => val ws = line.split("\t") val textId = ws(0) val words = ws(1).split(" ") (textId, words) } val df = rdd.toDF("id", "text") val cvm: CountVectorizerModel = new CountVectorizer() .setInputCol("text") .setOutputCol("features") .setMinDF(2) .fit(df) cvm.vocabulary.zipWithIndex.map(_.swap).foreach(println) cvm.transform(df).select("features").show() } def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) run(sc) } }
実行結果
kenjih$ cat data/sample1.txt 0 hello hello hello 1 hello world 2 goodbye world 3 I love you 4 you love me% kenjih$ kenjih$ spark-submit --master local --class com.kenjih.CountVectorizerSample target/scala-2.10/spark-sample.jar (0,hello) (1,you) (2,love) (3,world) +-------------------+ | features| +-------------------+ | (4,[0],[3.0])| |(4,[0,3],[1.0,1.0])| | (4,[3],[1.0])| |(4,[1,2],[1.0,1.0])| |(4,[1,2],[1.0,1.0])| +-------------------+
説明
- CountVectorizerを適用するcolumnはArray[String]にしておく。
- setMinDFでDocument Frequencyの最小値を設定できる。(最小値未満のドキュメントにしか単語はcorpusに含めない)
- CountVectorizerModel.vocabularyでcorpusの単語を参照できる。zipWithIndexしておくと、確認のときに便利。
Thank you: recently am stated Blog, while writing I got some issue , in that i read u r site it's some what clear to write please keep:
返信削除Best Online Training Institute in Chennai | Best Software Training Institute in Chennai
Hi,
返信削除I must appreciate you for providing such a valuable content for us. This is one amazing piece of article.
Palo Alto Online Training
This is my first time visit to your blog and I am very interested in the articles that you serve. Provide enough knowledge for me. Thank you for sharing useful and don't forget, keep sharing useful info:
返信削除Bigdata Hadoop Training in Gurgaon
Spark Training in Gurgaon
Thank you for Sharing. Very easy to use. Time effective.
返信削除Ccsp Certification
Thanks for sharing the information..
返信削除Aws course
So much convincing piece of information on sap analytics cloud training
返信削除