博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
R学习之——R用于文本挖掘(tm包)
阅读量:6721 次
发布时间:2019-06-25

本文共 2174 字,大约阅读时间需要 7 分钟。

 

首先需要安装并加载tm包。


 

1、读取文本

x = readLines("222.txt")

2、建立语料库

> r=Corpus(VectorSource(x)) > r A corpus with 7012 text documents

3、语料库输出,保存到硬盘

> writeCorpus(r)

 

4、查看语料库

> print(r)A corpus with 7012 text documents> summary(r)A corpus with 7012 text documentsThe metadata consists of 2 tag-value pairs and a data frameAvailable tags are:  create_date creator Available variables in the data frame are:  MetaID

  > inspect(r[2:2])

  A corpus with 1 text document

  The metadata consists of 2 tag-value pairs and a data frame

  Available tags are:
  create_date creator
  Available variables in the data frame are:
  MetaID

  [[1]]

  Female; Genital Neoplasms, Female/*therapy; Humans

  > r[[2]]

  Female; Genital Neoplasms, Female/*therapy; Humans

5、建立“文档-词”矩阵

> dtm = DocumentTermMatrix(r)> head(dtm)A document-term matrix (6 documents, 16381 terms)Non-/sparse entries: 110/98176Sparsity           : 100%Maximal term length: 81 Weighting          : term frequency (tf)

6、查看“文档-词”矩阵

> inspect(dtm[1:2,1:4])

7、查找出现200次以上的词

> findFreqTerms(dtm,200) [1] "acute"          "adjuvant"       "advanced"       "after"          [5] "and"            "breast"         "cancer"         "cancer:"        [9] "carcinoma"      "cell"           "chemotherapy"   "clinical"      [13] "colorectal"     "factor"         "for"            "from"          [17] "group"          "growth"         "iii"            "leukemia"      [21] "lung"           "lymphoma"       "metastatic"     "non-small-cell"[25] "oncology"       "patients"       "phase"          "plus"          [29] "prostate"       "randomized"     "receptor"       "response"      [33] "results"        "risk"           "study"          "survival"      [37] "the"            "therapy"        "treatment"      "trial"         [41] "tumor"          "with"

7、移除出现次数较少的词

inspect(removeSparseTerms(dtm, 0.4))

8、查找和“stem”的相关系数在0.5以上的词

> findAssocs(dtm, "stem", 0.5) stem cells  1.00  0.61

 9、计算文档相似度(用cosine计算距离)

> dist_dtm <- dissimilarity(dtm, method = 'cosine')> head(dist_dtm)[1] 1.0000000 0.7958759 0.8567770 0.9183503 0.9139337 0.9309934

10、聚类

> hc <- hclust(dist_dtm, method = 'ave')> plot(hc,xlab='')

 

 

     

转载于:https://www.cnblogs.com/todoit/archive/2012/07/13/2589741.html

你可能感兴趣的文章
注解SpringMVC
查看>>
Wpf 简单制作自己的窗体样式(2)
查看>>
每日刷题总结
查看>>
.net xml转json
查看>>
LeetCode题解(四)
查看>>
【转】ExcelHelper类,用npoi读取Excel文档
查看>>
mysql主从同步与防火墙端口的设定
查看>>
图书下载
查看>>
MyBatis框架入门小案例(关于用mybatis框架对数据库的增删改查)
查看>>
分享文档到百度文库
查看>>
关于position和float的用法!
查看>>
10条影响CSS渲染速度的写法与建议
查看>>
[Android Pro] 注册 Google Play 开发者帐户
查看>>
TextView
查看>>
timeval gettimeofday
查看>>
runtime简介
查看>>
网站收集
查看>>
MySQL5.7 (审计)安装audit审计插件
查看>>
LightSpeed 之Sql和存储过程的使用
查看>>
codeforces Looksery Cup 2015 H Degenerate Matrix 二分 注意浮点数陷阱
查看>>