背景:前一段子找房子,在58,赶集上搜房源,可恨中介冒充个人公布非常多房源信息,浪费彼此感情,为了响应节俭的号召,我就想着搞个程序取代我的大脑自己主动的把房源信息过滤一下,给我推荐出个人房源信息.因为算法简单,没有文採,望大牛们绕道,也欢迎拍砖.
转摘请注明,来自zsw2zkl的博客:
project共分为三部分
1.网页抓取+分析
2.分词(採用基于隐马模型的ansj分词算法)
3.训练模型+预測
前两者是为第三步做准备叙述.
以下环绕第三点展开
模型训练基于朴素贝叶斯算法
1.选取特征,以分过后的词作为每一篇房源信息的特征.为什么这样考虑呢,想想一下,你大脑判别该房源信息是中介还是个人也是通过阅读房源介绍读出来的感觉,中介用词和个人用词一般有非常大不同的.
2.准备训练样本,由project第一部分提供的功能去网页上抓取房源信息,基于这样一个事实.差点儿没有个人会把自己的房源信息标成中介,所以仅仅须要你去中介类别下抓取信息基本就是中介的了,我抓取了大约277篇,这样就得出了中介类别的下的训练数据,因为租房站点提供了人工认证房源,当然这个不是全然可信的(有证据不多言),这些信息基本能够觉得是个人的,我抓取了大约269篇,这样就有了个人类别下的训练数据
3.分词,以每一个分词作为文章的特征.因为租房信息这个类别有一些特征含义的词语,或者成为完整语意,例如说,无中介费不能分成无和中介费,因为两个是一个语意,朝南也不可分开.当然这都属于个人感觉,没有固定的步骤.我准备了个人词典来提高准确率,ps:因为我不想过早的干涉或者调优结果,写上了几个就不写了.分词后,我把标题调整了一下,中介的以broker开头,加上序号.个人的以person开头,加上序号,他们的标题,增加到了内容中
3.基于NB(朴素贝叶斯)方法训练,well,如今文章的属性有了,開始计算:计算的目的是要得出每一个词属于每一个类别的概率(最大似然法):參考java代码例如以下
数据结构为{word:[in_broker_num,in_person_num]}即单词为key,value为一个List,index=0表示中介,index=1表示个人
Map> m = new HashMap<>(); //broker unique word num float broker_num = 0.0f; float person_num = 0.0f; //borker total word num float broker_total = 0.0f; float person_total = 0.0f; //calculate every word`s probability in every article for(File f : new File(learn_file_root).listFiles()){ BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(f))); String fileName = f.getName(); //select some articles for cross validation eg:all the articles end with 1 will be selected for predicting if(fileName.endsWith(end)){ continue; } int index = 0; if(fileName.contains("person")){ index = 1; } String line = br.readLine(); String[] split = line.split(" "); br.close(); for (String word : split) { List list = m.get(word); if(list == null){ list = new ArrayList (){ private static final long serialVersionUID = -1256219944522765531L; { add(0.0); add(0.0); } }; m.put(word, list); } Double count = list.get(index); list.set(index, count + 1); } } System.out.println(m); for (Entry > wordM : m.entrySet()) { List list = wordM.getValue(); if(list.get(0) != 0.0){ broker_num ++; broker_total += list.get(0); } if(list.get(1) != 0.0){ person_num ++; person_total += list.get(1); } } for (Entry > wordM : m.entrySet()) { List list = wordM.getValue(); ArrayList list2 = new ArrayList (){ private static final long serialVersionUID = 1L; { add(0.0); add(0.0); } }; list2.set(0, Math.log(((list.get(0) + 1) / (broker_num + broker_total)))); list2.set(1, Math.log(((list.get(1) + 1) / (person_num + person_total)))); wordM.setValue(list2); } System.out.println(m);
因为在预測文章中会出现有些词语在模型中未出现,所以这里採取了加一平滑,即为每个出现的单词加1
4.OK 如今我们已经训练好了模型,就採用该模型进行预測:我们须要算出每一个文章在每一个类别下的概率,然后选择一个最大的概率所在的类别作为它的类别.这里有个独立同分布如果,当然这个如果不是非常合理,由于在文章中的每一个词大都不是独立的.有了这个如果之后我们就能够算出他们的概率,在上面的代码中有个步骤取了log函数,所以这里仅仅须要吧每一个文章下的概率值相加就可以得到.參考代码例如以下
List5.最后我们求一下准确率,召回率等指标,參考代码例如以下l_broker = new ArrayList (); List l_person = new ArrayList (); for(File f : new File(learn_file_root).listFiles()){ List l_null = new ArrayList (); List l_not_null = new ArrayList (); BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(f))); String fileName = f.getName(); if(!fileName.endsWith(end)){ continue; } String line = br.readLine(); String[] split = line.split(" "); br.close(); Double sum_broker = 0.0; Double sum_person = 0.0; for (String word : split) { List list = m.get(word); if(list == null){ l_null.add(word); sum_broker += 1.0 / broker_num; sum_person += 1.0 / person_num; continue; } l_not_null.add(word); sum_broker += list.get(0); sum_person += list.get(1); } sum_broker += Math.log(broker_num); sum_person += Math.log(person_num); // System.out.println("l_null: " + l_null.size() + " " + l_null);// System.out.println("l_not_null: " + l_not_null.size() + " " + l_not_null); if(sum_broker > sum_person){ l_broker.add(fileName); }else{ l_person.add(fileName); } } System.out.println("broker: " + l_broker.size() + " " + l_broker); System.out.println("person: " + l_person.size() + " " + l_person);
int broker_right = 0; for (String string : l_broker) { if(string.contains("broker")){ broker_right ++; } } int person_right = 0; for (String string : l_person) { if(string.contains("person")){ person_right ++; } } System.out.println("accuracy: " + (broker_right + person_right) * 1.0 / (l_broker.size() + l_person.size())); System.out.println("broker precision: " + (broker_right) * 1.0 / (l_broker.size())); System.out.println("person precision: " + (person_right) * 1.0 / (l_person.size())); System.out.println("broker recall: " + broker_right * 1.0 / (broker_right + l_person.size() - person_right)); System.out.println("person recall: " + person_right * 1.0 / (person_right + l_broker.size() - broker_right));6.參考结果例如以下
broker: 25 [broker106, broker116, broker126, broker146, broker156, broker166, broker176, broker186, broker196, broker206, broker226, broker236, broker246, broker256, broker26, broker266, broker276, broker36, broker46, broker56, broker66, broker76, broker86, broker96, person196]
person: 30 [broker136, broker16, broker216, broker6, person106, person116, person126, person136, person146, person156, person16, person166, person176, person186, person206, person216, person226, person236, person246, person256, person26, person266, person36, person46, person56, person6, person66, person76, person86, person96] accuracy: 0.9090909090909091 broker precision: 0.96 person precision: 0.8666666666666667 broker recall: 0.8571428571428571 person recall: 0.9629629629629629