使用逻辑回归对文本进行分类 To classify text content by logistic regression[达观杯1]


最近在参加达观杯的文本分类大赛,使用了几种方法对文本进行分类,在这里做一个记录 :)


gongsi-daerguan.png
比赛大概介绍

比赛地址:


1530871575831665.png
数据
“trainset.csv:此数据集用于训练模型,每一行对应一篇文章。文章分别在“字”和“词”的级别上做了脱敏处理。共有四列: 第一列是文章的索引(id),第二列是文章正文在“字”级别上的表示,即字符相隔正文(article);第三列是在“词”级别上的表示,即词语相隔正文(wordseg);第四列是这篇文章的标注(class)。 注:每一个数字对应一个“字”,或“词”,或“标点符号”。“字”的编号与“词”的编号是独立的”
“testset.csv” 里是只有id, article 和 wordseg
参赛者的工作就是建立一个模型,根据article或者word_seg 去预测文字的Class
第一次尝试使用sklearn里的逻辑回归进行预测,代码如下:


import pandas as pd from sklearn.linearmodel import LogisticRegression from sklearn.featureextraction.text import CountVectorizer
print('start') dftrain = pd.readcsv('./trainset.csv') dftest = pd.readcsv('./testset.csv') dftrain.drop(columns = ['article','id'], inplace = True) dftest.drop(columns=['article'],inplace= True)
vectorier = CountVectorizer(ngramrange=(1,2),mindf=3,maxdf=0.9,maxfeatures=100000) vectorier.fit(df_train['wordseg']) xtrain = vectorier.transform(df_train['wordseg']) xtest = vectorier.transform(df_test['wordseg']) ytrain = df_train['class']-1
lg = LogisticRegression(C = 4, dual=True) lg.fit(xtrain,ytrain)
ytest = lg.predict(xtest)
dftest['class']= ytest.tolist() dftest['class']=dftest['class']+1 dfresult = dftest.loc[:,['id','class']] dfresult.tocsv('./result.csv',index = False)
print('end')


将预测结果提交到网站,正确率大概是0.73
接下来考虑将vectorier 从CountVectorizer 改成tfidf
代码如下(其实只改了一行哈哈)


import pandas as pd from sklearn.linearmodel import LogisticRegression from sklearn.featureextraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer
print('start') dftrain = pd.readcsv('./trainset.csv') dftest = pd.readcsv('./testset.csv') dftrain.drop(columns = ['article','id'], inplace = True) dftest.drop(columns=['article'],inplace= True)
vctorizer = CountVectorizer(ngramrange=(1, 2), mindf=3, maxdf=0.9, maxfeatures=100000) print('inprocess1') vctorizer = TfidfVectorizer(ngramrange=(1, 2), mindf=3, maxdf=0.9, maxfeatures=100000) vctorizer.fit(df_train['wordseg']) xtrain = vctorizer.transform(df_train['wordseg']) xtest = vctorizer.transform(df_test['wordseg']) ytrain = df_train['class']-1
print('inprocess2') lg = LogisticRegression(C = 4, dual=True) lg.fit(xtrain,ytrain)
ytest = lg.predict(xtest)
print('inprocess3')
dftest['class']= ytest.tolist() dftest['class']=dftest['class']+1 dfresult = dftest.loc[:,['id','class']] dfresult.tocsv('./result.csv',index = False)
print('end')


提交代码以后,正确率从0.73升为了0.76,还是不错的,目前的排名在100多,排名第一的大哥是0.8几。感兴趣的朋友也可以去试一下:)

评论