使用支持向量机对文本进行分类 To classify text content by svm[达观杯2]

前情提要:

使用逻辑回归对文本进行分类 To classify text content by logistic regression[达观杯1] https://siyuanyuanyuan.blogspot.com/2018/08/to-classify-text-content-by-logistic.html


之前提到在参加达观杯文本分类比赛 ,使用逻辑回归的模型,正确率最高达到了0.76,这次准备使用svm模型看一下能否提高正确率

微信图片_20180821093025.jpg

上代码:

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

print('start')
df_train = pd.read_csv('./train_set.csv')
df_test = pd.read_csv('./test_set.csv')
df_train.drop(columns = ['article','id'], inplace = True)
df_test.drop(columns=['article'],inplace= True)


#vctorizer = CountVectorizer(ngram_range=(1, 2), min_df=3, max_df=0.9, max_features=100000)

print('vectoerizer')
vctorizer = TfidfVectorizer(ngram_range=(1, 2), min_df=3, max_df=0.9, max_features=100000)
vctorizer.fit(df_train['word_seg'])
x_train = vctorizer.transform(df_train['word_seg'])
x_test = vctorizer.transform(df_test['word_seg'])
y_train = df_train['class'] - 1



print('trasn and predict')
svm = LinearSVC
svm.fit(x_train, y_train)
y_test = svm.predict(x_test)


print('output')
df_test['class'] = y_test.tolist()
df_test['class'] = df_test['class'] + 1
df_result = df_test.loc[:, ['id', 'class']]
df_result.to_csv('./result.csv', index=False)
print('end')

可以看到,和逻辑回归模型的代码对比,其实就改动了一小部分

svm = LinearSVC svm.fit(x_train, y_train) y_test = svm.predict(x_test)

为了图快,这里先用了linear svc , 大概跑了30分钟,效果怎么样呢!

微信图片_20180822130529.png


居然还跌了零点几

微信图片_20180820131405.jpg

下一步应该换别的svm模型再尝试一下。

这一篇就到这里啦,大家关于这个比赛,或者机器学习有问题的,也可以在留言里和我交流。

评论