I am trying to predict article pageviews from the length of the title + text content of the article. I used TFIDF as follows:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
corpus = result_df['_text'].tolist()
count_vect = CountVectorizer(min_df=1, stop_words='english')
dtm = count_vect.fit_transform(corpus)
word_counts = dtm.toarray()
tfidf_transformer = TfidfTransformer()
tfidf = tfidf_transformer.fit_transform(word_counts)
words_df = pd.DataFrame(tfidf.todense(), columns=count_vect.get_feature_names())
I use standard scaling as follows:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
result_df['_title'] = scaler.fit_transform(result_df['_title'])
So, I get mine Xas follows:
_title 00 000 0002 0003 000667 000709 000725 001 0013 ... última últimamente último única únicamente único únicos útiles 네일_박은경 유니스텔라
0 62 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.000000 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0
1 41 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.000000 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0
2 53 0.000000 0.020781 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.000000
My Y(target values) look like this:
0 166.0
1 24.0
2 22.0
Now I'm trying to run basic linear regression, and I get an absolutely astronomical mean square error (RMSE):
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X.as_matrix(), Y, test_size=0.2)
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
lin_reg = LinearRegression()
lin_reg.fit(X_train, Y_train)
views_predictions = lin_reg.predict(X_test)
lin_mse = mean_squared_error(Y_test, views_predictions)
lin_rmse = np.sqrt(lin_mse) //value is 770956447401244.75
My Y value is 1.487, and the standard deviation is ~ 8000, so this number cannot be right. Even guessing the same number each time would significantly exceed this.
, DecisionTreeRegressor, RMSE 15053.957646453207 ( , ).
, ?
!