SciKit-Learn: Astronomical Error Using Linear Regression

I am trying to predict article pageviews from the length of the title + text content of the article. I used TFIDF as follows:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
corpus = result_df['_text'].tolist()
count_vect = CountVectorizer(min_df=1, stop_words='english')
dtm = count_vect.fit_transform(corpus)
word_counts = dtm.toarray()
tfidf_transformer = TfidfTransformer()
tfidf = tfidf_transformer.fit_transform(word_counts)
words_df = pd.DataFrame(tfidf.todense(), columns=count_vect.get_feature_names())

I use standard scaling as follows:

from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
result_df['_title'] = scaler.fit_transform(result_df['_title'])

So, I get mine Xas follows:

_title  00  000 0002    0003    000667  000709  000725  001 0013    ... última  últimamente último  única   únicamente  único   únicos  útiles  네일_박은경  유니스텔라
0   62  0.000000    0.000000    0.0 0.0 0.0 0.0 0.0 0.000000    0.0 ... 0.000000    0.0 0.0 0.0 0.0 0.000000    0.0 0.0 0.0 0.0
1   41  0.000000    0.000000    0.0 0.0 0.0 0.0 0.0 0.000000    0.0 ... 0.000000    0.0 0.0 0.0 0.0 0.000000    0.0 0.0 0.0 0.0
2   53  0.000000    0.020781    0.0 0.0 0.0 0.0 0.0 0.000000    0.0 ... 0.000000    

My Y(target values) look like this:

0        166.0
1         24.0
2         22.0

Now I'm trying to run basic linear regression, and I get an absolutely astronomical mean square error (RMSE):

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X.as_matrix(), Y, test_size=0.2)

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
lin_reg = LinearRegression()
lin_reg.fit(X_train, Y_train)
views_predictions = lin_reg.predict(X_test)
lin_mse = mean_squared_error(Y_test, views_predictions)
lin_rmse = np.sqrt(lin_mse) //value is 770956447401244.75

My Y value is 1.487, and the standard deviation is ~ 8000, so this number cannot be right. Even guessing the same number each time would significantly exceed this.

, DecisionTreeRegressor, RMSE 15053.957646453207 ( , ).

, ?

!

+4
1

, , .

, ( ) , , . .

, , "aardvark" , , , 1000 , . , 1000 "aardvark". , 1000 . 1000000 ! 3 . , 999997 .

?

  • . CountVectorizer(min_df=10), , 10 . .
  • . Ridge Lasso LinearRegression, , , . , tf-idf , GBM .

. ( ), , : . , , . , , , . (, ).

0

Source: https://habr.com/ru/post/1680153/


All Articles