I have 2.2 million sample data to classify over 7,500 categories . I use pandas and sckit-learn python for this.
Below is a sample of my dataset
itemid description category 11802974 SPRO VUH3C1 DIFFUSER VUH1 TRIPLE Space heaters Architectural Diffusers 10688548 ANTIQUE BRONZE FINISH PUSHBUTTON switch Door Bell Pushbuttons 9836436 Descente pour Cable tray fitting and accessories Tray Cable Drop Outs
Below are the steps that I followed:
- Preliminary processing
- Vector view
Training
dataset=pd.read_csv("trainset.csv",encoding = "ISO-8859-1",low_memory=False) dataset['description']=dataset['description'].str.replace('[^a-zA-Z]', ' ') dataset['description']=dataset['description'].str.replace('[\d]', ' ') dataset['description']=dataset['description'].str.lower() stop = stopwords.words('english') lemmatizer = WordNetLemmatizer() dataset['description']=dataset['description'].str.replace(r'\b(' + r'|'.join(stop) + r')\b\s*', ' ') dataset['description']=dataset['description'].str.replace('\s\s+',' ') dataset['description'] =dataset['description'].apply(word_tokenize) ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v' POS_LIST = [NOUN, VERB, ADJ, ADV] for tag in POS_LIST: dataset['description'] = dataset['description'].apply(lambda x: list(set([lemmatizer.lemmatize(item,tag) for item in x]))) dataset['description']=dataset['description'].apply(lambda x : " ".join(x)) countvec = CountVectorizer(min_df=0.0005) documenttermmatrix=countvec.fit_transform(dataset['description']) column=countvec.get_feature_names() y_train=dataset['category'] y_train=dataset['category'].tolist() del dataset del stop del tag
The created documentary matrix will be of the scipy csr matrix type with characteristics of 12k and 2.2 million samples.
For training, I tried using xgboost sckit learn
model = XGBClassifier(silent=False,n_estimators=500,objective='multi:softmax',subsample=0.8) model.fit(documenttermmatrix,y_train,verbose=True)
After 2-3 minutes the execution of the above code, I got an error
OSError: [WinError 541541187] Windows Error 0x20474343
I also tried Naive Bayes sckit learn, for which I received a memory error
Question
I used the Scipy matrix, which consumes very little memory, and also I delete all unused objects before doing xgboost or Naive bayes, I use a system with 128 GB of RAM , but still get memory problems during training.
I am new to python. Is there something wrong in my code? can anyone tell me how can i use memory efficiently and keep going?