Text classification of a large dataset in python

I have 2.2 million sample data to classify over 7,500 categories . I use pandas and sckit-learn python for this.

Below is a sample of my dataset

itemid description category 11802974 SPRO VUH3C1 DIFFUSER VUH1 TRIPLE Space heaters Architectural Diffusers 10688548 ANTIQUE BRONZE FINISH PUSHBUTTON switch Door Bell Pushbuttons 9836436 Descente pour Cable tray fitting and accessories Tray Cable Drop Outs 

Below are the steps that I followed:

  • Preliminary processing
  • Vector view
  • Training

      dataset=pd.read_csv("trainset.csv",encoding = "ISO-8859-1",low_memory=False) dataset['description']=dataset['description'].str.replace('[^a-zA-Z]', ' ') dataset['description']=dataset['description'].str.replace('[\d]', ' ') dataset['description']=dataset['description'].str.lower() stop = stopwords.words('english') lemmatizer = WordNetLemmatizer() dataset['description']=dataset['description'].str.replace(r'\b(' + r'|'.join(stop) + r')\b\s*', ' ') dataset['description']=dataset['description'].str.replace('\s\s+',' ') dataset['description'] =dataset['description'].apply(word_tokenize) ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v' POS_LIST = [NOUN, VERB, ADJ, ADV] for tag in POS_LIST: dataset['description'] = dataset['description'].apply(lambda x: list(set([lemmatizer.lemmatize(item,tag) for item in x]))) dataset['description']=dataset['description'].apply(lambda x : " ".join(x)) countvec = CountVectorizer(min_df=0.0005) documenttermmatrix=countvec.fit_transform(dataset['description']) column=countvec.get_feature_names() y_train=dataset['category'] y_train=dataset['category'].tolist() del dataset del stop del tag 

The created documentary matrix will be of the scipy csr matrix type with characteristics of 12k and 2.2 million samples.

For training, I tried using xgboost sckit learn

 model = XGBClassifier(silent=False,n_estimators=500,objective='multi:softmax',subsample=0.8) model.fit(documenttermmatrix,y_train,verbose=True) 

After 2-3 minutes the execution of the above code, I got an error

 OSError: [WinError 541541187] Windows Error 0x20474343 

I also tried Naive Bayes sckit learn, for which I received a memory error

Question

I used the Scipy matrix, which consumes very little memory, and also I delete all unused objects before doing xgboost or Naive bayes, I use a system with 128 GB of RAM , but still get memory problems during training.

I am new to python. Is there something wrong in my code? can anyone tell me how can i use memory efficiently and keep going?

+5
source share
1 answer

I think I can explain the problem in your code. OS error:

"

 ERROR_DS_RIDMGR_DISABLED 8263 (0x2047) 

The directory service has detected that the subsystem that allocates relative identifiers is disabled. This can happen as a defense mechanism when the system determines a significant portion of relative identifiers (RID).

"via https://msdn.microsoft.com/en-us/library/windows/desktop/ms681390

I think you have exhausted a significant portion of the RID at this point in your code:

 dataset['description'] = dataset['description'].apply(lambda x: list(set([lemmatizer.lemmatize(item,tag) for item in x]))) 

You pass the lemmatizer in your lambda, but lambdas are anonymous, so it looks like you can make 2.2 million copies of this lemmatizer at runtime.

You should try changing the low_memory flag to true when you have a memory problem.

Reply to comment -

I checked the Pandas documentation and you can define the function outside the dataset ['description']. apply (), and then refer to this function in a call to the dataset ['description']. apply (). This is how I will write the specified function.

 def lemmatize_descriptions(x): return list(set([lemmatizer.lemmatize(item,tag) for item in x])) 

Then the call to apply () will be -

 dataset['description'] = dataset['description'].apply(lemmatize_descriptions) 

Here is the documentation.

+6
source

Source: https://habr.com/ru/post/1273826/


All Articles