How to determine if the e-commerce or non-e-commerce website URL is programmatically?

There is a module in the project that accepts the URL and determines whether it is on the E-Commerce or Non-Electronics website.

I tried the following approaches:

  • Using Apache mahout, Classification: URL ---> Take html dump ---> preprocess html dump a) remove all html tags

    b) removal of stop words (aka common words) such as CDATA, href, value and, of, between, etc.

    c) a training model and then testing it.

The following options I used for training

bin / mahout trainclassifier \ -i training-data \ -o bayes-model \> -type bayes -ng 1

Testing:

/bin/mahout testclassifier \ -d test-data \ -m bayes-model \ -type bayes -source hdfs -ng 1 -method sequential 

I get accuracy as 73%, and with cbayes algorithm I get 52%.

I am going to improve the preliminary stage of processing by extracting information that is on the e-commerce website, for example, “Checkout button”, “recipient link”, “price / dollar symbol”, “Cash on delivery” text, “30 days gurantee "etc.

Any suggestions on how to extract this information or any other ways to predict the site as e-commerce or non-commerce?

+6
source share
1 answer

I am very surprised that you get such good accuracy with a simple html extraction and a gulf classifier.

But you seem to be on the right track with features like the checkout button and pricing.

Here is the document I found yesterday reading about Yandex:

To find out or buy? Product Review vs Web Store Classifier

It is about how to distinguish these two sites and some of the methods that they used. They also used SVM instead of naive bays.

+1
source

Source: https://habr.com/ru/post/906587/


All Articles