There is a module in the project that accepts the URL and determines whether it is on the E-Commerce or Non-Electronics website.
I tried the following approaches:
Using Apache mahout, Classification: URL ---> Take html dump ---> preprocess html dump a) remove all html tags
b) removal of stop words (aka common words) such as CDATA, href, value and, of, between, etc.
c) a training model and then testing it.
The following options I used for training
bin / mahout trainclassifier \ -i training-data \ -o bayes-model \> -type bayes -ng 1
Testing:
/bin/mahout testclassifier \ -d test-data \ -m bayes-model \ -type bayes -source hdfs -ng 1 -method sequential
I get accuracy as 73%, and with cbayes algorithm I get 52%.
I am going to improve the preliminary stage of processing by extracting information that is on the e-commerce website, for example, “Checkout button”, “recipient link”, “price / dollar symbol”, “Cash on delivery” text, “30 days gurantee "etc.
Any suggestions on how to extract this information or any other ways to predict the site as e-commerce or non-commerce?
source share