How to classify URLs? what are the functions of the url? How to select and extract functions from a URL

I just started working on the classification problem. His problem with the two classes, My Trained model (Machine Learning) will have to decide / predict whether to allow the URL or block it.

My question is very specific.

  • How to classify URLs? Should I use conventional text analysis methods?
  • What are URLs?
  • How to select and extract functions from a URL?
+6
source share
1 answer

I assume that you do not have access to the contents of the URL, so you can only extract functions from the url string itself. Otherwise, it makes sense to use the contents of the URL.

Here are some features that I will try. See this document for more ideas:

  • All components of the URL. For example, this page has the following URL:

    http://stackoverflow.com/questions/26456904/how-to-classify-urls-what-are-urls-features-how-to-select-and-extract-features

All tokens that appear in different parts of URLs must have a variable value for classification. In this case, the last part after tokenization gives great opportunities for this page. (e.g. classify, urls, select, retrieve, functions )

  * stackoverflow * com * questions * 26456904 * how to classify urls what are urls features how to select and extract features 
  1. URL length
  2. n-grams (2 grams as examples below)
    • Stackoverflow com
    • com-questions
    • Questions-26456904
    • 26456904 how
    • how to
    • ....
+6
source

Source: https://habr.com/ru/post/976953/


All Articles