Resolution absolute path from relative path

I am creating a web crawler and I am trying to find a way to find the absolute path from a relative path. I took 2 test sites. One in ROR and 1, made using Pyro CMS.

In the latter, I found href tags with the link "index.php". So, if I scan on now http://example.com/xyz, then my crawler will add and make it http://example.com/xyz/index.php. But the problem is that I have to add to the root instead of ie, it should be http://example.com/index.php. Therefore, if I scan http://example.com/xyz/index.php, I will find another "index.php" that will be added again.

So far in ROR, if the relative path starts with '/', I could easily find out that this is the root site.

I can handle the case of index.php, but there can be so many rules that I need to take care of if I start doing it manually. I am sure there is an easier way to do this.

+2
source share
1 answer

At Go, the package pathis your friend.

You can get a directory or folder from a path using path.Dir(), for example

p := "/xyz/index.php"
dir := path.Dir(p)
fmt.Println("dir:", dir) // Output: "/xyz"

If you find a link with a root path (starts with a slash), you can use it as is.

If it is relative, you can join it using the dirabove using path.Join(). Join()also "clear" the URL:

p2 := path.Join(dir, "index.php")
fmt.Println("p2:", p2)
p3 := path.Join(dir, "./index.php")
fmt.Println("p3:", p3)
p4 := path.Join(dir, "../index.php")
fmt.Println("p4:", p4)

Conclusion:

p2: /xyz/index.php
p3: /xyz/index.php
p4: /index.php

"", path.Join(), path.Clean(), , . :

  • .
  • . ( ).
  • .. path name ( ) .., .
  • .., , "/.." "/" .

"" URL ( , ..), url.Parse() url.URL raw url, URL- , :

uraw := "http://example.com/xyz/index.php"
u, err := url.Parse(uraw)
if err != nil {
    fmt.Println("Invalid url:", err)
}
fmt.Println("Path:", u.Path)

:

Path: /xyz/index.php

Go Playground.

+1

Source: https://habr.com/ru/post/1689188/


All Articles