Fetching URLs from Emacs Buffer?

How to write an Emacs Lisp function to find all hrefs in an HTML file and extract all links?

Input:

<html>
 <a href="http://www.stackoverflow.com" _target="_blank"> StackOverFlow & lt / a>
 <h1> Emacs Lisp </h1>
 <a href="http://news.ycombinator.com" _target="_blank"> Hacker News & lt / a>
</html>

Output:

http: //www.stackoverflow.com | StackOverFlow
http: //news.ycombinator.com | Hacker News

I saw the repeated search function mentioned several times during my search. This is what I think I need to do based on what I have read so far.

(defun extra-urls (file)
 ...
 (setq buffer (...
 (while
        (re-search-forward "http: //" nil t)
        (when (match-string 0)
...
)))
+3
3

, , :

(defun getlinks ()
  (beginning-of-buffer)
  (replace-regexp "^.*<a href=\"\\([^\"]+\\)\"[^>]+>\\([^<]+\\)</a>.*$" "LINK:\\1|\\2")
  (beginning-of-buffer)
  (replace-regexp "^\\([^L]\\|\\(L[^I]\\)\\|\\(LI[^N]\\)\\|\\(LIN[^K]\\)\\).*$" "")
  (beginning-of-buffer)
  (replace-regexp "
+" "
")
  (beginning-of-buffer)
  (replace-regexp "^LINK:\\(.*\\)$" "\\1")
)

LINK: url | description, , - , , , "LINK:".

HOWTO: (1) html , <href <a href, (2) Emacs scratch, (3) Cx Ce ")" , (4) HTML , (5) M-: (getlinks).

: -regexp. .

+2

, . , URL .

(defun extract-urls (fname)
 "Extract HTML href url's,titles to buffer 'new-urls.csv' in | separated format."
  (setq in-buf (set-buffer (find-file fname))); Save for clean up
  (beginning-of-buffer); Need to do this in case the buffer is already open
  (setq u1 '())
  (while
      (re-search-forward "^.*<a href=\"\\([^\"]+\\)\"[^>]+>\\([^<]+\\)</a>" nil t)

      (when (match-string 0)            ; Got a match
        (setq url (match-string 1) )    ; URL
        (setq title (match-string 2) )  ; Title
        (setq u1 (cons (concat url "|" title "\n") u1)) ; Build the list of URLs
       )
      )
  (kill-buffer in-buf)          ; Don't leave a mess of buffers
  (progn
    (with-current-buffer (get-buffer-create "new-urls.csv"); Send results to new buffer
      (mapcar 'insert u1))
    (switch-to-buffer "new-urls.csv"); Finally, show the new buffer
    )
  )

;; Create a list of files to process
;;
(mapcar 'extract-urls '(
                       "/tmp/foo.html"
                       "/tmp/bar.html"
               ))

+5

'xml, . , :

(defun my-grab-html (file)
  (interactive "fHtml file: ")
  (let ((res (car (xml-parse-file file)))) ; 'car because xml-parse-file returns a list of nodes
    (mapc (lambda (n)
            (when (consp n) ; don't operate on the whitespace, xml preserves whitespace
              (let ((link (cdr (assq 'href (xml-node-attributes n)))))
                (when link
                  (insert link)
                  (insert "|")
                  (insert (car (xml-node-children n))) ;# grab the text for the link
                  (insert "\n")))))
          (xml-node-children res))))

HTML, , .

+1

Source: https://habr.com/ru/post/1721370/


All Articles