Fetching URLs from Emacs Buffer?

Question

Fetching URLs from Emacs Buffer?

How to write an Emacs Lisp function to find all hrefs in an HTML file and extract all links?

Input:

<html>
 <a href="http://www.stackoverflow.com" _target="_blank"> StackOverFlow & lt / a>
 <h1> Emacs Lisp </h1>
 <a href="http://news.ycombinator.com" _target="_blank"> Hacker News & lt / a>
</html>

Output:

http: //www.stackoverflow.com | StackOverFlow
http: //news.ycombinator.com | Hacker News

I saw the repeated search function mentioned several times during my search. This is what I think I need to do based on what I have read so far.

(defun extra-urls (file)
 ...
 (setq buffer (...
 (while
        (re-search-forward "http: //" nil t)
        (when (match-string 0)
...
)))

+3

elisp

anon 29 . '09 7:58

3

, . , URL .

(defun extract-urls (fname)
 "Extract HTML href url's,titles to buffer 'new-urls.csv' in | separated format."
  (setq in-buf (set-buffer (find-file fname))); Save for clean up
  (beginning-of-buffer); Need to do this in case the buffer is already open
  (setq u1 '())
  (while
      (re-search-forward "^.*<a href=\"\\([^\"]+\\)\"[^>]+>\\([^<]+\\)</a>" nil t)

      (when (match-string 0)            ; Got a match
        (setq url (match-string 1) )    ; URL
        (setq title (match-string 2) )  ; Title
        (setq u1 (cons (concat url "|" title "\n") u1)) ; Build the list of URLs
       )
      )
  (kill-buffer in-buf)          ; Don't leave a mess of buffers
  (progn
    (with-current-buffer (get-buffer-create "new-urls.csv"); Send results to new buffer
      (mapcar 'insert u1))
    (switch-to-buffer "new-urls.csv"); Finally, show the new buffer
    )
  )

;; Create a list of files to process
;;
(mapcar 'extract-urls '(
                       "/tmp/foo.html"
                       "/tmp/bar.html"
               ))

+5

anon 01 . '09 14:02

'xml, . , :

(defun my-grab-html (file)
  (interactive "fHtml file: ")
  (let ((res (car (xml-parse-file file)))) ; 'car because xml-parse-file returns a list of nodes
    (mapc (lambda (n)
            (when (consp n) ; don't operate on the whitespace, xml preserves whitespace
              (let ((link (cdr (assq 'href (xml-node-attributes n)))))
                (when link
                  (insert link)
                  (insert "|")
                  (insert (car (xml-node-children n))) ;# grab the text for the link
                  (insert "\n")))))
          (xml-node-children res))))

HTML, , .

+1

Trey Jackson 29 . '09 16:05

Heinzi · Accepted Answer · 2009-10-29T10:09:30+0000

, , :

(defun getlinks ()
  (beginning-of-buffer)
  (replace-regexp "^.*<a href=\"\\([^\"]+\\)\"[^>]+>\\([^<]+\\)</a>.*$" "LINK:\\1|\\2")
  (beginning-of-buffer)
  (replace-regexp "^\\([^L]\\|\\(L[^I]\\)\\|\\(LI[^N]\\)\\|\\(LIN[^K]\\)\\).*$" "")
  (beginning-of-buffer)
  (replace-regexp "
+" "
")
  (beginning-of-buffer)
  (replace-regexp "^LINK:\\(.*\\)$" "\\1")
)

LINK: url | description, , - , , , "LINK:".

HOWTO: (1) html , <href <a href, (2) Emacs scratch, (3) Cx Ce ")" , (4) HTML , (5) M-: (getlinks).

: -regexp. .

Fetching URLs from Emacs Buffer?

More articles: