GVim regular expression to remove duplicate domains from a list

Question

GVim regular expression to remove duplicate domains from a list

I need a regular expression written for use in gVim that will remove duplicate domains from the list of URLs (gVim can be downloaded here: http://www.vim.org/download.php

I have a list of over 6,000,000 URLs in a .txt file (which opens in gVim for editing).

URLs are in this format:

http://www.example.com/some-url.php
http://example2.com/another_url.html
http://example3.com/
http://www.example4.com/anotherURL.htm
http://www.example.com/some-url2.htm
http://example.com/some-url3.html
http://www.example2.com/somethingelse.php
http://example5.com

In other words, there is no specific format for URLs. Some have WWW, some do not, they all have different formats.

I need a regular expression written for gVim that will remove all DOMAIN duplicates from the list (and the corresponding URL), leaving the first instance it found.

, , , :

http://www.example.com/some-url.php
http://example2.com/another_url.html
http://example3.com/
http://www.example4.com/anotherURL.htm
http://example5.com

, , gVim:

http://supportweb.cs.bham.ac.uk/documentation/tutorials/docsystem/build/tutorials/gvim/gvim.html#Vi-Regular-Expressions

http://www.softpanorama.org/Editors/Vimorama/vim_regular_expressions.shtml

+1

windows vim regex

Learning 23 . '10 2:18

2

:

%! sort | uniq

-1

Paul Betts 23 . '10 2:26

ZyX · Accepted Answer · 2010-10-23T07:07:31+0000

, : %s!\v%(^http://%(www\.)?(%([^./]+\.)+[^./]+)%(/.*)?$\_.{-})@<=^http://%(www\.)?\1%(/.*)?\n!!g, 6 URL- . :

:let g:gotDomains={}
:%g/^/let curDomain=matchstr(getline('.'), '\v^http://%(www\.)?\zs[^/]+') | if !has_key(g:gotDomains, curDomain) | let g:gotDomains[curDomain]=1 | else | delete _ | endif

:

let g:gotDomains={} ,
%g/^/{command} {command}
let curDomain=matchstr(...)
- getline('.')
- \v ( ).
- ^
- \zs ( \zs)
if !has_key(g:gotDomains, curDomain), .
let g:gotDomains[curDomain]=1 ( 1, ).
delete _ ( , ).

GVim regular expression to remove duplicate domains from a list

More articles: