Crawler MP3 Link

I was looking for a good way to implement this. I am working on a simple website crawler that will go around a specific set of websites and crawl all mp3 links into a database. I don’t want to upload files, just scan the link, index them and search. So far, for some sites I have been successful, but for some they use URL redirects and stuff that confuses the crawler ..

any ideas? How does beemp3.com index all of these links?

thanks

+3
source share
3 answers

You can query the http header for links and check their mime type. If there is a chance of audio / mpeg, you select the mp3 link.

+1
source

- ( ). QUERY_TEXT Google :

QUERY_TEXT intitle:
"index.of" "parent directory" "size" "last modified" "description"
[snd] (mp4|mp3|avi)
-inurl:(jsp|php|html|aspx|htm|cf|shtml|lyrics|mp3s|mp3|index)
-gallery
-intitle:"last modified"
-intitle:(intitle|mp3)
0

?

Python:
, Scrapy ( python), Django Framework. , , Scrapy - . IIRC , DRY ( , Django ).

URL, .

.

Perhaps you can edit your question and add information about your crawler; Is it written from scratch, is it some kind of turnkey solution, etc.?

0
source

Source: https://habr.com/ru/post/1712577/


All Articles