How to extract movie title from file name

I am trying to extract movie metadata (title and year) from their file name.

The name pattern is not standard, but it is not random, so I try to cover as many cases as possible.
To give you an idea, these are examples of the file name:

samples = ['The Movie Title.avi', 'The Movie Title DVDRIP. Useless.info.avi', 'The Movie Title [2005].avi', 'The Movie Title (2005) [Useless.info].avi', 'The Movie Title 2005 H264 DVDRip Useless-Info.avi', 'The Movie Title 2005 XviD Useless info.avi', 'The Movie Title {2005} DVDRIP. UselessInfo.avi', 'The.Movie.Title.2005.Useless.info.avi', '[Useless.info]_The.Movie.Title.2005.Useless.avi'] 

Somewhere there UselessInfo , because what is written there can be anything and cannot be used to extract information (changes from file to file). Also note that 'The Movie Title' may be something with numbers or a non-character character, for example: The Movie Title 2 - The Return' for example.

The expected conclusion should be as follows:

 metadata = {'title': 'The Movie Title', 'year': '2005'} 

I am now using a regex chain but I don’t know what is the best way to do this.

+4
source share
3 answers

As you mentioned in one of the comments, the purpose of this “file name processing” in the “standardized form of the move header” is to compare the two lists.

With your current approach, you can skip many corner cases.

First of all, you need to think carefully about which options you accept. You mentioned different places for "the" movie - what about spelling errors and case sensitivity? How about word order?

Instead of making your code longer and longer, I would recommend that you look for some kind of universal solution.

A few ideas came to my mind - take what you like, mix as you like, heat a little, and it will be well cooked - here we go:

  • LCS: The longest common substring problem , The longest subsequence problem - useful when:
    • word order is important.
    • universal, just indicate how large the substring / subsequence should be as a percentage of the input (max or min or avg or the sum of two file names is your choice).
  • Matching not strings, but sets of words . Thanks to this, you can be resistant to word order, repetition, and others. When you write in python, it’s easy for you to create many word sets or a word set map. Here are some suggestions:
    • For each movie, instead of expressing the entire line regularly: (1) Divide the movie file name into words (2) Eliminate: "the", "movie", etc. (3) cut out the most important parts (“walking”) - “ing” → “walk”, etc.). (4) put the words remaining in the set (5) as a result of the established set representing the film.
    • For each list: all movie file names are converted to sets (as indicated above), and all these sets are put into a set (now you have a set of line sets - yes)
    • For list A and B: just do A ^ B or A - B , again what you need (checkout Python Manual: install .
  • If you need later, to return the set representing the movie to the movie file name. When creating lists A, B you need to create cards MA, MB, which will display for you a "set of words" in the "file name".
  • LCS again , but now imagine that your alphabet is words. If you are not familiar with the term Formal langages - imagine that your movie name is written in special letters, each letter is exactly one word. Thanks to this, you have a sequence of words, and you can search for a subsequence of words. Now applying LCS will give you the “Longest General Set of Words for Storing Words” in the movie title .
+1
source

Why not download a database (possibly on Wikipedia) with a list of movie names and dates, and then compare the file names with this list? There are so many edge cases that it can be more effective.

+2
source

It was a long time ago! but if someone needs it, I found this python library called PTN very useful! many thanks to the guy who encoded it!

install it: pip install parse-torrent-name

 import PTN torrentName = "[Torrent9.info ] Silicon.Valley.S04E04.VOSTFR.WEB-DL.XviD-T9.avi" info = PTN.parse(torrentName) print(info) 

Exit: {'episode': 4, 'codec': 'XviD', 'title': 'Silicon.Valley.', 'group': 'T9', 'website': 'Torrent9.info', 'excess': 'VOSTFR', 'season': 4, 'quality': 'WEB-DL'}

So this is what you need!

+1
source

Source: https://habr.com/ru/post/1391706/


All Articles