How to convert text to title?

I have a text file containing a list of names that I need to change to the heading (words should start with a capital letter, with the exception of most articles, conjunctions and prepositions).

For example, this list of book titles:

barbarians at the gate hot, flat, and crowded A DAY LATE AND A DOLLAR SHORT THE HITCHHIKER GUIDE TO THE GALAXY 

should be changed to:

 barbarians at the gate hot, flat, and crowded A DAY LATE AND A DOLLAR SHORT THE HITCHHIKER GUIDE TO THE GALAXY 

I wrote the following code:

 while(<DATA>) { $_=~s/(\s+)([az])/$1.uc($2)/eg; print $_; } 

But he smoothes the first letter of each word, even words such as "at", "the" and "a" in the middle of the name:

 barbarians at the gate hot, flat, and crowded A DAY LATE AND A DOLLAR SHORT THE HITCHHIKER GUIDE TO THE GALAXY 

How can i do this?

+4
source share
2 answers

Thanks See also Lingua :: EN :: Titlecase - Hรฅkon Hรฆgland to get a way out.

 use Lingua::EN::Titlecase; my $tc = Lingua::EN::Titlecase->new(); while(<DATA>) { my $line = $_; my $tc = Lingua::EN::Titlecase->new($line); print $tc; } 
+4
source

You can also try using this regex: ^(.)(.*?)\b|\b(at|to|that|and|this|the|a|is|was)\b|\b(\w)([\w']*?(?:[^\w'-]|$)) and replace with \U$1\L$2\U$3\L$4 . This works, I agree with the first letter of words that are not articles, capitalizing it and then matching the rest of the word. This seems to work in PHP, I don't know about Perl, but most likely it will work.

  • ^(.)(.*?)\b matches the first letter of the first word (group 1) and the rest of the word (group 2). This is done so as not to make amends to the first word, because this is an article.
  • \b(word|multiple words|...)\b matches any connective word to prevent their capitalization.
  • (\w)([\w']*?(?:[^\w'-]|$)) matches the first letter of the word (group 3) and the rest of the word (group 4). Here I used [^\w'-] instead of \b , so hyphens and apostrophes are also considered word characters. This will prevent 's from 's

The following characters are used to replace \U , and \L is their reduction. If you want, you can add more articles or words to the regular expression to prevent them from being used.

UPDATE: I changed the regex so you can also include connecting phrases (a few words). But it will still make a very long regular expression ...

0
source

Source: https://habr.com/ru/post/1261044/


All Articles