Work with raw text data from a scanned catalog.
I only want to save 2 types of lines:
- the beginning with the number (artists 'works)
- contains 2 matched upper letters ** with accents ** (artists' names)
I want to remove everything else easily (with true -false?)
my data
ÁÀDFDS (artist 1 with accents)
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB (artist 2)
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
B'BDDED (artist 3)
az*ù*ù*ù (bad string)
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
ÉÈDFSF (artist 4)
6 Sed cursus augue in tempus scelerisque.
A..gdgdgdg (bad string begining with a upper case letter)
7 in commodo enim in laoreet gravida.
Expected results
with accents DFDS
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
B'BDDED
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
ÉÈDDFSF
6 Sed cursus augue in tempus scelerisque.
7 in commodo enim in laoreet gravida.
Data is imported into R using:
readlines ("clipboard")
I can identify strings, including artist names, in capital letters with various regular expressions
eg.
[A-ZÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝYÆO][A-ZÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝYÆO |']
I can identify strings, including works of art
^[0-9]+[\s]
Any help would be greatly appreciated.
source
share