Transliteration script for linux shell

Question

Transliteration script for linux shell

I have several .txt files containing text in the alphabet; I want to transliterate the text into another alphabet; some characters of alphabet 1 are 1: 1 with characters of alphabet 2 (i.e., a becomes e), while others are 1: 2 (i.e., x becomes ch).

I would like to do this using a simple script for the Linux shell.

With tr or sed, I can convert 1: 1 characters:

sed -fy/abcdefghijklmnopqrstuvwxyz/nopqrstuvwxyzabcdefghijklm/

a will become n, b will become o et cetera (as it seems to me, a caesar code)

But how can I deal with 1: 2 characters?

+5

linux shell sed tr

user3946687 Aug 16 '14 at 8:46

source share

4 answers

Using Awk:

 #!/usr/bin/awk -f BEGIN { FS = OFS = "" table["a"] = "e" table["x"] = "ch" # and so on... } { for (i = 1; i <= NF; ++i) { if ($i in table) { $i = table[$i] } } } 1

Using:

 awk -f script.awk file

Test:

 # echo "the quick brown fox jumps over the lazy dog" | awk -f script.awk the quick brown foch jumps over the lezy dog

+4

konsolebox Aug 16 '14 at 8:57

source share

This can be done quite concisely using single-line Perl:

 perl -pe '%h=(a=>"xy",c=>"z"); s/(.)/defined $h{$1} ? $h{$1} : $1/eg'

or equivalent ( thanks jaypal ):

 perl -pe '%h=(a=>"xy",c=>"z"); s|(.)|$h{$1}//=$1|eg'

%h is a hash containing characters (keys) and their substitutions (values). s is a lookup command (as in sed). The g modifier means that the substitution is global, and e means that the replacement part is evaluated as an expression. It captures each character one by one and replaces them with a value in the hash, if it exists, otherwise it retains the original value. The -p switch means that every line in the input is automatically printed.

Testing:

 $ perl -pe '%h=(a=>"xy",c=>"z"); s|(.)|$h{$1}//=$1|eg' <<<"abc" xybz

+2

Tom fenech Aug 16 '14 at 16:20

source share

Using sed .

Write a transiterate.sed file containing:

 s/a/e/g s/x/ch/g

and then run from the command line to get the transliterated output.txt from input.txt:

 sed -f transliterate.sed input.txt > output.txt

If you need this more often, consider adding #!/bin/sed -f as the first line and creating an executable using chmod 744 transliterate.sed as described on the Wikipedia page for sed .

0

mgoni Apr 26 '19 at 11:54

source share

Ed morton · Accepted Answer · 2014-08-17T14:26:26+0000

Not an answer, just to show a shorter, idiomatic way to populate the table[] array from @konsolebox's answer, as discussed in the relevant comments:

 BEGIN { split("aeb", old) split("x ch o", new) for (i in old) table[old[i]] = new[i] FS = OFS = "" }

so the mapping of old to new characters is clearly shown in that the char in the first split () maps to char (s) below it and for any other mapping you want, you just need to change the line in split (), doesn't change 26- explicit assignments to table [].

You can even create a generic script to do mappings and simply pass old and new lines as variables:

 BEGIN { split(o, old) split(n, new) for (i in old) table[old[i]] = new[i] FS = OFS = "" }

then in the shell nothing like this:

 old="aeb" new="x ch o" awk -vo="$old" -vb="$new" -f script.awk file

and you can protect yourself from your mistakes by filling in the lines, for example:

 BEGIN { numOld = split(o, old) numNew = split(n, new) if (numOld != numNew) { printf "ERROR: #old vals (%d) != #new vals (%d)\n", numOld, numNew | "cat>&1" exit 1 } for (i=1; i <= numOld; i++) { if (old[i] in table) { printf "ERROR: \"%s\" duplicated at position %d in old string\n", old[i], i | "cat>&2" exit 1 } if (newvals[new[i]]++) { printf "WARNING: \"%s\" duplicated at position %d in new string\n", new[i], i | "cat>&2" } table[old[i]] = new[i] } }

It would not be good to know if you wrote that b displays x, and then mistakenly wrote that b displays y? The above is really the best way to do this, but your challenge, of course.

Here is one complete solution, as described in the comments below

 BEGIN { numOld = split("aeb", old) numNew = split("x ch o", new) if (numOld != numNew) { printf "ERROR: #old vals (%d) != #new vals (%d)\n", numOld, numNew | "cat>&1" exit 1 } for (i=1; i <= numOld; i++) { if (old[i] in table) { printf "ERROR: \"%s\" duplicated at position %d in old string\n", old[i], i | "cat>&2" exit 1 } if (newvals[new[i]]++) { printf "WARNING: \"%s\" duplicated at position %d in new string\n", new[i], i | "cat>&2" } map[old[i]] = new[i] } FS = OFS = "" } { for (i = 1; i <= NF; ++i) { if ($i in map) { $i = map[$i] } } print }

I renamed the table array as map only because iMHO better reflects the purpose of the array.

save above in script.awk file and run it as awk -f script.awk inputfile

Transliteration script for linux shell

More articles: