Add Tab Separator to Grep

I am new to grep and awk, and I would like to create tabs separated by values โ€‹โ€‹in the "frequency.txt" file (this script looks at a large case and then displays each individual word and how many times it is used in the case - I modified it for Khmer language). I looked ( grep tab on UNIX ), but I can not find an example that makes sense to me for this bash script (I'm too many newbies).

I use this bash script in cygwin:

#!/bin/bash
# Create a tally of all the words in the corpus.
#
echo Creating tally of word frequencies...
#
sed -e 's/[a-zA-Z]//g' -e 's/โ€‹/ /g' -e 's/\t/ /g' \
    -e 's/[ยซ|ยป|:|;|.|,|(|)|-|?|แŸ”|"|"]//g' -e 's/[0-9]//g' \
    -e 's/ /\n/g' -e 's/แŸ //g' -e 's/แŸก//g' -e 's/แŸข//g' \
    -e 's/แŸฃ//g' -e 's/แŸค//g' -e 's/แŸฅ//g' -e 's/แŸฆ//g' \
    -e 's/แŸง//g' -e 's/แŸจ//g' -e 's/แŸฉ//g' dictionary.txt | \
  tr [:upper:] [:lower:] | \
  sort | \
  uniq -c | \
  sort -rn > frequency.txt
grep -Fwf dictionary.txt frequency.txt | awk '{print $2 "," $1}'

Awk prints with a comma, but it is only on the screen. How can I put a tab (a comma will also work), between frequency and term?

dictionary.txt(Khmer , , sed , ):

แž–แŸ’แžš แŸ‡ แžœแžทแž‰แŸ’แž‰แžถแžŽ แž“แžนแž„ แž”แŸ’แžšแž–แž“แŸ’แž’ แžแŸ’แž˜แŸ„แž„ แžแŸ’แž˜แžธ แž–แŸ„แž› แžแžถ แžขแž‰แŸ’แž‡แžพแž‰ แž˜แž€ แž แžพแž™ แžขแŸ’แž“แž€ แžŽแžถ แžŠแŸ‚แž› แžฎ แž€แŸ แžแžถ แžขแž‰แŸ’แž‡แžพแž‰ แž˜แž€ แžŠแŸ‚แžš แžขแŸ’แž“แž€ แžŽแžถ แžŠแŸ‚แž› แžŸแŸ’แžšแŸแž€ แž“ แŸ„แŸ‡ แž˜แžถแž“ แžแŸ‚ แž˜แž€ แž แžพแž™ แžขแŸ’แž“แž€ แžŽแžถ แžŠแŸ‚แž› แž…แž„แŸ‹ แž”แžถแž“ แž˜แžถแž“ แžแŸ‚ แž™แž€ แž‘แžนแž€ แž‡แžธแžœแžทแž แž“ แŸ„แŸ‡ แž… แžปแŸ‡ แžฅแž แž…แŸแž‰ แžแŸ’แž›แŸƒ แž‘แŸ.

frequency.txt, ( ):

25605 แž“แžนแž„ 25043 แž‡แžถ 22004 แž”แžถแž“ 20515 แž“ แŸ„แŸ‡

, frequency.txt ( TAB ):

25605TAB แž“แžนแž„ 25043TAB แž‡แžถ 22004TAB แž”แžถแž“ 20515TAB แž“ แŸ„แŸ‡

!

+3
3

sed :

tr -d '[a-zA-Z][0-9]ยซยป:;.,()-?แŸ”""|แŸ แŸกแŸขแŸฃแŸคแŸฅแŸฆแŸงแŸจแŸฉ'
tr '\t' ' '

:

  • 's/โ€‹/ /g' - , [a-z][A-Z], , , -op
  • 's/[ยซ|ยป|:|;|.|,|(|)|-|?|แŸ”|"|"]//g' - , ( ), 's/[ยซยป:;.,()-?แŸ”""|]//g' ( , )
  • 's/ /\n/g' - ,

, uniq:

sed 's/^ *\([0-9]\+\) /\1\t/'

, AWK :

awk 'BEGIN{OFS='\t'} {print $2, $1}'
+3

awk "<"?

+1

The following script should get you where you need to go. The trumpet up teeallows you to see the output on the screen, while recording the output on./outfile

#!/bin/sh  

sed ':a;N;s/[a-zA-Z0-9แŸ”แŸ แŸกแŸขแŸฃแŸคแŸฅแŸฆแŸงแŸจแŸฉ\nยซยป:;.,()?""-]//g;ta' < dictionary.txt | \
gawk '{$0=toupper($0);for(i=1;i<=NF;i++)a[$i]++}
   END{for(item in a)printf "%s\t%d ", item, a[item]}' | \
tee ./outfile
+1
source

Source: https://habr.com/ru/post/1789125/


All Articles