Detecting CJK characters in a string (C #)

I use iTextSharp to create a series of PDF files using Open Sans as the default font. Sometimes names are inserted into the contents of PDF files. However, my problem is that some of the names I need to insert contain CJK characters (stored in nvarchar columns in SQL Server), and as far as I know, Open Sans does not currently support CJK characters. I need to use Open Sans as my default font, so ideally I would like to try to find CJK characters in rows captured from the database and switch to CJK font when printing those characters.

Would there be a better rule for this regex? I could not find regex patterns that could help with this, unfortunately.

Thanks in advance for your help!

+4
source share
3 answers

use iTextSharp.text.pdf.FontSelector;

iTextSharp.text.pdf.FontSelector selector = new iTextSharp.text.pdf.FontSelector(); // add 2 type of font to FontSelector selector.AddFont(openSansfont); selector.AddFont(chinesefont); iTextSharp.text.Phrase phrase = selector.Process(yourTxt); 

FontSelector will use the correct font for you!

Detailed description from the source file FontSelector.cs.

Select the appropriate fonts that contain the glyphs needed to display the text correctly. Fonts are checked in order until a character is found.

I forgot what order he is looking for first! please try it out !! Edit: order from first addFont to last addFont.

http://itextpdf.com/examples/iia.php?id=214

+2
source

Simply, if someone comes across this question, I found another solution using the unicode blocks listed here ( http://msdn.microsoft.com/en-us/library/20bw873z.aspx#SupportedNamedBlocks ) in a regular expression.

 var Name = "Joe Bloggs"; var Regex = new Regex(@"\p{IsCJKUnifiedIdeographs}"); if(Regex.IsMatch(Name)) { //switch to CJK font } else { //keep calm and carry on } 

EDIT:

You may have to match more than just unified ideograms, try using this as a regular expression:

 string r = @"\p{IsHangulJamo}|"+ @"\p{IsCJKRadicalsSupplement}|"+ @"\p{IsCJKSymbolsandPunctuation}|"+ @"\p{IsEnclosedCJKLettersandMonths}|"+ @"\p{IsCJKCompatibility}|"+ @"\p{IsCJKUnifiedIdeographsExtensionA}|"+ @"\p{IsCJKUnifiedIdeographs}|"+ @"\p{IsHangulSyllables}|"+ @"\p{IsCJKCompatibilityForms}"; 

This works for all the Korean text I tried.

+5
source

Well, I really edited the daves to make it work, but apparently only I see that until my colleague considers it, so I will post the solution as my own answer. Basically, dave just needs to expand its regex a bit:

 string regex = @"\p{IsHangulJamo}|"+ @"\p{IsCJKRadicalsSupplement}|"+ @"\p{IsCJKSymbolsandPunctuation}|"+ @"\p{IsEnclosedCJKLettersandMonths}|"+ @"\p{IsCJKCompatibility}|"+ @"\p{IsCJKUnifiedIdeographsExtensionA}|"+ @"\p{IsCJKUnifiedIdeographs}|"+ @"\p{IsHangulSyllables}|"+ @"\p{IsCJKCompatibilityForms}"; 

which will detect Korean characters when used as follows:

 string subject = "도형이"; Match match = Regex.Match(subject, regex); if(match.Success) { //change to Korean font } else { //keep calm and carry on { 
0
source

Source: https://habr.com/ru/post/1479497/


All Articles