Encoding in languages other than ASCII

Question

Encoding in languages other than ASCII

Well, I found this strikingly strange for me, but in C #, you can write your source code in another language. I wrote a sample Korean source code to illustrate my point:

namespace 대한민국 { public class 학생 { public string 이름 { get; private set; } public string 좌우명 { get; private set; } public 학생(string 이름, string 좌우명) { this.이름 = 이름; this.좌우명 = 좌우명; } } public class 대학교 { private List<학생> 재학생목록 = new List<학생>(); public void 입학(학생 입학생) { 재학생목록.Add(입학생); } public void 재학생출력() { foreach (학생 선택된학생 in 재학생목록) { Console.WriteLine("이름: {0}", 선택된학생.이름); Console.WriteLine("좌우명: {0}", 선택된학생.좌우명); } } } public class 프로그램 { static void Main(string[] args) { 대학교 스쿨오브헬 = new 대학교(); 스쿨오브헬.입학(new 학생("전땅끄", "본인은 단돈 29만원과 땅끄로 이 신성하고 거룩한 국가의 민주주의를 발전시켰소")); 스쿨오브헬.입학(new 학생("이피카츄", "여러분 이거 다 거짓말인거 아시죠!!!")); 스쿨오브헬.입학(new 학생("빵상아줌마", "빵빵 똥똥똥똥 땅땅 따라라라라~~~")); 스쿨오브헬.재학생출력(); } } }

The above code compiles and gives a valid result.

With the exception of keywords, you can write your source code in languages other than English. Of course, this is very impractical and no one will do it.

My question is this:

Is this a C # function or a Visual Studio function? (I could not get a similar program to work in C ++ in Visual Studio 2010)
What is the impact of performance? (I would hardly accept almost anything, but I wasn’t sure that they did some kind of crazy conversion, allowing you to encode non-ASCII characters)
What is the reason for implementing this feature?

+4

c # .net unicode

l46kok 01 Oct '12 at 11:47

source share

2 answers

1: this is a C # language specification, therefore: C #

2: no at all; parsers don't really care if there is something Fred vs 프로그램 ; none of them are significant to the compiler

3: because not all developers speak English (or: Latin) as their main language. It is possible that 프로그램 expresses the intention of the class very easily and meaningfully for developers working on this project.

+5

Marc gravell 01 Oct '12 at 11:52

source share

Jon · Accepted Answer · 2012-10-01T11:57:54+0000

1) Both the C # specification and the CLI specification allow this.

The C # standard says that

The source file is an ordered sequence of Unicode characters.

and

The identifier in the corresponding program must be in canonical format defined by the Unicode Normalization Form C formula, as defined by Unicode Standard Appendix 15. The behavior when an identifier is encountered that is not in the Normalization Form C is determined by the implementation; however, diagnosis is not required.

The CLI ECMA standard has the following:

I.8.5 Naming
Names are assigned to entities of the type system so that they can refer to other parts of the type system or type implementations. Types, fields, methods, properties, and events have names. As for the type system, values, local residents and parameters have no names. A type system entity is assigned one name (for example, there is only one name for a type).
I.8.5.1 Valid names
All name mappings are performed byte-by-byte (for example, case-sensitive, independent, also known as code comparison). If the names are used to access the built-in VES function (for example, the class initialization method), there is always an accompanying indication of the definition so as not to create any set of reserved names.

An important transition follows:

CLS Rule 4: Assemblies must follow Appendix 7 of Technical Report 15 Unicode Standard 3.0, which defines the set of characters allowed to run and included in identifiers available on the Internet http://www.unicode.org/unicode/reports/tr15/tr15-18 .html Identifiers must be in the canonical format defined by Unicode Normalization Form C. For CLS purposes, two identifiers are the same if their lowercase mappings (as indicated by Unicode are language-sensitive, one-to-one lowercase mappings) are the same. That is, for two identifiers that are considered different within the CLS, they must differ more than just their case. However, in order to redefine an inherited definition, the CLI requires the use of exact coding of the original declaration.
[Note: CLS (consumer): you do not need to consume types that violate rule 4 of CLS, but must have them to allow access to named elements that use one of their keywords as a name. CLS (expander): no need to create types that violate the CLS 4 rule. Provides a mechanism for defining new names that obey these rules, but are the same as the keyword in the language. CLS (framework): you should not export types that violate the CLS 4 rule. Avoid using names that are commonly used as keywords in programming languages.

2) There should be no impact on performance. The CLI rules state that name matching should be done using Unicode language-insensitive mappings, which means that matching two names requires conversion to a sequence of Unicode codes. If a compiler or runtime is chosen to store this information in variable-length encoding, such as UTF-8, and convert to code points on the fly, then there will theoretically be some kind of performance difference; practically I don’t expect any implementation for this, or a difference in performance that can be measured if they did.

Note that CLS Rule 4 states that “to redefine an inherited definition, the CLI requires the use of the exact encoding of the original declaration”, which makes a specific limitation on rewriting names. But since this is not a universal requirement, you still need to "convert everything to code points before comparison."

3) Again, this is in the CLI specification, so the language should do it.

Encoding in languages ​​other than ASCII

I.8.5 Naming

I.8.5.1 Valid names

More articles:

Encoding in languages other than ASCII