Least commonly used separator character in plain text <ASCII 128

Question

Least commonly used separator character in plain text <ASCII 128

For coding reasons that may horrify you (I'm too embarrassed to say), I need to keep several text elements on the same line.

I will delimit them with a symbol.

Which character is best used for this, i.e. which character least appears in the text? Must be printable and possibly less than 128 in ASCII to avoid locality issues.

+67

delimiter ascii

Too embarrassed to say Jan 29 '09 at 15:35

source share

16 answers

I would choose the "Unit Separator" ASCII code "US": ASCII 31 (0x1F)

In the old, old days, most were done in turn, without random access. This meant that several control codes were embedded in ASCII.

ASCII 28 (0x1C) File Separator - Used to indicate separation between files on a data input stream. ASCII 29 (0x1D) Group Separator - Used to indicate separation between tables on a data input stream (called groups back then). ASCII 30 (0x1E) Record Separator - Used to indicate separation between records within a table (within a group). These roughly map to a tuple in modern nomenclature. ASCII 31 (0x1F) Unit Separator - Used to indicate separation between units within a record. The roughly map to fields in modern nomenclature.

The module separator is in ASCII, and there is Unicode support for displaying it (usually it’s “we” in the same glyph), but many fonts do not display it.

If you must display it, I would recommend displaying it in the application, after it has been parsed into fields.

+20

Edwin Buck Jan 09 '17 at 19:31 on

source share

Probably | or ^ or ~ you could also combine two characters

+17

SQLMenace Jan 29 '09 at 15:38

source share

When using different languages, this symbol: ¬

turned out to be the best. However, I am still testing.

+14

Icarin Sep 01 '10 at 16:49

source share

What about the CSV format? Symbols can be escaped in the standard CSV format, and many parsers have already been written.

+13

Alex Fort Jan 29 '09 at 15:38

source share

You said "printable", but may include characters like a tab (0x09) or a form feed (0x0c). I almost always select tabs, not commas for delimited files, because commas can sometimes appear in text.

(Interestingly, the ascii table contains GS (0x1D), RS (0x1E), and US (0x1F) characters for groups, records, and unit delimiters, regardless of whether they were / were.)

If by "printable" you mean a character that the user could recognize and easily enter, I would go to the pipe | first a character with a few other weird characters ( @ or ~ or ^ or \ , or a backdoor that I can't imagine here). These characters +=!$%&*()-'":;<>,.?/ Seem to be more likely to appear in user input. As for the underscore _ and hash # , and the brackets {}[] , I do not know.

+12

Jason S Jan 30 '09 at 1:29

source share

Can you use the pipe symbol? This is usually the next most common delimiter after a comma or tab delimited. It is unlikely that most texts will contain a pipe, and ord ('|') returns 124 for me, so it seems to fit your requirements.

+9

Jay Jan 29 '09 at 15:38

source share

For quick acceleration, I use things like this: let's say you want to concatenate str1, str2 and str3 what I do:

 delimitedStr=str1.Replace("@","@a").Replace("|","@p")+"|"+str2.Replace("@","@a").Replace("|","@p")+"|"+str3.Replace("@","@a").Replace("|","@p");

then to extract the original usage:

 splitStr=delimitedStr.Split("|".ToCharArray()); str1=splitStr[0].Replace("@p","|").Replace("@a","@"); str2=splitStr[1].Replace("@p","|").Replace("@a","@"); str3=splitStr[2].Replace("@p","|").Replace("@a","@");

Note: replacement order is important

its indestructible and easy to use

+7

Mohammad Amin Aug 13 2018-11-11T00:

source share

Trumpet for victory! |

+2

Eppz Jan 29 '09 at 15:41

source share

We use ascii 0x7f, which is pseudo-printable and almost never appears with regular use.

+2

Joe Jan 30 '09 at 1:09

source share

It can be good or bad (usually bad) depending on the situation and language, but do not forget that you can always encode Base64 all. Then you don’t have to worry about shielding and expanding the various patterns on each side, and you can simply split and split strings based on a character that is not used in Base64 encoding.

I had to resort to this solution when I ran into putting XML documents into XML properties / nodes. There cannot be CDATA blocks in the properties, and nodes that avoid it, since CDATA obviously cannot have additional CDATA blocks inside without breaking the structure.

CSV is probably the best idea for most situations.

+2

Coxy Feb 11 '09 at 5:59

source share

You will probably have to choose something and ignore its other uses.

may be a good candidate.

+1

Iain Holder Jan 29 '09 at 15:39

source share

Well, to some extent this will depend on the nature of your text, but the 0x7C bar is very common in text.

+1

Jackson Jan 29 '09 at 15:39

source share

I don’t think I have ever seen an ampersand followed by a comma in the natural text, but first you can check the file to see if it contains a delimiter, and if so, use an alternative. If you want to always know that the limiter used will not lead to a conflict, do a file check cycle for the delimiter you want and if it exists, then double the line until the file no longer matches. It doesn't matter if similar strings because your program will only search for exact matches.

+1

Matthew Lynam Feb 11 '09 at 5:28

source share

Both channels and carriage are an obvious choice. I would like to note that if it is expected that users will enter the entire answer, the carriage is easier to find on any keyboard than on the pipe.

+1

Will Johnson Aug 19 '13 at 23:55 on

source share

I'm not sure you need to use ASCII, but if you can encode it in UTF-8, you can find a really obscure character: ╡ (U + 2561) - which I use a lot in my programs.

You can also view the serialization of objects and simply create new fields for all the elements you may need.

+1

wdavies973 Feb 21 '17 at 0:00

source share

Nick Fortescue · Accepted Answer · 2009-01-29 15:48

Assuming you cannot use CSV, I would say that you need to use data. Take some sample data and make a simple character counter for each value 0-127. Choose one that does not occur. If there is too much choice, get a larger dataset. It does not take long to write, and you will get the best answer for you.

The answer will be different for different problem areas, therefore | (pipe) is common in shell scripts, ^ is common in mathematical formulas, and the same is probably true for most other characters.

I personally think I will go | (pipe) if a choice is possible, but with real data is safer.

And no matter what you do, make sure you design a shielding circuit!

Least commonly used separator character in plain text <ASCII 128

More articles: