Grep for (curly | microsoft | smart) quotes

I have a huge folder filled with XML documents, some of which may break because they contain these curly quotes, i.e. quotes from Microsoft Word i.e. intellectual quotes. I just want to check quickly to see what I am up against. Does anyone know how to fight for them so that I can easily find the criminals?

Edit

Here is a simplified example.

<?xml version="1.0" encoding="UTF-8"?>
<items>
  <item>Pretend this is a curly quote: '</item>
</items>
+3
source share
5 answers

Curly quotes have the following Unicode code points and UTF-8 sequence:

Name CodePoint UTF-8 sequence
---- --------- --------------
LEFT SINGLE QUOTATION MARK U + 2018 0xE2 0x80 0x98
RIGHT SINGLE QUOTATION MARK              U+2019        0xE2 0x80 0x99
SINGLE LOW-9 QUOTATION MARK              U+201A        0xE2 0x80 0x9A
SINGLE HIGH-REVERSED-9 QUOTATION MARK    U+201B        0xE2 0x80 0x9B 
LEFT DOUBLE QUOTATION MARK               U+201C        0xE2 0x80 0x9C
RIGHT DOUBLE QUOTATION MARK              U+201D        0xE2 0x80 0x9D
DOUBLE LOW-9 QUOTATION MARK              U+201E        0xE2 0x80 0x9E
DOUBLE HIGH-REVERSED-9 QUOTATION MARK    U+201F        0xE2 0x80 0x9F

XML UTF-8, .

+5

, UTF-8 dalle, :

grep -r -P "\xE2\x80\x9C" .

-r , -P grep , Perl.

+4

xml, , , , , , XML ( , ).

, ", - sed -i .bak 's/["„]/"/' file1 file2 ... ( Linux/OSX/cygwin Windows), , -.

0

, , , , . - // . :

Text    | Error
----------------
O*Connor| Yes

O'Connor| No

O’Connor| No

CF-.

<cfif #REFind("[[:punct:],[:digit:]]",textName)# GT 0 >     
    <cfset temp_name = textName.ReplaceAll(JavaCast( "string", "[^A-Za-z\u2018\u2019\u201A\u201B\u2032\u2035\'\-\ ]" ),JavaCast( "string", "" )) >
<cfif (len(temp_name )EQ len(textName)) >
    <!--- If you find single quote or hyphen, do nothing --->
<cfelse>
    <cfset errormsg = The Text contains special charctaer">
</cfif>

: http://axonflux.com/handy-regexes-for-smart-quotes

0

Mac, grep ( .) GNU grep :

brew tap homebrew/dupes
brew install homebrew/dupes/grep

:

ggrep -r -P "\xE2\x80\x9C" .
etc.

dalle neubert script, , , .

0

Source: https://habr.com/ru/post/1795562/


All Articles