Removing non-ascii characters from a string

Question

Removing non-ascii characters from a string

I get weird characters when retrieving data from a website:

Â

How to delete everything that is not an extended ASCII character?

+43

php

LordZardeck Jan 08 '12 at 22:26

source share

8 answers

Do you want only ASCII printed characters ?

use this:

 <?php header('Content-Type: text/html; charset=UTF-8'); $str = "abqwrešđčžsff"; $res = preg_replace('/[^\x20-\x7E]/','', $str); echo "($str)($res)";

Or even better, convert your input to utf8 and use phputf8 lib to translate the "abnormal" characters into their ascii representation:

 require_once('libs/utf8/utf8.php'); require_once('libs/utf8/utils/bad.php'); require_once('libs/utf8/utils/validation.php'); require_once('libs/utf8_to_ascii/utf8_to_ascii.php'); if(!utf8_is_valid($str)) { $str=utf8_bad_strip($str); } $str = utf8_to_ascii($str, '' );

+31

DamirR Jan 08 2018-12-12T00:

source share

$clearstring=filter_var($rawstring, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);

+17

Utopia Aug 24 '15 at 8:46

source share

We have a web application that was supposed to send data to a legacy system that could only process the first 128 characters of an ASCII character set.

The solution we should use was something to “translate” as many characters as possible into equivalent ASCII equivalents, but leave everything that cannot be translated alone.

Normally I would do something like this:

 <?php // transliterate if (function_exists('iconv')) { $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text); } ?>

... but this replaces everything that cannot be translated into a question mark (?).

So, we are done doing the following. Check at the end of this function for a (commented) php regex that simply strips out non-ASCII characters.

 <?php public function cleanNonAsciiCharactersInString($orig_text) { $text = $orig_text; // Single letters $text = preg_replace("/[∂άαáàâãªä]/u", "a", $text); $text = preg_replace("/[∆ΛÁÀÂÃÄ]/u", "A", $text); $text = preg_replace("/[Ђ]/u", "b", $text); $text = preg_replace("/[β]/u", "B", $text); $text = preg_replace("/[çς©]/u", "c", $text); $text = preg_replace("/[Ç]/u", "C", $text); $text = preg_replace("/[δ]/u", "d", $text); $text = preg_replace("/[éèêëέëèε℮є]/u", "e", $text); $text = preg_replace("/[ÉÈÊË€ξЄ€∑]/u", "E", $text); $text = preg_replace("/[₣]/u", "F", $text); $text = preg_replace("/[Њњ]/u", "H", $text); $text = preg_replace("/[ђћЋ]/u", "h", $text); $text = preg_replace("/[ÍÌÎÏ]/u", "I", $text); $text = preg_replace("/[íìîïιίϊі]/u", "i", $text); $text = preg_replace("/[Јј]/u", "j", $text); $text = preg_replace("/[ΚЌ]/u", 'K', $text); $text = preg_replace("/[ќ]/u", 'k', $text); $text = preg_replace("/[ℓ∟]/u", 'l', $text); $text = preg_replace("/[]/u", "M", $text); $text = preg_replace("/[ñηήηπⁿ]/u", "n", $text); $text = preg_replace("/[Ñ∏Ν]/u", "N", $text); $text = preg_replace("/[óòôõºöοσό]/u", "o", $text); $text = preg_replace("/[ÓÒÔÕÖθΩθΩ]/u", "O", $text); $text = preg_replace("/[ρφ]/u", "p", $text); $text = preg_replace("/[®]/u", "R", $text); $text = preg_replace("/[Ѓѓ]/u", "r", $text); $text = preg_replace("/[Ѕ]/u", "S", $text); $text = preg_replace("/[ѕ]/u", "s", $text); $text = preg_replace("/[]/u", "T", $text); $text = preg_replace("/[τ†‡]/u", "t", $text); $text = preg_replace("/[úùûüџμΰµυϋύ]/u", "u", $text); $text = preg_replace("/[√]/u", "v", $text); $text = preg_replace("/[ÚÙÛÜЏ]/u", "U", $text); $text = preg_replace("/[Ψψωώẅẃẁ]/u", "w", $text); $text = preg_replace("/[ẀẄẂ]/u", "W", $text); $text = preg_replace("/[Χχ]/u", "x", $text); $text = preg_replace("/[ỲΫ¥]/u", "Y", $text); $text = preg_replace("/[ỳγўЎ]/u", "y", $text); $text = preg_replace("/[ζ]/u", "Z", $text); // Punctuation $text = preg_replace("/[‚‚]/u", ",", $text); $text = preg_replace("/[`‛′'']/u", "'", $text); $text = preg_replace("/[″""«»„]/u", '"', $text); $text = preg_replace("/[—–―−–‾⌐─↔→←]/u", '-', $text); $text = preg_replace("/[ ]/u", ' ', $text); $text = str_replace("…", "...", $text); $text = str_replace("≠", "!=", $text); $text = str_replace("≤", "<=", $text); $text = str_replace("≥", ">=", $text); $text = preg_replace("/[‗≈≡]/u", "=", $text); // Exciting combinations $text = str_replace("", "bl", $text); $text = str_replace("℅", "c/o", $text); $text = str_replace("₧", "Pts", $text); $text = str_replace("™", "tm", $text); $text = str_replace("№", "No", $text); $text = str_replace("", "4", $text); $text = str_replace("‰", "%", $text); $text = preg_replace("/[∙•]/u", "*", $text); $text = str_replace("‹", "<", $text); $text = str_replace("›", ">", $text); $text = str_replace("‼", "!!", $text); $text = str_replace("⁄", "/", $text); $text = str_replace("∕", "/", $text); $text = str_replace("⅞", "7/8", $text); $text = str_replace("⅝", "5/8", $text); $text = str_replace("⅜", "3/8", $text); $text = str_replace("⅛", "1/8", $text); $text = preg_replace("/[‰]/u", "%", $text); $text = preg_replace("/[Љљ]/u", "Ab", $text); $text = preg_replace("/[]/u", "IO", $text); $text = preg_replace("/[ﬁﬂ]/u", "fi", $text); $text = preg_replace("/[]/u", "3", $text); $text = str_replace("£", "(pounds)", $text); $text = str_replace("₤", "(lira)", $text); $text = preg_replace("/[‰]/u", "%", $text); $text = preg_replace("/[↨↕↓↑│]/u", "|", $text); $text = preg_replace("/[∞∩∫⌂⌠⌡]/u", "", $text); //2) Translation CP1252. $trans = get_html_translation_table(HTML_ENTITIES); $trans['f'] = '&fnof;'; // Latin Small Letter F With Hook $trans['-'] = array( '&hellip;', // Horizontal Ellipsis '&tilde;', // Small Tilde '&ndash;' // Dash ); $trans["+"] = '&dagger;'; // Dagger $trans['#'] = '&Dagger;'; // Double Dagger $trans['M'] = '&permil;'; // Per Mille Sign $trans['S'] = '&Scaron;'; // Latin Capital Letter S With Caron $trans['OE'] = '&OElig;'; // Latin Capital Ligature OE $trans["'"] = array( '&lsquo;', // Left Single Quotation Mark '&rsquo;', // Right Single Quotation Mark '&rsaquo;', // Single Right-Pointing Angle Quotation Mark '&sbquo;', // Single Low-9 Quotation Mark '&circ;', // Modifier Letter Circumflex Accent '&lsaquo;' // Single Left-Pointing Angle Quotation Mark ); $trans['"'] = array( '&ldquo;', // Left Double Quotation Mark '&rdquo;', // Right Double Quotation Mark '&bdquo;', // Double Low-9 Quotation Mark ); $trans['*'] = '&bull;'; // Bullet $trans['n'] = '&ndash;'; // En Dash $trans['m'] = '&mdash;'; // Em Dash $trans['tm'] = '&trade;'; // Trade Mark Sign $trans['s'] = '&scaron;'; // Latin Small Letter S With Caron $trans['oe'] = '&oelig;'; // Latin Small Ligature OE $trans['Y'] = '&Yuml;'; // Latin Capital Letter Y With Diaeresis $trans['euro'] = '&euro;'; // euro currency symbol ksort($trans); foreach ($trans as $k => $v) { $text = str_replace($v, $k, $text); } // 3) remove <p>, <br/> ... $text = strip_tags($text); // 4) &amp; => & &quot; => ' $text = html_entity_decode($text); // transliterate // if (function_exists('iconv')) { // $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text); // } // remove non ascii characters // $text = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $text); return $text; } ?>

+15

Silas Palmer Jul 24. '14 at 4:35

source share

I also think that using a regular expression might be the best solution.

Here is my suggestion:

 function convert_to_normal_text($text) { $normal_characters = "a-zA-Z0-9\s`~!@#$%^&*()_+-={}|:;<>?,.\/\"\'\\\[\]"; $normal_text = preg_replace("/[^$normal_characters]/", '', $text); return $normal_text; }

Then you can use it as follows:

 $before = 'Some "normal characters": Abc123!+, some ASCII characters: ABC+ŤĎ and some non-ASCII characters: Ąąśćł.'; $after = convert_to_simple_text($before); echo $after;

Output:

 Some "normal characters": Abc123!+, some ASCII characters: ABC+ and some non-ASCII characters: .

+2

simhumileco Aug 17 '16 at 12:20

source share

I just needed to add a title

 header('Content-Type: text/html; charset=UTF-8');

+1

ALHaines Sep 10 '13 at 16:24

source share

This should be pretty straight forward and not needed for the iconv function:

 // Remove all characters that are not the separator, az, 0-9, or whitespace $string = preg_replace('![^'.preg_quote('-').'a-z0-_9\s]+!', '', strtolower($string)); // Replace all separator characters and whitespace by a single separator $string = preg_replace('!['.preg_quote('-').'\s]+!u', '-', $string);

0

Goran Jakovljevic Mar 13 '15 at 7:30

source share

I think the best way to do something like this is to use the ord () command. This way you can store characters written in any language. Just remember to check the results of your text first. This will not work in Unicode.

 $name="βγδεζηΘKgfgebhjrf!@#$%^&"; //this function will clear all non greek and english characters on greek-iso charset function replace_characters($string) { $str_length=strlen($string); for ($x=0;$x<$str_length;$x++) { $character=$string[$x]; if ((ord($character)>64 && ord($character)<91) || (ord($character)>96 && ord($character)<123) || (ord($character)>192 && ord($character)<210) || (ord($character)>210 && ord($character)<218) || (ord($character)>219 && ord($character)<250) || ord($character)==252 || ord($character)==254) { $new_string=$new_string.$character; } } return $new_string; } //end function $name=replace_characters($name); echo $name;

0

websolutions.gr Apr 25 '15 at 12:56

source share

Chris Bornhoft · Accepted Answer · 2012-01-08 22:34

Replacing a regular expression would be a better option. Using $str as an example string and matching it with :print: which is a POSIX character class :

 $str = 'aAÂ'; $str = preg_replace('/[[:^print:]]/', '', $str); // should be aA

That :print: searches for all non-printable characters. Any characters that are not part of the current character set will be deleted.

Note. . Before using this method, you must ensure that your current character set is ASCII. POSIX character classes support both ASCII and Unicode and will only match the current character set. Starting with PHP 5.6, the default encoding is UTF-8.

Removing non-ascii characters from a string

More articles: