Normalize British and American English for Elasticsearch

Is there a best practice for normalizing English and American English in Elasticsearch?

Using Synonym Token Filter , an incredibly long configuration file is required. In fact, in the UK and the USA, several thousand different words are written in English, and it is almost impossible to find a truly comprehensive list of words. Here is a list of nearly 2,000 words , but it is far from complete.

Preferably, I would like to create an ES Analyzer / Filter with rules for converting the USA to the UK . This may be the best approach, but I donโ€™t know where to start - what type of filters do I need? It does not have to cover everything - it just has to normalize most search terms. For instance. "gray" - "gray", "color" - "color", "center" - "center", etc.

+6
source share
2 answers

This is the approach that I went after some time. This is a combination of ground rules, โ€œfixesโ€ and synonyms: first, use char_filter to provide a set of ground rules for writing. This is not 100% correct, but it does a good job:

"char_filter": { "en_char_filter": { "type": "mapping", "mappings": [ # fixes "aerie=>axerie", "aeroplane=>airplane", "aloe=>aloxe", "canoe=>canoxe", "coerce=>coxerce", "poem=>poxem", "prise=>prixse", # whole words "armour=>armor", "behaviour=>behavior", "centre=>center" "colour=>color", "clamour=>clamor", "draught=>draft", "endeavour=>endeavor", "favour=>favor", "flavour=>flavor", "harbour=>harbor", "honour=>honor", "humour=>humor", "labour=>labor", "litre=>liter", "metre=>meter", "mould=>mold", "neighbour=>neighbor", "plough=>plow", "saviour=>savior", "savour=>savor", # generic transformations "ae=>e", "ction=>xion", "disc=>disk", "gramme=>gram", "isable=>izable", "isation=>ization", "ise=>ize", "ising=>izing", "ll=>l", "oe=>e", "ogue=>og", "sation=>zation", "yse=>yze", "ysing=>yzing" ] } } 

The record of โ€œcorrectionsโ€ is intended to prevent misapplication of other rules. For instance. "prise=>prixse" prevents a "win" from becoming a "prize", which has a different meaning. You may need to adapt this to suit your needs.

Then enable the synonym filter to find the most commonly used exceptions:

 "en_synonym_filter": { "type": "synonym", "synonyms": EN_SYNONYMS } 

Here is our list of synonyms, which includes the most important keywords for our use case. You might want to adapt this list to your needs:

 EN_SYNONYMS = ( "accolade, prize => award", "accoutrement => accouterment", "aching, pain => hurt", "acw, anticlockwise, counterclockwise, counter-clockwise => ccw", "adaptor => adapter", "advocate, attorney, barrister, procurator, solicitor => lawyer", "ageing => aging", "agendas, agendum => agenda", "almanack => almanac", "aluminium => aluminum", "america, united states, usa", "amphitheatre => amphitheater", "anti-aliased, anti-aliasing => antialiased", "arbour => arbor", "ardour => ardor", "arse => ass", "artefact => artifact", "aubergine => eggplant", "automobile, motorcar => car", "axe => ax", "bannister => banister", "barbecue => bbq", "battleaxe => battleax", "baulk => balk", "beetroot => beet", "biassed => biased", "biassing => biasing", "biscuit => cookie", "black american, african american, afro-american, negro", "bobsleigh => bobsled", "bonnet => hood", "bulb, electric bulb, light bulb, lightbulb", "burned => burnt", "bussines, bussiness => business", "business man, business people, businessman", "business woman, business people, businesswoman", "bussing => busing", "cactus, cactuses => cacti", "calibre => caliber", "candour => candor", "candy floss, candyfloss, cotton candy", "car park, parking area, parking ground, parking lot, parking-lot, parking place, parking", "carburettor => carburetor", "castor => caster", "cataloguing => cataloging", "catboat, sailboat, sailing boat", "champion, gainer, victor, win, winner => victory", "chat => talk", "chequebook => checkbook", "chequer => checker", "chequerboard => checkerboard", "chequered => checkered", "christmas tree ball, christmas tree ball ornament, christmas ball ornament, christmas bauble", "christmas, x-mas => xmas", "cinema => movies", "clangour => clangor", "clarinettist => clarinetist", "conditioning => conditioner", "conference => meeting", "coriander => cilantro", "corporate => company", "cosmos, universe => outer space", "cosy, cosiness => cozy", "criminal => crime", "curriculums => curricula", "cypher => cipher", "daddy, father, pa, papa => dad", "defence => defense", "defenceless => defenseless", "demeanour => demeanor", "departure platform, station platform, train platform, train station", "dishrag => dish cloth", "dishtowel, dishcloth => dish towel", "doughnut => donut", "downspout => drainpipe", "drugstore => pharmacy", "e-mail => email", "enamoured => enamored", "england => britain", "english => british", "epaulette => epaulet", "exercise, excercise, training, workout => fitness", "expressway, motorway, highway => freeway", "facebook => facebook, social media", "fanny => buttocks", "fanny pack => bum bag", "farmyard => barnyard", "faucet => tap", "fervour => fervor", "fibre => fiber", "fibreglass => fiberglass", "flashlight => torch", "flautist => flutist", "flier => flyer", "flower fly, hoverfly, syrphid fly, syrphus fly", "foot-walk, sidewalk, sideway => pavement", "football, soccer", "forums => fora", "fourth => 4", "freshman => fresher", "chips, fries, french fries", "gaol => jail", "gaolbird => jailbird", "gaolbreak => jailbreak", "gaoler => jailer", "garbage, rubbish => trash", "gasoline => petrol", "gases, gasses", "gauge => gage", "gauged => gaged", "gauging => gaging", "gipsy, gipsies, gypsies => gypsy", "glamour => glamor", "glueing => gluing", "gravesite, sepulchre, sepulture => sepulcher", "grey => gray", "greyish => grayish", "greyness => grayness", "groyne => groin", "gryphon, griffon => griffin", "hand shake, shake hands, shaking hands, handshake", "haulier => hauler", "hobo, homeless, tramp => bum", "new year, new year eve, hogmanay, silvester, sylvester", "holiday => vacation", "holidaymaker, holiday-maker, vacationer, vacationist => tourist", "homosexual, fag => gay", "inbox, letterbox, outbox, postbox => mailbox", "independence day, 4th of july, fourth of july, july 4th, july 4, 4th july, july fourth, forth of july, 4 july, fourth july, 4th july", "infant, suckling, toddler => baby", "infeasible => unfeasible", "inquire, inquiry => enquire", "insure => ensure", "internet, website => www", "jelly => jam", "jewelery, jewellery => jewelry", "jogging => running", "journey => travel", "judgement => judgment", "kerb => curb", "kiwifruit => kiwi", "laborer => worker", "lacklustre => lackluster", "ladybeetle, ladybird, ladybug => ladybird beetle", "larrikin, scalawag, rascal, scallywag => naughty boy", "leaf => leaves", "licence, licenced, licencing => license", "liquorice => licorice", "lorry => truck", "loupe, magnifier, magnifying, magnifying glass, magnifying lens, zoom", "louvred => louvered", "louvres => louver", "lustre => luster", "mail => post", "mailman => postman", "marriage, married, marry, marrying, wedding => wed", "mayonaise => mayo", "meagre => meager", "misdemeanour => misdemeanor", "mitre => miter", "mom, momma, mummy, mother => mum", "moonlight => moon light", "moult => molt", "moustache, moustached => mustache", "nappy => diaper", "nightlife => night life", "normalcy => normality", "octopus => kraken", "odour => odor", "odourless => odorless", "offence => offense", "omelette => omelet", "# fix torres del paine", "paine => painee", "pajamas => pyjamas", "pantyhose => tights", "parenthesis, parentheses => bracket", "parliament => congress", "parlour => parlor", "persnickety => pernickety", "philtre => filter", "phoney => phony", "popsicle => iced-lolly", "porch => veranda", "pretence => pretense", "pullover, jumper => sweater", "pyjama => pajama", "railway => railroad", "rancour => rancor", "rappel => abseil", "row house, serial house, terrace house, terraced house, terraced housing, town house", "rigour => rigor", "rumour => rumor", "sabre => saber", "saltpetre => saltpeter", "sanitarium => sanatorium", "santa, santa claus, st nicholas, st nicholas day", "sceptic, sceptical, scepticism, sceptics => skeptic", "sceptre => scepter", "shaikh, sheikh => sheik", "shivaree => charivari", "silverware, flatware => cutlery", "simultaneous => simultanous", "sleigh => sled", "smoulder, smouldering => smolder", "sombre => somber", "speciality => specialty", "spectre => specter", "splendour => splendor", "spoilt => spoiled", "street => road", "streetcar, tramway, tram => trolley-car", "succour => succor", "sulphate, sulphide, sulphur, sulphurous, sulfurous => sulfur", "super hero, superhero => hero", "surname => last name", "sweets => candy", "syphon => siphon", "syphoning => siphoning", "tack, thumb-tack, thumbtack => drawing pin", "tailpipe => exhaust pipe", "taleban => taliban", "teenager => teen", "television => tv", "thank you, thanks", "theatre => theater", "tickbox => checkbox", "ticked => checked", "timetable => schedule", "tinned => canned", "titbit => tidbit", "toffee => taffy", "tonne => ton", "transportation => transport", "trapezium => trapezoid", "trousers => pants", "tumour => tumor", "twitter => twitter, social media", "tyre => tire", "tyres => tires", "undershirt => singlet", "university => college", "upmarket => upscale", "valour => valor", "vapour => vapor", "vigour => vigor", "waggon => wagon", "windscreen, windshield => front shield", "world championship, world cup, worldcup", "worshipper, worshipping => worshiping", "yoghourt, yoghurt => yogurt", "zip, zip code, postal code, postcode", "zucchini => courgette" ) 
+2
source

I understand that this answer somewhat deviates from the original OP question, but if you just want to normalize your spelling for US-British English, you can see the adjustable size list here (~ 1700 replacements): http://www.tysto.com/ uk-us-spelling-list.html . I am sure there are others that you can use to create a consolidated master list.

In addition to changing the spelling, you must be very careful not to boldly replace the words in isolation with their (alleged!) Colleagues in American English. I would advise not all but the hardest lexical substitutions. For example, I do not see anything bad from this.

"counterclockwise, counterclockwise, counterclockwise => counterclockwise"

but this one

"hobo, homeless, tramp => bum"

will index "Homeless Man" => * "Tramp", which is nonsense. (Not to mention that hobos, homeless people and "tramps" are very different - http://knowledgenuts.com/2014/11/26/the-difference-between-hobos-tramps-and-bums/ .)

In general, in addition to changing the spelling, the difference between the dialectic between the United States and Great Britain is complex and cannot be reduced to simple list listings.

PS If you really want to do it right (i.e., take into account the grammatical context, etc.), you probably need a context-sensitive paraphrase model to โ€œtranslateโ€ British into American English (or vice versa, depending on your needs ) ever hits the ES index. This can be done (with sufficient parallel data) using a ready-made statistical translation model or, possibly, even using special user software using natural language parsing, POS marking, chunking, etc.

+1
source

Source: https://habr.com/ru/post/970923/


All Articles