This is the approach that I went after some time. This is a combination of ground rules, โfixesโ and synonyms: first, use char_filter to provide a set of ground rules for writing. This is not 100% correct, but it does a good job:
"char_filter": { "en_char_filter": { "type": "mapping", "mappings": [ # fixes "aerie=>axerie", "aeroplane=>airplane", "aloe=>aloxe", "canoe=>canoxe", "coerce=>coxerce", "poem=>poxem", "prise=>prixse", # whole words "armour=>armor", "behaviour=>behavior", "centre=>center" "colour=>color", "clamour=>clamor", "draught=>draft", "endeavour=>endeavor", "favour=>favor", "flavour=>flavor", "harbour=>harbor", "honour=>honor", "humour=>humor", "labour=>labor", "litre=>liter", "metre=>meter", "mould=>mold", "neighbour=>neighbor", "plough=>plow", "saviour=>savior", "savour=>savor", # generic transformations "ae=>e", "ction=>xion", "disc=>disk", "gramme=>gram", "isable=>izable", "isation=>ization", "ise=>ize", "ising=>izing", "ll=>l", "oe=>e", "ogue=>og", "sation=>zation", "yse=>yze", "ysing=>yzing" ] } }
The record of โcorrectionsโ is intended to prevent misapplication of other rules. For instance. "prise=>prixse" prevents a "win" from becoming a "prize", which has a different meaning. You may need to adapt this to suit your needs.
Then enable the synonym filter to find the most commonly used exceptions:
"en_synonym_filter": { "type": "synonym", "synonyms": EN_SYNONYMS }
Here is our list of synonyms, which includes the most important keywords for our use case. You might want to adapt this list to your needs:
EN_SYNONYMS = ( "accolade, prize => award", "accoutrement => accouterment", "aching, pain => hurt", "acw, anticlockwise, counterclockwise, counter-clockwise => ccw", "adaptor => adapter", "advocate, attorney, barrister, procurator, solicitor => lawyer", "ageing => aging", "agendas, agendum => agenda", "almanack => almanac", "aluminium => aluminum", "america, united states, usa", "amphitheatre => amphitheater", "anti-aliased, anti-aliasing => antialiased", "arbour => arbor", "ardour => ardor", "arse => ass", "artefact => artifact", "aubergine => eggplant", "automobile, motorcar => car", "axe => ax", "bannister => banister", "barbecue => bbq", "battleaxe => battleax", "baulk => balk", "beetroot => beet", "biassed => biased", "biassing => biasing", "biscuit => cookie", "black american, african american, afro-american, negro", "bobsleigh => bobsled", "bonnet => hood", "bulb, electric bulb, light bulb, lightbulb", "burned => burnt", "bussines, bussiness => business", "business man, business people, businessman", "business woman, business people, businesswoman", "bussing => busing", "cactus, cactuses => cacti", "calibre => caliber", "candour => candor", "candy floss, candyfloss, cotton candy", "car park, parking area, parking ground, parking lot, parking-lot, parking place, parking", "carburettor => carburetor", "castor => caster", "cataloguing => cataloging", "catboat, sailboat, sailing boat", "champion, gainer, victor, win, winner => victory", "chat => talk", "chequebook => checkbook", "chequer => checker", "chequerboard => checkerboard", "chequered => checkered", "christmas tree ball, christmas tree ball ornament, christmas ball ornament, christmas bauble", "christmas, x-mas => xmas", "cinema => movies", "clangour => clangor", "clarinettist => clarinetist", "conditioning => conditioner", "conference => meeting", "coriander => cilantro", "corporate => company", "cosmos, universe => outer space", "cosy, cosiness => cozy", "criminal => crime", "curriculums => curricula", "cypher => cipher", "daddy, father, pa, papa => dad", "defence => defense", "defenceless => defenseless", "demeanour => demeanor", "departure platform, station platform, train platform, train station", "dishrag => dish cloth", "dishtowel, dishcloth => dish towel", "doughnut => donut", "downspout => drainpipe", "drugstore => pharmacy", "e-mail => email", "enamoured => enamored", "england => britain", "english => british", "epaulette => epaulet", "exercise, excercise, training, workout => fitness", "expressway, motorway, highway => freeway", "facebook => facebook, social media", "fanny => buttocks", "fanny pack => bum bag", "farmyard => barnyard", "faucet => tap", "fervour => fervor", "fibre => fiber", "fibreglass => fiberglass", "flashlight => torch", "flautist => flutist", "flier => flyer", "flower fly, hoverfly, syrphid fly, syrphus fly", "foot-walk, sidewalk, sideway => pavement", "football, soccer", "forums => fora", "fourth => 4", "freshman => fresher", "chips, fries, french fries", "gaol => jail", "gaolbird => jailbird", "gaolbreak => jailbreak", "gaoler => jailer", "garbage, rubbish => trash", "gasoline => petrol", "gases, gasses", "gauge => gage", "gauged => gaged", "gauging => gaging", "gipsy, gipsies, gypsies => gypsy", "glamour => glamor", "glueing => gluing", "gravesite, sepulchre, sepulture => sepulcher", "grey => gray", "greyish => grayish", "greyness => grayness", "groyne => groin", "gryphon, griffon => griffin", "hand shake, shake hands, shaking hands, handshake", "haulier => hauler", "hobo, homeless, tramp => bum", "new year, new year eve, hogmanay, silvester, sylvester", "holiday => vacation", "holidaymaker, holiday-maker, vacationer, vacationist => tourist", "homosexual, fag => gay", "inbox, letterbox, outbox, postbox => mailbox", "independence day, 4th of july, fourth of july, july 4th, july 4, 4th july, july fourth, forth of july, 4 july, fourth july, 4th july", "infant, suckling, toddler => baby", "infeasible => unfeasible", "inquire, inquiry => enquire", "insure => ensure", "internet, website => www", "jelly => jam", "jewelery, jewellery => jewelry", "jogging => running", "journey => travel", "judgement => judgment", "kerb => curb", "kiwifruit => kiwi", "laborer => worker", "lacklustre => lackluster", "ladybeetle, ladybird, ladybug => ladybird beetle", "larrikin, scalawag, rascal, scallywag => naughty boy", "leaf => leaves", "licence, licenced, licencing => license", "liquorice => licorice", "lorry => truck", "loupe, magnifier, magnifying, magnifying glass, magnifying lens, zoom", "louvred => louvered", "louvres => louver", "lustre => luster", "mail => post", "mailman => postman", "marriage, married, marry, marrying, wedding => wed", "mayonaise => mayo", "meagre => meager", "misdemeanour => misdemeanor", "mitre => miter", "mom, momma, mummy, mother => mum", "moonlight => moon light", "moult => molt", "moustache, moustached => mustache", "nappy => diaper", "nightlife => night life", "normalcy => normality", "octopus => kraken", "odour => odor", "odourless => odorless", "offence => offense", "omelette => omelet", "# fix torres del paine", "paine => painee", "pajamas => pyjamas", "pantyhose => tights", "parenthesis, parentheses => bracket", "parliament => congress", "parlour => parlor", "persnickety => pernickety", "philtre => filter", "phoney => phony", "popsicle => iced-lolly", "porch => veranda", "pretence => pretense", "pullover, jumper => sweater", "pyjama => pajama", "railway => railroad", "rancour => rancor", "rappel => abseil", "row house, serial house, terrace house, terraced house, terraced housing, town house", "rigour => rigor", "rumour => rumor", "sabre => saber", "saltpetre => saltpeter", "sanitarium => sanatorium", "santa, santa claus, st nicholas, st nicholas day", "sceptic, sceptical, scepticism, sceptics => skeptic", "sceptre => scepter", "shaikh, sheikh => sheik", "shivaree => charivari", "silverware, flatware => cutlery", "simultaneous => simultanous", "sleigh => sled", "smoulder, smouldering => smolder", "sombre => somber", "speciality => specialty", "spectre => specter", "splendour => splendor", "spoilt => spoiled", "street => road", "streetcar, tramway, tram => trolley-car", "succour => succor", "sulphate, sulphide, sulphur, sulphurous, sulfurous => sulfur", "super hero, superhero => hero", "surname => last name", "sweets => candy", "syphon => siphon", "syphoning => siphoning", "tack, thumb-tack, thumbtack => drawing pin", "tailpipe => exhaust pipe", "taleban => taliban", "teenager => teen", "television => tv", "thank you, thanks", "theatre => theater", "tickbox => checkbox", "ticked => checked", "timetable => schedule", "tinned => canned", "titbit => tidbit", "toffee => taffy", "tonne => ton", "transportation => transport", "trapezium => trapezoid", "trousers => pants", "tumour => tumor", "twitter => twitter, social media", "tyre => tire", "tyres => tires", "undershirt => singlet", "university => college", "upmarket => upscale", "valour => valor", "vapour => vapor", "vigour => vigor", "waggon => wagon", "windscreen, windshield => front shield", "world championship, world cup, worldcup", "worshipper, worshipping => worshiping", "yoghourt, yoghurt => yogurt", "zip, zip code, postal code, postcode", "zucchini => courgette" )