How to read values ​​from numbers written as words?

As we all know, numbers can be written either in numbers or called up by their names. Although there are many examples that can be found to convert 123 to one hundred twenty-three, I could not find good examples of how to convert them differently.

Some of the caveats are:

  • cardinal / nominal or ordinal: “one” and “first”
  • common misspellings: forty / four
  • hundreds / thousands: 2100 → "twenty hundred", as well as "two thousand one hundred."
  • delimiters: “eleven hundred fifty two”, but also “eleven hundred fifty” or “eleven hundred fifty two” and “nothing”
  • colloialism: thirty
  • fragments: “one third”, “two fifths”
  • common names: 'dozen', 'half'

And perhaps additional warnings are possible that are not yet listed. Suppose the algorithm must be very reliable and even understand spelling errors.

What fields / documents / studies / algorithms should be read in order to learn how to write all this? Where is the information?

PS: My last parser should understand 3 different languages: English, Russian and Hebrew. And perhaps more languages ​​will be added at a later stage. Hebrew also has male / female numbers, such as "one man" and "one woman" have different "one", "ehad" and "ahat". The Russian also has its own difficulties.

Google does a great job of this, for example:

http://www.google.com/search?q=two+thousand+and+one+hundred+plus+five+dozen+and+four+fifths+in+decimal

(the reverse is also possible http://www.google.com/search?q=999999999999+in+english )

+47
language-agnostic algorithm numbers parsing nlp
Sep 16 '08 at 7:47
source share
12 answers

I played with the PEG parser to do what you wanted (and can post it as a separate answer later) when I noticed that there is a very simple algorithm that does an excellent job with regular forms of numbers in English, Spanish and German, by at least.

For example, to work with English, you need a dictionary that explicitly maps words to values:

"one" -> 1, "two" -> 2, ... "twenty" -> 20, "dozen" -> 12, "score" -> 20, ... "hundred" -> 100, "thousand" -> 1000, "million" -> 1000000 

... etc.

Algorithm:

 total = 0 prior = null for each word w v <- value(w) or next if no value defined prior <- case when prior is null: v when prior > v: prior+v else prior*v else if w in {thousand,million,billion,trillion...} total <- total + prior prior <- null total = total + prior unless prior is null 

For example, this happens as follows:

 total prior v unconsumed string 0 _ four score and seven 4 score and seven 0 4 20 and seven 0 80 _ seven 0 80 7 0 87 87 total prior v unconsumed string 0 _ two million four hundred twelve thousand eight hundred seven 2 million four hundred twelve thousand eight hundred seven 0 2 1000000 four hundred twelve thousand eight hundred seven 2000000 _ 4 hundred twelve thousand eight hundred seven 2000000 4 100 twelve thousand eight hundred seven 2000000 400 12 thousand eight hundred seven 2000000 412 1000 eight hundred seven 2000000 412000 1000 eight hundred seven 2412000 _ 8 hundred seven 2412000 8 100 seven 2412000 800 7 2412000 807 2412807 

And so on. I am not saying that it is perfect, but for the quick and dirty it is good.




Addressing your specific list when editing:

  • cardinal / nominal or ordinal: “one” and “first” - just put them in the dictionary
  • english / british: "fourty" / "forty" - ditto
  • hundreds / thousands: 2100 → "twenty one hundred", and also "two thousand one hundred" - it works like
  • delimiters: “eleven hundred and fifty two,” but also “eleven hundred and fifty” or “eleven hundred and fifty two,” and also simply define the “next word” as the longest prefix that matches a particular word, or until the next non-word, if any don't do for a start
  • colloqialisms: "thirty something" - works
  • fragments: “one third”, “two fifths” - but not yet ...
  • common names: "dozen", "half" - <strong> works; you can even do things like half a dozen

The number 6 is the only one I don’t have a ready answer for, and because of the ambiguity between ordinals and factions (at least in English), the fact that my last cup of coffee was many hours ago is added.

+42
Mar 17 '09 at 5:28
source share
— -

This is not an easy problem, and I don’t know how to do it. I could sit down and try to write something like that. I would do it in Prolog, Java or Haskell. As far as I can see, there are a few questions:

  • Tokenization: sometimes numbers are written eleven hundred fifty two, but I saw eleven hundred fifty or eleven hundred fifty two and much more. A survey could be conducted on which forms are actually used. This can be especially difficult for Hebrew.
  • Spelling mistakes: it's not that hard. You have a limited number of words, and a little Levenshtein magic should do the trick.
  • Alternative forms, as you already mentioned, exist. This includes ordinals / cardinal numbers as well as forty / four and ...
  • ... common names or commonly used phrases and network elements (named objects). Do you want to extract 30 from the Thirty Years War or 2 from the Second World War?
  • Roman numerals too?
  • Colloialisms such as Thirty and Three Euros and Shrapnel, which I don’t know how to treat.

If you are interested in this, I could do it this weekend. My idea probably uses UIMA and tokenization with it, then move on tokenize / disambiguate further and finally translate. There may be more problems, see if I can come up with a few more interesting things.

Sorry, this is not a real answer, just an extension of your question. I will let you know if I find / write anything.

By the way, if you are interested in the semantics of numbers, I just found an interesting article from Friederike Moltmann, discussing some issues regarding the logical interpretation of numbers.

+11
Sep 18 '08 at 0:26
source share

I have a code that I wrote some time ago: text2num . This does some of what you want, except that it does not process sequence numbers. I did not use this code at all for anything, so it has not been tested to a large extent!

+10
Sep 16 '08 at 7:52
source share

Use the Python library pattern-en :

 >>> from pattern.en import number >>> number('two thousand fifty and a half') => 2050.5 
+7
Aug 03 2018-11-11T00:
source share

You must keep in mind that Europe and America have different meanings.

European standard:

 One Thousand One Million One Thousand Millions (British also use Milliard) One Billion One Thousand Billions One Trillion One Thousand Trillions 

Here is a small link to it.




An easy way to see the difference is as follows:

 (American counting Trillion) == (European counting Billion) 
+5
Mar 17 '09 at 14:06
source share

Ordinal numbers are not applicable because they cannot be connected in significant ways to other numbers in the language (... at least in English)

eg. one hundred and first, eleven seconds, etc.

However, there is another English / American warning with the word "and"

i.e.

one hundred one (english) one hundred one (american)

In addition, the use of "a" means one in English

thousand = one thousand

... On the other hand, the Google Calculator does an amazing job.

one hundred three thousand times the speed of light

And even...

two thousand one hundred and a dozen

... WTF?!? score plus a dozen in Roman numerals

+4
Mar 14 '09 at 0:56
source share

Here is an extremely reliable solution at Clojure.

AFAIK is a unique implementation approach.

 ;---------------------------------------------------------------------- ; numbers.clj ; written by: Mike Mattie codermattie@gmail.com ;---------------------------------------------------------------------- (ns operator.numbers (:use compojure.core) (:require [clojure.string :as string] )) (def number-word-table { "zero" 0 "one" 1 "two" 2 "three" 3 "four" 4 "five" 5 "six" 6 "seven" 7 "eight" 8 "nine" 9 "ten" 10 "eleven" 11 "twelve" 12 "thirteen" 13 "fourteen" 14 "fifteen" 15 "sixteen" 16 "seventeen" 17 "eighteen" 18 "nineteen" 19 "twenty" 20 "thirty" 30 "fourty" 40 "fifty" 50 "sixty" 60 "seventy" 70 "eighty" 80 "ninety" 90 }) (def multiplier-word-table { "hundred" 100 "thousand" 1000 }) (defn sum-words-to-number [ words ] (apply + (map (fn [ word ] (number-word-table word)) words)) ) ; are you down with the sickness ? (defn words-to-number [ words ] (let [ n (count words) multipliers (filter (fn [x] (not (false? x))) (map-indexed (fn [ i word ] (if (contains? multiplier-word-table word) (vector i (multiplier-word-table word)) false)) words) ) x (ref 0) ] (loop [ indices (reverse (conj (reverse multipliers) (vector n 1))) left 0 combine + ] (let [ right (first indices) ] (dosync (alter x combine (* (if (> (- (first right) left) 0) (sum-words-to-number (subvec words left (first right))) 1) (second right)) )) (when (> (count (rest indices)) 0) (recur (rest indices) (inc (first right)) (if (= (inc (first right)) (first (second indices))) * +))) ) ) @x )) 

Here are some examples.

 (operator.numbers/words-to-number ["six" "thousand" "five" "hundred" "twenty" "two"]) (operator.numbers/words-to-number ["fifty" "seven" "hundred"]) (operator.numbers/words-to-number ["hundred"]) 
+3
Nov 13 2018-11-11T16
source share

My LPC implementation of some of your requirements (in English only):

 internal mapping inordinal = ([]); internal mapping number = ([]); #define Numbers ([\ "zero" : 0, \ "one" : 1, \ "two" : 2, \ "three" : 3, \ "four" : 4, \ "five" : 5, \ "six" : 6, \ "seven" : 7, \ "eight" : 8, \ "nine" : 9, \ "ten" : 10, \ "eleven" : 11, \ "twelve" : 12, \ "thirteen" : 13, \ "fourteen" : 14, \ "fifteen" : 15, \ "sixteen" : 16, \ "seventeen" : 17, \ "eighteen" : 18, \ "nineteen" : 19, \ "twenty" : 20, \ "thirty" : 30, \ "forty" : 40, \ "fifty" : 50, \ "sixty" : 60, \ "seventy" : 70, \ "eighty" : 80, \ "ninety" : 90, \ "hundred" : 100, \ "thousand" : 1000, \ "million" : 1000000, \ "billion" : 1000000000, \ ]) #define Ordinals ([\ "zeroth" : 0, \ "first" : 1, \ "second" : 2, \ "third" : 3, \ "fourth" : 4, \ "fifth" : 5, \ "sixth" : 6, \ "seventh" : 7, \ "eighth" : 8, \ "ninth" : 9, \ "tenth" : 10, \ "eleventh" : 11, \ "twelfth" : 12, \ "thirteenth" : 13, \ "fourteenth" : 14, \ "fifteenth" : 15, \ "sixteenth" : 16, \ "seventeenth" : 17, \ "eighteenth" : 18, \ "nineteenth" : 19, \ "twentieth" : 20, \ "thirtieth" : 30, \ "fortieth" : 40, \ "fiftieth" : 50, \ "sixtieth" : 60, \ "seventieth" : 70, \ "eightieth" : 80, \ "ninetieth" : 90, \ "hundredth" : 100, \ "thousandth" : 1000, \ "millionth" : 1000000, \ "billionth" : 1000000000, \ ]) varargs int denumerical(string num, status ordinal) { if(ordinal) { if(member(inordinal, num)) return inordinal[num]; } else { if(member(number, num)) return number[num]; } int sign = 1; int total = 0; int sub = 0; int value; string array parts = regexplode(num, " |-"); if(sizeof(parts) >= 2 && parts[0] == "" && parts[1] == "-") sign = -1; for(int ix = 0, int iix = sizeof(parts); ix < iix; ix++) { string part = parts[ix]; switch(part) { case "negative" : case "minus" : sign = -1; continue; case "" : continue; } if(ordinal && ix == iix - 1) { if(part[0] >= '0' && part[0] <= '9' && ends_with(part, "th")) value = to_int(part[..<3]); else if(member(Ordinals, part)) value = Ordinals[part]; else continue; } else { if(part[0] >= '0' && part[0] <= '9') value = to_int(part); else if(member(Numbers, part)) value = Numbers[part]; else continue; } if(value < 0) { sign = -1; value = - value; } if(value < 10) { if(sub >= 1000) { total += sub; sub = value; } else { sub += value; } } else if(value < 100) { if(sub < 10) { sub = 100 * sub + value; } else if(sub >= 1000) { total += sub; sub = value; } else { sub *= value; } } else if(value < sub) { total += sub; sub = value; } else if(sub == 0) { sub = value; } else { sub *= value; } } total += sub; return sign * total; } 
+2
Mar 17 '09 at 5:43
source share

Well, I answered this question too late, but I was working on a small test case that seemed to work very well for me. I used a (simple but ugly and big) regular expression to find all the words for me. The expression is as follows:

 (?<Value>(?:zero)|(?:one|first)|(?:two|second)|(?:three|third)|(?:four|fourth)| (?:five|fifth)|(?:six|sixth)|(?:seven|seventh)|(?:eight|eighth)|(?:nine|ninth)| (?:ten|tenth)|(?:eleven|eleventh)|(?:twelve|twelfth)|(?:thirteen|thirteenth)| (?:fourteen|fourteenth)|(?:fifteen|fifteenth)|(?:sixteen|sixteenth)| (?:seventeen|seventeenth)|(?:eighteen|eighteenth)|(?:nineteen|nineteenth)| (?:twenty|twentieth)|(?:thirty|thirtieth)|(?:forty|fortieth)|(?:fifty|fiftieth)| (?:sixty|sixtieth)|(?:seventy|seventieth)|(?:eighty|eightieth)|(?:ninety|ninetieth)| (?<Magnitude>(?:hundred|hundredth)|(?:thousand|thousandth)|(?:million|millionth)| (?:billion|billionth))) 

Line breaks for formatting purposes are shown here.

Anyway, my method was to run this RegEx with a library such as PCRE and then read named matches. And he worked on all the different examples listed in this question, minus “one half,” like, as I did not add them, but, as you can see, it would not be easy to do. This concerns many problems. For example, he addresses the following questions in the original question and other answers:

  • cardinal / nominal or ordinal: “one” and “first”
  • common spelling errors: "forty" / "four" (note that this is NOT SPECIALLY related to this issue, this will be what you would like to do before passing the string to this parser. This analyzer sees this example as " FOURTH"...)
  • hundreds / thousands: 2100 → "twenty hundred", as well as "two thousand one hundred."
  • delimiters: “eleven hundred fifty two”, but also “eleven hundred fifty” or “eleven hundred fifty two” and “nothing”
  • colloqialisms: “thirty something” (is it also NOT REVERSED as “something”? Well, this code finds this number simply “30”). **

Now, instead of storing this regular expression monster in your source, I was considering creating this RegEx at runtime using something like the following:

 char *ones[] = {"zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen"}; char *tens[] = {"", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"}; char *ordinalones[] = { "", "first", "second", "third", "fourth", "fifth", "", "", "", "", "", "", "twelfth" }; char *ordinaltens[] = { "", "", "twentieth", "thirtieth", "fortieth", "fiftieth", "sixtieth", "seventieth", "eightieth", "ninetieth" }; and so on... 

The simple part here is that we only store words that matter. In the case of SIXTH, you will notice that there is no entry for it, because it is just a normal number with TH, which is superimposed ... But such as TWELVE require different attention.

So, now we have the code to create our (ugly) RegEx, now we just execute it on our number lines.

One thing I would recommend is to filter out or eat the word "And." This is not necessary, and it leads only to other problems.

So, you need to set up a function that passes named matches for "Scale" to a function that looks at all possible values ​​of a quantity, and multiplies your current result by that value. Then you create a function that looks at the match “Value”, and returns an int (or something else that you use) based on the value found there.

All VALUE matches are added to your result, and magnitutde multiplies the result by the mag value. So, "Two hundred fifty thousand" becomes "2", then "2 * 100", then "200 + 50", then "250 * 1000", ending with 250,000 ...

Just for fun, I wrote a vbScript version of this, and it did a great job with all the examples above. Now it does not support named matches, so I had to make it a bit difficult to get the correct result, but I got it. Bottom line, if it's a “VALUE” match, add it to your battery. If this corresponds to the value, multiply your battery by 100, 1000, 1,000,000, 1,000,000,000, etc. This will give you some pretty amazing results, and all you have to do to set things up like “one half” is add them to your RegEx, put a code marker for them and process them.

Well, I hope this post helps someone out there. If anyone wants to, I can post the vbScript pseudo code that I used to verify this, but this is not very nice code and NOT production code.

If I can .. What is the final language in which it will be written? C ++, or something like a scripting language? Greg Huglill’s source will help you understand how this all comes together.

Let me know if I can provide any other assistance. Sorry, I only know English / American, so I can not help you with other languages.

+2
Mar 20 '09 at 23:33
source share

I converted the ordinal editorial statements of early modern books (eg, 2nd Edition, Editing Quarts) into integers and needed the support of 1-100 ordinals in English and 1-10 ordinals in several Romance languages. Here is what I came up with in Python:

 def get_data_mapping(): data_mapping = { "1st": 1, "2nd": 2, "3rd": 3, "tenth": 10, "eleventh": 11, "twelfth": 12, "thirteenth": 13, "fourteenth": 14, "fifteenth": 15, "sixteenth": 16, "seventeenth": 17, "eighteenth": 18, "nineteenth": 19, "twentieth": 20, "new": 2, "newly": 2, "nova": 2, "nouvelle": 2, "altera": 2, "andere": 2, # latin "primus": 1, "secunda": 2, "tertia": 3, "quarta": 4, "quinta": 5, "sexta": 6, "septima": 7, "octava": 8, "nona": 9, "decima": 10, # italian "primo": 1, "secondo": 2, "terzo": 3, "quarto": 4, "quinto": 5, "sesto": 6, "settimo": 7, "ottavo": 8, "nono": 9, "decimo": 10, # french "premier": 1, "deuxième": 2, "troisième": 3, "quatrième": 4, "cinquième": 5, "sixième": 6, "septième": 7, "huitième": 8, "neuvième": 9, "dixième": 10, # spanish "primero": 1, "segundo": 2, "tercero": 3, "cuarto": 4, "quinto": 5, "sexto": 6, "septimo": 7, "octavo": 8, "noveno": 9, "decimo": 10 } # create 4th, 5th, ... 20th for i in xrange(16): data_mapping[str(4+i) + "th"] = 4+i # create 21st, 22nd, ... 99th for i in xrange(79): last_char = str(i)[-1] if last_char == "0": data_mapping[str(20+i) + "th"] = 20+i elif last_char == "1": data_mapping[str(20+i) + "st"] = 20+i elif last_char == "2": data_mapping[str(20+i) + "nd"] = 20+i elif last_char == "3": data_mapping[str(20+i) + "rd"] = 20+i else: data_mapping[str(20+i) + "th"] = 20+i ordinals = [ "first", "second", "third", "fourth", "fifth", "sixth", "seventh", "eighth", "ninth" ] # create first, second ... ninth for c, i in enumerate(ordinals): data_mapping[i] = c+1 # create twenty-first, twenty-second ... ninty-ninth for ci, i in enumerate([ "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety" ]): for cj, j in enumerate(ordinals): data_mapping[i + "-" + j] = 20 + (ci*10) + (cj+1) data_mapping[i.replace("y", "ieth")] = 20 + (ci*10) return data_mapping 
0
Dec 29 '16 at 21:29
source share

Try

  • Open the HTTP request at " http://www.google.com/search?q= " + number + "+ in + decimal".

  • Parse the result for your number.

  • Cache number / result pairs to average queries over time.

-one
Mar 19 '09 at 19:42
source share

One place to look is gnu get_date lib , which can parse near any text date into a text label. Although this is not exactly what you are looking for, their solution to a similar problem can provide many useful tips.

-2
Mar 17 '09 at 2:00
source share



All Articles