Well, I answered this question too late, but I was working on a small test case that seemed to work very well for me. I used a (simple but ugly and big) regular expression to find all the words for me. The expression is as follows:
(?<Value>(?:zero)|(?:one|first)|(?:two|second)|(?:three|third)|(?:four|fourth)| (?:five|fifth)|(?:six|sixth)|(?:seven|seventh)|(?:eight|eighth)|(?:nine|ninth)| (?:ten|tenth)|(?:eleven|eleventh)|(?:twelve|twelfth)|(?:thirteen|thirteenth)| (?:fourteen|fourteenth)|(?:fifteen|fifteenth)|(?:sixteen|sixteenth)| (?:seventeen|seventeenth)|(?:eighteen|eighteenth)|(?:nineteen|nineteenth)| (?:twenty|twentieth)|(?:thirty|thirtieth)|(?:forty|fortieth)|(?:fifty|fiftieth)| (?:sixty|sixtieth)|(?:seventy|seventieth)|(?:eighty|eightieth)|(?:ninety|ninetieth)| (?<Magnitude>(?:hundred|hundredth)|(?:thousand|thousandth)|(?:million|millionth)| (?:billion|billionth)))
Line breaks for formatting purposes are shown here.
Anyway, my method was to run this RegEx with a library such as PCRE and then read named matches. And he worked on all the different examples listed in this question, minus âone half,â like, as I did not add them, but, as you can see, it would not be easy to do. This concerns many problems. For example, he addresses the following questions in the original question and other answers:
- cardinal / nominal or ordinal: âoneâ and âfirstâ
- common spelling errors: "forty" / "four" (note that this is NOT SPECIALLY related to this issue, this will be what you would like to do before passing the string to this parser. This analyzer sees this example as " FOURTH"...)
- hundreds / thousands: 2100 â "twenty hundred", as well as "two thousand one hundred."
- delimiters: âeleven hundred fifty twoâ, but also âeleven hundred fiftyâ or âeleven hundred fifty twoâ and ânothingâ
- colloqialisms: âthirty somethingâ (is it also NOT REVERSED as âsomethingâ? Well, this code finds this number simply â30â). **
Now, instead of storing this regular expression monster in your source, I was considering creating this RegEx at runtime using something like the following:
char *ones[] = {"zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen"}; char *tens[] = {"", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"}; char *ordinalones[] = { "", "first", "second", "third", "fourth", "fifth", "", "", "", "", "", "", "twelfth" }; char *ordinaltens[] = { "", "", "twentieth", "thirtieth", "fortieth", "fiftieth", "sixtieth", "seventieth", "eightieth", "ninetieth" }; and so on...
The simple part here is that we only store words that matter. In the case of SIXTH, you will notice that there is no entry for it, because it is just a normal number with TH, which is superimposed ... But such as TWELVE require different attention.
So, now we have the code to create our (ugly) RegEx, now we just execute it on our number lines.
One thing I would recommend is to filter out or eat the word "And." This is not necessary, and it leads only to other problems.
So, you need to set up a function that passes named matches for "Scale" to a function that looks at all possible values ââof a quantity, and multiplies your current result by that value. Then you create a function that looks at the match âValueâ, and returns an int (or something else that you use) based on the value found there.
All VALUE matches are added to your result, and magnitutde multiplies the result by the mag value. So, "Two hundred fifty thousand" becomes "2", then "2 * 100", then "200 + 50", then "250 * 1000", ending with 250,000 ...
Just for fun, I wrote a vbScript version of this, and it did a great job with all the examples above. Now it does not support named matches, so I had to make it a bit difficult to get the correct result, but I got it. Bottom line, if it's a âVALUEâ match, add it to your battery. If this corresponds to the value, multiply your battery by 100, 1000, 1,000,000, 1,000,000,000, etc. This will give you some pretty amazing results, and all you have to do to set things up like âone halfâ is add them to your RegEx, put a code marker for them and process them.
Well, I hope this post helps someone out there. If anyone wants to, I can post the vbScript pseudo code that I used to verify this, but this is not very nice code and NOT production code.
If I can .. What is the final language in which it will be written? C ++, or something like a scripting language? Greg Huglillâs source will help you understand how this all comes together.
Let me know if I can provide any other assistance. Sorry, I only know English / American, so I can not help you with other languages.