How to determine if a string is an English sentence or code?

Consider the following two lines: the first is code, the second is an English sentence (more precisely, a phrase). How can I find that the first is code, and the second is not.

1. for (int i = 0; i < b.size(); i++) { 2. do something in English (not necessary to be a sentence). 

I am thinking about counting special characters (for example, "=", ";", "++", etc.) and set if this is a threshold value. Are there any better ways to do this? Any java libraries?

Please note that the code may not be parsed, because it is not a complete method / operator / expression.

My suggestion is that English sentences are fairly regular, most likely containing only ",", ".", "_", "(", ")", Etc. They do not contain anything like this: write("the whole lot of text");

+6
source share
7 answers

The basic idea is to convert a string into a set of tokens. For example, the above line of code may become "KEY, SEPARATOR, ID, APPOINTMENT, NUMBER, SEPARATOR, ...". And then we can use simple rules to separate code from English.

check here code

+2
source

You can try the OpenNLP offer parser. It returns n best analyzes for the proposal. For most English sentences, it returns at least one. I believe that for most code snippets it will not return any, and therefore you can be sure that this is not an English sentence.

Use this code to parse:

  // Initialize the sentence detector final SentenceDetectorME sdetector = EasyParserUtils .getOpenNLPSentDetector(Constants.SENTENCE_DETECTOR_DATA); // Initialize the parser final Parser parser = EasyParserUtils .getOpenNLPParser(Constants.PARSER_DATA_LOC); // Get sentences of the text final String sentences[] = sdetector.sentDetect(essay); // Go through the sentences and parse each for (final String sentence : sentences) { // Parse the sentence, produce only 1 parse final Parse[] parses = ParserTool.parseLine(sentence, parser, 10); if (parses.length == 0) { // Most probably this is code } else { // An English sentence } } 

and these are two helper methods (from EasyParserUtils) used in the code:

 public static Parser getOpenNLPParser(final String parserDataURL) { try (final InputStream isParser = new FileInputStream(parserDataURL);) { // Get model for the parser and initialize it final ParserModel parserModel = new ParserModel(isParser); return ParserFactory.create(parserModel); } catch (final IOException e) { e.printStackTrace(); return null; } } 

and

 public static SentenceDetectorME getOpenNLPSentDetector( final String sentDetDataURL) { try (final InputStream isSent = new FileInputStream(sentDetDataURL)) { // Get models for sentence detector and initialize it final SentenceModel sentDetModel = new SentenceModel(isSent); return new SentenceDetectorME(sentDetModel); } catch (final IOException e) { e.printStackTrace(); return null; } } 
+4
source

Look at lexical analysis and parsing (just as if you were writing a compiler). You may not even need a parser unless you need complete instructions.

+3
source

You can use the Java parser or create it using BNF , but the problem here is that you said that the code may not be parsed so that it fails.

My advice: use a specific user-defined regular expression to detect special patterns in the code. Use as much as possible to have a good level of success.

Some examples:

  • for\s*\( (for loop)
  • while\s*\( (while loop)
  • [a-zA-Z_$][a-zA-Z\d_$]*\s*\( ( constructor )
  • \)\s*\{ (beginning of the block / method)
  • ...

Yes, this is a long shot, but looking at what you want, you have few opportunities.

+1
source

There is no need to reinvent the wheel; compilers are already doing it for you. The first step in any compilation process is to check if the markers in the file are within the language. This, of course, will not help us, since English and java are not different from each other. However, the second step, parsing, will throw an error with any sentence written in English instead of java code (or something else that is not proper java). So, instead of using external libraries and try using an alternative approach, why don't you use the already available java compiler?

you can have a wrapper class like

 public class Test{ public static void main(){ /*Insert code to check here*/ } } 

which compiles, and if it goes well, then bum, you know its valid code. Of course, it will not work with code fragments that are not complete, for example, for a loop that you put in an example without an end bracket. If it does not compile, you can threaten this line with ways, for example, try to parse it with your own pseudo-random English parser, made using a flexible bison, the GNU tools used to create GCC, for example. I don’t know exactly what you are trying to accomplish using the program you are trying to do, but in this way you can know if it is encoded, manual English, or just garbage that you don't care. The analysis of natural languages ​​is very difficult, and at the moment, modern approaches use inaccurate statistical methods, so they are not always right, which you may not need in your program.

+1
source

For a very simple method, which seems to work very well on some samples. Take out System.out . This is for illustrative purposes only. As you can see from the output of the example, comments on the code look like text, so if large comments on the block without javadoc mix with the code, you can get false positives. Hard code thresholds are my estimate. Feel free to customize them.

 public static void main(String[] args) { for(String arg : args){ System.out.println(arg); System.out.println(codeStatus(arg)); } } static CodeStatus codeStatus (String string) { String[] words = string.split("\\b"); int nonText = 0; for(String word: words){ if(!word.matches("^[A-Za-z][az]*|[0-9]+(.[0-9]+)?|[ .,]|. $")){ nonText ++; } } System.out.print("\n"); double percentage = ((double) nonText) / words.length; System.out.println(percentage); if(percentage > .2){ return CodeStatus.CODE; } if(percentage < .1){ return CodeStatus.TEXT; } return CodeStatus.INDETERMINATE; } enum CodeStatus { CODE, TEXT, INDETERMINATE } 

Output result:

 You can try the OpenNLP sentence parser. It returns the n best parses for a sentence. For most English sentences it returns at least one. I believe, that for most code snippets it won't return any and hence you can be quite sure it is not an English sentence. 0.0297029702970297 TEXT Use this code for parsing: 0.18181818181818182 INDETERMINATE // Initialize the sentence detector 0.125 INDETERMINATE final SentenceDetectorME sdetector = EasyParserUtils .getOpenNLPSentDetector(Constants.SENTENCE_DETECTOR_DATA); 0.6 CODE // Initialize the parser 0.16666666666666666 INDETERMINATE final Parser parser = EasyParserUtils .getOpenNLPParser(Constants.PARSER_DATA_LOC); 0.5333333333333333 CODE // Get sentences of the text 0.1 INDETERMINATE final String sentences[] = sdetector.sentDetect(essay); 0.38461538461538464 CODE // Go through the sentences and parse each 0.07142857142857142 TEXT for (final String sentence : sentences) { // Parse the sentence, produce only 1 parse final Parse[] parses = ParserTool.parseLine(sentence, parser, 10); if (parses.length == 0) { // Most probably this is code } else { // An English sentence } } 0.2537313432835821 CODE and these are the two helper methods (from EasyParserUtils) used in the code: 0.14814814814814814 INDETERMINATE public static Parser getOpenNLPParser(final String parserDataURL) { try (final InputStream isParser = new FileInputStream(parserDataURL);) { // Get model for the parser and initialize it final ParserModel parserModel = new ParserModel(isParser); return ParserFactory.create(parserModel); } catch (final IOException e) { 0.3835616438356164 CODE 
+1
source

Here is an ideal and safe solution. The basic idea is to first get all available keywords and special characters, and then use the set to create a token. For example, the line of code in the question becomes "KEY, SEPARATOR, ID, ASSIGN, NUMBER, SEPARATOR, ...". And then we can use simple rules to separate code from English.

0
source

Source: https://habr.com/ru/post/977008/


All Articles