Regular expression eliminates unnecessary backtracking in Java

Question

Regular expression eliminates unnecessary backtracking in Java

Hi, I am very new to the Regex world. I would like to retrieve the timestamp, location, and id_str field in my Java test string.

20110302140010915|{"user":{"is_translator":false,"show_all_inline_media":false,"following":null,"geo_enabled":true,"profile_background_image_url":"http:\/\/a3.twimg.com\/a\/1298918947\/images\/themes\/theme1\/bg.png","listed_count":0,"favourites_count":2,"verified":false,"time_zone":"Mountain Time (US & Canada)","profile_text_color":"333333","contributors_enabled":false,"statuses_count":152,"profile_sidebar_fill_color":"DDEEF6","id_str":"207356721","profile_background_tile":false,"friends_count":14,"followers_count":13,"created_at":"Mon Oct 25 04:05:43 +0000 2010","description":null,"profile_link_color":"0084B4","location":"WaKeeney, KS","profile_sidebar_border_color":"C0DEED",

I tried this

 (\d*).*?"id_str":"(\d*)",.*"location":"([^"]*)"

It has a lot of deviations if I used a lazy quantifier .*? (3000 steps in regexbuddy), but the number of characters between the id_str and location anchor is not always the same. In addition, it can be catastrophic if no location is found in the row.

How can I avoid 1) Unnecessary retreat?

and

2) Is it faster to find a string without a match?

Thanks.

+4

java regex

Seen Jun 22 '13 at 21:04

source share

2 answers

It looks like JSON and trusts me that it's pretty easy to parse it that way.

 String[] input = inputStr.split("|", 2); System.out.println("Timestamp: " + input[0]); // 20110302140010915 JSONObject user = new JSONObject(input[1]).getJSONObject("user"); System.out.println ("ID: " + user.getString("id_str")); // 207356721 System.out.println ("Location: " + user.getString("location")); // WaKeeney, KS

Link :
JSON Java API docs

+5

Ravi thapliyal Jun 22 '13 at 21:32

source share

Casimir et Hippolyte · Accepted Answer · 2013-06-22T21:19:28+0000

Instead, you can try:

 (\d*+)(?>[^"]++|"(?!id_str":))+"id_str":"(\d*+)",(?>[^"]++|"(?!location":))+"location":"([^"]*+)"

The idea here is to eliminate backtracks as much as possible using only possessive quantifiers and atomic groups with limited character classes (as in the last capture group)

For example, to avoid the first lazy quantifier, I use this:

 (?>[^"]++|"(?!id_str":))+

the regex engine will accept all characters that are not double quotes as much as possible (and do not register a single countdown position because the possessive quantifier is used) when the double quote is found as a control if it is not followed by the id_str": anchor. All this part is wrapped in an atomic group (without going back inside), repeated one or more times.

Do not be afraid, using an appearance inside which you will not succeed quickly and only if a double quote is found. However, you can try the same with i if you are sure that it is less frequent than " (or a rare character earlier if you find):

 (?>[^i]++|i(?!d_str":))+id_str":(...

EDIT: the best choice here looks , which is less common: (200 steps versus 422 with double quote)

 (\d*+)(?>[^,]++|,(?!"id_str":))+,"id_str":"(\d*+)",(?>[^,]++|,(?!"location":))+,"location":"([^"]*+)"

To have better characteristics, and if you have such an opportunity, try adding an anchor ( ^ ) to your template if this is the beginning of a line or a new line (with multi-line mode).

 ^(\d*+)(?>[^"]++|"(?!id_str":))+"id_str":"(\d*+)",(?>[^"]++|"(?!location":))+"location":"([^"]*+)"

Regular expression eliminates unnecessary backtracking in Java

More articles: