Redshift / Postgres: how can I ignore lines that generate errors? (Invalid JSON in json_extract_path_text)

I am trying to run a query in redshift where I select using json_extract_path_text . Unfortunately, some JSON entries in this database column are not valid.

What happens: When a request hits an invalid JSON value, it stops with a "JSON parsing error".

I want to: ignore any rows with invalid JSON in this column, but return any rows where it can parse JSON.

Why can't I get him to do what I want: I don't think I understand error handling in Redshift / Postgres. It should be possible to simply skip any lines that generate errors, but I tried to enter EXEC SQL WHENEVER SQLERROR CONTINUE (based on Postgres docs ) and got a "syntax error in or near SQLERROR ".

+6
source share
7 answers

Create UDF for python:

 create or replace function f_json_ok(js varchar(65535)) returns boolean immutable as $$ if js is None: return None import json try: json.loads(js) return True except: return False $$ language plpythonu 

Use it like this:

 select * from schema.table where 'DesiredValue' = case when f_json_ok(json_column) then json_extract_path_text(json_column, 'Key') else 'nope' end 
+9
source

I assume that the JSON data is actually stored in the TEXT column and not in the JSON column (otherwise you could not store non-JSON there).

If there is some template for the data that allows you to create a regular expression that detects valid strings or invalid ones, then you can use the CASE statement. For instance:

 SELECT CASE WHEN mycol !~ 'not_json' THEN json_extract_path_text(mycol, ....) ELSE NULL END AS mystuff ... 

replacing not_json with a regular expression that detects formatted values ​​other than JSON.

This may or may not be practical depending on the format of your data.

According to the answers to this question , it is apparently possible to fully validate arbitrary JSON data using some regular expression implementations, but, alas, not the one used by postgresql.

+2
source

Edit : this looks like Redshift only supports Python UDF , so this answer will not work. I am going to leave this answer here for posterity (and in case someone finds someone who does not use Redshift).

Potentially relevant: here is the plpgsql function that will try to decode JSON and return the default value if that fails:

 CREATE OR REPLACE FUNCTION safe_json(i text, fallback json) RETURNS json AS $$ BEGIN RETURN i::json; EXCEPTION WHEN others THEN RETURN fallback; END; $$ LANGUAGE plpgsql IMMUTABLE RETURNS NULL ON NULL INPUT; 

Then you can use it as follows:

 SELECTFROM ( SELECT safe_json(my_text, '{"error": "invalid JSON"}'::json) AS my_json FROM my_table ) as x 

To ensure you always have valid JSON

+2
source

Update: UDF solution seems perfect. At the time I wrote this, there was no answer there. These are just some of the working methods.

Although json_extract_path_text cannot ignore errors, Redshift COPY has a MAXERROR parameter.

So you can use something like this:

 COPY raw_json FROM 's3://data-source' CREDENTIALS 'aws_access_key_id;aws_secret_access_key' JSON 's3://json_path.json' MAXERROR 1000; 

The following error is in the json_path.json file: you cannot use $ to indicate the root element:

 { "jsonpaths": [ "$['_id']", "$['type']", "$" <--------------- this will fail. ] } 

So, it would be convenient to have a “top-level” element containing other fields, for example: (So, $['data'] is everything that is written in your record)

 { "data": { "id": 1 ... } } { "data": { "id": 2 ... } } 

If you cannot change the original format, Redshift UNLOAD will help:

 UNLOAD ('select_statement') TO 's3://object_path_prefix' 

It's easy to use select_statement to concatenate: { "data" : + old line + } ...

Then Redshift jumps again!

+2
source

Redshift lacks many Postgres features, such as error handling.

How do I deal with this:

  • Use CREATE TABLE AS to create a "fixup" table with a JSON field and any key in the main table that you are trying to query. Make sure you set DISTKEY and SORTKEY in the JSON field.

  • Add two columns to my patch table: valid_json (BOOLEAN) and extract_test (VARCHAR)

  • Try to UPDATE extract_test with some text from the JSON field using JSON_EXTRACT_PATH_TEXT .

  • Use errors to identify common characters that twist JSON. If I import weblog data, can I find ???? or something similar

  • Use the UPDATE SET table valid_json = false for JSON fields with this value

  • Finally, change the json fields in my source table using UPDATE c SET json_field = NULL FROM fixup_table f WHERE original_table.id = f.id AND f.valid_json = FALSE

It is still manual, but much faster than fixing row by row on a large table, and with the correct DISTKEY / SORTKEY in your patch table, you can quickly execute queries.

+1
source

You can use the following function:

 CREATE OR REPLACE FUNCTION isValidJSONv2(i varchar(MAX)) RETURNS int stable AS $CODE$ import json import sys try: if i is None: return 0 json_object = json.loads(i) return 1 except: return 0 $CODE$ language plpythonu; 

The problem still is that if you are still using json parsing functions in the select element, the error will still be selected. You will need to filter the actual values ​​from unvalid jsons in different tables. I posted this problem here: https://forums.aws.amazon.com/thread.jspa?threadID=232468

0
source

Redshift now supports passing a boolean argument that allows invalid JSON as null

select json_extract_path_text('invalid', 'path', true)

returns null

https://docs.aws.amazon.com/redshift/latest/dg/JSON_EXTRACT_PATH_TEXT.html

0
source

Source: https://habr.com/ru/post/973841/


All Articles