How to efficiently normalize data while inserting into an SQL table (Postgres)

I want to import a large log file into (Postgres-) SQL

Some row columns are very repeatable, for example, the "event_type" column has 1 out of 10 different row values.

I have a rough understanding of data normalization.

First, is it correct to assume that: Is it useful (for storage size, indexing, and query speed) to store event_type in a separate table (possibly with a foreign key ratio)?

To normalize, I would have to check the individual event_type values ​​in the raw log and insert them into the event_types table.

There are many types of fields, for example event_types.

So, secondly: Is there a way to tell the database to create and maintain such a table when inserting data?

Are there other strategies for this? I work with pandas.

+4
source share
1 answer

This is a typical situation when you start building a database from data that is still stored, for example, in a log file. There is a solution - as usual, but it is not very fast. Perhaps you can write a log message handler to process messages as they arrive; if the flow (messages / second) is not too large, you will not notice the overhead, especially if you can forget about writing the message in a flat text file.

-, . , 3- (3NF). , ( event_type) . ( , 2NF - , , ISO, M/F (/) .. - 3NF .)

, event_type char(20). int , 4 . 1000 event_type char(20), 20kB . , . , date timestamp, ( 4 8 ) , (, ).

-, , . .

- (, , , python):

CREATE FUNCTION ingest_log_message(mess text) RETURNS int AS $$
DECLARE
  parts  text[];
  et_id  int;
  log_id int;
BEGIN
  parts := regexp_split_to_array(mess, ','); -- Whatever your delimiter is

  -- Assuming:
  --   parts[1] is a timestamp
  --   parts[2] is your event_type
  --   parts[3] is the actual message

  -- Get the event_type identifier. If event_type is new, INSERT it, else just get the id.
  -- Do likewise with other log message parts whose unique text is located in a separate table.
  SELECT id INTO et_id
  FROM event_type
  WHERE type_text = quote_literal(parts[2]);
  IF NOT FOUND THEN
    INSERT INTO event_type (type_text)
    VALUES (quote_literal(parts[2]))
    RETURNING id INTO et_id;
  END IF;

  -- Now insert the log message
  INSERT INTO log_message (dt, et, msg)
  VALUES (parts[1]::timestamp, et_id, quote_literal(parts[3]))
  RETURNING id INTO log_id;

  RETURN log_id;
END; $$ LANGUAGE plpgsql STRICT;

:

CREATE TABLE event_type (
  id        serial PRIMARY KEY,
  type_text char(20)
);

CREATE TABLE log_message (
  id        serial PRIMARY KEY,
  dt        timestamp,
  et        integer REFERENCES event_type
  msg       text
);

SELECT, id :

SELECT * FROM ingest_log_message(the_message);

quote_literal() . : (1) ( "" ); (2) SQL- .

, , .

+4

Source: https://habr.com/ru/post/1540944/


All Articles