Using JSON instead of normalized data, is this approach correct?

Question

Using JSON instead of normalized data, is this approach correct?

There are microblog messages and related voices / emoticons, as in MySQL tables innoDB. There is a requirement for two types of pages:

(A) A listing page containing many microblogs along with its counts / emoticons is counted on one page (say 25).

eg.

BIG FULL REPORT
Not so funny content that should be funny. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus euismod, therefore, pellentesque ...... DETAILS ....
(3) like, (5) boring, (7) smiling

. + 24 Other posts on one page.

(B) Permanent link containing one microblog with detailed voting + vote count / emoticons.

BIG FULL REPORT
Not so funny content that should be funny. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus euismod, therefore, pellentesque. Quisque viverra adipiscing auctor. Mauris ut diam risus, in fermentum elit. Aliquam urna lectus, egestas sit amet cursus et, auctor ut elit. Nulla tempus suscitit nisi, nec condimentum dui fermentum non. In eget lacus mi, ut placerat nisi.
(You, Derp and 1 more ), (5) boring , (7) smiling

1st approach:

Table No. 1:

post_id | post_content | post_title | creation_time

Table # 2 for storing votes, likes, emoticons:

 action_id | post_id | action_type | action_creator | creation_time

To display a message page or single entry. The first table is requested to receive messages, the second is requested to receive all actions associated with messages. Whenever a vote is taken, etc., a tab is inserted into the post_actions table.

Second approach:

Table No. 1:

 post_id | post_content | post_title | creation_time | action_data

Where action_data could be something like { "likes" : 3,"smiles":4 ...}

Table No. 2:

 action_id | post_id | action_type | action_creator | creation_time

To display the message page, only the first table is requested to receive messages and action data, to display individual messages with detailed actions, the second table is requested to get all the actions associated with messages. Whenever a vote is taken, etc., a tab is inserted into the post_actions table, and the action_data field of table No. 1 is updated to store the updated score.

Assuming there are 100K messages and 10x actions Ie 1 million or more actions are created. Benefits 2? Any shortcomings besides having to read, modify, and update JSON information? Is there anyway, when approach number 2 can be monitored and improved?

Adding additional information based on feedback:

Python scripts will read, write data.
MySQL database servers will be different from web servers.
Entries due to post creation are low Ie 10,000 per day. But those due to actions can be higher, suggest a maximum of 50 entries per second due to actions such as voting, sympathy, emoticon.
I take care of comparing performance and reading for both and getting a second approach, as well as in the future.

+4

json mysql innodb denormalization

Dhruvpathak Oct 19 '12 at 9:03

source share

4 answers

Assuming that the system receives far more messages than letters, I can come up with several ways to do this. You can take advantage of the fact that social networking sites really do not need to have consistent data, but in the end they are consistent, as long as each user constantly sees their actions.

Option number 1.

Add a column for each type of action in table # 1 and increment them each time a new action occurs. Thus, the list of main pages is very fast.

Table number 1

 post_id | post_content | post_title | creation_time | action1_count | action2_count | action3_count | ...

What's cool about this approach is that when viewing a permalink, you don’t need to request all the actions for a message from table 2. Just request the last 5 of any actions and all actions taken by the viewer. Check out the inspiration here: How to get the last 2 items for each category in one select (with mysql)

Option number 2.

This is similar to your first approach, but write down the number of actions in Redis hashing or just as a JSON object for memcache. It quickly highlights download requests on the home page. The downside is that if redis (and always when memcached) is reloaded, you need to reinitialize them or just do it when someone views the page from a permalink view.

+2

Teemu ikonen Oct 27 '12 at 2:20

source share

First of all, I would say that Option 2 stems from trying to optimize too soon if you don’t already have statistics indicating that a lack of connection for queries on the listing page will improve performance, I would stick to Option 1.

The main problem with Option 2 is maintenance, every time you need to change something, you will have to change it in two places, and to fix the error or fill in the old records with a new field, all messages that you will need perform string manipulations on the database side (usually).

In my experience, the performance advantage of option 2 will be negligible, most of the delay in querying the database (at least such short queries) will occur when connecting to a remote server.

Also, if you correctly render the request, moving between both approaches (or using a different approach, for example, caching the most frequently used records) will be quite simple, use the simplest approach (which is option 1) and then change it when you have Information about problems with your current implementation (which are unlikely to be what you think now).

For clarity, here is a list of the advantages and disadvantages of Option 1 (which is the opposite of Option 2):

Option 1

pros

Quick record.
Easy maintenance
Lower storage requirements
No data duplication

vs

Reads lists slower.

+2

Alon bar david Oct 28 '12 at 6:52

source share

The performance difference between insert / delete / update is important. Insert is much faster than delete / update. Therefore, I would choose a solution that minimizes deletion / update.

Table # 1 will look like the first option:
post_id | post_content | post_title | creation_time

Table 2 is almost the same, without action_id .
post_id | action_type | action_creator | creation_time

Table 2 will have a map layout pointer in the post_id, action_type, and action_creator columns.

Two orders of the composite map index are important for fast queries. Because the index will be, even if not all parts of the index are used. This will be the request below. select ... from table_2 where post_id = 1 and action_type = 2
but the next query will not select ... from table_2 where post_id = 1 and action_creator = 2

A quick explanation is to use a map join index that looks like a tree, you need to use all the parts above the tree. That is, you cannot request "action_creator" without requesting post_id and action_type to use the index.

 -post_id |--action_type |--action_creator

However, now you can make your queries and always click on the composite index, and you basically do inserts in both table # 1 and table #.

If in the end you get a huge table number 2 due to the large number of "actions", you can split your tables in the future when you split into post_id. As most of your time, your users hit newer entries, and you can "prioritize" one partition with faster disks and more memory caching in the database. Or later optimize with http://memcached.org/ in front of the database.

-1

Nils Oct 28 '12 at 12:01

source share

Martin Müller · Accepted Answer · 2012-10-25T14:55:49+0000

I would recommend either storing all the data with similar / voices (aggregated and atomic) inside table 1, and completely abandon table 2 OR to use 2 tables without aggregated data, relying on JOIN syntax, smart queries and good indexes.

Why? Because otherwise you will be requesting and recording in both tables all the time when a comment / vote / like is being made. Taking 10 actions for the message, which are intended only to display interaction, I would really save all this in 1 table, possibly creating an additional column for each type of action. You can use JSON or just serialize() on arrays, which should be a little faster.

Which solution you choose at the end will greatly depend on how many actions you get and how you want to use them. Getting all actions for 1 post is easy with solution 1 and very fast, but searching inside will be useless. Solution 2, on the other hand, takes up more space, careful query processing, and indexes.

Using JSON instead of normalized data, is this approach correct?

BIG FULL REPORT

BIG FULL REPORT

Option 1

pros

vs

More articles: