What is the best way to store a markdown field in my database when I need to display both HTML and plain text representations?

I have a database and I have a website. I have a field in my interface, this is text, but I want it to support markdowns . I'm trying to understand what right was stored in my database because I have various views that need to be supported (PDF reports, web pages, excel files, etc.)?

My concern is that since some of these views do not support HTML, I do not just want to have an HTML version of this field.

Should I store 2 copies (only one text and one HTML?), Or should I store HTML and try to remove their HTML tags on the fly when I view Excel, for example?

I need to figure out the correct format (or formats) to store in the database in order to be able to display both:

  • HTML and
  • Plain text (no setpoint or HTML syntax)

Any suggestions would be appreciated as I don't want to go the wrong way. I want to say that I do not want to show any HTML tags or markup syntax in my release of Excel.

+6
source share
6 answers

Decide how it is:

  • Save the original data (markdown text).
  • Generate received data (HTML and plain text) on the fly.
  • Measure performance:
    • If acceptable, you're done, woohoo!
    • If not, cache the received data.

Caching can be done in many ways ... you can immediately generate derived data and store it in a database, or you can first store NULL and generate it lazily (when and when it is needed). You can even cache it outside the database.

But no matter what you do, make sure that the cache is never "out of date" β€”that is, when the original data changes, the derived data in the cache must be generated or at least marked as "dirty." One way to do this is through triggers.

+9
source

You need to save the data in canonical format. That is, in one true format in your database. It looks like this format should be a text column containing markdowns. This answers the database design part of your question.

Then, depending on which format you need to export, you should take the canonical format and convert it to the desired output format. This can be simply outputting markdown text or running it through some kind of parser to remove markdown or convert it to HTML.

+6
source

Most seem to say that they simply store the data as HTML in a database and then process it to turn it into plain text. In my opinion, there are some disadvantages:

  • Most likely, you will need the application code to remove the HTML code and extract the plain text. Imagine if you did this in SQL Server. What if you want to write a stored procedure / request with a simple text version? How to extract text in SQL? This is possible using a function, but it is a lot of work.

  • Processing an HTML block can be slow. I would suggest that for small HTML blocks this will be very fast, but certainly more overhead than just reading a text field.

  • HTML parsers do not always work well / they can be complex. The idea is that your users can be very creative and insert drops that will not work well with your parser. From experience, I know that it’s not always trivial to extract plain text from HTML.

I would suggest what most email providers do:

  • Keep rich text / HTML version and text version. Two fields in the database.
  • As with email providers, users may want these two fields to have different content.
  • You can write a user interface function that allows the user to enter HTML code and then convert it through the application into a text version. This gives the user a good starting point, and they can massage / edit the plain text version before saving to the database.
+2
source

I would suggest saving it in HTML format, since it is the richest in this case and removes tags when receiving data for other formats (such as PDF, Latex or something else). In the next question, you will find a way to easily remove tags.

Regular expression to remove HTML tags

From my point of view, storing data (initial and reduced) in two separate fields is a waste of space, but also a integrity problem, since one of the fields can be modified in theory without changing the second.

Good luck

+1
source
  • Always keep the source, in your case it is a markdown.
  • Also save frequently used formats.
  • Use on-demand conversion / rendering for less commonly used formats.

    Explanation:

  • There is always a source. This may be required for various purposes, for example. the same input can be edited, audit trail, debugging, etc. etc.

  • CPU / drum overhead, if the same format is often requested, you trade it with disk storage, which is cheap compared to formats.

  • Temporary overhead, see # 2

+1
source

I think that what I would do - if the repository is not a problem - would store the canonical version, but automatically generate from it, in a stored , computed field, any other versions that might be required. You want the fields to be saved because they make no sense in the conversion every time you need data. And you want them to be calculated, because you don't want them to not sync with the canonical version.

In essence, this is using the database as a cache for other versions, but a cache that guarantees data integrity.

0
source

Source: https://habr.com/ru/post/947893/


All Articles