How Firefox Reader Browsing Works

Summary

I’m looking for criteria by which I can create a web page, and [honestly] sure that it will appear in Firefox Reader. Browse if the user wishes.

Some sites have this option, some do not. Some with more text have no option than others with much less text. The stack overflow for the instance displays only the question, not any answers in Reader View.

Question

I have updated Firefox from 38.0.1 to 38.0.5 and found a new ReaderView function - a kind of overlay that removes “page clutter” and makes it easier to read text. Readerview is located on the right side of the address bar as an interactive icon on certain pages.

This is great, but from a programming point of view, I want to know how the "reader view" works, what criteria are the pages to which it relates. I did some research on the Mozilla Firefox website without clear answers (all the answers to all the programming options that I found), I, of course, Googled / Binged, and this only came back with links to Firefox add-ons - this is not an addon but is the main part of the new version of Firefox.

I made the assumption that the readerview used HTML5 and retrieved the contents of <article> , but this is not the case, since it works on Wikipedia, which does not seem to use <article> tags or similar HTML5 tags, instead readview retrieves specific <div> and displays them alone. This feature works on some HTML5 pages, such as Wikipedia, but not on others.

If anyone has ideas on how Firefox ReaderView works and how this operation can be used by website developers, can you share it? Or, if you can find where this information can be located, you can point me in the right direction - since I could not find it.

+43
javascript firefox firefox-reader-view
Jun 05 '15 at 8:18
source share
3 answers

You need at least one <p> around the text that you want to see in read mode and at least 516 characters in 7 words inside the text.

for example, this will trigger a reader view:

 <body> <p> 123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789 123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789 123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789 123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789 123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789 123456789 123456 </p> </body> 

See my example at https://stackoverflow.com/a/312960/

+41
Jun 10 '15 at 7:51
source share

Reading the gitHub code, this morning the process is that the page elements are listed in order of probability - with <section> , <p> , <div> , <article> at the top of the list (i.e. most likely).

Then, each of these “nodes” is assigned a score based on things such as comma counts and class names that refer to node. This is a somewhat multifaceted process in which scores for text fragments are added, but also, ratings for invalid parts or syntax are apparently reduced. The ratings in the "node" parts are reflected in the overall node rating. those. the parent element contains dozens of all the lower elements, I think.

The value of this rating determines whether an HTML page can be viewed in Firefox.

I don’t quite understand if the value of the Firefox score or the read function is set.

Javascript is really not my forte, and I think someone else should check out the link provided by Richard ( https://github.com/mozilla/readability ) and see if they can provide a more thorough answer.

What I did not see, but expected to see, is an estimate based on the amount of text content in the <p> or <div> tags (or others).

Any improvements in this question or answer, please share!

EDIT: Images in the <div> or <figure> (HTML5) tags in the <p> element appear to be saved in the Reader view when the page text is valid.

+18
Jun 06 '15 at 22:43
source share

I followed Martin with a link to the Readability.js GitHub repository and looked at the source code. Here is what I can do.

The algorithm works with paragraph tags. First of all, he tries to identify parts of the page that are definitely not like forms, etc. - and removes them. Then it goes through the paragraph nodes on the page and assigns a rating based on content saturation: it gives them points for things like the number of commas, length of content, etc. Note that a paragraph with less than 25 characters is immediately discarded.

Then the “bubble” of the DOM tree is evaluated: each paragraph adds a part of its rating to all the parent nodes - the direct parent receives the full score added to his total, only grandparents, half great-grandmother and third, etc. This allows the algorithm to identify higher-level elements that are likely to be the main content section.

Although this is just a Firefox algorithm, I think that if it works well for Firefox, it will work well for other browsers.

For these Reader View algorithms to work on your website, you want them to correctly identify the heavy content sections of your page. This means that you want narrower content pages on your page to get high scores in the algorithm.

So, here are some rules of thumb to improve page quality in the eyes of these algorithms:

  • Use paragraph tags in your content! Many people tend to ignore them in favor of <br /> tags. Although it may look similar, many content-related algorithms (not just Reader View) are heavily dependent on them.
  • Use HTML5 semantic elements in your markup, e.g. <article> , <nav> , <section> , <aside> . Despite the fact that they are not the only criterion (as you noted in the question), it is very useful for computers reading your pages (and not just Reader View) to distinguish between different sections of your content. Readability.js uses them to guess which sites are likely or unlikely to contain important content.
  • Wrap the main content in a single container, such as a <article> or <div> element. This will receive point points for all paragraph tags within it and be identified as the main content section.
  • Make sure the DOM tree is not shallow in areas with dense content. If you have many elements that violate your content, you only make life more difficult for the algorithm: there will not be a single element that stands out as the parent of many paragraphs containing content, but many separate with low scores.
+8
Nov 22 '16 at 16:58
source share



All Articles