How to avoid many database trips and a lot of irrelevant data?

Question

How to avoid many database trips and a lot of irrelevant data?

I have worked with various applications and have repeatedly come across this situation. So far I have not figured out what works best.

Here's the script:

I have an application on my desktop or on the Internet.
I need to get simple documents from a database. The document contains general information and details about the parts, so the database:

GeneralDetails table:

 | DocumentID | DateCreated | Owner | | 1 | 07/07/07 | Naruto | | 2 | 08/08/08 | Goku | | 3 | 09/09/09 | Taguro |

ItemDetails table

 | DocumentID | Item | Quantity | | 1 | Marbles | 20 | | 1 | Cards | 56 | | 2 | Yo-yo | 1 | | 2 | Chess board | 3 | | 2 | GI Joe | 12 | | 3 | Rubber Duck | 1 |

As you can see, tables have a one-to-many relationship. Now, to get all the documents and their corresponding elements, I always do one of two things:

Method 1 - multi-channel (pseudo-code):

  Documents = GetFromDB("select DocumentID, Owner " + "from GeneralDetails") For Each Document in Documents { Display(Document["CreatedBy"]) DocumentItems = GetFromDB("select Item, Quantity " + "from ItemDetails " + "where DocumentID = " + Document["DocumentID"] + "") For Each DocumentItem in DocumentItems { Display(DocumentItem["Item"] + " " + DocumentItem["Quantity"]) } }

Method 2 - A lot of unnecessary data (pseudocode):

 DocumentsAndItems = GetFromDB("select g.DocumentID, g.Owner, i.Item, i.Quantity " + "from GeneralDetails as g " + "inner join ItemDetails as i " + "on g.DocumentID = i.DocumentID") //Display...

I used the first method when I was in college for desktop applications, the performance was not bad, so I realized that everything was in order.

Until one day I saw the article “Make the Internet faster,” she says that many round trips to the database are bad; therefore, since then I have used the second method.

In the second method, I avoided round trips, using an inner join to retrieve the first and second tables at once, but it creates unnecessary or redundant data. See Result Set.

 | DocumentID | Owner | Item | Quantity | | 1 | Naruto | Marbles | 20 | | 1 | Naruto | Cards | 56 | | 2 | Goku | Yo-yo | 1 | | 2 | Goku | Chess board | 3 | | 2 | Goku | GI Joe | 12 | | 3 | Taguro | Rubber Duck | 1 |

The result set has redundant DocumentID and Owner . It looks like an abnormal database.

Now the question arises: how can I avoid round trips and at the same time avoid redundant data?

+6

optimization language-agnostic sql data-retrieval

dpp Aug 18 '11 at 6:25

source share

10 answers

An internal join is better because the database has more options for optimization.

In general, you cannot create a query that does not produce redundant results. For this, the relational model is too restrictive. I would just live with it: the database is responsible for optimizing these cases.

If you really run into performance issues (mainly due to a network bottleneck), you can write a stored procedure that makes a request and denormalizes it. In your example, create a result, for example:

 | DocumentID | Owner | Items | Quantity | | 1 | Naruto | Marbles, Cards | 20, 56 | | 2 | Goku | Yo-yo, Chess board, GI Joe, Rubber Duck | 1, 3, 12, 1 |

But this, of course, does not correspond to the first normal form - so you will need to parse it on the client. If you use an XML-enabled database (for example, Oracle or MS SQL Server), you can even create an XML file on the server and send it to the client.

But no matter what you do, remember: premature optimization is the root of all evil. Do not do this kind of thing before you are 100% sure that you really have a problem that you can solve in this way.

+3

Markus pilman Aug 18 '11 at 7:04

source share

You can read the first table, extract the keys from the rows from the second table, and get them using the second selection.

Sort of

 DocumentItems = GetFromDB("select Item, Quantity " + "from ItemDetails " + "where DocumentID in (" + LISTING_OF_KEYS + ")")

+2

Alpedar Aug 18 '11 at 7:05

source share

The second way is, of course, the way. But you do not need to select columns that you are not going to use. Therefore, if you only need Item and Quantity , follow these steps:

 DocumentsAndItems = GetFromDB("select i.Item, i.Quantity " + "from GeneralDetails as g " + "inner join ItemDetails as i " + "on g.DocumentID = i.DocumentID")

(I suppose you have other conditions that you would enter in the where part of the request, otherwise a connection is not needed.)

+1

Oleg Muravskiy Jun 09 '16 at 12:39

source share

If you use .NET and MS SQL Server, a simple solution here would be to study the use of MARS (multiple active result sets). Here, the sample code is blocked directly from the help of Visual Studio 2015 on the MARS demo:

 using System; using System.Data; using System.Data.SqlClient; class Class1 { static void Main() { // By default, MARS is disabled when connecting // to a MARS-enabled host. // It must be enabled in the connection string. string connectionString = GetConnectionString(); int vendorID; SqlDataReader productReader = null; string vendorSQL = "SELECT VendorId, Name FROM Purchasing.Vendor"; string productSQL = "SELECT Production.Product.Name FROM Production.Product " + "INNER JOIN Purchasing.ProductVendor " + "ON Production.Product.ProductID = " + "Purchasing.ProductVendor.ProductID " + "WHERE Purchasing.ProductVendor.VendorID = @VendorId"; using (SqlConnection awConnection = new SqlConnection(connectionString)) { SqlCommand vendorCmd = new SqlCommand(vendorSQL, awConnection); SqlCommand productCmd = new SqlCommand(productSQL, awConnection); productCmd.Parameters.Add("@VendorId", SqlDbType.Int); awConnection.Open(); using (SqlDataReader vendorReader = vendorCmd.ExecuteReader()) { while (vendorReader.Read()) { Console.WriteLine(vendorReader["Name"]); vendorID = (int)vendorReader["VendorId"]; productCmd.Parameters["@VendorId"].Value = vendorID; // The following line of code requires // a MARS-enabled connection. productReader = productCmd.ExecuteReader(); using (productReader) { while (productReader.Read()) { Console.WriteLine(" " + productReader["Name"].ToString()); } } } } Console.WriteLine("Press any key to continue"); Console.ReadLine(); } } private static string GetConnectionString() { // To avoid storing the connection string in your code, // you can retrive it from a configuration file. return "Data Source=(local);Integrated Security=SSPI;" + "Initial Catalog=AdventureWorks;MultipleActiveResultSets=True"; } }

Hope this puts you on the path to understanding. There are many different philosophies on the topic of circular disconnection, and a lot of this depends on the type of application you are writing and the data store you are connecting to. If this is an intranet project and there are not many concurrent users, then a large number of database calls is not a problem or a problem that you think about, except what your reputation looks like so as not to have a more ordered code! (Grin) If this is a web application, then this is a different story, and you should try so that you do not return to too often, if at all, to avoid it. MARS is a good answer to solve this problem, since everything returns from the server in one shot, and then you can iterate over the returned data. I hope this is useful to you!

+1

Daniel Anderson Jun 09 '16 at 14:33

source share

The answer depends on your task.

1. If you want to generate a List / Report, you need Method-2 with redundant data. You transfer more data over the network, but save time creating content.

2. If you want to display the General List first and then display the data by the user, then it is better to use method-1. To generate and send a limited set of data will be very fast.

3. If you want to preload all the data into the application, you can use XML. It will provide ALL non-redundant data. However, there is additional programming with XML encoding in SQL and decoding on the client.

I would do something similar to create XML on the SQL side:

 ;WITH t AS ( SELECT g.DocumentID, g.Owner, i.Item, i.Quantity FROM GeneralDetails AS g INNER JOIN ItemDetails AS i ON g.DocumentID = i.DocumentID ) SELECT 1 as Tag, Null as Parent, DocumentID as [Document!1!DocumentID], Owner as [Document!1!Owner], NULL as [ItemDetais!2!Item], NULL as [ItemDetais!2!Quantity] FROM t GROUP BY DocumentID, Owner UNION ALL SELECT 2 as Tag, 1 as Parent, DocumentID, Owner, Item, Quantity FROM t ORDER BY [Document!1!DocumentID], [Document!1!Owner], [ItemDetais!2!Item], [ItemDetais!2!Quantity] FOR XML EXPLICIT;

+1

Slava murygin Jun 09 '16 at 18:04

source share

As far as I can see, you have several options

Align the lines so that all of your elements appear without redundant data. those. "Marble, cards."
Returns your query as a compressed XML file that your program can parse as if it were a database.
- This gives you the benefit of just one trip, but you also get all the data in one file, which can be massive.
This item will be my preference, implement the lazy loading form.
- This means that "extra" data is only loaded if necessary. Therefore, although it has several trips, the trips are intended only to obtain the required data.

+1

Newdeveloper Jun 09 '16 at 20:56

source share

Somehow, in my application with ~ 200 forms / screens and a database with ~ 300 tables, I never needed either the first or the second method.

In my application, quite often the user sees two grids (tables) on the screen, next to each other:

main GeneralDetails table with a list of documents (usually there is a search function that restricts results using a variety of filters).
from the ItemDetails table for the selected document. Not for all documents. Only for one current document. When the user selects another document in the first grid, I (re) run a query to retrieve the details of the selected document. Only for one selected document.

Thus, there is no connection between the main and detailed table. And there is no cycle to extract the details for all the main documents.

Why do you need details for all documents on the client?

I would say that best practices come down to common sense:

It is always useful to transmit over the network only the data you need, without redundancy. And it is always good that the number of requests / requests is as low as possible. Instead of sending multiple requests in a loop, send a single request that will return all the necessary rows. Then cut and quit it on the client, if it is really necessary.

If you need to somehow process a package of documents together with their details, this is a different story, and so far I have always managed to do this on the server side, without transferring all this data to the client.

If for some reason you need to get a list of all the main documents, along with details for all the documents for the client, I would make two requests without any cycles:

 SELECT ... FROM GeneralDetails SELECT ... FROM ItemDetails

These two queries will return two data arrays and, if necessary, I would combine the master part data in internal structures in the memory on the client.

0

Vladimir Baranov Jun 03 '16 at 6:31

source share

You can also optimize this process by extracting data from two tables separately. After that, you can either skip the records or join the tables to create the same result set as the SQL server.

With ORM, you can retrieve objects separately in two rounds — one for retrieving GeneralDetails , and the other for retrieving ItemDetails after checking GeneralDetails.DocumentId . Although there are two round trips to the database, they are optimized than either of the other two methods.

Here is an example of NHibernate:

 void XXX() { var queryGeneral = uow.Session.QueryOver<GeneralDetails>(); var theDate = DateTime.Now.Subtract(5); queryGeneral.AndRestrictionOn(c => c.SubmittedOn).IsBetween(theDate).And(theDate.AddDays(3)); // Whatever other criteria applies. var generalDetails = queryGeneral.List(); var neededDocIds = generalDetails.Select(gd => gd.DocumentId).Distinct().ToArray(); var queryItems = uow.Session.QueryOver<ItemDetails>(); queryItem.AndRestrictionOn(id => id.DocumentId).IsIn(neededDocs); var itemDetails = queryItems.List(); // The records from both tables are now in the generalDetails and itemDetails lists so you can manipulate them in memory... }

I believe that (there is no living example) with the ADO.NET dataset, you can actually save the second round trip to the database. You don’t even have to join the results; this is a matter of coding style and workflow, but as a rule, you can update your user interface while working with two result sets,

 void YYY() { var sql = "SELECT * FROM GeneralDetails WHERE DateCreated BETWEEN '2015-06-01' AND '2015-06-20';"; sql += @" WITH cte AS ( SELECT DocumentId FROM GeneralDetails WHERE DateCreated BETWEEN '2015-06-01' AND '2015-06-20' ) SELECT * FROM ItemDetails INNER JOIN cte ON ItemDetails.DocumentId = cte.DocumentId"; var ds = new DataSet(); using (var conn = new SqlConnection("a conn string")) using (var da = new SqlDataAdapter()) { conn.Open(); da.SelectCommand = conn.CreateCommand(); da.SelectCommand.CommandText = sql; da.Fill(ds); } // Now the two table are in the dataset so you can loop through them and do your stuff... }

Note. I wrote the code above only in the sake example and have not been tested!

0

Bozhidar stoyneff Jun 08 '16 at 10:59

source share

Since I asked this question, I realized that there are other areas that I can optimize for the application while retrieving data. In this case, I will do the following:

Ask yourself, do I really need to collect a lot of documents along with my sub-elements? Usually in the user interface I show entries in the list only when the user needs subelements (if the user clicks on the entry), I will get them.
If it is really necessary to display many posts with sub-elements, for example post / comments, I will provide only some messages, think about pagination or provide the "download more" function.

To summarize, I can finish lazy loading, get data only when the user needs it.

The solution to avoiding access to the database server, although it does not guarantee improved performance, since it requires more processing on the database server and in the application, is to extract several sets of records, one result for parent documents and one result for subitems, see pseudocode :

  recordSets = GetData ("select * from parentDocs where [condition] ; select * from subItems where [condition]") //join the parent documents and subitems here

I may need a temporary table for the parent documents, so I can use it for the condition in the second query, since I only need to restore the subitems of the selected parent documents.

I must also point out that doing the test is better than just applying the principles right away, since it really is based on each case.

0

dpp Jun 10 '16 at 15:41

source share

Stefan mai · Accepted Answer · 2011-08-18T06:59:12+0000

The method used by ActiveRecord and other ORMs is to select the first table, combine the identifiers, and then use these identifiers in the IN clause for the second selection.

SELECT * FROM ItemDetails WHERE DocumentId IN ([Comma-separated list of identifiers here])

Benefits:

No redundant data

Disadvantages:

Two queries

Generally speaking, the first method is called the "N + 1 request problem", and the solutions are called "eager loading." I am inclined to see that your “Method 2” is preferable to database latency, usually exceeding the size of redundant data in terms of data transfer rate, but YRMV. Like almost everything in software, this is a compromise.

How to avoid many database trips and a lot of irrelevant data?

More articles: