MySQL query help: how to process data on most of the last rows per day from a large dataset

I have a somewhat complicated set of tables for which I need to do some construction / optimization of SQL queries. Currently, most of the logic that is used to get the results we need is performed at the application level, which leads to terrible performance due to complete table crawls, etc. SQL is not my strong suit, so I thought I would go so that the SO crowd could see if anyone could reach out.

Infrastructure History:

  • DB is MySQL5
  • We access this data through Hibernate using Java
  • The contents of most of these tables are relatively static, with the exception of the Salesperson-hourly-performance table, which contains a row for each hour of every day when the seller is active (for example, made or received a call) with the seller’s current performance account for the whole day. Given the number of sellers in the companies in question, this table can increase by 20K + rows per day.

Data objects

I created a simplified version of the table setup that includes the relevant data. The "real" tables contain about 20 companies, 300 branches, 20 thousand sellers and millions of sales performance reports.

CREATE TABLE `so_test`.`company` ( `id` int(10) unsigned NOT NULL auto_increment, `name` varchar(45) NOT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=8 DEFAULT CHARSET=latin1; INSERT INTO company VALUES (7, 'CompanyXX'); CREATE TABLE `so_test`.`division` ( `id` int(10) unsigned NOT NULL auto_increment, `name` varchar(45) NOT NULL, `campanyId` int(10) unsigned NOT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=18 DEFAULT CHARSET=latin1; INSERT INTO division VALUES (17, 'APAC #1'); CREATE TABLE `so_test`.`salesperson` ( `id` int(10) unsigned NOT NULL auto_increment, `divisionId` int(10) unsigned NOT NULL, `name` varchar(45) NOT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=213860 DEFAULT CHARSET=latin1; INSERT INTO salesperson VALUES (213859, 'bob jones'); CREATE TABLE `so_test`.`salesperson_hourly_performance` ( `id` int(10) unsigned NOT NULL auto_increment, `timestamp` DATETIME NOT NULL, `salesPersonId` int(10) unsigned NOT NULL, `callsInBound` int(10) unsigned NOT NULL, `callsOutBound` int(10) unsigned NOT NULL, `issuedOrders` int(10) unsigned NOT NULL, `salesRevenue` decimal(10,4) NOT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=552395 DEFAULT CHARSET=latin1; INSERT INTO salesperson_hourly_performance VALUES (552394, '2009-05-03 22:00:00', 213859, 15, 17, 14, 10798.0478), (551254, '2009-05-03 21:00:00', 213859, 14, 16, 13, 9802.3620), (551115, '2009-05-03 20:00:00', 213859, 13, 14, 12, 9183.8250), (550072, '2009-05-03 19:00:00', 213859, 11, 13, 11, 8490.8678), (549613, '2009-05-03 18:00:00', 213859, 10, 11, 9, 7230.1125), (549389, '2009-05-03 17:00:00', 213859, 9, 10, 8, 6486.2173), (548861, '2009-05-03 16:00:00', 213859, 7, 9, 7, 5537.8553), (548059, '2009-05-03 15:00:00', 213859, 6, 8, 6, 4663.8469), (547466, '2009-05-03 14:00:00', 213859, 5, 7, 5, 4082.6388), (546729, '2009-05-03 13:00:00', 213859, 4, 6, 4, 3057.7368), (546611, '2009-05-03 12:00:00', 213859, 3, 5, 2, 1751.6135), (545642, '2009-05-03 11:00:00', 213859, 2, 4, 2, 1751.6135), (545558, '2009-05-03 10:00:00', 213859, 1, 3, 0, 0.0000), (545072, '2009-05-03 09:00:00', 213859, 1, 2, 0, 0.0000), (565071, '2009-05-04 13:00:00', 213859, 19, 17, 6, 4200.1710), (575070, '2009-05-06 14:00:00', 213859, 0, 2, 1, 120.0000); 

Business Requirements:

  • Configure a set of web-based user-panel user interfaces that provide a separate performance overview for companies, departments, and individual vendors.
  • The user interface is much similar to one other, except for the data set: the company control panel combines all the data of all sellers in each of the component divisions and displays a row for each company, while the section toolbar for a particular company aggregates the data of each seller in this division and line per division.
  • User interfaces allow the user to select a date range for the report panel and sort by any of the columns. Displayed columns include:

    (Company | Department | Sales person) Name, Total issued orders, Total sales revenue, Total incoming calls, Total outgoing outgoing calls.

My problem / request for SO:

The “inherited” approach (which was shameful, but sort of sorted-extremely acceptable when the output was for the daily magazine) was to programmatically iterate through performance data for each of the relevant objects (for example, each salesperson in a department in a company ), find the "last" for each of the specified days in the specified date range and summarize the data. However, given the massive dataset and the need to present this data live in the user interface, I need guidance / examples on how to create effective SQL queries against this dataset, which will allow pagination and sorting.

  • If there is any kind soul, please show me a reasonable request that receives the sum of each column of seller performance data for a certain date range (bearing in mind that for each day, the row used for the amount is the last in date for that day, for this seller).

  • A query that performs query # 1 across the entire range of sellers (for example, all sellers in a given company) with support for pagination and ordering in a specific column?

Hope I have included enough details to clarify what I ask ... please let me know if you need more information.

Many thanks to SO SQL gods!

UPDATE:

Missing keys from salesPerson → division and from division → company were added. In addition, the fixed timestamp data type must be DATETIME instead of VARCHAR.

+4
source share
2 answers

I don’t know what the company and the division are joining ... but this:

 select c.name as company, d.name as division, s.name as salesperson, sum(h.callsinbound) as callsinboundsum, sum(h.callsoutbound) as callsoutboundsum, sum(h.issuedorders) as issuedorderssum, sum(h.revenue) as revenuesum from sales_history_performance h inner join (select th.salespersonid, date(th.timestamp) as my_date, max(th.timestamp) as max_time from sales_history_performance th inner join salesperson ts on th.salespersonid = ts.id where th.timestamp between '5/1/2009' and '5/3/2009' --inclusive in MySQL group by th.salespersonid, date(th.timestamp) ) t on h.salespersonid = t.salespersonid and h.timestamp = t.max_time inner join salesperson s on h.salespersonid = s.id inner join division d on s.divisionid = d.id inner join company c on d.companyid = c.id group by c.name, d.name, s.name order by 1,2,3 

You can edit the line and sp.name like '%' , which you commented out to add to it any filter that you need.

What this does is as follows: it goes out and builds a table of the top timestamp on a daily basis. If the identifier in sales_history_performance significantly larger for subsequent records, use this since you are unlikely to get duplicates. In any case, it joins the table, summing up all the columns with a metric, compared to the seller. You can withdraw the seller from an external request if you want to get a company number. Be that as it may, he will return all sellers.

Update . I added to the company and unit. This is a fairly general request. If you want to limit the division / company / seller, you can do this in the WHERE clause of the outer query, although you can get some benefit from doing this in the inner query - it's just a bit harder to maintain.

+1
source

Bearing in mind that for each day, the line used for the amount is the last by date for> that day, for this seller)

This information is difficult to digest. I was interested in what you say that the daily amount is stored in the salesperson_hourly_performance table, summing up daily diaries and hourly resumes in the same table.

In your example, there is no relation to the unit and the company. But to break down sales per person per day by a specific date range:

 select s.name,substring(timestamp,1,11) as day,sum(callsInBound),sum(callsOutBound),sum(issuedOrders),sum(salesRevenue) from salesperson_hourly_performance facts , salesperson s where facts.salesPersonId = s.id and timestamp >= "2009-05-03 00:00:00" and timestamp < "2009-05-07 00:00:00" group by s.name,day order by day asc; +-----------+-------------+-------------------+--------------------+-------------------+-------------------+ | name | day | sum(callsInBound) | sum(callsOutBound) | sum(issuedOrders) | sum(salesRevenue) | +-----------+-------------+-------------------+--------------------+-------------------+-------------------+ | bob jones | 2009-05-03 | 101 | 125 | 93 | 72836.7372 | | bob jones | 2009-05-04 | 19 | 17 | 6 | 4200.7100 | | bob jones | 2009-05-06 | 0 | 2 | 1 | 120.0000 | +-----------+-------------+-------------------+--------------------+-------------------+-------------------+ 

Saving the timestamp as the actual type of timestamp / datetime will give you easier flexibility regarding dates and times. There mysql functions to convert strings to datetimes, which probably can help your queries if it really should be a varchar column

Edit, I would really not mix the details in this table. Store one table per day, one table for several hours.

if you only need the string with the highest date on the day of use, for example.

 SELECT p.name, Substring(TIMESTAMP,1,11) AS DAY, Sum(callsinbound), Sum(callsoutbound), Sum(issuedorders), Sum(salesrevenue) FROM (SELECT sh.salespersonid, Substring(sh.TIMESTAMP,1,11) AS DAY, Max(TIMESTAMP) AS max_ts FROM salesperson_hourly_performance sh GROUP BY sh.salespersonid, DAY) t INNER JOIN salesperson_hourly_performance shp ON t.salespersonid = shp.salespersonid AND t.max_ts = shp.TIMESTAMP INNER JOIN salesperson p ON shp.salespersonid = p.id GROUP BY p.name, DAY; 

Add where sentences where you need, for example. according to the first request

0
source

Source: https://habr.com/ru/post/1286157/


All Articles