What statistics are calculated faster in SAS, proc summary?

Question

What statistics are calculated faster in SAS, proc summary?

I need a theoretical answer.

Imagine you have a table with 1.5 billion rows (the table is created as columns using DB2-Blu).

You use SAS, and you will do statistics using Proc Summary , such as min / max / mean values, standard deviation and percentile-10, percentile-90 through your peer groups.

For example, you have 30,000 peer groups, and you have 50,000 values in each group of peer networks (a total of 1.5 billion).

Otherwise, you have 3 million peer groups, and you also have 50 values in each peer group. Thus, you again get 1.5 billion values.

Will it work faster if you have fewer peer groups but more values in each peer group? Or it will work faster with a large number of peer groups, but less fewer values in each peer group.

I could check the first case (30,000 peer groups and 50,000 values per peer group), and it took about 16 minutes. But I can not check the second case.

Can you write a rough forecast for runtime if I have 3 million peer groups and also 50 values in each peer group?

Another dimension for the question. Would it be faster to do these statistics if using Proc SQL instead?

Sample code below:

 proc summary data = table_blu missing chartype; class var1 var2; /* Var1 and var2 are toghether peer-group */ var values; output out = stattable(rename = (_type_ = type) drop = _freq_) n=n min=min max=max mean=mean std=std q1=q1 q3=q3 p10=p10 p90=p90 p95=p95 ; run;

+5

sql statistics db2 sas proc-sql

user3714330 Apr 20 '17 at 12:31

source share

2 answers

Not a theoretical answer, but still relevant IMO ...

To speed up proc summary on large tables, add the / groupinternal to the class statement. Of course, if you do not want the variables to be formatted before grouping.

eg:

 class age / groupinternal;

This indicates to SAS that it is not necessary to apply a format to a value before calculating which class should group the value. Each value will have a format applied to it, even if you did not explicitly specify it. It doesn't really matter for small tables, but on large tables it can.

From this simple test, it reduces the time from 60 seconds on my machine to 40 seconds (YMMV):

 data test; set sashelp.class; do i = 1 to 10000000; output; end; run; proc summary data=test noprint nway missing; class age / groupinternal; var height; output out=smry mean=; run;

+2

Robert Penridge Apr 20 '17 at 17:55

source share

Dompazz · Accepted Answer · 2017-04-20T13:16:53+0000

So there is something to think about.

The first point, and perhaps the biggest in terms of performance, is getting data from DB2 to SAS. (I assume this is not an instance of the SAS database - correct me if that is the case). What a large table and moving it through the wire takes time. Because of this, if you can compute all these statistics inside DB2 using an SQL statement, this is likely to be your fastest option.

Suppose you upload a table to a SAS server:

A table sorted by CLASS variables will be much faster to process than an unsorted table. If the SAS knows that the table is sorted, it does not need to scan the table for writing to go to the group, it can execute a read block instead of random I / O.

If the table is not sorted, then the larger the number of groups, the more scanning of the table should occur.

The fact is that the speed of receiving data from HD to the CPU will be of paramount importance in an unsorted process.

From there, you get into the memory and the problem with the processor. PROC SUMMARY is multi-threaded, and SAS will read N groups at a time. If the group size can fit into the memory allocated for this thread, you will not have a problem. If the group size is too large, then SAS will have to be on the page.

I reduced the problem to a 15M line example:

 %let grps=3000; %let pergrp=5000;

Unsorted:

 NOTE: There were 15000000 observations read from the data set WORK.TEST. NOTE: The data set WORK.SUMMARY has 3001 observations and 9 variables. NOTE: PROCEDURE SUMMARY used (Total process time): real time 20.88 seconds cpu time 31.71 seconds

SORTED:

 NOTE: There were 15000000 observations read from the data set WORK.TEST. NOTE: The data set WORK.SUMMARY has 3001 observations and 9 variables. NOTE: PROCEDURE SUMMARY used (Total process time): real time 5.44 seconds cpu time 11.26 seconds

===============================

 %let grps=300000; %let pergrp=50;

Unsorted:

 NOTE: There were 15000000 observations read from the data set WORK.TEST. NOTE: The data set WORK.SUMMARY has 300001 observations and 9 variables. NOTE: PROCEDURE SUMMARY used (Total process time): real time 19.26 seconds cpu time 41.35 seconds

SORTED:

 NOTE: There were 15000000 observations read from the data set WORK.TEST. NOTE: The data set WORK.SUMMARY has 300001 observations and 9 variables. NOTE: PROCEDURE SUMMARY used (Total process time): real time 5.43 seconds cpu time 10.09 seconds

I ran them several times and the runtime was similar. Sorted times are roughly equal and faster.

The more groups / less per group, the faster it was unsorted, but look at the overall CPU usage, it is higher. My laptop has a very fast SSD, so IO was probably not a limiting factor - HD was able to keep up with the requirements for multi-core processors. On a system with slower HD, the total runtime may vary.

In the end, it depends too much on the data structure and the features of your server and database.

What statistics are calculated faster in SAS, proc summary?

More articles: