Writing efficient queries in SAS using Proc sql with Teradata

EDIT: Here is a more complete set of code that accurately shows what happens in the answer below.

libname output '/data/files/jeff' %let DateStart = '01Jan2013'd; %let DateEnd = '01Jun2013'd; proc sql; CREATE TABLE output.id AS ( SELECT DISTINCT id FROM mydb.sale_volume AS sv WHERE sv.category IN ('a', 'b', 'c') AND sv.trans_date BETWEEN &DateStart AND &DateEnd ) CREATE TABLE output.sums AS ( SELECT id, SUM(sales) FROM mydb.sale_volue AS sv INNER JOIN output.id AS ids ON ids.id = sv.id WHERE sv.trans_date BETWEEN &DateStart AND &DateEnd GROUP BY id ) run; 

The goal is to simply query the table for some id based on category membership. Then I summarize the activities of these members in all categories.

The above approach is much slower than:

  • Running the first query to get a subset
  • Starting a second request for the amount of each identifier
  • Running a third query, which internally connects two result sets.

If I understand correctly, it can be more efficient to make sure that all of my code is completely passed, not cross-loading.


After posting a question yesterday, a member suggested that I possibly take advantage of a performance issue that was more specific to my situation.

I use the SAS Enterprise Guide to write some programs / data queries. I do not have permission to change the underlying data that is stored in Teradata.

My main problem is writing efficient SQL queries in this environment. For example, I am querying a large table (with tens of millions of records) for a small subset of identifiers. Then I use this subset to query the large table again:

 proc sql; CREATE TABLE subset AS ( SELECT id FROM bigTable WHERE someValue = x AND date BETWEEN a AND b ) 

It works in seconds and returns 90k ID. Then I want to query this set of identifiers for a large table, and there will be problems. I want to summarize the values ​​over time for identifiers:

 proc sql; CREATE TABLE subset_data AS ( SELECT bigTable.id, SUM(bigTable.value) AS total FROM bigTable INNER JOIN subset ON subset.id = bigTable.id WHERE bigTable.date BETWEEN a AND b GROUP BY bigTable.id ) 

For some reason this takes a very long time. The difference is that the first request flags are "someValue". The second considers all activities, regardless of what is in "someValue". For example, I can mark every customer who orders pizza. Then I would look at every purchase for all customers who ordered pizza.

I am not too familiar with SAS, so I’m looking for any advice on how to do this more efficiently or speed up the process. I am open to any thoughts or suggestions and please let me know if I can offer more detailed information. I guess I'm just surprised that the second request takes a long time to process.

+6
source share
5 answers

The most important thing to understand when using SAS to access data in Teradata (or any other external database, for that matter) is that the SAS software prepares SQL and sends it to the database. The idea is to try to free you (the user) from all the details of the database. SAS does this using a concept called "impict pass-through", which means that SAS translates the SAS code into the DBMS code. Among the many things that happen is this data type conversion: SAS has only two (and only two) data types, numeric and character.

SAS translates things for you, but it can be confusing. For example, I saw lazy database tables defined using VARCHAR (400) columns that have values ​​that never exceed some shorter length (for example, a column for a person’s name). This is not a big problem in the database, but since SAS does not have a VARCHAR data type, it creates a 400-character variable for each row. Even with data compression, this can result in the resulting SAS dataset being unnecessarily large.

An alternative way is to use an “explicit pass” where you write your own queries using the actual syntax of the DBMS in question. These queries are completely executed in the DBMS and return the results back to SAS (which still performs data type conversion for you. For example, there is an “end-to-end” query that joins two tables and creates a SAS dataset as the result:

 proc sql; connect to teradata (user=userid password=password mode=teradata); create table mydata as select * from connection to teradata ( select a.customer_id , a.customer_name , b.last_payment_date , b.last_payment_amt from base.customers a join base.invoices b on a.customer_id=b.customer_id where b.bill_month = date '2013-07-01' and b.paid_flag = 'N' ); quit; 

Note that everything inside a pair of parentheses is native Teradata SQL and that the join operation itself is performed inside the database.

The sample code that you provided in your question is NOT a complete SAS / Teradata program example. To help you better, you need to show the real program, including any library links. For example, suppose your real program looks like this:

 proc sql; CREATE TABLE subset_data AS SELECT bigTable.id, SUM(bigTable.value) AS total FROM TDATA.bigTable bigTable JOIN TDATA.subset subset ON subset.id = bigTable.id WHERE bigTable.date BETWEEN a AND b GROUP BY bigTable.id ; 

This will mean the previously assigned LIBNAME statement through which the SAS connects to Teradata. The syntax of this WHERE clause will be very important if the SAS can even pass the full Teradata request. (You, for example, do not show what “a” and “b" mean. It is very possible that the only way SAS can perform the join is to drag both tables back into the local working session and make the connection on your SAS server.

I can simply say that you are trying to convince your Teradata administrators to allow the creation of driver tables in some database. The idea is that you create a relatively small table inside Teradata containing the identifier you want to extract, and then use this table to do explicit joins. I am sure that for this you will need a little more formal preparation of the database (for example, how to determine the correct index and how to "collect statistics"), but with this knowledge and skills in mind, your work will simply fly.

I could go on and on, but I will stay here. I use SAS with Teradata every day against what I am told is one of the largest Teradata environments on the planet. I like programming in both.

+8
source

You mean the assumption that the 90k entries in your first request are unique id s. Is that certain?

I ask because the implication from your second request is that they are not unique.
- One id can have several values ​​over time and have different somevalue s

If id not unique in the first dataset, you need GROUP BY id or use DISTINCT in the first query.

Imagine that 90k lines consist of 30k unique id s and therefore have an average value of 3 lines per id .

And then imagine that these 30k unique id actually have 9 entries in your time window, including the lines where somevalue <> x .

Then you will get 3x9 records for id .

And as these two numbers grow, the number of records in your second query grows geometrically.


Alternative request

If this is not a problem, an alternative query (which is not ideal but possible) will be ...

 SELECT bigTable.id, SUM(bigTable.value) AS total FROM bigTable WHERE bigTable.date BETWEEN a AND b GROUP BY bigTable.id HAVING MAX(CASE WHEN bigTable.somevalue = x THEN 1 ELSE 0 END) = 1 
+1
source

If the identifier is unique and has one value, you can try to create a format.

Create a dataset that looks like this:

fmtname, start, label

where fmtname is the same for all entries, the name of the legal format (begins and ends with a letter, contains alphanumeric or _); start - value ID; and label is 1. Then add one line with the same value for fmtname, empty start, label 0 and another variable, hlo='o' (for "other"). Then import into the proc format using the CNTLIN parameter, and now you have a 1/0 value conversion.

Here is a quick example of using SASHELP.CLASS. The ID here is the name, but it can be a number or a character - depending on what is suitable for your use.

 data for_fmt; set sashelp.class; retain fmtname '$IDF'; *Format name is up to you. Should have $ if ID is character, no $ if numeric; start=name; *this would be your ID variable - the look up; label='1'; output; if _n_ = 1 then do; hlo='o'; call missing(start); label='0'; output; end; run; proc format cntlin=for_fmt; quit; 

Now, instead of making a connection, you can execute your request “normally”, but with the optional clause where and put(id,$IDF.)='1' . It will not be optimized with an index or whatever, but may be faster than a join. (It may also not be faster - it depends on how the SQL optimizer works.)

+1
source

If the identifier is unique, you can add UNIQUE PRIMARY INDEX (id) to this table, otherwise a non-specific PI will be used by default. Knowing uniqueness helps the optimizer develop a better plan.

Without additional information like Explain (just put EXPLAIN in front of SELECT), it's hard to say how this can be improved.

+1
source

One alternative solution is to use SAS procedures. I don't know what your actual SQL does, but if you just do the frequencies (or something else that can be done in PROC), you can do:

 proc sql; create view blah as select ... (your join); quit; proc freq data=blah; tables id/out=summary(rename=count=total keep=id count); run; 

Or any number of other options (PROC MEANS, PROC TABULATE, etc.). This can be faster than doing the sum in SQL (depending on some details, for example, how your data is organized, what you actually do and how much memory you have). It has the added benefit that SAS can do for this in the database if you create a view in the database, which can be faster. (In fact, if you simply run the frequency from the base table, it may be even faster, and then attach the results to a smaller table).

0
source

Source: https://habr.com/ru/post/949138/


All Articles