Re: using dataset variables and formatting without observations

WendyT · Posted 08-03-2009 11:18 AM

Good morning!

I'm working on a series of programs that download data from an Oracle database. The first step downloads data for a particular project to the sas file 'PROJECT' based on a where condition.

There are some projects that have special additional calculation(s) that need to be done and appended back to the 'PROJECT' file.

My plan is to use PROJECT to generate the special data, concatenate the special dataset(s) to one dataset named SPECIAL, then use PROC DATASETS and APPEND to add SPECIAL back to the PROJECT file.

I want to use the attributes of the PROJECT file, so all the variable types and sizes match for the append.

A SET statement with (OBS=0) appears to work correctly.

Is there a better or more appropriate way to do this, and where might I find some documentation?

Thanks for your suggestions.

Wendy

*make PROJECT dataset from Oracle database ;
PROC SQL NOPRINT ;
CREATE PROJECT
AS SELECT … FROM …. WHERE… ;

*make special datasets from PROJECT or other sources ;

PROC SQL NOPRINT ;
CREATE SPECIAL1
AS SELECT … FROM PROJECT... WHERE … ;
QUIT ;

DATA SPECIAL2 ; SET PROJECT ;
Gyrate …gyrate… etc. ;
RUN ;

DATA SPECIAL3 ; SET SOMETHINGELSE ;
etc... ;
RUN ;

*concatenate special datasets using the PROJECT dataset as a model ;

DATA SPECIAL ;
SET PROJECT(OBS=0) SPECIAL1 SPECIAL2 SPECIAL3 ;
RUN ;

*add special datasets back to PROJECT ;

PROC DATASETS ;
APPEND BASE=PROJECT DATA=SPECIAL ;
RUN ;

sbb · Posted 08-03-2009 01:34 PM

With the SET you have constructed, a SAS ERROR condition is generated when a given SAS variable has a NUMERIC / CHARACTER type conflict in two or more datasets listed on the SET statement.

And, do consider that SAS will simply truncate a variable's length when the "type" conflict is not present but a "shorter length" with a WARNING message, that sometimes goes unnoticed.

My general approach (and recommendation) is to define either a AUTOCALL macro or INCLUDE member with specific LENGTH, FORMAT and LABEL definitions, however I do not typically have external DBMS data feeds to consider.

Scott Barry
SBBWorks, Inc.

ChrisNZ · Posted 08-03-2009 06:42 PM

While I see nothing odd in you post, I would agree with Scott: define the lengths (and types!) yourself, rather than let sas do it and then try to fix them.

It is difficult to add more comments without knowing more about your constraints.

Replacing the last 2 steps with one might be faster if SPECIAL is larger than PROJECT:

DATA PROJECT;
SET PROJECT SPECIAL1 SPECIAL2 SPECIAL3 ;
RUN ;

High-Performance SAS Coding - Third Edition

ballardw · Posted 08-04-2009 06:55 PM

You may find the information in Dictionary.columns useful. Look at the items related to your PROJECTS dataset.

WendyT · Posted 08-05-2009 03:03 PM

The Oracle database I am using is for Water Quality at my agency, and the PROJECT datasets coming down from the Oracle database tend to be large.

My original code did SET all of the datasets, but I realized that I would need a more general solution to implement across other projects.

Everyone will have some sort of data to add through SPECIAL, but the contents of SPECIAL will differ between projects. For example, most projects will want a Total Nitrogen calculation, my project has parameters of light transmission in water that live in a different database, and others might have something like pesticide data.

SPECIAL should always be much smaller than PROJECT. For example, this morning's run for my project had 1.1 million rows in PROJECT, and 50K rows in SPECIAL.

In addition, I need the final result of PROJECT+SPECIAL to be sorted. I tested run and cpu time of SETting the datasets followed by PROC SORT, PROC SQL using OUTER UNION CORR and ORDER BY, and to my surprise, APPEND followed by PROC SORT was the clear winner.

Thanks also for the caution to watch my variables and lengths carefully. For most of the cases, I am using PROC SQL, and defining the value of a couple of fields, like

'C' AS REMARK,
'TN_CALC' AS ANALYTE,

and the text fields in Oracle are of length 20 and 240, respectively. I definitely could specify the LENGTH explicitly like this, but was hoping to not have to do that.

'C' AS REMARK LENGTH=20,
'TN_CALC' AS ANALYTE LENGTH=240,

I'm not familiar with AUTOCALL, so will have to do some reading.

Thanks so much for the suggestions!

Wendy Message was edited by: WendyT

using dataset variables and formatting without observations