Solved: Re: Loading and manipulating multiple datasets simultaneously

mark_ph · Posted 08-13-2014 10:06 AM

Dear All,

I need to call multiple datasets from the same library in SAS and change the format of one variable called DATE1 in both datasets. All the datasets have the same column variables because I have one dataset per year. I have managed to do it by using the following code (case 2 datasets):

data dataset1 dataset2;

set exp.dataset1 exp.dataset2 ;

date1_mod = INPUT(PUT(date1,8.),YYMMDD8.);

format date1_mod YYMMDD8.

run;

This code works, but I would like to ask you whether there is a more efficient way to do it (I may need to call 15-20 datasets).

Once I import the datasets, I need to use PROC SQL and CREATE TABLE in order to perform another operation on both datasets. The code below works in the case of a single dataset, but it fails with multiple datasets.

My first attempt tries to extend the case with 1 dataset in the following way:

proc sql;

create table mod_dataset1 mod_dataset2 as

select V1, V2, V3, V4, V5, V6, month(date1_mod) as month

from dataset1 dataset2

group by V1, calculated month

having date1_mod=max(date1_mod);

quit;

My second attempt involves the creation of a function MYFUN. This doesn't work either.

* creating function myfun;

function myfun(dataset);

proc sql;

create table newdataset as

select V1, V2, V3, V4, V5, V6, month(date1_mod) as month

from dataset

group by V1, calculated month

having date1_mod=max(date1_mod);

quit;

endsub;

run;

* calling function myfun;

newdataset = myfun(dataset);

Any help would be highly appreciated.

Reeza · Posted 08-13-2014 10:43 AM

I think you're looking to create a macro rather than a function.

For example:

%macro change_data(data_in);

data &data_in.;

set exp.data_in ;

date1_mod = INPUT(PUT(date1,8.),YYMMDD8.);

format date1_mod YYMMDD8.

run;

proc sql;

create table mod_&data_in. as

select V1, V2, V3, V4, V5, V6, month(date1_mod) as month

from &data_in.

group by V1, calculated month

having date1_mod=max(date1_mod);

quit;

%mend;

%change_data(dataset1);

%change_data(dataset2);

%change_data(dataset3);

View solution in original post

RW9 · Posted 08-13-2014 10:11 AM

Personally I would just generate the code from the metadata:

data _null_;

set sashelp.vtable (where=(libname="YOUR_LIBNAME"));

/* for each row returned do */

call execute('data your_libname.'||strip(memname)||';

set your_libname.'||strip(memname)||';

date1_mod=input(put(date1,8.),yymmdd8.);

format...;

run; /* Note you could put further code after here like your proc sql */');

run;

Using the above methodology (or you could use macro), you can create complex large programs with very little starting information.

Astounding · Posted 08-13-2014 10:35 AM

Looking at your DATA step, I question whether you understand what it does. Do you realize that dataset1 and dataset2 will be exact copies of one another? They will contain all the observations from both sources of data (exp.dataset1 and exp.dataset2). Did you want something a little bit different, like dataset1 reflecting the contents of exp.dataset1 and dataset2 reflecting the contents of exp.dataset2?

I like the general solution proposed by RW9, but I would steer you toward identifying columns rather than tables. (I think the source of the metadata would be sashelp.vcolumn but you would have to check.) That way, you can look for only those data sets that specifically contain DATE1 and skip processing any others.

Good luck.

Reeza · Posted 08-13-2014 10:43 AM

I think you're looking to create a macro rather than a function.

For example:

%macro change_data(data_in);

data &data_in.;

set exp.data_in ;

date1_mod = INPUT(PUT(date1,8.),YYMMDD8.);

format date1_mod YYMMDD8.

run;

proc sql;

create table mod_&data_in. as

select V1, V2, V3, V4, V5, V6, month(date1_mod) as month

from &data_in.

group by V1, calculated month

having date1_mod=max(date1_mod);

quit;

%mend;

%change_data(dataset1);

%change_data(dataset2);

%change_data(dataset3);

ballardw · Posted 08-13-2014 11:02 AM

Can't help with the value reassignment though I would suggest going back to the original read date step and use the correct informat to begin with.

Changing characteristics of variables such as Label and Format for multiple datasets may best be done with proc datasets.

mark_ph · Posted 08-13-2014 11:14 AM

Thank you guys for your answers.

Astounding, you are right. I have just realized that my first code produces two identical datasets. As you guessed correctly, I wanted dataset1 to reflect the contents of exp.dataset1 and dataset2 to reflect the contents of exp.dataset2, because I was trying to extend the case with a sigle dataset.

data dataset1;

set exp.dataset1;

date1_mod = INPUT(PUT(date1,8.),YYMMDD8.);

format date1_mod YYMMDD8.

run;

Astounding · Posted 08-13-2014 11:43 AM

mark_ph,

You will need to somehow process each data set individually. Macro language is a good tool for that, but I'm not sure how well versed you are in that area. So I'll stick with the RW9 approach. (Technically, this does use macro language, but it doesn't necessarily appear that it does so.)

data _null_;

set sashelp.vcolumn;

where libname='EXP' and upcase(name)='DATE1';

call execute (' *** clumsily string together all the code you need for a single data set, DATA step plus SQL code *** ');

run;

To get an idea of what you have to work with coming from sashelp.vcolumn, try running this program:

data test;

name='Mark';

run;

proc contents data=sashelp.vcolumn;

run;

proc print data=sashelp.vcolumn;

where libname='WORK' and upcase(memname)='TEST';

run;

Without going into the details of CALL EXECUTE, this should at least get you started in a good direction. If this proves to be too clumsy, you could certainly switch gears. Use these tools to get a list of all the data sets you will need to process, and then follow Reeza's approach.

Good luck.

jakarman · Posted 08-13-2014 11:15 AM

Why using proc sql or a datastep. These have the intention of changing data.

You can do a lot on changing data-descritpions (base) with proc-datasets.
Base SAS(R) 9.4 Procedures Guide, Third Edition

proc datasets library=... ;

modify dataset1 ; format date1_mod YYMMDD8. ;

modify dataset2 ; format date1_mod YYMMDD8. ;

modify dataset3 ; format date1_mod YYMMDD8. ;
run;

I expect it will just do an update in place in the header of the dataset. Not copying/replacing it.
There are no notices on IO when running this.

---->-- ja karman --<-----

Howles · Posted 08-19-2014 11:52 AM

Using PROC DATASETS to apply a format is certainly good practice, in general. But the original post here included the task of deriving a new variable, something PROC DATASETS doesn't do.

Also, I suggest using a DATA step to read the N annual data sets into a single new data set. That makes it possible to use BY and/or WHERE processing for tasks involving selection and/or stratification by year. That's another good practice.

jakarman · Posted 08-20-2014 05:24 AM

Howard, To confess ...
I did read the question and did run to fast into those words of "changing format" oops...( should say sorry)
Seeing afterwards the question was also a variable type conversion. I left my answer as a fit for the original question.
The best thing to do (good practice) could be solving the type when data is coming in. Infile/input options access/dbtype.

The combining to an analytics dataset (by where processing) is an advice I agree.
I looked at your nice paper http://support.sas.com/resources/papers/proceedings14/1277-2014.pdf

The sum statement is not well known SAS(R) 9.4 Statements: Reference, Third Edition. I often fail to remember that one.
The scope to just the SAS environment is an important one as you indicated, external dbms may support something.
Advice of not using the monotonic function as not documented not reliable agree on that as good practice.
Working out 5 approaches with a performance comparison, marvelous. Just a pitty a of the performance of the last two. But it is an indication what will happen if the dataset becomes larger. A long introduction...

There is something missing, that is: "what the effects are of the multi-threading processing in proc sql (Tk-kernel)?".
This is new since 9. The monotonic function can not be reliable anymore when there are several threads parallel processed.
Using the macro-environment for intermediate storage and counting is thinking on a possible parallel interruption break on the wrong moment. It will hardly be possible to test or validate something like that. I think of an interrupt between symget/symput (FCMP usage). The probability that it will happen is almost zero, but is it really zero?

---->-- ja karman --<-----

Classroom Training Available!