I have a dataset called low_2002 as follows:
DATA low_2002;
input county_id ammonia nitrogen voc;
DATALINES;
1001 568.76 2329.47 6695.85
1003 0.24 0.00 50.41
1005 10.73 347.76 325.23
1007 0.00 59.06 77.60
1009 4.99 0.00 0.00
;
RUN;
and another dataset called TRI as follows:
DATA TRI;
input county_id voc;
DATALINES;
1001 432.47
1003 0.00
1005 3.22
1007 0.00
1009 1.52
;
RUN;
What I would like to do is replace each county_id's VOC value in low_2002 with the corresponding value in TRI, sequentially, while keeping everything else exactly the same. This will produce 5 datasets. I want to name them low_2002_TRI_1, low_2002_TRI_2, ..., low_2002_TRI_5. For example, low_2002_TRI_1 will be: (county_id 1001 has the VOC value of 6695.85
replaced with 432.47
, while everything else stays the same).
DATA low_2002_TRI_1;
input county_id ammonia nitrogen voc;
DATALINES;
1001 568.76 2329.47 432.47
1003 0.24 0.00 50.41
1005 10.73 347.76 325.23
1007 0.00 59.06 77.60
1009 4.99 0.00 0.00
;
RUN;
low_2002_TRI_2 will be:
DATA low_2002_TRI_2;
input county_id ammonia nitrogen voc;
DATALINES;
1001 568.76 2329.47 6695.85
1003 0.24 0.00 0.00
1005 10.73 347.76 325.23
1007 0.00 59.06 77.60
1009 4.99 0.00 0.00
;
RUN;
and all the way until low_2002_TRI_5:
DATA low_2002_TRI_5;
input county_id ammonia nitrogen voc;
DATALINES;
1001 568.76 2329.47 6695.85
1003 0.24 0.00 50.41
1005 10.73 347.76 325.23
1007 0.00 59.06 77.60
1009 4.99 0.00 1.52
;
RUN;
Then, I would like to save each of the low_2002_TRI_1, low_2002_TRI_2, ..., low_2002_TRI_5 as csv files with no variable names and the county_id variable dropped, so that low_2002_TRI_1.csv looks like:
568.76 | 2329.47 | 432.47 |
0.24 | 0.00 | 50.41 |
10.73 | 347.76 | 325.23 |
0.00 | 59.06 | 77.60 |
4.99 | 0.00 | 0.00 |
and low_2002_TRI_2 looks like:
568.76 | 2329.47 | 6695.85 |
0.24 | 0.00 | 0.00 |
10.73 | 347.76 | 325.23 |
0.00 | 59.06 | 77.60 |
4.99 | 0.00 | 0.00 |
The above is just an example of 5 county_id's, but in reality, I will be applying this procedure to a fuller dataset with approximately 3000 county_id's.
At first you might imagine this task would read one obs from LOW_2002 at a time. It would then selectively output the unchanged obs to all but one of the output datasets. Then make the change, using the data from TRI, and output the changed obs to the single appropriate dataset. Go on and do the next record from LOW_2002.
But this requires a lot of hard coding of dataset names, among other complexities.
A better way is to load the original LOW_2002 into a hash object (i.e. a table in memory). Then read one record at a time from TRI. For each such record from TRI, modify the variable VOC in the hash object and then output the hash object to the corresponding dataset. Then restore the original value to the hash object.
DATA low_2002;
input county_id ammonia nitrogen voc;
DATALINES;
1001 568.76 2329.47 6695.85
1003 0.24 0.00 50.41
1005 10.73 347.76 325.23
1007 0.00 59.06 77.60
1009 4.99 0.00 0.00
RUN;
DATA TRI;
input county_id voc;
DATALINES;
1001 432.47
1003 0.00
1005 3.22
1007 0.00
1009 1.52
RUN;
data _null_;
if 0 then set low_2002;
if _n_=1 then do;
declare hash h(dataset:'low_2002',ordered:'a');
h.definekey('county_id');
h.definedata(all:'Y');
h.definedone();
end;
set tri (rename=(voc=new_voc));
h.find();
tmp_voc=voc;
voc=new_voc;
h.replace();
h.output(dataset:cats('low_2002_tri_',_n_));
voc=tmp_voc;
h.replace();
run;
Among the advantages of the above is that you don't have to know in advance how many datasets will be produced. Instead just dynamically generate as many dataset names as needed by constructing the dataset names in the hash OUTPUT method. It will produce as many datasets as there are observations in TRI.
BTW, this program assumes there is a one-to-one correspondence of COUNTY_ID in TRI to COUNTY_ID in LOW_2002.
At first you might imagine this task would read one obs from LOW_2002 at a time. It would then selectively output the unchanged obs to all but one of the output datasets. Then make the change, using the data from TRI, and output the changed obs to the single appropriate dataset. Go on and do the next record from LOW_2002.
But this requires a lot of hard coding of dataset names, among other complexities.
A better way is to load the original LOW_2002 into a hash object (i.e. a table in memory). Then read one record at a time from TRI. For each such record from TRI, modify the variable VOC in the hash object and then output the hash object to the corresponding dataset. Then restore the original value to the hash object.
DATA low_2002;
input county_id ammonia nitrogen voc;
DATALINES;
1001 568.76 2329.47 6695.85
1003 0.24 0.00 50.41
1005 10.73 347.76 325.23
1007 0.00 59.06 77.60
1009 4.99 0.00 0.00
RUN;
DATA TRI;
input county_id voc;
DATALINES;
1001 432.47
1003 0.00
1005 3.22
1007 0.00
1009 1.52
RUN;
data _null_;
if 0 then set low_2002;
if _n_=1 then do;
declare hash h(dataset:'low_2002',ordered:'a');
h.definekey('county_id');
h.definedata(all:'Y');
h.definedone();
end;
set tri (rename=(voc=new_voc));
h.find();
tmp_voc=voc;
voc=new_voc;
h.replace();
h.output(dataset:cats('low_2002_tri_',_n_));
voc=tmp_voc;
h.replace();
run;
Among the advantages of the above is that you don't have to know in advance how many datasets will be produced. Instead just dynamically generate as many dataset names as needed by constructing the dataset names in the hash OUTPUT method. It will produce as many datasets as there are observations in TRI.
BTW, this program assumes there is a one-to-one correspondence of COUNTY_ID in TRI to COUNTY_ID in LOW_2002.
Thanks! That's really smart and efficient way of doing it. Do you have any ideas how to accomplish the last part? I would like to export each dataset to csv without the county_id variable. So for example, for Low_2002_tri_1, I would do:
data Low_2002_tri_1(drop=county_id);
set Low_2002_tri_1;
run;
proc export data=Low_2002_tri_1
outfile="E:\folder\Low_2002_tri_1.csv"
dbms=csv
replace;
putnames=no;
run;
But, how can I loop this for all datasets?
Ah nevermind, I figured it out by writing a macro, thanks again!
@elbarto wrote:
Thanks! That's really smart and efficient way of doing it. Do you have any ideas how to accomplish the last part? I would like to export each dataset to csv without the county_id variable. So for example, for Low_2002_tri_1, I would do:
data Low_2002_tri_1(drop=county_id); set Low_2002_tri_1; run; proc export data=Low_2002_tri_1 outfile="E:\folder\Low_2002_tri_1.csv" dbms=csv replace; putnames=no; run;
But, how can I loop this for all datasets?
Pretty much every where you use a data set you can use Data set Options to drop, keep or rename variables.
So
proc export data=Low_2002_tri_1 (drop=country_id) outfile="E:\folder\Low_2002_tri_1.csv" dbms=csv replace; putnames=no; run;
No need for that potentially dangerous data step to corrupt your data set.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.