BookmarkSubscribeRSS Feed
Ronein
Onyx | Level 15

Hello

Lets say I want to set  multiple data sets  and I want to have only rows of my population.

The population is defined by the customers exist in data set t1.

IS there a  better code (less time  run) to do it?

In real world each table have around 2 million rows and  have 12 tables so it take long time run

Data t1;
input id x y year;
cards;
1 19 29 2025
2 10 15 2025
3 30 29 2025
;
run;

Data t2;
input id x y year;
cards;
1 34 36 2024
3 23 12 2024
;
run;

Data t3;
input id x y year;
cards;
2 56 23 2023
5 18 17 2023
7 43 23 2023
;
run;

proc sql;
create table t2_pop as
select a.*
from t2 as a
inner join t1  as b
on a.id=b.id
;
quit;

proc sql;
create table t3_pop as
select a.*
from t3 as a
inner join t1  as b
on a.id=b.id
;
quit;

data want ;
set t1
	t2_pop 
	t3_pop;
Run;


11 REPLIES 11
Kathryn_SAS
SAS Employee

You can use one DATA step with MERGE and then another DATA step to SET the results with the original data set.

data combine;
merge t1(in=a) t2(in=b) t3(in=c);
by id;
if a;
run;

data final;
set t1 combine;
run;

proc print data=final;
run;
Ronein
Onyx | Level 15
I think the merge you did have a problem because data sets have same columns names so it take only columns from t3
Kathryn_SAS
SAS Employee

My MERGE is doing what your 2 PROC SQL steps are doing. Then I SET those results with the original data set. If you run my code with your data sets, you will see that the Final data set is the same as your Want data set.

Ronein
Onyx | Level 15
merge t1(in=a) t2(in=b) t3(in=c);
The x here is from t3 only
Kathryn_SAS
SAS Employee

My code also includes a BY and an IF statement. If you run the complete code:

Data t1;
input id x y year;
cards;
1 19 29 2025
2 10 15 2025
3 30 29 2025
;
run;

Data t2;
input id x y year;
cards;
1 34 36 2024
3 23 12 2024
;
run;

Data t3;
input id x y year;
cards;
2 56 23 2023
5 18 17 2023
7 43 23 2023
;
run;

data combine;
merge t1(in=a) t2(in=b) t3(in=c);
by id;
if a;
run;

proc print data=combine;
run;

You get this output for the first part:

Obs    id     x     y    year

 1      1    34    36    2024
 2      2    56    23    2023
 3      3    23    12    2024
Tom
Super User Tom
Super User

SQL does not guarantee order unless you add an ORDER BY clause to the query.

 

But your output does not look like the right answer to me. 

 

There should be 6 observations, 2 for each ID in the final result. 3 come from the T1 dataset and then T2 contributes one observation each for IDs 1 and 3 and T3 only contributes the one observation for ID 2.

 

To see this keep track of where the data came from.

data want1;
  set t1(in=in1) t2-t3 indsname=dsn;
  by id;
  retain pop;
  if first.id then pop=in1;
  if pop;
  dsname = dsn;
run;

proc print;
run;

Result

Obs    id     x     y    year    pop    dsname

 1      1    19    29    2025     1     WORK.T1
 2      1    34    36    2024     1     WORK.T2
 3      2    10    15    2025     1     WORK.T1
 4      2    56    23    2023     1     WORK.T3
 5      3    30    29    2025     1     WORK.T1
 6      3    23    12    2024     1     WORK.T2

This data step

data combine;
merge t1(in=a) t2(in=b) t3(in=c);
by id;
if a;
run;

Only keeps three observations because there are just three observations in T1 and none of the other datasets have more than one observation per ID. 

 

Essentially instead of subsetting to just  the subjects that appear in the first dataset you are UPDATING the data in the first dataset with the values from the later datasets.  

 

Note that if the same ID appeared in T2 and T3 then the value for T2 will disappear as its value is overwritten by the value from T3.

 

I am not sure that is what the user wanted, but in that case you might also want to look at using the UPDATE statement instead of the MERGE statement.   Although UPDATE only allows two source datasets so if you would need to first combine all of the follow up datasets into one.  One of the nice things about the UPDATE statement if the goal is a single observation per ID is that it will apply the transactions in order so if there were updates in 2023 and then new updates in 2024 they will be applied in order.  And you only need to enter the data into the transactions for the variables that have actually changed.  If the value in the transaction dataset is missing for a variable then no change is made to the current value of that variable.

 

 

 

Tom
Super User Tom
Super User

@Kathryn_SAS wrote:

My MERGE is doing what your 2 PROC SQL steps are doing. Then I SET those results with the original data set. If you run my code with your data sets, you will see that the Final data set is the same as your Want data set.


No.  MERGE is not the same thing as SQL JOIN.

 

Especially when the datasets in question have common non-key variables.

 

In a MERGE the values of such common variable come from the last dataset mentioned in the MERGE statement that contributes to the observation.  In SQL you need to explicitly decide which version of the variables you want. If you select multiple variables with the same name (for example by using the * wildcard) then the values used will be those coming from the first one listed in the SELECT statement.  The later ones are just ignored.

 

In this case the SQL join in question was only selecting variables from the A alias (T2 and T3)  so there is no risk of values being mistakenly overwritten.

 

Kathryn_SAS
SAS Employee

In the example that was provided, when I run the code below and compare the PROC SQL results to the Merge, before the final step of concatenating with the original data set (t1), the data sets are the same for each ID value; although the order of the observations is different. 

Data t1;
input id x y year;
cards;
1 19 29 2025
2 10 15 2025
3 30 29 2025
;
run;

Data t2;
input id x y year;
cards;
1 34 36 2024
3 23 12 2024
;
run;

Data t3;
input id x y year;
cards;
2 56 23 2023
5 18 17 2023
7 43 23 2023
;
run;

proc sql;
create table t2_pop as
select a.*
from t2 as a
inner join t1  as b
on a.id=b.id
;
quit;

/*proc print data=t2_pop;*/
/*run;*/

proc sql;
create table t3_pop as
select a.*
from t3 as a
inner join t1  as b
on a.id=b.id
;
quit;

/*proc print data=t3_pop;*/
/*run;*/

data combine;
set t2_pop t3_pop;
run;

title 'Data set from 2 SQL steps';
proc print data=combine;
run;

data combine1;
merge t1(in=a) t2(in=b) t3(in=c);
by id;
if a;
run;

title 'Data set from Merge';
proc print data=combine1;
run;
Data set from 2 SQL steps                                            15:12 Tuesday, July 1, 2025   4

Obs    id     x     y    year

 1      1    34    36    2024
 2      3    23    12    2024
 3      2    56    23    2023



Data set from Merge                                                  15:12 Tuesday, July 1, 2025   5

Obs    id     x     y    year

 1      1    34    36    2024
 2      2    56    23    2023
 3      3    23    12    2024

mkeintz
PROC Star

 

If each dataset is sorted by ID, then

 

data want (drop=_:);
  set t1 (in=in1) t2 t3;
  by id;
  retain _found_in_t1;
  if first.id then _found_in_t1=in1;
  if _found_in_t1;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ballardw
Super User

Note that if you name the data sets carefully you can use LIST syntax to shorten the set statement.

 

data want1;
   set somelib.setX: ;
run;

The colon in the name means that ALL data sets in the Somelib library whose names start with the letters SETX would be used on the set statement.

 

data want2;
   set somelib.set23 - Somelib.set37;
run;

The - between the names means that the data sets somelib.set23, somelib.set24, etc. sequentially to somelib.set37 would be used. This syntax requires that all of the numbers from 23 through 37 exist or will throw errors, or at least did the last time I used this. 

 

Both constructs may be used in a single set statement.

 

The reason your code takes a long time to run is how long the multiple passes/combining of sets take. It has little to do with the set statements.

Tom
Super User Tom
Super User

You could use the SET statement to INTERLEAVE the observations instead of the MERGE statement.  That way since each observation stays separate there is no risk of overwriting values because of common variable names.

 

Because the IN= indicator variable will be reset to false when moving past the set of observations read from the POPulation dataset you will need to create another retained variable to keep track of whether any observations were found in the population dataset.

 

General solution might look like this if the dataset are sorted by ID.  Then you could create a retained variable POP to remember the IN= indicator variable's value on the first observation per ID and use that to subset the data.

data want;
  set t1(in=in1) t2-t3;
  by id;
  retain pop;
  if first.id then pop=in1;
  if pop;
run;

Note if the population dataset is actually just the list of ID (and not actual observations you want to write out) then just modify so those observations are not written.

  if first.id then pop=in1;
  else if pop;

 

If you did want to do it with SQL then use the UNION set operator.

proc sql;
create table want as
select * from t1
union
select * from t2 where id in (select id from t1)
union
select * from t3 where id in (select id from t1)
;
quit;

If the variables are not in the same order in all of the dataset then add the CORRESPONDING keyword to the UNIONs.  If you could have duplicate observations then add the ALL keyword to the UNIONs.

 

If the dataset are not sorted (and the POPULATION dataset is small enough) you could use HASH object.

data want;
  if _n_=1 then do;
    declare hash h(dataset:'t1');
    h.definekey('id');
    h.definedata('id');
    h.definedone();
  end;
  set t1-t3;
  if 0=h.find();
run;
Spoiler
1    data t1;
2      input id x y year;
3    cards;

NOTE: The data set WORK.T1 has 3 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


7    ;
8
9    data t2;
10     input id x y year;
11   cards;

NOTE: The data set WORK.T2 has 2 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


14   ;
15
16   data t3;
17     input id x y year;
18   cards;

NOTE: The data set WORK.T3 has 3 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


22   ;
23
24   data want1;
25     set t1(in=in1) t2-t3;
26     by id;
27     retain pop;
28     if first.id then pop=in1;
29     if pop;
30   run;

NOTE: There were 3 observations read from the data set WORK.T1.
NOTE: There were 2 observations read from the data set WORK.T2.
NOTE: There were 3 observations read from the data set WORK.T3.
NOTE: The data set WORK.WANT1 has 6 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


31
32   proc print;
33   run;

NOTE: There were 6 observations read from the data set WORK.WANT1.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


34
35   proc sql;
36   create table want2 as
37   select * from t1
38   union
39   select * from t2 where id in (select id from t1)
40   union
41   select * from t3 where id in (select id from t1)
42   order by id
43   ;
NOTE: Table WORK.WANT2 created, with 6 rows and 4 columns.

44   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.02 seconds
      cpu time            0.01 seconds


45   proc print;
46   run;

NOTE: There were 6 observations read from the data set WORK.WANT2.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


47
48   data want3;
49     if _n_=1 then do;
50       declare hash h(dataset:'t1');
51       h.definekey('id');
52       h.definedata('id');
53       h.definedone();
54     end;
55     set t1-t3;
56     if 0=h.find();
57   run;

NOTE: There were 3 observations read from the data set WORK.T1.
NOTE: There were 3 observations read from the data set WORK.T1.
NOTE: There were 2 observations read from the data set WORK.T2.
NOTE: There were 3 observations read from the data set WORK.T3.
NOTE: The data set WORK.WANT3 has 6 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


58
59   proc print;
60   run;

NOTE: There were 6 observations read from the data set WORK.WANT3.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 1306 views
  • 2 likes
  • 5 in conversation