I have 3 files, all have ID variables. The IDs are in the same format accross the 3 files. They can appear in only 1 file, or 2 files, or 3 files. In each file they are unique (no duplicate), but not sorted. I want to get all unique IDs from the 3 files, only the ID variable is enough.
The files are large, so the best is to avoid proc sort. Any suggestions?
Using proc sql and doing a union is probably one of the easiest ways. I created 3 tables with IDs, scrambled them and then created another table with the just the unique IDs. From the 6.2 million rows in the original tables, there were only 4.4 million unique IDs created.
data tbl1 (keep=id skey);
do i = 1 to 3000000;
id = (i*2);
skey = rand('Uniform');
output;
end;
run;
data tbl2 (keep=id skey);
do i = 1 to 2000000;
id = (i*3);
skey = rand('Uniform');
output;
end;
run;
data tbl3 (keep=id skey);
do i = 1 to 1200000;
id = (i*5);
skey = rand('Uniform');
output;
end;
run;
proc sort data=tbl1; by skey; run;
proc sort data=tbl2; by skey; run;
proc sort data=tbl3; by skey; run;
proc sql stimer;
create table id_list as
select distinct id from tbl1
union select distinct id from tbl2
union select distinct id from tbl3;
/*
NOTE: Table WORK.ID_LIST created, with 4400000 rows and 1 columns.
NOTE: SQL Statement used (Total process time):
real time 3.31 seconds
cpu time 4.88 seconds
*/
I couldn't program this fast and easily enough, but I know there are those watching who can.
Create a hash table based on ID. There would be one variable in the data portion of the hash table, which would work as follows. When loading from each data source, add 1 to the data portion that matches the ID. At the end, unload any ID values that have a data portion of 1.
Note that using a counter doesn't tell you where the ID came from. If you need that information (other than just the fact that the ID was unique), you could increment by 1, 2, and 4 (or by 1, 10, and 100) instead of adding 1 each time.
Of course, all of this depends on having enough memory to store the hash table. So it may depend on your definition of a "large" data set.
EDITED AFTER THE FACT:
I'm not sure I understood what you meant by getting "unique IDs". I was trying to get those IDs that appeared in only one of the three data sets. It could easily be that you meant to get a complete list of all IDs, but with no duplicates.
Hi @fengyuwuzu,
When it comes to performance questions, many technical details can be important. So, I guess it would be interesting (for the experts in this matter out there) to know:
1. How many observations (in millions) do the three files (SAS datasets, or?) have?
2. What are type, length and typical values of the ID variable?
3. Do indexes exist for one or more of the three files (see PROC CONTENTS output)?
4. Do you have an idea how large the overlaps between the sets of IDs will be?
5. Any hardware limitations (RAM, free disk space) that may be important?
6. What are you planning to do with the dataset containing the unique IDs?
Using proc sql and doing a union is probably one of the easiest ways. I created 3 tables with IDs, scrambled them and then created another table with the just the unique IDs. From the 6.2 million rows in the original tables, there were only 4.4 million unique IDs created.
data tbl1 (keep=id skey);
do i = 1 to 3000000;
id = (i*2);
skey = rand('Uniform');
output;
end;
run;
data tbl2 (keep=id skey);
do i = 1 to 2000000;
id = (i*3);
skey = rand('Uniform');
output;
end;
run;
data tbl3 (keep=id skey);
do i = 1 to 1200000;
id = (i*5);
skey = rand('Uniform');
output;
end;
run;
proc sort data=tbl1; by skey; run;
proc sort data=tbl2; by skey; run;
proc sort data=tbl3; by skey; run;
proc sql stimer;
create table id_list as
select distinct id from tbl1
union select distinct id from tbl2
union select distinct id from tbl3;
/*
NOTE: Table WORK.ID_LIST created, with 4400000 rows and 1 columns.
NOTE: SQL Statement used (Total process time):
real time 3.31 seconds
cpu time 4.88 seconds
*/
@Pamela_JSRCC: Good suggestion! I think it would save time to omit the DISTINCT keyword. The result of the UNION operator is free of duplicates anyway.
Thank you for all replies.
One of my file is about 80G with >750 million rows. But I decided to extract the ID column only which makes the file much smaller.
With the ID_only files, proc sort is not a problem any more. (my ID is characteric with length 44)
Yes, I wanted to add one column indicating source, like AB showing the ID exists in both A and D sets; ABC indicating in all 3 sets.
Thank you all again!
Once you have the id_list table, you can easily do left joins to create flags for the sources. I prefer the flag(0,1) implementation because it makes it easy to sum them to get counts:
proc sql;
select count(*) as cnt
from id_list;
create table id_summary as
select a.id,
case when missing(b1.id) then 0 else 1 end as tbl1_flag 'tbl1 flag',
case when missing(b2.id) then 0 else 1 end as tbl2_flag 'tbl2 flag',
case when missing(b3.id) then 0 else 1 end as tbl3_flag 'tbl3 flag',
calculated tbl1_flag +
calculated tbl2_flag +
calculated tbl3_flag as tbl_cnt 'tbl cnt'
from id_list A
left join tbl1 B1 on b1.id = a.id
left join tbl2 B2 on b2.id = a.id
left join tbl3 B3 on b3.id = a.id
;
select tbl_cnt, count(*) as cnt
from id_summary
group by tbl_cnt
;
select tbl1_flag, tbl2_flag, tbl3_flag, count(*) as cnt
from id_summary
group by tbl1_flag, tbl2_flag, tbl3_flag
;
select sum(tbl1_flag) as tbl1_cnt,
sum(tbl2_flag) as tbl2_cnt,
sum(tbl3_flag) as tbl3_cnt
from id_summary;
/*
cnt
--------
4400000
tbl cnt cnt
------------------
1 2800000
2 1400000
3 200000
tbl1 tbl2 tbl3
flag flag flag cnt
--------------------------------------
0 0 1 400000
0 1 0 800000
0 1 1 200000
1 0 0 1600000
1 0 1 400000
1 1 0 800000
1 1 1 200000
tbl1_cnt tbl2_cnt tbl3_cnt
----------------------------
3000000 2000000 1200000
*/
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.