BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
fengyuwuzu
Pyrite | Level 9

I have 3 files, all have ID variables. The IDs are in the same format accross the 3 files. They can appear in only 1 file, or 2 files, or 3 files. In each file they are unique (no duplicate), but not sorted. I want to get all unique IDs from the 3 files, only the ID variable is enough.

The files are large, so the best is to avoid proc sort. Any suggestions?

1 ACCEPTED SOLUTION

Accepted Solutions
Pamela_JSRCC
Quartz | Level 8

Using proc sql and doing a union is probably one of the easiest ways.  I created 3 tables with IDs, scrambled them and then created another table with the just the unique IDs.  From the 6.2 million rows in the original tables, there were only 4.4 million unique IDs created.


data tbl1 (keep=id skey);
  do i = 1 to 3000000;
      id   = (i*2);
      skey = rand('Uniform');
      output;
  end;
run;
data tbl2 (keep=id skey);
  do i = 1 to 2000000;
      id   = (i*3);
      skey = rand('Uniform');
      output;
  end;
run;
data tbl3 (keep=id skey);
  do i = 1 to 1200000;
      id   = (i*5);
      skey = rand('Uniform');
      output;
  end;
run;
proc sort data=tbl1; by skey; run;
proc sort data=tbl2; by skey; run;
proc sort data=tbl3; by skey; run;

proc sql stimer;
create table id_list as
      select distinct id from tbl1
union select distinct id from tbl2
union select distinct id from tbl3;
/*
NOTE: Table WORK.ID_LIST created, with 4400000 rows and 1 columns.
NOTE: SQL Statement used (Total process time):
      real time           3.31 seconds
      cpu time            4.88 seconds
*/

View solution in original post

7 REPLIES 7
Astounding
PROC Star

I couldn't program this fast and easily enough, but I know there are those watching who can.

 

Create a hash table based on ID.  There would be one variable in the data portion of the hash table, which would work as follows.  When loading from each data source, add 1 to the data portion that matches the ID.  At the end, unload any ID values that have a data portion of 1.

 

Note that using a counter doesn't tell you where the ID came from.  If you need that information (other than just the fact that the ID was unique), you could increment by 1, 2, and 4 (or by 1, 10, and 100) instead of adding 1 each time.

 

Of course, all of this depends on having enough memory to store the hash table.  So it may depend on your definition of a "large" data set.

 

EDITED AFTER THE FACT:

 

I'm not sure I understood what you meant by getting "unique IDs".  I was trying to get those IDs that appeared in only one of the three data sets.  It could easily be that you meant to get a complete list of all IDs, but with no duplicates.

FreelanceReinh
Jade | Level 19

Hi @fengyuwuzu,

 

When it comes to performance questions, many technical details can be important. So, I guess it would be interesting (for the experts in this matter out there) to know:

 

1. How many observations (in millions) do the three files (SAS datasets, or?) have?
2. What are type, length and typical values of the ID variable?
3. Do indexes exist for one or more of the three files (see PROC CONTENTS output)?
4. Do you have an idea how large the overlaps between the sets of IDs will be?
5. Any hardware limitations (RAM, free disk space) that may be important?
6. What are you planning to do with the dataset containing the unique IDs?

 

Pamela_JSRCC
Quartz | Level 8

Using proc sql and doing a union is probably one of the easiest ways.  I created 3 tables with IDs, scrambled them and then created another table with the just the unique IDs.  From the 6.2 million rows in the original tables, there were only 4.4 million unique IDs created.


data tbl1 (keep=id skey);
  do i = 1 to 3000000;
      id   = (i*2);
      skey = rand('Uniform');
      output;
  end;
run;
data tbl2 (keep=id skey);
  do i = 1 to 2000000;
      id   = (i*3);
      skey = rand('Uniform');
      output;
  end;
run;
data tbl3 (keep=id skey);
  do i = 1 to 1200000;
      id   = (i*5);
      skey = rand('Uniform');
      output;
  end;
run;
proc sort data=tbl1; by skey; run;
proc sort data=tbl2; by skey; run;
proc sort data=tbl3; by skey; run;

proc sql stimer;
create table id_list as
      select distinct id from tbl1
union select distinct id from tbl2
union select distinct id from tbl3;
/*
NOTE: Table WORK.ID_LIST created, with 4400000 rows and 1 columns.
NOTE: SQL Statement used (Total process time):
      real time           3.31 seconds
      cpu time            4.88 seconds
*/

FreelanceReinh
Jade | Level 19

@Pamela_JSRCC: Good suggestion! I think it would save time to omit the DISTINCT keyword. The result of the UNION operator is free of duplicates anyway.

fengyuwuzu
Pyrite | Level 9

Thank you for all replies.

One of my file is about 80G with >750 million rows. But I decided to extract the ID column only which makes the file much smaller.

With the ID_only files, proc sort is not a problem any more. (my ID is characteric with length 44)

 

Yes, I wanted to add one column indicating source, like AB showing the ID exists in both A and D sets; ABC indicating in all 3 sets.

 

Thank you all again!

Pamela_JSRCC
Quartz | Level 8

Once you have the id_list table, you can easily do left joins to create flags for the sources.  I prefer the flag(0,1) implementation because it makes it easy to sum them to get counts:

 

proc sql;
select count(*) as cnt
       from id_list;
create table id_summary as
select a.id,
       case when missing(b1.id) then 0 else 1 end as tbl1_flag 'tbl1 flag',
       case when missing(b2.id) then 0 else 1 end as tbl2_flag 'tbl2 flag',
       case when missing(b3.id) then 0 else 1 end as tbl3_flag 'tbl3 flag',
       calculated tbl1_flag +
          calculated tbl2_flag +
             calculated tbl3_flag                 as tbl_cnt   'tbl cnt'
from   id_list A
left join tbl1 B1 on b1.id = a.id
left join tbl2 B2 on b2.id = a.id
left join tbl3 B3 on b3.id = a.id
;
select tbl_cnt, count(*) as cnt
from   id_summary
group  by tbl_cnt
;
select tbl1_flag, tbl2_flag, tbl3_flag, count(*) as cnt
from   id_summary
group  by tbl1_flag, tbl2_flag, tbl3_flag
;
select  sum(tbl1_flag) as tbl1_cnt,
        sum(tbl2_flag) as tbl2_cnt,
        sum(tbl3_flag) as tbl3_cnt
from    id_summary;
/*
     cnt
--------
 4400000

 tbl cnt       cnt
------------------
       1   2800000
       2   1400000
       3    200000

    tbl1      tbl2      tbl3
    flag      flag      flag       cnt
--------------------------------------
       0         0         1    400000
       0         1         0    800000
       0         1         1    200000
       1         0         0   1600000
       1         0         1    400000
       1         1         0    800000
       1         1         1    200000

tbl1_cnt  tbl2_cnt  tbl3_cnt
----------------------------
 3000000   2000000   1200000
*/

 

fengyuwuzu
Pyrite | Level 9
Thank you!! This is really a brilliant idea. I learned a lot again.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1214 views
  • 1 like
  • 4 in conversation