Solved: Re: merge all IDs from 3 files , anyway to avoid sorting (files are la...

fengyuwuzu · Posted 04-20-2016 03:12 PM

I have 3 files, all have ID variables. The IDs are in the same format accross the 3 files. They can appear in only 1 file, or 2 files, or 3 files. In each file they are unique (no duplicate), but not sorted. I want to get all unique IDs from the 3 files, only the ID variable is enough.

The files are large, so the best is to avoid proc sort. Any suggestions?

Pamela_JSRCC · Posted 04-20-2016 04:28 PM

Using proc sql and doing a union is probably one of the easiest ways. I created 3 tables with IDs, scrambled them and then created another table with the just the unique IDs. From the 6.2 million rows in the original tables, there were only 4.4 million unique IDs created.

data tbl1 (keep=id skey);
do i = 1 to 3000000;
      id   = (i*2);
      skey = rand('Uniform');
      output;
end;
run;
data tbl2 (keep=id skey);
do i = 1 to 2000000;
      id   = (i*3);
      skey = rand('Uniform');
      output;
end;
run;
data tbl3 (keep=id skey);
do i = 1 to 1200000;
      id   = (i*5);
      skey = rand('Uniform');
      output;
end;
run;
proc sort data=tbl1; by skey; run;
proc sort data=tbl2; by skey; run;
proc sort data=tbl3; by skey; run;

proc sql stimer;
create table id_list as
      select distinct id from tbl1
union select distinct id from tbl2
union select distinct id from tbl3;
/*
NOTE: Table WORK.ID_LIST created, with 4400000 rows and 1 columns.
NOTE: SQL Statement used (Total process time):
      real time           3.31 seconds
      cpu time            4.88 seconds
*/

View solution in original post

Astounding · Posted 04-20-2016 04:00 PM

I couldn't program this fast and easily enough, but I know there are those watching who can.

Create a hash table based on ID. There would be one variable in the data portion of the hash table, which would work as follows. When loading from each data source, add 1 to the data portion that matches the ID. At the end, unload any ID values that have a data portion of 1.

Note that using a counter doesn't tell you where the ID came from. If you need that information (other than just the fact that the ID was unique), you could increment by 1, 2, and 4 (or by 1, 10, and 100) instead of adding 1 each time.

Of course, all of this depends on having enough memory to store the hash table. So it may depend on your definition of a "large" data set.

EDITED AFTER THE FACT:

I'm not sure I understood what you meant by getting "unique IDs". I was trying to get those IDs that appeared in only one of the three data sets. It could easily be that you meant to get a complete list of all IDs, but with no duplicates.

FreelanceReinh · Posted 04-20-2016 04:20 PM

Hi @fengyuwuzu,

When it comes to performance questions, many technical details can be important. So, I guess it would be interesting (for the experts in this matter out there) to know:

1. How many observations (in millions) do the three files (SAS datasets, or?) have?
2. What are type, length and typical values of the ID variable?
3. Do indexes exist for one or more of the three files (see PROC CONTENTS output)?
4. Do you have an idea how large the overlaps between the sets of IDs will be?
5. Any hardware limitations (RAM, free disk space) that may be important?
6. What are you planning to do with the dataset containing the unique IDs?

Pamela_JSRCC · Posted 04-20-2016 04:28 PM

Using proc sql and doing a union is probably one of the easiest ways. I created 3 tables with IDs, scrambled them and then created another table with the just the unique IDs. From the 6.2 million rows in the original tables, there were only 4.4 million unique IDs created.

data tbl1 (keep=id skey);
do i = 1 to 3000000;
      id   = (i*2);
      skey = rand('Uniform');
      output;
end;
run;
data tbl2 (keep=id skey);
do i = 1 to 2000000;
      id   = (i*3);
      skey = rand('Uniform');
      output;
end;
run;
data tbl3 (keep=id skey);
do i = 1 to 1200000;
      id   = (i*5);
      skey = rand('Uniform');
      output;
end;
run;
proc sort data=tbl1; by skey; run;
proc sort data=tbl2; by skey; run;
proc sort data=tbl3; by skey; run;

proc sql stimer;
create table id_list as
      select distinct id from tbl1
union select distinct id from tbl2
union select distinct id from tbl3;
/*
NOTE: Table WORK.ID_LIST created, with 4400000 rows and 1 columns.
NOTE: SQL Statement used (Total process time):
      real time           3.31 seconds
      cpu time            4.88 seconds
*/

FreelanceReinh · Posted 04-20-2016 04:46 PM

@Pamela_JSRCC: Good suggestion! I think it would save time to omit the DISTINCT keyword. The result of the UNION operator is free of duplicates anyway.

fengyuwuzu · Posted 04-20-2016 05:14 PM

Thank you for all replies.

One of my file is about 80G with >750 million rows. But I decided to extract the ID column only which makes the file much smaller.

With the ID_only files, proc sort is not a problem any more. (my ID is characteric with length 44)

Yes, I wanted to add one column indicating source, like AB showing the ID exists in both A and D sets; ABC indicating in all 3 sets.

Thank you all again!

Pamela_JSRCC · Posted 04-20-2016 05:45 PM

Once you have the id_list table, you can easily do left joins to create flags for the sources. I prefer the flag(0,1) implementation because it makes it easy to sum them to get counts:

proc sql;
select count(*) as cnt
       from id_list;
create table id_summary as
select a.id,
       case when missing(b1.id) then 0 else 1 end as tbl1_flag 'tbl1 flag',
       case when missing(b2.id) then 0 else 1 end as tbl2_flag 'tbl2 flag',
       case when missing(b3.id) then 0 else 1 end as tbl3_flag 'tbl3 flag',
       calculated tbl1_flag +
          calculated tbl2_flag +
             calculated tbl3_flag                 as tbl_cnt 'tbl cnt'
from   id_list A
left join tbl1 B1 on b1.id = a.id
left join tbl2 B2 on b2.id = a.id
left join tbl3 B3 on b3.id = a.id
;
select tbl_cnt, count(*) as cnt
from   id_summary
group by tbl_cnt
;
select tbl1_flag, tbl2_flag, tbl3_flag, count(*) as cnt
from   id_summary
group by tbl1_flag, tbl2_flag, tbl3_flag
;
select sum(tbl1_flag) as tbl1_cnt,
        sum(tbl2_flag) as tbl2_cnt,
        sum(tbl3_flag) as tbl3_cnt
from    id_summary;
/*
     cnt
--------
4400000

tbl cnt       cnt
------------------
       1   2800000
       2   1400000
       3    200000

    tbl1      tbl2      tbl3
    flag      flag      flag       cnt
--------------------------------------
       0         0         1    400000
       0         1         0    800000
       0         1         1    200000
       1         0         0   1600000
       1         0         1    400000
       1         1         0    800000
       1         1         1    200000

tbl1_cnt tbl2_cnt tbl3_cnt
----------------------------
3000000   2000000   1200000
*/

fengyuwuzu · Posted 04-21-2016 10:33 AM

Thank you!! This is really a brilliant idea. I learned a lot again.

merge all IDs from 3 files , anyway to avoid sorting (files are large)

Re: merge all IDs from 3 files , anyway to avoid sorting (files are large)

Re: merge all IDs from 3 files , anyway to avoid sorting (files are large)

Re: merge all IDs from 3 files , anyway to avoid sorting (files are large)

Re: merge all IDs from 3 files , anyway to avoid sorting (files are large)

Re: merge all IDs from 3 files , anyway to avoid sorting (files are large)

Re: merge all IDs from 3 files , anyway to avoid sorting (files are large)

Re: merge all IDs from 3 files , anyway to avoid sorting (files are large)

Re: merge all IDs from 3 files , anyway to avoid sorting (files are large)

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away