Solved: Re: Using two data sets to form another data set (no combine or merge)

Alireza_Boloori · Posted 07-04-2017 08:22 PM

Thank you very much for your thoughts in advance!

I use SAS 9.4. I have a data set with 17170 rows, and I want to create an array based on one of variables in this data set. Then, I want to use that array in another data set to filter its observations (i.e., more than 1 million observations). So, if any observation in the new data set is not among those 17170 different values, then that specific row has to be removed.

Here is the code I used (where libsas refers to my local directory):

data libsas.new_data;
set libsas.data1; /* data1 has 17170 rows */
array schedule(*) X1; /* I want to form array "schedule" based on variable X1 in data1 */

set libsas.data2; /* data2 is the second data set in which I want to filter observations */
if X2 in schedule; /* X2 is the variable in data2 where I want to filter observations */
run;

However, what I get is a data with 17170 rows (i.e., the same size as my first data), while it should be much more that. Would you please help me to understand what's wrong? Thanks!

art297 · Posted 07-04-2017 08:46 PM

You got that result because that is the way SAS works unless told otherwise. If you had read both datasets within separate DO UNTIL loops, you'd have gotten what you wanted. e.g.,

data data1;
  input x1;
  cards;
3
4
5
7
;

data data2;
  input x2 z;
  cards;
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
3 3
4 4
5 5
6 6
7 7
3 3
4 4
5 5
6 6
7 7
;

data new_data (drop=i x1);
  array schedule(20000) _temporary_; 
  do until (eof1);
    set data1 end=eof1;
    i+1;
    schedule(i)=x1;
  end;
  do until (eof2);
    set data2 end=eof2;
    if X2 in schedule then output;
  end;
run;

Art, CEO, AnalystFinder.com

View solution in original post

Astounding · Posted 07-04-2017 08:42 PM

Your array contains one element named X1. It does not contain thousands of elements, just one.

SAS contains many ways to select matches. Here is one:

proc sql;

create table libsas.new_data as select * from libsas.data2

where x2 in (select distinct x1 from libsas.data1);

quit;

Alireza_Boloori · Posted 07-13-2017 09:41 PM

@Astounding Thank you for your feedback!

art297 · Posted 07-04-2017 08:46 PM

You got that result because that is the way SAS works unless told otherwise. If you had read both datasets within separate DO UNTIL loops, you'd have gotten what you wanted. e.g.,

data data1;
  input x1;
  cards;
3
4
5
7
;

data data2;
  input x2 z;
  cards;
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
3 3
4 4
5 5
6 6
7 7
3 3
4 4
5 5
6 6
7 7
;

data new_data (drop=i x1);
  array schedule(20000) _temporary_; 
  do until (eof1);
    set data1 end=eof1;
    i+1;
    schedule(i)=x1;
  end;
  do until (eof2);
    set data2 end=eof2;
    if X2 in schedule then output;
  end;
run;

Art, CEO, AnalystFinder.com

Alireza_Boloori · Posted 07-13-2017 09:40 PM

@art297 Thank you very much!

Reeza · Posted 07-04-2017 09:50 PM

SQL is the easiest way to solve this. See @Astounding solution. A SAS data step doesn't process data all at once, it reads a line at a time.But this means it doesn't have information on the records either ahead or below, though there are ways around that. But in general, in SAS, an array is simply a way to reference variables in the same row.

art297 · Posted 07-04-2017 10:27 PM

@Reeza: @Alireza_Boloori's subject line indicated NO COMBINE of MERGE.

And I don't know if I'd say the SQL solution is easier but, in this case, should run about 100 times faster than the datastep approach.

Art, CEO, AnalystFinder.com

Alireza_Boloori · Posted 07-13-2017 09:42 PM

@Reeza Thank you for your feedback! I was thinking the same about using SQL. I guess I have to start learning it.

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away