Solved: Re: how to find the value I am looking for from my dataset

EEEY · Posted 09-30-2015 03:53 PM

I have a small dataset with values like: emplid 1111 2222 3333 Another bigger dataset like: mgrid_1 mgrid_2 mgrid_3 mgrid_4...mgrid_11 101 1111 121212 202 23232 2222 34321 202 1111 23432 3333 11212 4444 55555 6666 Basically, the value of emlid = 1111,2222,3333 locate in different columns of my bigger dataset. The result I am looking for would be emplid mgrid 1111 101 2222 23232 3333 23432. Could anybody help me with a simple solution? Thanks a lot!

Patrick · Posted 10-06-2015 01:49 AM

May be I missunderstand your data and how it's structured. It appears the same value in "have2" can exist more than once. Below code will return at least one row per source row from have1 - but return multiple rows if empid matches more than once to a value in have2.

In below code "&colon;" needs to be replaced with ":"

data have1;
  input empid;
cards;
1111
2222
3333
2124
9999
;

data have2;
  input mgrid_1 mgrid_2 mgrid_3 mgrid_4 mgrid_5;
cards;
101 1111 1212 2124 14322
201 2211 2222 3434 5656
301 3322 2124 5434 3333
401 3355 1199 55 77
;

data want(keep=empid mgrid);
  if _n_=1 then
    do;
      if 0 then set have1(keep=empid) have2(keep=mgrid_1 rename=(mgrid_1=mgrid));
      dcl hash h1(multidata&colon;'y');
      h1.defineKey('empid');
      h1.defineData('mgrid');
      h1.defineDone();

      array mgrids {*} mgrid_1 - mgrid_5;
      do while(not last);
        set have2 end=last;
        do i=2 to dim(mgrids);
          h1.add(key:mgrids[i], data&colon;mgrids[i-1]);
        end;
      end;
    end;
    call missing(of _all_);

    set have1;
    _rc=h1.find();
    output;
    do while(h1.find_next() = 0);
      output;
    end;
 run;

View solution in original post

Steelers_In_DC · Posted 09-30-2015 04:04 PM

I have no idea what you are trying to say. Put in a format like below so we can see what your actual datasets look like:

data have1;
input emplid$;
cards;
1111
2222
3333
;

data have2;
input mgrid_1 mgrid2 mgrid3;
cards;
1111 2222 3333
;

Steelers_In_DC · Posted 09-30-2015 04:04 PM

also a dataset that you 'want'.

EEEY · Posted 09-30-2015 04:14 PM

data have1;
input emplid$;
cards;
1111
2222
3333
;

data have2;
input mgrid_1 mgrid_2 mgrid_3 mgrid_4 mgrid_5;
cards;
101 1111 1212 2124 14322
201 2211 2222 3434 5656
301 3322 2124 5434 3333
401 3355 1199 55 77
;

data want;
input emplid mgrid;
cards;
1111 101
2222 2211
3333 5434
;

My dataset would be like this. The last dataset is what I am looking for

PaigeMiller · Posted 09-30-2015 04:06 PM

I think you'd need to use an ARRAY that loops over all possible columns where the emplid might be. This also allows you to determine the value in the column preceding this.

array x{i} col1-col999;
do i=2 to 999;    /* Note 1111 cannot be in col1 */
    if x{i}=1111 then do; emplid=x{i}; mgrid=x{i-1}; output; end;
end;

--
Paige Miller

EEEY · Posted 09-30-2015 04:17 PM

Thanks PaigeMiller. I guess that would be the solution but I am really bad at array. I have posted the code for the datasets I have and I want. the value 1111 is in my dataset have1.. how should i modify your program?

PaigeMiller · Posted 09-30-2015 04:31 PM

Well, to tell you the truth, I'd like to see you learn a little about arrays instead of me doing the work for you.

I believe you ought to be able to give it a try to get my code to work on your exact data. If it doesn't work, show us the code and the SASLOG and we'll figure out how to make progress.

--
Paige Miller

Steelers_In_DC · Posted 09-30-2015 04:37 PM

I am also not too familiar with arrays. I need to get on that. Here is a solution using the same mindset but taking advantage of transpose:

data have1;
input emplid$;
cards;
1111
2222
3333
;

data have2;
input mgrid_1$ mgrid_2$ mgrid_3$ mgrid_4$ mgrid_5$;
cards;
101 1111 1212 2124 14322
201 2211 2222 3434 5656
301 3322 2124 5434 3333
401 3355 1199 55 77
;

proc transpose data=have2 out=tran2;by mgrid_1;var mgrid_2 mgrid_3 mgrid_4 mgrid_5;

data lag;
set tran2;
by mgrid_1;
mgrid = lag(col1);
if first.mgrid_1 then mgrid = mgrid_1;
run;

proc sql;
create table want as
select a.emplid,b.mgrid
from have1 a left join
lag b on
a.emplid = b.col1;

EEEY · Posted 10-03-2015 09:27 PM

Thanks for your response. Your code is working well. But my dataset Have2 has 182K rows of record and 11 columns.

I am not sure if proc transpose is a good idea.

EEEY · Posted 10-03-2015 09:26 PM

Thanks PaigeMiller for your response. Actually your code is working well. But the thing is 1111 is just one value in the data set Have1. In my really data set Have1, I have 200+ rows. How to modify your code.

The code I have :

data have1;
set emp_level_final2;
keep mgrid_1-mgrid_11;
run;

data have2;
set MS_MGR;
keep emplid;
run;

data want;
set have1;
array x{11} mgrid_1-mgrid_11;

set have2;
array y{1} emplid;
;
do i=2 to 11; /* Note 1111 cannot be in col1 */
if x{i}=y{1} then do; emplid=y{1};sup_id=x{i-1}; output; end;
end;
run;

By using the code above, it only gave me 1 row in the Want data set that I knew it should have come back with about 100 rows). There was no errors in the log. Please help!

MikeZdeb · Posted 10-04-2015 03:07 PM

Hi. You can search a numeric array with the WHICHN function (no loop needed ... there's a WHICHC for character data). To look for one value ...

* note NUMERIC emplid;
data have1;
input emplid @@;
datalines;
1111 2222 3333
;

data have2;
input mgrid_1-mgrid_5;
datalines;
101 1111 1212 2124 14322
201 2211 2222 3434 5656
301 3322 2124 5434 3333
401 3355 1199 55 77
;

data want (keep=emplid mgrid);
retain emplid 1111;
array grid(*) mgrid_1-mgrid_5;
set have2;
found = whichn(emplid, of grid(*));
if found then do; mgrid=grid(found-1); output; end;
run;

If you want to search for more than one EMPLID ...

data want (keep=emplid mgrid);
* lookup array (HAVE1) ... places all values in an array;
array lookup(3) _temporary_
do j=1 to howmany;
set have1 nobs=howmany;
lookup(j) = emplid;
end;

* check grid (HAVE2) for values in lookup (HAVE1);
array grid(5) mgrid_1-mgrid_5;
do until (last);
   set have2 end=last;
do j=1 to howmany;
   found = whichn(lookup(j), of grid(*));
   if found then do; emplid=lookup(j); mgrid=grid(found-1); output; end;
end;
end;
run;

data set WANT ...

Obs emplid mgrid

1      1111       101
2      2222      2211
3      3333      5434

EEEY · Posted 10-08-2015 12:37 AM

Thanks Mike for your solution. Unfortunately emplid and mgrid in my 2 datasets are both character instead of numeric. I am wondering if I still can use with whichn function.

The code is :

data want (keep=emplid mgrid);

* lookup array (HAVE1) ... places all values in an array;

array lookup(*) $ _temporary_;

do j=1 to howmany;

set have1 nobs=howmany;

lookup(j) = emplid;

end;

* check grid (HAVE2) for values in lookup (HAVE1);

array

grid(11) $ mgrid_1-mgrid_11;

do until (last);

set have2 end=last;

do j=1 to 267;

found = whichn(lookup(j), of grid(*));

if found then do; emplid=lookup(j); mgrid=grid(found-1); output; end;

end;

run;

The log I got is:

24 data want (keep=emplid mgrid);

25 * lookup array (HAVE1) ... places all values in an array;

26 array lookup(*) $ _temporary_;

ERROR: The non-variable based array lookup has been defined with zero elements.

27 do j=1 to howmany;

28 set have1 nobs=howmany;

29 lookup(j) = emplid;

ERROR: Too many array subscripts specified for array lookup.

30 end;

31

32 * check grid (HAVE2) for values in lookup (HAVE1);

33 array grid(11) $ mgrid_1-mgrid_11;

34 do until (last);

35 set have2 end=last;

36 do j=1 to 267;

37 found = whichn(lookup(j), of grid(*));

ERROR: Too many array subscripts specified for array lookup.

38 if found then do; emplid=lookup(j); mgrid=grid(found-1); output; end;

ERROR: Too many array subscripts specified for array lookup.

39 end;

40 end;

41 run;

NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).

37:19 37:33

NOTE: The SAS System stopped processing this step because of errors.

WARNING: The data set WORK.WANT may be incomplete. When this step was stopped there were 0 observations and 2 variables.

WARNING: Data set WORK.WANT was not replaced because this step was stopped.

NOTE: DATA statement used (Total process time):

real time 0.00 seconds

cpu time 0.01 seconds

Patrick · Posted 10-06-2015 01:49 AM

May be I missunderstand your data and how it's structured. It appears the same value in "have2" can exist more than once. Below code will return at least one row per source row from have1 - but return multiple rows if empid matches more than once to a value in have2.

In below code "&colon;" needs to be replaced with ":"

data have1;
  input empid;
cards;
1111
2222
3333
2124
9999
;

data have2;
  input mgrid_1 mgrid_2 mgrid_3 mgrid_4 mgrid_5;
cards;
101 1111 1212 2124 14322
201 2211 2222 3434 5656
301 3322 2124 5434 3333
401 3355 1199 55 77
;

data want(keep=empid mgrid);
  if _n_=1 then
    do;
      if 0 then set have1(keep=empid) have2(keep=mgrid_1 rename=(mgrid_1=mgrid));
      dcl hash h1(multidata&colon;'y');
      h1.defineKey('empid');
      h1.defineData('mgrid');
      h1.defineDone();

      array mgrids {*} mgrid_1 - mgrid_5;
      do while(not last);
        set have2 end=last;
        do i=2 to dim(mgrids);
          h1.add(key:mgrids[i], data&colon;mgrids[i-1]);
        end;
      end;
    end;
    call missing(of _all_);

    set have1;
    _rc=h1.find();
    output;
    do while(h1.find_next() = 0);
      output;
    end;
 run;

EEEY · Posted 10-08-2015 12:41 AM

Hi Patrick,

Thanks for this cool hash object solution. I learned before (a bit only) and felt it was not really useful but right now I realized that this is hash oject should be used for.

The only thing was that I had to remove (keep=emplid mgrid) to get the complete dataset and then did the seperate data step to keep these two columns only. Otherwise it only gave me one column mgrid.

Really appreciate.

EEEY · Posted 10-15-2015 02:19 AM

hI Patrick,

I am sure you are the expert of hash object.

Actually my dataset Have2 is even more complicated than I think.

So my

data have1;
  input empid;
cards;1111
2222
3333
2124
9999;

data have2;
  input mgrid_1 mgrid_2 mgrid_3 mgrid_4 mgrid_5;
cards;101 1111 1212 2124 14322
201 2211 2222 3434 5656
301 3322 2124 5434 3333
401 3355 1199 55 77
234 2222 2323 2121 55;


So I have added the last row into the dataset have2.

When this look-up process execute, note that value 2222 appeares at two spot in the dataset have2. 
One in column mgrid_2  and the other one in column mgrid_3.

The result I want to see would be like:

emplid mgrid col
1111   101   1
2222   2211  2
2222   234   1
3333   5434  4 
2124   3322  2

 The Column col I would like to add into the want dataset is the level of the manager from Have2 dataset.

Hope you could help me out again.

Thanks a lot.

Classroom Training Available!