DATA Step, Macro, Functions and more

Create unique identifier to link rows with duplicates across different variables

Reply
New User
Posts: 1

Create unique identifier to link rows with duplicates across different variables

[ Edited ]

Hi all-

 

I could really use some help with Base SAS programming.

 

I have a dataset with a unique identifier called clt_id. There are also 50 variables var1-var50 that have unique identifiers for property ownership.

 

So I have 51 collums and I would like to create an identifier that links the clt_id's together when there is a match between any of the 50 property identifiers.

 

So the collums look like this

clt_id   var1 ... var50

 

and I would like to have:

clt_id var1 ... var50 varm

where varm would be the same for two clt_ids if any var from one clt_id matched with any other var under another clt_id. There are a maximum of 2 occurances for each var so varm should only contain pairs or unique values.

 

I would like the identifier to be a concatanation of the two relevant clt_ids.

 

ie.

if first.clt_id > second.clt_id then ident = cats(of second.clt_id first.clt_id);
if first.clt_id < second.clt_id then ident = cats(of first.clt_id second.clt_id);

 

I tried stacking the collums into one called var, sort ascending var clt_id , then using the property ownership identifier on second.var as the identifier.

 

Then i sorted descending and did the same to apply the ownership identifier to the other match. This should work since I am only interested in matching in twos.

 

Heres what I have so far:

%macro stack();
	data stacked;
	set
	%do i=1 %to 50;
		have (keep=var&i clt_id rename=(var&i=var))
	%end;
;
	run;
%mend stack;
%stack();

proc sort data=stacked;
	by ascending var clt_id;
run;

data stacked;
	do _n_=1 by 1 until(last.var);
	set stacked;
	by var;
	if second.var then varm=var;
	end;
run;

data sort data=stacked;
	by descending var clt_id;
run;

data stacked;
	do _n_=1 by 1 until(last.var);
	set stacked;
	by var;
	if second.var then dinm=var;
run;

 

 I received the following error on my last data step:

 

ERROR 650-185: DATA STEP Component Object failure.
               Aborted during the COMPILATION phase.
ERROR 557-185: Variable second is not an object.

 

I thought I could refer to first and second when using by after the set. Anyone know why I am getting this error?

 

Thanks so much,

Krazee

PROC Star
Posts: 283

Re: Create unique identifier to link rows with duplicates across different variables

Posted in reply to krazee_koder

Hi, What is second.var? That seems interesting. Has SAS introduced an automatic variable second.var recently?

Do you mind charting out a sample table of your HAVE dataset and a WANT dataset with a few records to help me visualise better please. Thank you!

Super User
Posts: 5,497

Re: Create unique identifier to link rows with duplicates across different variables

Posted in reply to krazee_koder

Some of the rules before this gets too krazy ...

 

There is no such thing as "second." in SAS.  You have "first." and "last." and that's it.

 

To be allowed to use "first." and "last." there must be a BY statement earlier in the DATA step, using the same variable name.  So if your BY statement reads "by din;" then you have first.din and last.din to work with, but there is no last.var to work with.

 

To be allowed to use a BY statement, your data must be sorted by the variable(s) in the BY statement.  There are exceptions to this if you specify the word NOTSORTED, but save that for another day.

 

If you want help fixing your application, that might be possible.  I would advise skipping macro language entirely, however, until you get a working version of your program.  Then you can begin to think about how to add macro language to generate the same working program.

Trusted Advisor
Posts: 1,553

Re: Create unique identifier to link rows with duplicates across different variables

Posted in reply to krazee_koder

It is pitty you havn't post the full log. It is difficult to know which line made the ERROR without the full log.

 

You cannot use SECOND.VAR - what did you mean by it?

You can use first.var meaning first row of a value when sorted by var.

You can get the previous row value of the variable VAR using LAG function. (like: prev = lag(var); )

 

I cannot guess what you meant by 

data stacked;
	do _n_=1 by 1 until(last.var);
	set stacked;
	by var;
	if second.var then dinm=var;
run;

What is DINM ?

Where do you check that vars match?

If you have both CLT_IDs in memory and they are numeric you can conctenate them by:

varm = min(clt_id1, clt_id2) || max(clt_id1, clt_id2);

and no need to sort twice (ascending and descending).

 

 

 

Super User
Posts: 11,343

Re: Create unique identifier to link rows with duplicates across different variables

Posted in reply to krazee_koder

Here's an idea.

Provide some example data. You need not provide all 50 variables but 3 or 4 should be sufficient to show the procees.

Then show what the final result should look like for that data.

 

I have a feeling that you are making this way more complicated than it should be.

 

NOTE that First. and Last. refer to the ROW order in the data set. Also the value of First. is either 1 or 0 for true (the record is the first fo the specified group) or false (the record is not the first of the group).

 

Ask a Question
Discussion stats
  • 4 replies
  • 59 views
  • 0 likes
  • 5 in conversation