Re: Sort based on reference column and ID column

osbornejo · Posted 06-17-2022 06:00 PM

I have a large table that looks something like this:

ID	CrossRef
10001	.
10002	.
10003	.
...	...
12005	.
12006	10003
12007	10042
...	...
13015	10003
...	...

Any row that is missing a CrossRef is an original entry, and any row with a CrossRef is modifying the row with the matching ID. Larger ID values indicate later entries.

What I want is to sort this so that the original entry is immediately followed by any that modify it, e.g.:

ID	CrossRef
10001	.
10002	.
10003	.
12006	10003
13015	10003
10005	.
...	...
10042	.
12007	10042
...	...

mkeintz · Posted 06-17-2022 06:24 PM

This does what I understand you want:

data need / view=need;
  set have;
  _sortvar=coalesce(crossref,id);
run;
proc sort data=need out=want (drop=_:);
  by _sortvar id;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

osbornejo · Posted 06-21-2022 05:51 PM

This almost works - the problem I'm having is that each original entry (e.g. id = 10003) always has a CrossRef of . (i.e. missing), so the coalesce gets confused. What you're doing makes sense though.

Tom · Posted 06-21-2022 07:04 PM

@osbornejo wrote:

This almost works - the problem I'm having is that each original entry (e.g. id = 10003) always has a CrossRef of . (i.e. missing), so the coalesce gets confused. What you're doing makes sense though.

That does not make sense.

Try it.

data have;
  input ID	CrossRef ;
cards;
10001	.
10002	.
10003	.
12005	.
12006	10003
12007	10042
13015	10003
;

data step1;
  set have;
  orig_id = coalesce(crossref,id);
run;

proc sort;
  by orig_id id;
run;

Results

                Cross
Obs      ID      Ref     orig_id

 1     10001        .     10001
 2     10002        .     10002
 3     10003        .     10003
 4     12006    10003     10003
 5     13015    10003     10003
 6     12007    10042     10042
 7     12005        .     12005

osbornejo · Posted 06-22-2022 10:34 AM

I just checked my actual data again and I misspoke - the original entries have a CrossRef of 0, not missing. I think that's what is throwing the sort off, because when I tried it, I ended up with

Obs      ID      CrossRef orig_id

 1     10001        0     0
 2     10002        0     0
 3     10003        0     0
 4     12005        0     0
 5     12006    10003     10003
 6     13015    10003     10003
 7     12007    10042     10042

Also, ID and CrossRef are character variables, not numeric.

Tom · Posted 06-21-2022 07:07 PM

Is the issue that the CROSS_REF is not to the ORIGINAL id but to just the PREVIOUS id? Which might have itself been a CROSS_REF before?

mkeintz · Posted 06-22-2022 10:06 AM

@osbornejo wrote:

This almost works - the problem I'm having is that each original entry (e.g. id = 10003) always has a CrossRef of . (i.e. missing), so the coalesce gets confused. What you're doing makes sense though.

The task is "What I want is to sort this so that the original entry is immediately followed by any that modify it,".

which is what happens in the code I provided, given that "larger ID values indicate later entries.".

I would need some additional explanation of what you want, and some sample data and desired results in which the submitted code produces different results.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

osbornejo · Posted 06-22-2022 10:35 AM

I just checked my actual data again and I misspoke - the original entries have a CrossRef of 0, not missing. I think that's what is throwing the coalesce off, because when I tried it, I ended up with

Obs      ID      CrossRef orig_id

 1     10001        0     0
 2     10002        0     0
 3     10003        0     0
 4     12005        0     0
 5     12006    10003     10003
 6     13015    10003     10003
 7     12007    10042     10042

Also, ID and CrossRef are character variables, not numeric.

mkeintz · Posted 06-23-2022 08:51 AM

@osbornejo wrote:

I just checked my actual data again and I misspoke - the original entries have a CrossRef of 0, not missing. I think that's what is throwing the coalesce off

Yes, a zero is not a missing value, so coalesce will not skip over it to look for a subsequent non-missing value. This can be dealt with by some minor code modification.

But possibly more significant you also tell us

ID and CrossRef are character variables, not numeric.

Now the coalesce function will automatically convert arguments of entirely numeric characters to actual numeric values, with a note to that effect on your log. This is probably good, because any ID or CROSSREF values of less than 5 digits (forget the "0" for now), will be sorted in numeric order, rather than in lexicographic order if they had remained character. Otherwise "2345" would sort AFTER "12345".

So I would stick with the coalesce function, but change the first argument from CROSSREF to IFC(crossref='0','.',crossref), allowing a character "0" to be replaced by a ".", and treated by coalesce as a missing value.

Example:

data have;
  infile datalines truncover;
  input ID :$5.	CrossRef :$5.;
cards;
10001	0
10002	0
10003	0
12005	0
12006	10003
12007	10042
13015	10003
;

data need / view=need;
  set have;
  _sortvar=coalesce(ifc(crossref='0','.',crossref),id);
run;
proc sort data=need out=want (drop=_:);
  by _sortvar id;
run;

Edit note: actually the problem of character variables is not entirely solved above. The minor sort key (ID) can cause a problem it you apply the code above to the data below:

data have;
  infile datalines truncover;
  input ID :$5.	CrossRef :$5.;
cards;
2345    0
10001	0
10002	0
10003	0
12005	0
12006	10003
12007	10042
13015	10003
19999   2345
33333   2345
;

The fix would be:

data need / view=need;
  set have;
  _sortvar=coalesce(ifc(crossref='0','.',crossref),id);
  _sortvar2=right(id);
run;
proc sort data=need out=want (drop=_:);
  by _sortvar _sortvar2;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

osbornejo · Posted 06-23-2022 12:17 PM

Ok, I tried this and when I look at _sortvar in the need table, it is equal to . when the CrossRef is 0. I don't understand why that is happening - I thought coalesce() was supposed to skip missing values. Does the period work as a missing value when it is a character variable?

Tom · Posted 06-23-2022 12:21 PM

Only strings that contain all spaces will be treated as missing. Perhaps you are confused because the normal character informat $ will convert '.' into ' '.

mkeintz · Posted 06-23-2022 02:56 PM

@osbornejo wrote:
Ok, I tried this and when I look at _sortvar in the need table, it is equal to . when the CrossRef is 0. I don't understand why that is happening - I thought coalesce() was supposed to skip missing values. Does the period work as a missing value when it is a character variable?

If CrossRef="0" then the only time _sortvar should be missing is when ID has a non-numeric character or is completely blank. And in that case there would be an error message on your SAS log.

So I can't respond usefully to this result without relevant information. Please provide (1) the original data of the offending observations, and (2) the log of the data step that generated the _sortvar=. result. Otherwise, I am unable to reproduce the problem you describe.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

osbornejo · Posted 06-23-2022 03:10 PM

Unfortunately I can't provide the original data as it contains sensitive/identifiable information. However, I was able to get your code to work by replacing '.' in the ifc() with '' (empty string). My thanks to you and @Tom for your help!

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away