BookmarkSubscribeRSS Feed
osbornejo
Obsidian | Level 7

I have a large table that looks something like this:

 

IDCrossRef
10001.
10002.
10003.
......
12005.
1200610003
1200710042
......
1301510003
......

 

Any row that is missing a CrossRef is an original entry, and any row with a CrossRef is modifying the row with the matching ID.  Larger ID values indicate later entries.

 

What I want is to sort this so that the original entry is immediately followed by any that modify it, e.g.:

 

IDCrossRef
10001.
10002.
10003.
1200610003
1301510003
10005.
......
10042.
1200710042
......
12 REPLIES 12
mkeintz
PROC Star

This does what I understand you want:

 

data need / view=need;
  set have;
  _sortvar=coalesce(crossref,id);
run;
proc sort data=need out=want (drop=_:);
  by _sortvar id;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
osbornejo
Obsidian | Level 7

This almost works - the problem I'm having is that each original entry (e.g. id = 10003) always has a CrossRef of . (i.e. missing), so the coalesce gets confused. What you're doing makes sense though.

Tom
Super User Tom
Super User

@osbornejo wrote:

This almost works - the problem I'm having is that each original entry (e.g. id = 10003) always has a CrossRef of . (i.e. missing), so the coalesce gets confused. What you're doing makes sense though.


That does not make sense.

Try it.

data have;
  input ID	CrossRef ;
cards;
10001	.
10002	.
10003	.
12005	.
12006	10003
12007	10042
13015	10003
;

data step1;
  set have;
  orig_id = coalesce(crossref,id);
run;

proc sort;
  by orig_id id;
run;

Results

                Cross
Obs      ID      Ref     orig_id

 1     10001        .     10001
 2     10002        .     10002
 3     10003        .     10003
 4     12006    10003     10003
 5     13015    10003     10003
 6     12007    10042     10042
 7     12005        .     12005
osbornejo
Obsidian | Level 7

I just checked my actual data again and I misspoke - the original entries have a CrossRef of 0, not missing.  I think that's what is throwing the sort off, because when I tried it, I ended up with 

 

Obs      ID      CrossRef orig_id

 1     10001        0     0
 2     10002        0     0
 3     10003        0     0
 4     12005        0     0
 5     12006    10003     10003
 6     13015    10003     10003
 7     12007    10042     10042

 

Also, ID and CrossRef are character variables, not numeric. 

Tom
Super User Tom
Super User

Is the issue that the CROSS_REF is not to the ORIGINAL id but to just the PREVIOUS id?  Which might have itself been a CROSS_REF before?

mkeintz
PROC Star

@osbornejo wrote:

This almost works - the problem I'm having is that each original entry (e.g. id = 10003) always has a CrossRef of . (i.e. missing), so the coalesce gets confused. What you're doing makes sense though.


The task is "What I want is to sort this so that the original entry is immediately followed by any that modify it,".

which is what happens in the code I provided, given that "larger ID values indicate later entries.".

 

I would need some additional explanation of what you want, and some sample data and desired results in which the submitted code produces different results.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
osbornejo
Obsidian | Level 7

I just checked my actual data again and I misspoke - the original entries have a CrossRef of 0, not missing.  I think that's what is throwing the coalesce off, because when I tried it, I ended up with 

 

Obs      ID      CrossRef orig_id

 1     10001        0     0
 2     10002        0     0
 3     10003        0     0
 4     12005        0     0
 5     12006    10003     10003
 6     13015    10003     10003
 7     12007    10042     10042

 

Also, ID and CrossRef are character variables, not numeric. 

mkeintz
PROC Star

@osbornejo wrote:

I just checked my actual data again and I misspoke - the original entries have a CrossRef of 0, not missing.  I think that's what is throwing the coalesce off

Yes, a zero is not a missing value, so coalesce will not skip over it to look for a subsequent non-missing value.  This can be dealt with by some minor code modification.

 

But possibly more significant you also tell us

ID and CrossRef are character variables, not numeric. 


Now the coalesce function will automatically convert arguments of entirely numeric characters to actual numeric values, with a note to that effect on your log.  This is probably good, because any ID or CROSSREF values of less than 5 digits (forget the "0" for now), will be sorted in numeric order, rather than in lexicographic order if they had remained character.  Otherwise "2345" would sort AFTER "12345".

 

So I would stick with the coalesce function, but change the first argument from  CROSSREF to IFC(crossref='0','.',crossref), allowing a character "0" to be replaced by a ".", and treated by coalesce as a missing value.

 

Example:

 

data have;
  infile datalines truncover;
  input ID :$5.	CrossRef :$5.;
cards;
10001	0
10002	0
10003	0
12005	0
12006	10003
12007	10042
13015	10003
;

data need / view=need;
  set have;
  _sortvar=coalesce(ifc(crossref='0','.',crossref),id);
run;
proc sort data=need out=want (drop=_:);
  by _sortvar id;
run;

 

Edit note: actually the problem of character variables is not entirely solved above.  The minor sort key (ID) can cause a problem it you apply the code above to the data below:

 

data have;
  infile datalines truncover;
  input ID :$5.	CrossRef :$5.;
cards;
2345    0
10001	0
10002	0
10003	0
12005	0
12006	10003
12007	10042
13015	10003
19999   2345
33333   2345
;

The fix would be:

 

data need / view=need;
  set have;
  _sortvar=coalesce(ifc(crossref='0','.',crossref),id);
  _sortvar2=right(id);
run;
proc sort data=need out=want (drop=_:);
  by _sortvar _sortvar2;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
osbornejo
Obsidian | Level 7
Ok, I tried this and when I look at _sortvar in the need table, it is equal to . when the CrossRef is 0. I don't understand why that is happening - I thought coalesce() was supposed to skip missing values. Does the period work as a missing value when it is a character variable?
Tom
Super User Tom
Super User

Only strings that contain all spaces will be treated as missing.  Perhaps you are confused because the normal character informat $ will convert '.' into ' '.

mkeintz
PROC Star

@osbornejo wrote:
Ok, I tried this and when I look at _sortvar in the need table, it is equal to . when the CrossRef is 0. I don't understand why that is happening - I thought coalesce() was supposed to skip missing values. Does the period work as a missing value when it is a character variable?

If CrossRef="0" then the only time _sortvar should be missing is when ID has a non-numeric character or is completely blank.  And in that case there would be an error message on your SAS log.

 

So I can't respond usefully to this result without relevant information.  Please provide (1) the original data of the offending observations, and (2) the log of the data step that generated the _sortvar=. result.    Otherwise, I am unable to reproduce the problem you describe.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
osbornejo
Obsidian | Level 7
Unfortunately I can't provide the original data as it contains sensitive/identifiable information. However, I was able to get your code to work by replacing '.' in the ifc() with '' (empty string). My thanks to you and @Tom for your help!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1199 views
  • 7 likes
  • 3 in conversation