Re: Merging two data sets without adding records

Paul_NYS · Posted 06-15-2015 04:17 PM

Hi Everyone

I have a base data set that contains the info for a group of court cases--there is one record for each case. There is another data set I have that has dockets in it and I need one column in it for the cases in the base data set. The second data set has more than one record for each case--the multiple records are different dockets/dates for each case.

I would just like to grab the docket ID column from the second data set and add it to the first base data set populated with the most recent docket from those in the second docket data set--and keep the same original number of records in the first base data set. However, when I run the merge (using a "by" statement with a shared ID), the merge just creates multiple records in the first data set for each of multiple dockets in the second data set.

Any ideas how to set up a merge that will maintain the same records in the base data set and just add the most recent column (docket) from the second data set?

Also, here is the merge statement I am using:

data s3t1;

merge s3 (in=a) vgcleanD1 (keep=entity_id FileDate Docket);

if a;

by entity_id;

run;

Paul

sbb · Posted 06-15-2015 04:27 PM

Pursue using the IN=<flag_indicator> on the SET statement as an individual dataset option -- then test for those contributing instances that are for your BASE dataset -- suggest using SAS diagnostic code PUTLOG _ALL_; which will show evidence of conditions you "want" - suggest reviewing the SAS DOC on DATA step processing with SET and the IN= dataset option.

Also,"most recent" implies you have a way to order your "second" dataset with DESCENDING on a SORT with BY? Look at using EQUALS and NODUPKEY on a second sort step that has a BY omitting the 'date' variable -- that will remove duplicates other than the most recent. Here again, use PROC PRINT (intermediate) and/or PUTLOG _ALL_; or PROC FREQ; against your "ordered" and "de-dup'd" second dataset to prepare it for the final MERGE back with your BASE dataset using a DATA step with a MERGE (along with the IN= dataset parameter), a BY, and again using the IF <flag_indicator> to detect a BASE observation contributed condition for OUTPUT.

Scott Barry
SBBWorks, Inc.

Paul_NYS · Posted 06-15-2015 04:46 PM

I tried the sort with nodupkey and it didn't get me what I was looking for.

Hold off the above question, I didn't phrase it correctly.

Paul

Kurt_Bremser · Posted 06-16-2015 01:21 AM

Sort vgcleanD1 by entity_id descending FileDate;

Then do:

data s3t1;

merge s3 (in=a) vgcleanD1 (keep=entity_id FileDate Docket);

if a;

by entity_id;

if first.entity_id;

run;

This of course assumes that there is only one observation per entity_id in s3; if there are more, you need to make the reduction of vgcleanD1 in an intermediate step after the sort:

data vgcleanD1_int;

set vgcleanD1;

by entity_id;

if first.entity_id;

run;

Then use vgcleanD1_int in the merge, and omit the first.entity_id condition.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Tom · Posted 06-15-2015 09:14 PM

Since you are doing a 1 to Many merge all of the records from the "Many" will still be marked as also being in the "one" since they match on the BY variables.

You can add

IF FIRST.ENTITY_ID;

to prevent it from taking more than one record per entity.

Merging two data sets without adding records