BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SRemm
Calcite | Level 5

Hello,

 

I have three large data sets that I merge into one data set data set - one has 20121 rows and data set 2 and 3 have 20143 rows I sort them all on the Emp identifier then run the merge.

 

After I merge the three I end up with 20143 rows but I end up with 22 members with a duplicate Contribution amount gets added twice with the merge. 

 

To resolve it I take my finished report exported into Excel and find duplicate SSN and high light them 

 

Then I filter the 20143 rows on the color high light to bring them to the top of the list

 

Then I manually update the Emp Contribtion amounts to zero for one of the duplicate values - this is where NODUPRECS NODUPKEY dropped the entire dup rows and the other variables amounts in the rows are needed or it tosses off my other column amounts. 

 

I am looking for a modify or update data step or SQL or case that I can clean up those rows with a dup member identifier that also have a dup contribution amount to update one of them to zero. 


Any help is greatly appreciated. 

 

SR- 

 


 data abc_Merged_001AAa;

MERGE abc_STEP2_SORTEDA abc_STEP3_SORTED_A abc_STEP1_SORTEDa ;
by ami

 ;
 RUN;

 


NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 20143 observations read from the data set WORK.abc_STEP2_SORTEDA.
NOTE: There were 20121 observations read from the data set WORK.abc_STEP3_SORTED_D.
NOTE: There were 20143 observations read from the data set WORK.abc_STEP1_SORTEDA.
NOTE: The data set WORK.abc_MERGED_001AAA has 20143 observations and 21 variables.

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

We could probably help with this significantly easier if you could provide a small data sample that reflects the issue and what you want as final output. It does not have to be your real data, but should be reflective of your data, such as if you're doing it by specific groups make sure to include more than one group of data.

 

After I merge the three I end up with 20143 rows but I end up with 22 members with a duplicate Contribution amount gets added twice with the merge. 

 

Usually to me that means you need to add some other condition to the merge, or de-duplicate a data set before the merge. 

 

I am looking for a modify or update data step or SQL or case that I can clean up those rows with a dup member identifier that also have a dup contribution amount to update one of them to zero. 

 

BY group processing will identify multiple groups if you have a key set of variables that identify a 'group'. 

Then if its not the first of the group you can set those values to 0 or missing. You may want missing so they're not included as N's for summary statistics, whereas missing would be excluded automatically, 0's could throw off averages or other calculations.

 

proc sort data=sashelp.cars out=cars;
by make model mpg_highway;
run;

data identify_first;
set cars;
by Make; *note only make this time;

if NOT first.make then mpg_highway=0;*set to 0;
if NOT first.make then mpg_highway=.; *set to missing;


keep make model mpg_highway; *limit data for example;
run;

@SRemm wrote:

Hello,

 

I have three large data sets that I merge into one data set data set - one has 20121 rows and data set 2 and 3 have 20143 rows I sort them all on the Emp identifier then run the merge.

 

After I merge the three I end up with 20143 rows but I end up with 22 members with a duplicate Contribution amount gets added twice with the merge. 

 

To resolve it I take my finished report exported into Excel and find duplicate SSN and high light them 

 

Then I filter the 20143 rows on the color high light to bring them to the top of the list

 

Then I manually update the Emp Contribtion amounts to zero for one of the duplicate values - this is where NODUPRECS NODUPKEY dropped the entire dup rows and the other variables amounts in the rows are needed or it tosses off my other column amounts. 

 

I am looking for a modify or update data step or SQL or case that I can clean up those rows with a dup member identifier that also have a dup contribution amount to update one of them to zero. 


Any help is greatly appreciated. 

 

SR- 

 


 data abc_Merged_001AAa;

MERGE abc_STEP2_SORTEDA abc_STEP3_SORTED_A abc_STEP1_SORTEDa ;
by ami

 ;
 RUN;

 


NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 20143 observations read from the data set WORK.abc_STEP2_SORTEDA.
NOTE: There were 20121 observations read from the data set WORK.abc_STEP3_SORTED_D.
NOTE: There were 20143 observations read from the data set WORK.abc_STEP1_SORTEDA.
NOTE: The data set WORK.abc_MERGED_001AAA has 20143 observations and 21 variables.

 


 

View solution in original post

5 REPLIES 5
SuryaKiran
Meteorite | Level 14

FIRST. would solve your problem.

Something like: If not first.id then col=0;

Thanks,
Suryakiran
SRemm
Calcite | Level 5

Thanks Suryakiran

 

So col=0 as in col is the variable name??

 

SR- 

SuryaKiran
Meteorite | Level 14

Sort the data by the Key variable you have (In your case I guess it's ami ) and then change the variable values as required.

 

eg:

data have;
input id val;
datalines;
1 20
1 30
1 50
2 10
3 20
4 40
;
run;
proc sort data=have;
by id val;
run;
data want;
set have;
by id;
if not first.id then val=0;
run;
Thanks,
Suryakiran
SRemm
Calcite | Level 5

Hi Suryakiran,

 

That worked like a charm... 

 

Thanks again for your time hope this helps other people in the future...

 

Best,


SR

 

data want;
set abc_Merged_001AAa;
by AMI;
if not first.AMI then EMP_CTB_AMT=0;
run;

Reeza
Super User

We could probably help with this significantly easier if you could provide a small data sample that reflects the issue and what you want as final output. It does not have to be your real data, but should be reflective of your data, such as if you're doing it by specific groups make sure to include more than one group of data.

 

After I merge the three I end up with 20143 rows but I end up with 22 members with a duplicate Contribution amount gets added twice with the merge. 

 

Usually to me that means you need to add some other condition to the merge, or de-duplicate a data set before the merge. 

 

I am looking for a modify or update data step or SQL or case that I can clean up those rows with a dup member identifier that also have a dup contribution amount to update one of them to zero. 

 

BY group processing will identify multiple groups if you have a key set of variables that identify a 'group'. 

Then if its not the first of the group you can set those values to 0 or missing. You may want missing so they're not included as N's for summary statistics, whereas missing would be excluded automatically, 0's could throw off averages or other calculations.

 

proc sort data=sashelp.cars out=cars;
by make model mpg_highway;
run;

data identify_first;
set cars;
by Make; *note only make this time;

if NOT first.make then mpg_highway=0;*set to 0;
if NOT first.make then mpg_highway=.; *set to missing;


keep make model mpg_highway; *limit data for example;
run;

@SRemm wrote:

Hello,

 

I have three large data sets that I merge into one data set data set - one has 20121 rows and data set 2 and 3 have 20143 rows I sort them all on the Emp identifier then run the merge.

 

After I merge the three I end up with 20143 rows but I end up with 22 members with a duplicate Contribution amount gets added twice with the merge. 

 

To resolve it I take my finished report exported into Excel and find duplicate SSN and high light them 

 

Then I filter the 20143 rows on the color high light to bring them to the top of the list

 

Then I manually update the Emp Contribtion amounts to zero for one of the duplicate values - this is where NODUPRECS NODUPKEY dropped the entire dup rows and the other variables amounts in the rows are needed or it tosses off my other column amounts. 

 

I am looking for a modify or update data step or SQL or case that I can clean up those rows with a dup member identifier that also have a dup contribution amount to update one of them to zero. 


Any help is greatly appreciated. 

 

SR- 

 


 data abc_Merged_001AAa;

MERGE abc_STEP2_SORTEDA abc_STEP3_SORTED_A abc_STEP1_SORTEDa ;
by ami

 ;
 RUN;

 


NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 20143 observations read from the data set WORK.abc_STEP2_SORTEDA.
NOTE: There were 20121 observations read from the data set WORK.abc_STEP3_SORTED_D.
NOTE: There were 20143 observations read from the data set WORK.abc_STEP1_SORTEDA.
NOTE: The data set WORK.abc_MERGED_001AAA has 20143 observations and 21 variables.

 


 

SAS INNOVATE 2024

Innovate_SAS_Blue.png

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Get the $99 certification deal.jpg

 

 

Back in the Classroom!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 645 views
  • 2 likes
  • 3 in conversation