Hello,
I have three large data sets that I merge into one data set data set - one has 20121 rows and data set 2 and 3 have 20143 rows I sort them all on the Emp identifier then run the merge.
After I merge the three I end up with 20143 rows but I end up with 22 members with a duplicate Contribution amount gets added twice with the merge.
To resolve it I take my finished report exported into Excel and find duplicate SSN and high light them
Then I filter the 20143 rows on the color high light to bring them to the top of the list
Then I manually update the Emp Contribtion amounts to zero for one of the duplicate values - this is where NODUPRECS NODUPKEY dropped the entire dup rows and the other variables amounts in the rows are needed or it tosses off my other column amounts.
I am looking for a modify or update data step or SQL or case that I can clean up those rows with a dup member identifier that also have a dup contribution amount to update one of them to zero.
Any help is greatly appreciated.
SR-
data abc_Merged_001AAa;
MERGE abc_STEP2_SORTEDA abc_STEP3_SORTED_A abc_STEP1_SORTEDa ;
by ami
;
RUN;
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 20143 observations read from the data set WORK.abc_STEP2_SORTEDA.
NOTE: There were 20121 observations read from the data set WORK.abc_STEP3_SORTED_D.
NOTE: There were 20143 observations read from the data set WORK.abc_STEP1_SORTEDA.
NOTE: The data set WORK.abc_MERGED_001AAA has 20143 observations and 21 variables.
We could probably help with this significantly easier if you could provide a small data sample that reflects the issue and what you want as final output. It does not have to be your real data, but should be reflective of your data, such as if you're doing it by specific groups make sure to include more than one group of data.
After I merge the three I end up with 20143 rows but I end up with 22 members with a duplicate Contribution amount gets added twice with the merge.
Usually to me that means you need to add some other condition to the merge, or de-duplicate a data set before the merge.
I am looking for a modify or update data step or SQL or case that I can clean up those rows with a dup member identifier that also have a dup contribution amount to update one of them to zero.
BY group processing will identify multiple groups if you have a key set of variables that identify a 'group'.
Then if its not the first of the group you can set those values to 0 or missing. You may want missing so they're not included as N's for summary statistics, whereas missing would be excluded automatically, 0's could throw off averages or other calculations.
proc sort data=sashelp.cars out=cars;
by make model mpg_highway;
run;
data identify_first;
set cars;
by Make; *note only make this time;
if NOT first.make then mpg_highway=0;*set to 0;
if NOT first.make then mpg_highway=.; *set to missing;
keep make model mpg_highway; *limit data for example;
run;
@SRemm wrote:
Hello,
I have three large data sets that I merge into one data set data set - one has 20121 rows and data set 2 and 3 have 20143 rows I sort them all on the Emp identifier then run the merge.
After I merge the three I end up with 20143 rows but I end up with 22 members with a duplicate Contribution amount gets added twice with the merge.
To resolve it I take my finished report exported into Excel and find duplicate SSN and high light them
Then I filter the 20143 rows on the color high light to bring them to the top of the list
Then I manually update the Emp Contribtion amounts to zero for one of the duplicate values - this is where NODUPRECS NODUPKEY dropped the entire dup rows and the other variables amounts in the rows are needed or it tosses off my other column amounts.
I am looking for a modify or update data step or SQL or case that I can clean up those rows with a dup member identifier that also have a dup contribution amount to update one of them to zero.
Any help is greatly appreciated.
SR-
data abc_Merged_001AAa;
MERGE abc_STEP2_SORTEDA abc_STEP3_SORTED_A abc_STEP1_SORTEDa ;
by ami
;
RUN;
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 20143 observations read from the data set WORK.abc_STEP2_SORTEDA.
NOTE: There were 20121 observations read from the data set WORK.abc_STEP3_SORTED_D.
NOTE: There were 20143 observations read from the data set WORK.abc_STEP1_SORTEDA.
NOTE: The data set WORK.abc_MERGED_001AAA has 20143 observations and 21 variables.
FIRST. would solve your problem.
Something like: If not first.id then col=0;
Thanks Suryakiran
So col=0 as in col is the variable name??
SR-
Sort the data by the Key variable you have (In your case I guess it's ami ) and then change the variable values as required.
eg:
data have;
input id val;
datalines;
1 20
1 30
1 50
2 10
3 20
4 40
;
run;
proc sort data=have;
by id val;
run;
data want;
set have;
by id;
if not first.id then val=0;
run;
Hi Suryakiran,
That worked like a charm...
Thanks again for your time hope this helps other people in the future...
Best,
SR
data want;
set abc_Merged_001AAa;
by AMI;
if not first.AMI then EMP_CTB_AMT=0;
run;
We could probably help with this significantly easier if you could provide a small data sample that reflects the issue and what you want as final output. It does not have to be your real data, but should be reflective of your data, such as if you're doing it by specific groups make sure to include more than one group of data.
After I merge the three I end up with 20143 rows but I end up with 22 members with a duplicate Contribution amount gets added twice with the merge.
Usually to me that means you need to add some other condition to the merge, or de-duplicate a data set before the merge.
I am looking for a modify or update data step or SQL or case that I can clean up those rows with a dup member identifier that also have a dup contribution amount to update one of them to zero.
BY group processing will identify multiple groups if you have a key set of variables that identify a 'group'.
Then if its not the first of the group you can set those values to 0 or missing. You may want missing so they're not included as N's for summary statistics, whereas missing would be excluded automatically, 0's could throw off averages or other calculations.
proc sort data=sashelp.cars out=cars;
by make model mpg_highway;
run;
data identify_first;
set cars;
by Make; *note only make this time;
if NOT first.make then mpg_highway=0;*set to 0;
if NOT first.make then mpg_highway=.; *set to missing;
keep make model mpg_highway; *limit data for example;
run;
@SRemm wrote:
Hello,
I have three large data sets that I merge into one data set data set - one has 20121 rows and data set 2 and 3 have 20143 rows I sort them all on the Emp identifier then run the merge.
After I merge the three I end up with 20143 rows but I end up with 22 members with a duplicate Contribution amount gets added twice with the merge.
To resolve it I take my finished report exported into Excel and find duplicate SSN and high light them
Then I filter the 20143 rows on the color high light to bring them to the top of the list
Then I manually update the Emp Contribtion amounts to zero for one of the duplicate values - this is where NODUPRECS NODUPKEY dropped the entire dup rows and the other variables amounts in the rows are needed or it tosses off my other column amounts.
I am looking for a modify or update data step or SQL or case that I can clean up those rows with a dup member identifier that also have a dup contribution amount to update one of them to zero.
Any help is greatly appreciated.
SR-
data abc_Merged_001AAa;
MERGE abc_STEP2_SORTEDA abc_STEP3_SORTED_A abc_STEP1_SORTEDa ;
by ami
;
RUN;
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 20143 observations read from the data set WORK.abc_STEP2_SORTEDA.
NOTE: There were 20121 observations read from the data set WORK.abc_STEP3_SORTED_D.
NOTE: There were 20143 observations read from the data set WORK.abc_STEP1_SORTEDA.
NOTE: The data set WORK.abc_MERGED_001AAA has 20143 observations and 21 variables.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.