- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi everyone,
If I have 2 datasets having the same variables and both of them having the variable named "report_date":
1. Daily_Data (imported into LIB "Work")
2. Consolidated_Data (stored at LIB "Mylib")
On daily basic, after importing "Daily_data", I am using the scripts:
data Mylib.consolidated_data;
set Mylib.Consolidated_data work.daily_data;
run;
to extend the dataset "Consolidated_data". However, to avoid combining the same source data twice or 3 times unintentionally, can we put a condition some thing like that:
Let say, the dataset "Daily_data" with the variable "Report_date" having one same value. For eg. 09/08/2018. Accordingly, if the variable "Report_date" in dataset "Consolidated_data" contains at least one observation equal to 09/08/2018, the combining step will be skipped.
Thank you,
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
To prevent accidental dupülication of entries, but still allow updates (if new data needs to be corrected), I'd do this:
data _null_;
set work.daily_data;
call symput('exclude',put(report_date,best.));
stop;
run;
data Mylib.consolidated_data;
set
Mylib.Consolidated_data (
where=(report_date ne &exclude)
)
work.daily_data
;
run;
Note how a little visual formatting makes the set statement easier to grok.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
To prevent accidental dupülication of entries, but still allow updates (if new data needs to be corrected), I'd do this:
data _null_;
set work.daily_data;
call symput('exclude',put(report_date,best.));
stop;
run;
data Mylib.consolidated_data;
set
Mylib.Consolidated_data (
where=(report_date ne &exclude)
)
work.daily_data
;
run;
Note how a little visual formatting makes the set statement easier to grok.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you very much. It was what I am looking for. My scripts have worked.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Assuming that your constant Report_Date var is always larger than the report_date values in the consolidated data set, you can do something like this
proc sql;
insert into Mylib.consolidated_data as cons
select * from work.daily_data as daily
having daily.Report_date > max(cons.Report_date);
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your instruction but I would like to keep proc import. 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
First, note that your program will take longer and longer to run over time. Instead of using DATA step to combine the data sets, it would be much more efficient to use PROC APPEND:
proc append data=daily_data base=Mylib.consolidated_data;
run;
This assumes that the data set structure never changes (same variables, same lengths for each variable, every time).
Since the latest batch of data will appear at the end of the DATA step, you can rely on the final observation in the consolidated data set coming from the last batch of data. To extract the final observation, you could use something like this:
data final;
set Mylib.consolidated_data nobs=_nobs_ point=_nobs_;
output;
stop;
keep report_date;
rename report_date = final_previous_date;
run;
Since it reads just 1 observation, it will be speedy.
Then get the earliest date in your daily data set:
proc summary data=daily_data;
var report_date;
output out=daily_summary min=current_minimum_date;
run;
Finally, combine these into a way that runs PROC APPEND only when needed;
data _null_;
set daily_summary;
set final;
if current_minimum_date > final_previous_date then do;
call execute("proc append data=daily_data base=Mylib.consolidated_data; run;");
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Check if PROC SORT with NODUPKEY will work for you.
Suryakiran
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks. As I am quite newbie. Would you mind show me the detailed codes?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data Mylib.consolidated_data;
set Mylib.Consolidated_data work.daily_data;
run;
proc sort data Mylib.consolidated_data nodupkey;
by report_date;
run;
After the data is appended you can sort the data set with nodupkey option which will remove duplicate report_date.
Note: The second occurrence of the same by variable record will be deleted.
example:
data test;
input id num;
datalines;
1 10
1 20
2 10
3 10
;
run;
proc sort data=test nodupkey;
by id;
run;
/* If you want to keep the highest number */
data test;
input id num;
datalines;
1 10
1 20
2 10
3 10
;
run;
proc sort data=test;
by id descending num;
run;
proc sort data=test nodupkey;
by id;
run;
Suryakiran