04-01-2014 11:09 PM
I couldn't find an easy and efficient way in SAS to copy one dataset to another inside the same library.
# 3 below seems best to me, but it’s not obvious, and I couldn't find it explicitly documented anywhere.
Maybe there’s an easier way, so please comment.
I can think of 4 ways to copy one dataset to another inside the same library:
Example of # 3 (the destination dataset dest_ds needs to be empty or nonexistent, so I'm deleting it)
proc datasets nolist lib=MyLib;
append data = source_ds
out = dest_ds ;
04-02-2014 01:35 AM
#2 is the best option, hands down. Everything else is crushing peanuts with an elephant's hoof, IMO.
If you want to do it inside SAS, #1 is preferred. A simple "data xxx; set yyy; run;" will not be slower than #3, because it will be simply limited by I/O bandwidth.
Hint: Copying inside the same library will usually be slower than copying to another one, because you use the same HW for reading and writing. Different libs _may_ be located on different disks, giving you more aggregate bandwidth.
04-02-2014 04:16 AM
Sorry, I would disagree with that. #1 would be the better option principally because there is no error checking (unless you program it yourself) done, and generally the feedback from the operation is not looked at. You wont notice so much with copy, however if it fails (file not found, dir not exist, no permissions etc.) then you may find processes further down the line are using old data. Its even more dangerous with delete commands as it can default in other areas, or cause other unexpected problems. If you absolutely have to use DOS to move/copy/delete files I would suggest either writing some QC code around to check existence of all items and post process messages coming back etc. or alternatively use an app or code one yourself outside SAS. For anything else use the datastep as this will tell you any errors directly in your log and not continue to operate until you fix it. #3, #4 don't seem particularly intuitive or advantageous to my mind.
04-02-2014 04:49 AM
04-02-2014 05:57 AM
#3 should be faster than #1 because APPEND will use block reads and writes. Using PROC DELETE and PROC APPEND may be faster than PROC DATASETS for the reason that PROC DELETE was resurrected.
04-02-2014 05:00 AM
Rw9, the #1 option is a SAS datastep using the PDV. Sequential processing with a lot of not needed overhead.
I agree with Kurt as copying at OS level is not having all that SAS-overhead. In a smarter OS design this kind of copying could be propagated as a remote command executes on the SAN directly. In that case you will not even notice any IO. The best of IO for performance is no IO at all. And trust the binary copy of an OS as when you cannot trust that, dump it all as it rubbish.
Do you need to do it with SAS statements?
think on using Proc SQL creating a new table with all fields. That procedure is multithreaded and as result the order of records can change.
But when that order is not relevant, no problem.
Brings me back to your original question, why should you make an additional copy of an existing dataset?
Is it for archiving a history. you have aging (proc datasets): Base SAS(R) 9.3 Procedures Guide, Second Edition (little more as changing names)
Is it for maintaining a series of datasets, you have gdg's: SAS(R) 9.3 Language Reference: Concepts, Second Edition (every time a new version is created the old one is kept)
with gdg's you are making a copy and rename as part of the concept
04-02-2014 09:36 AM
Thank you for all of these responses.
I'm surprised to see how many people prefer #1 -- the simple data step.
Jaap asked a good question: why would anyone (in one's right mind) want to copy a dataset inside the same library?
The simple answer is because this is what some people are used to doing -- those people whose first operating system was Windows. I often make a simple-default copy of a file in the same folder, then rename it to a new name, and proceed to modify the new file.
In my current project I'm doing the same thing automatically (automating waste).
I'm downloading many monthly datasets into a folder/library with a date in their names, but I want to use only the most recent dataset in my program, so I'm copying the most recent dataset to another dataset (in the same folder) without the date in the name. In my program, I'm using the most recent dataset -- the one without the date in the name.
Jaap correctly suggested that this is a job for a Generational Data Group, so I really need to look into that. Maybe that's why I'm having difficulties -- because I should be using a GDG.
04-02-2014 10:29 AM
Why not just set a macro variable to the name of the dataset and avoid copying it at all?
You can automate the detection of which is the most recent by using the date embedded in the member name or by using other metadata such as MODATE.
04-02-2014 10:47 AM
I'm using several macro variables, but I can't use that specific one that Tom suggested because I don't know which date is the most recent. Instead of saying "use the file with the extension _20140402", I want to say "use the most recent file" -- which is what I think the GDG does... I think a zero in parentheses (0) refers to the most recent member of the GDG. I need to read that article that Jaap suggested.
04-02-2014 10:50 AM
But you can quickly scan all the file names and determine which is the latest one and then use that file, especially easy if you have an extension with the date as above.
04-02-2014 11:00 AM
You can query dictionary.tables and find tables (datasets) with either the latest creation date or modification date within any specified library.
04-02-2014 11:10 AM
What Reeza and ballardw sugest would work, and I might do that if the GDG organization doesn't work. I'm currently leaning toward the GDGs because it seems like the more direct option -- saying "use the most recent dataset" instead of first finding the name of the most recent dataset, saving the name of that dataset in a macro variable, and then using the dataset with that name.
04-02-2014 11:11 AM
If your table names are suffixed with date in that nice yyyymmdd format then you can just take the maximum value of the table name.
data base_20140301 base_20140308 base_20140315 ;
proc sql noprint ;
select max(memname) into :base separated by ' '
where libname = 'WORK' and memname like 'BASE_%'
%put Last BASE table name is &base ;
04-02-2014 11:40 AM
wow what is this suddenly a living thread. A lot more of good suggestions.
1- GDG and Aging are at their best when you would need the moving content (like moving average)
eg: the last 6 weeks processes (one dataset every week), the last 30 days etc.
As you can combine that period of available datasets and there is no need to design an adding / delete with dates.
2- using a macro-var like Tom's proposed moddat is an ideal one when
a/ you can predict/validate the name.
Eg: Every workingday coming in one dataset. You can increment the expected one with SAS-date processing setting up the macro
b/ the naming convention is data related normally the system-date. I would prefer a simdate setting so you can test with time-travelling
That was the years of milo-testing when having common implemented that. But when you have testdata to be repeated it also very usefull.
3- processing the last date available as ballardw had pointed
This will work when the incoming datasets are SAS-datasets. (transferred binary or saved directly)
But I think the creation dates are being set by the sender not necessary (last update done) as the date naming as the dataset
4- Data-step processing for *.txt files containing dates.
Often associated with first making a directory listing containing dates. (reeza)
I get the feeling getting some sense of the requirements and logic the wanted approach is more like option 2.