BookmarkSubscribeRSS Feed
agoldma
Pyrite | Level 9

I couldn't find an easy and efficient way in SAS to copy one dataset to another inside the same library.

# 3 below seems best to me, but it’s not obvious, and I couldn't find it explicitly documented anywhere.

Maybe there’s an easier way, so please comment.

I can think of 4 ways to copy one dataset to another inside the same library:

  1. Data step -- very easy, the first thing everyone learns, but it's kind of slow and wastes resources
  2. The X command -- using the operating system, but that's not really SAS
  3. One PROC DATASETS using an APPEND statement -- appending to a nonexistent dataset (deleting the destination dataset before appending)
  4. Three PROC DATASETS -- first to COPY to a different library, then CHANGE the name of the new dataset, then COPY to the original library

Any others?

copy

Example of # 3 (the destination dataset dest_ds needs to be empty or nonexistent, so I'm deleting it)

proc datasets nolist lib=MyLib;

     delete dest_ds;

     append data = source_ds

           out  = dest_ds  ;

quit;

13 REPLIES 13
Kurt_Bremser
Super User

#2 is the best option, hands down. Everything else is crushing peanuts with an elephant's hoof, IMO.

If you want to do it inside SAS, #1 is preferred. A simple "data xxx; set yyy; run;" will not be slower than #3, because it will be simply limited by I/O bandwidth.

Hint: Copying inside the same library will usually be slower than copying to another one, because you use the same HW for reading and writing. Different libs _may_ be located on different disks, giving you more aggregate bandwidth.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Sorry, I would disagree with that.  #1 would be the better option principally because there is no error checking (unless you program it yourself) done, and generally the feedback from the operation is not looked at.  You wont notice so much with copy, however if it fails (file not found, dir not exist, no permissions etc.) then you may find processes further down the line are using old data.  Its even more dangerous with delete commands as it can default in other areas, or cause other unexpected problems.  If you absolutely have to use DOS to move/copy/delete files I would suggest either writing some QC code around to check existence of all items and post process messages coming back etc. or alternatively use an app or code one yourself outside SAS.  For anything else use the datastep as this will tell you any errors directly in your log and not continue to operate until you fix it.  #3, #4 don't seem particularly intuitive or advantageous to my mind.

data_null__
Jade | Level 19

#3 should be faster than #1 because APPEND will use block reads and writes.  Using PROC DELETE and PROC APPEND may be faster than PROC DATASETS for the reason that PROC DELETE was resurrected.

16         proc datasets library=work;
17            delete abc;
NOTE: The file WORK.ABC (memtype=DATA) was not found, but appears on a DELETE statement.
18            append base=abc data=zzz;
19            run;

NOTE:
Appending WORK.ZZZ to WORK.ABC.
NOTE: BASE data set does not
exist. DATA file is being copied to BASE file.
INFO: Engine
's block-read method is in use.
INFO: Engine's block-write  method is in
use.
NOTE: There were
1 observations read from the data set WORK.ZZZ.
NOTE: The data set WORK.ABC has
1 observations and 6 variables.
20            quit;

NOTE:
PROCEDURE DATASETS used (Total process time):
      real time          
0.04 seconds
      cpu time           
0.03 seconds
jakarman
Barite | Level 11

Rw9, the #1 option is a SAS datastep using the PDV. Sequential processing with a lot of not needed overhead.

I agree with Kurt as copying at OS level is not having all that SAS-overhead. In a smarter OS design this kind of copying could be propagated as a remote command executes on the SAN directly. In that case you will not even notice any IO. The best of IO for performance is no IO at all. And trust the binary copy of an OS as when you cannot trust that, dump it all as it rubbish.

Do you need to do it with SAS statements?

think on using Proc SQL creating a new table with all fields. That procedure is multithreaded and as result the order of records can change.

But when that order is not relevant, no problem.

Brings me back to your original question, why should you make an additional copy of an existing dataset?

Is it for archiving a history.   you have aging (proc datasets): Base SAS(R) 9.3 Procedures Guide, Second Edition    (little more as changing names)

Is it for maintaining a series of datasets, you have gdg's: SAS(R) 9.3 Language Reference: Concepts, Second Edition (every time a new version is created the old one is kept)  

with gdg's you are making a copy and rename as part of the concept

---->-- ja karman --<-----
agoldma
Pyrite | Level 9

Thank you for all of these responses.

I'm surprised to see how many people prefer #1 -- the simple data step.

Jaap asked a good question: why would anyone (in one's right mind) want to copy a dataset inside the same library?

The simple answer is because this is what some people are used to doing -- those people whose first operating system was Windows. I often make a simple-default copy of a file in the same folder, then rename it to a new name, and proceed to modify the new file.

In my current project I'm doing the same thing automatically (automating waste).

I'm downloading many monthly datasets into a folder/library with a date in their names, but I want to use only the most recent dataset in my program, so I'm copying the most recent dataset to another dataset (in the same folder) without the date in the name. In my program, I'm using the most recent dataset -- the one without the date in the name.

Jaap correctly suggested that this is a job for a Generational Data Group, so I really need to look into that. Maybe that's why I'm having difficulties -- because I should be using a GDG.

Tom
Super User Tom
Super User

Why not just set a macro variable to the name of the dataset and avoid copying it at all?

You can automate the detection of which is the most recent by using the date embedded in the member name or by using other metadata such as MODATE.

agoldma
Pyrite | Level 9

I'm using several macro variables, but I can't use that specific one that Tom suggested because I don't know which date is the most recent. Instead of saying "use the file with the extension _20140402", I want to say "use the most recent file" -- which is what I think the GDG does... I think a zero in parentheses (0) refers to the most recent member of the GDG. I need to read that article that Jaap suggested.

Reeza
Super User

But you can quickly scan all the file names and determine which is the latest one and then use that file, especially easy if you have an extension with the date as above.

ballardw
Super User

You can query dictionary.tables and find tables (datasets) with either the latest creation date or modification date within any specified library.

agoldma
Pyrite | Level 9

What Reeza and ballardw sugest would work, and I might do that if the GDG organization doesn't work. I'm currently leaning toward the GDGs because it seems like the more direct option -- saying "use the most recent dataset" instead of first finding the name of the most recent dataset, saving the name of that dataset in a macro variable, and then using the dataset with that name.

Tom
Super User Tom
Super User

If your table names are suffixed with date in that nice yyyymmdd format then you can just take the maximum value of the table name.

data base_20140301 base_20140308 base_20140315 ;

  x=1;

run;

proc sql noprint ;

  select max(memname) into :base separated by ' '

    from dictionary.members

    where libname = 'WORK' and memname like 'BASE_%'

  ;

quit;

%put Last BASE table name is &base ;

jakarman
Barite | Level 11

wow what is this suddenly a living thread. A lot more of good suggestions.
The options:

1- GDG and Aging are at their best when you would need the moving content (like moving average)

  eg: the last 6 weeks processes  (one dataset every week),  the last 30 days etc.

   As you can combine that period of available datasets and there is no need to design an adding / delete with dates.

2- using a macro-var like Tom's proposed moddat is an ideal one when

  a/ you can predict/validate the name.

  Eg: Every workingday coming in one dataset. You can increment the expected one with SAS-date processing setting up the macro

  b/ the naming convention is data related normally the system-date. I would prefer a simdate setting so you can test with time-travelling

  That was the years of milo-testing when having common implemented that. But when you have testdata to be repeated it also very usefull. 

3- processing the last date available as ballardw had pointed

  This will work when the incoming datasets are SAS-datasets.  (transferred binary or saved directly)

  But I think the creation dates are being set by the sender not necessary (last update done) as the date naming as the dataset

4- Data-step processing for *.txt files containing dates.

    Often associated with first making a directory listing containing dates. (reeza)

I get the feeling getting some sense of the requirements and logic the wanted approach is more like option 2.

---->-- ja karman --<-----

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 4790 views
  • 2 likes
  • 8 in conversation