Solved: SCYP Training: Level 2 Practice: Sorting Data to Remove Duplicate Rows

ncd · Posted 01-28-2021 04:19 PM

Hi all,

I am trying to do the practice on Dashboard/ My courses/ SAS Programming 1: Essentials/ Lessons/ Lesson 3: Exploring and Validating Data.

The code I wrote is:

proc sort data=PG1.np_largeparks nodupkey out=park_clean dupout=park_dups;
by _all_;
run;

and the code solution says:

proc sort data=pg1.np_largeparks
		  out=park_clean
		  dupout=park_dups
		  nodupkey;
    by _all_;
run;

Unfortunately, neither of them works. I pasted the log below. Cant figure why there appears 0 observations. The solution says there must be 30 duplicates.

Thanks,

Cagri

ballardw · Posted 01-28-2021 04:43 PM

How many records were in PG1.np_largeparks when it was created at the set up of the training data sets?

I might suspect an earlier Proc sort without the OUT= that sorted the data set in place (see the note about the data is already sorted in the log?) and deleted the records already. So there is nothing to remove now.

View solution in original post

ballardw · Posted 01-28-2021 04:43 PM

How many records were in PG1.np_largeparks when it was created at the set up of the training data sets?

I might suspect an earlier Proc sort without the OUT= that sorted the data set in place (see the note about the data is already sorted in the log?) and deleted the records already. So there is nothing to remove now.

ncd · Posted 01-29-2021 08:39 AM

Interestingly enough, there are 123 obs from the beginning. Somehow the file after duplicates are deleted was overwritten on the original file. Now I set it up from the beginning and the original file has 153 obs. Thanks for the quick reply.

Cynthia_sas · Posted 01-28-2021 05:28 PM

Hi:
If you want to restore the data back to the start point of class, all you need to do is rerun the program that makes the data. If you rerun the program (as you did when you initially set up the data), the class files will be refreshed.
As you can see from my LOG, below:

after I make the data for class, you should start with 153 rows in PG1.NP_LARGEPARKS with 30 duplicate rows. So it appears that you've already deleted the dups from the LARGEPARKS data table.
Cynthia

ncd · Posted 01-29-2021 08:40 AM

Dear Cynthia, thank you so much.

I have rerun the file that makes that data now I have 153 obs in the raw file. Somehow the file after duplicates are deleted was overwritten on the original file when I was working on it or some other glitch occurred. Now I set it up from the beginning and the original file has 153 obs. Thanks for your help.

ballardw · Posted 01-29-2021 10:13 AM

@ncd wrote:

Dear Cynthia, thank you so much.

I have rerun the file that makes that data now I have 153 obs in the raw file. Somehow the file after duplicates are deleted was overwritten on the original file when I was working on it or some other glitch occurred. Now I set it up from the beginning and the original file has 153 obs. Thanks for your help.

Proc Sort when you do not use the OUT= option replaces the data set used.

It is quite typical for people to use

Proc sort data=somedataset;
   by thisvar thatvar;
run;

Which sorts in place, i.e. replaces the original set with one sorted.

But if you use

Proc sort data=somedataset nodupkey;
   by thisvar thatvar;
run;

Then it replaces the data set with one sorted and with the duplicates removed.

This is the designed behavior and not a "glitch".

You would not be the first person to unintentionally delete records. Ask me how I know 😳

SASRB · Posted 08-01-2023 10:57 AM

Hello,

Let my ask another question in this regard. Why there was neither error nor the discrepancy in the output data when I put "nodupkey" prior to dupout=park.dups:?

How can I understand when the commands order is strict and when I can be "creative"?

1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 72         
 73         
 74         proc sort data=pg1.np_largeparks out=park_clean
 75         nodupkey dupout=park_dups;
 76         by _all_;
 77         run;
 
 NOTE: There were 153 observations read from the data set PG1.NP_LARGEPARKS.
 NOTE: 30 observations with duplicate key values were deleted.
 NOTE: The data set WORK.PARK_CLEAN has 123 observations and 5 variables.
 NOTE: The data set WORK.PARK_DUPS has 30 observations and 5 variables.
 NOTE: PROCEDURE SORT used (Total process time):
       real time           0.00 seconds
       user cpu time       0.01 seconds

Thank you.

Cynthia_sas · Posted 08-01-2023 01:02 PM

Hi:

We recommend that you refer to the documentation to find whether an option for a procedure is required to be specified a certain way. Here are 3 different invocations of PROC SORT. Note that all 3 invocations work, even if the options like DATA=, OUT=, DUPOUT= and NODUPKEY are listed in a different order each time:

Generally, after the keyword PROC you must list the procedure name and then usually other options can be specified in any order. As a best practice, I always use the DATA= option and the OUT= option first, when I code my PROC SORT, but even DATA= is optional because if you don't have it, then SAS uses the value of the automatic variable _LAST_.

Cynthia

SASRB · Posted 08-03-2023 06:30 AM

Thank you for your clarification.
It's good to know about the possibility to get the same outcome in a slightly different ways.

SCYP Training: Level 2 Practice: Sorting Data to Remove Duplicate Rows

Re: SCYP Training: Level 2 Practice: Sorting Data to Remove Duplicate Rows

Re: SCYP Training: Level 2 Practice: Sorting Data to Remove Duplicate Rows

Re: SCYP Training: Level 2 Practice: Sorting Data to Remove Duplicate Rows

Re: SCYP Training: Level 2 Practice: Sorting Data to Remove Duplicate Rows

Re: SCYP Training: Level 2 Practice: Sorting Data to Remove Duplicate Rows

Re: SCYP Training: Level 2 Practice: Sorting Data to Remove Duplicate Rows

Re: SCYP Training: Level 2 Practice: Sorting Data to Remove Duplicate Rows

Re: SCYP Training: Level 2 Practice: Sorting Data to Remove Duplicate Rows

Re: SCYP Training: Level 2 Practice: Sorting Data to Remove Duplicate Rows