DATA Step, Macro, Functions and more

Necessary to repeat sorting?

Reply
N/A
Posts: 0

Necessary to repeat sorting?

My data set have 4 variables: Identity, Year, Gender and Amount. Each person(Identity) can have more than one record.

First I sort them

BY YEAR IDENTITY GENDER AMOUNT;

There are 25 different years in the data set. In the next data step I divide the data into 25 separate sets, YEAR_1 - YEAR25, with each year in its own set. I do nothing that affects the ordering whatsoever.

Then I want to count the number of individuals in each year by using FIRST.IDENTITY.

So I put BY IDENTITY in the code like this:

DATA Counting1;
SET YEAR_1;
BY Identity
/* Counting */
RUN;

and so on, for each year.

I get the error message that the sets are not ordered by Identity.

Is it it really necessary to use PROC SORT on data sets YEAR_1 - YEAR_25 before the counting?

I haven't done anything that should have affected the RELATIVE individual ordering of persons(variable Identity).

Susan
Super Contributor
Posts: 474

Re: Necessary to repeat sorting?

Hello Susan.

Data sorted by YEAR IDENTITY GENDER AMOUNT is in different thant by IDENTITY.

For example

YEAR IDENTITY GENDER AMOUNT
2008 1 M 10
2008 2 F 11
2009 1 F 12
2009 2 F 13

is sorted by YEAR IDENTITY GENDER AMOUNT not sorted by IDENTITY.

Try to look the IDENTITY column alone, you can easily see that it is not sorted.
YEAR IDENTITY GENDER AMOUNT
. 1 . .
. 2 . .
. 1 . .
. 2 . .

At row 3 the sequence is broke, as 1 (row 3) is lower than 2 (row 2).

Greetings from Portugal.

Daniel Santos at www.cgd.pt
N/A
Posts: 0

Re: Necessary to repeat sorting?

First I sorted the one and only original file BY YEAR IDENTITY GENDER AMOUNT;

If the years are 1976 - 2000 then the file should look like this

Year Id ....
1976 2
1976 4
1976 4
.
.
.

1977 1
1977 4
1977 6
.
.
.

1978 2
1978 3
1978 3
1978 4
1978 5
1978 6
.
.
.

Then I make 25 separate data sets for each year. The first three should start the same way as above and the other 22 sets in the same manner. I haven't changed the order.

I claim that the Identity column is ordered in each year set, and the counting is done separately in each year set.

I don't understand what you mean. Read my first message once more.

Susan
N/A
Posts: 0

Re: Necessary to repeat sorting?

My 1:58 PM message was directed to Daniel Santos.
Super Contributor
Super Contributor
Posts: 3,174

Re: Necessary to repeat sorting?

Suggest you share your SAS log output with source code revealed for further feedback/comment from *ANY* forum subscriber.

Scott Barry
SBBWorks, Inc.
Super Contributor
Super Contributor
Posts: 3,174

Re: Necessary to repeat sorting?

Your SAS log output should have sufficient diagnostic information to tell you what observation was the problem -- that information typically helps me identify my programming oversight, working back from the DATA step that generated the error showing my data. No, you should not need to sort your file by YEAR, presuming that there is only one YEAR value -- suggest doing a PROC FREQ just prior to the failing DATA step to be sure. Good desk-checking and self-initiated diagnostics nearly always save the day and improve efficiency.

Scott Barry
SBBWorks, Inc.
Trusted Advisor
Posts: 2,113

Re: Necessary to repeat sorting?

Susan,

In addition to Scott's comments, I'll add a couple.

You can do a PROC CONTENTS on YEAR_1 to see how SAS thinks it is sorted.

If you broke the dataset up into pieces using PROC SQL, it could have re-arranged the data without your knowledge as part of its internal optimization. The only way to guarantee that SQL maintains an order is to use the ORDER BY clause in the SELECT statement. (This "feature" is why you can't implement the LAG functionality within SQL.)

Doc Muhlbaier
Duke
Super Contributor
Posts: 474

Re: Necessary to repeat sorting?

OK, my mistake.
If you break the dataset by YEAR, the IDENTITY variable will maintain the ordering sequence (ordered), IF and ONLY IF the dataset is splitted sequentially, say processed through datastep (which processes each OBS sequentially). As Doc@Duke, said, you cannot assume the ordering sequence with PROC SQL, unless you explicitly say so. That is because SQL uses an optimizer that will choose the best strategy/technique to perform the task, and this may not involve a sequential processing of the dataset.

I agree with Scott, before making any further assumption about your task and the way you are doing it, you should share with us a bit of your code.

Greetings from Portugal.

Daniel Santos at www.cgd.pt
SAS Employee
Posts: 160

Re: Necessary to repeat sorting?

try

DATA Counting1;
SET YEAR_1(sortedby=Identity);
BY Identity
/* Counting */
RUN;

http://support.sas.com/onlinedoc/913/getDoc/en/lrdict.hlp/a000131184.htm

Why do you want to make 25 datasets, if you just want to count the number of individuals in each year ? You can do that with just your initial dataset.
N/A
Posts: 0

Re: Necessary to repeat sorting?

Spot on Geniz.

Every SAS data set has details held in a descriptor portion. This is a little header with details about the dataset. One of these details is the variable it is sorted by. When you run a proc sort SAS updates the sortedby value to reflect the variables the data has been sorted by.

When you run your SQL to break the data up the resulting data sets do not have the sortedby value set in the descriptor.

If you know for sure that the data is sorted in a particular way then you can set the sortedby value yourself as Geniz has pointed out, to save you running another proc sort.

If, when SAS comes to process the dataset, it finds the data is not actually sorted the way you say it is, it will throw an error.
Contributor
Posts: 49

Re: Necessary to repeat sorting?

Susan,

If you just want to count the number of unique ID per YEAR why not just use something like:

PROC SQL;
SELECT DISTINCT year
, COUNT(DISTINCT id) AS n_identifier
FROM alldata
GROUP BY year
;
QUIT;
Ask a Question
Discussion stats
  • 10 replies
  • 250 views
  • 0 likes
  • 6 in conversation