BookmarkSubscribeRSS Feed
curiosity
Obsidian | Level 7

Hi,

 

I have a friend who uses SAS Studio and she has sent me some code to run. I am using SAS 9.4

The code is mostly conditional datasteps (if-then statements) and proc sql and it is about 400 lines long.

However, while we end up with similar data sets in terms of number of observations and variables, some of our observations are slightly different.

 

The logs are clear except for this line that appears every now and then:
NOTE: The query requires remerging summary statistics back with the original data.

 

There are no errors at all when running the code.

 

Is there something I can do about this to follow certain steps to make sure we can get the same results from a set of code?

13 REPLIES 13
Patrick
Opal | Level 21

In principal the results should be the same. 

There could be environment differences like your friend running code with UTF-8 session encoding and you with ANSI (single byte encode environment). If that's the case then do you see any "funny" characters in your data? Such issues could cause character based expressions to return undesired results.

 

"some of our observations are slightly different"

What are the differences? Some rounding? Please be very specific and share a few examples.

Also try to identify the exact row where the differences occur and ideally find the expression in your code/data step for this row and data used that causes this difference.

 

"NOTE: The query requires remerging summary statistics back with the original data."

That's only a SAS note you get for certain SQL syntax. It's nothing to worry about and only worth looking into if it's about improving performance.

curiosity
Obsidian | Level 7

@Patrick wrote:

In principal the results should be the same. 

There could be environment differences like your friend running code with UTF-8 session encoding and you with ANSI (single byte encode environment). If that's the case then do you see any "funny" characters in your data? Such issues could cause character based expressions to return undesired results.

Thanks Patrick. You're right that there is some encoding issues in the data.


I import the data from a CSV into SAS 9.4 (Unicode) and clean the data and send my colleague the cleaned SAS dataset file.

She uses SAS onDemand for Academics. She uses my input dataset and writes code to do some work on it.

She sends me the final code to get to her completed dataset and shows me screenshots of her final dataset demonstrating the required results. 

I run her code on my local machine on 3 different versions: SAS 9.4 (Unicode), SAS 9.4 and SAS Studio. and then finally online on SAS onDemand.

 

When I import the data on my local machine on SAS 9.4 (Unicode) and run her code , it completes without any errors in the log but incorrect results.

The logic is completely garbled.

 

When I run the code on SAS 9.4 or SAS Studio locally:

 NOTE: Data file TEMP.EXPORTS.DATA is in a format that is native to another host, or the file 
       encoding does not match the session encoding. Cross Environment Data Access will be used, which 
       might require additional CPU resources and might reduce performance.
 79         run;
 
 ERROR: Some character data was lost during transcoding in the dataset TEMP.EXPORTS. Either the 
        data contains characters that are not representable in the new encoding or truncation occurred 
        during transcoding.

 

My friend is using SAS on Demand for Academics so I finally gave in and registered an account on there and I uploaded my dataset then ran her code and the results were perfect, similar to what she said.

 

Interestingly, the final dataset on my SAS 9.4 (Unicode) has the correct number of observations and variables but the values of certain observations in certain variables are not correct which indicates that the logic steps did not complete properly.

Since I am using more data/set steps and proc sql I read that it could be due to sorting issues so now I attempted to sort the data before and after every datastep/proc sql batch but still not luck.

 

 

 

"some of our observations are slightly different"

What are the differences? Some rounding? Please be very specific and share a few examples.

 

I have 8 variables that get chosen based on conditional logic of the initial dataset that is iterated over 20 times to end up with a final data set.

 

Some variables are created on the way to flag but the final dataset is the same number of variables as we started with, just fewer observations.

The logic moves around the values between observations and deletes the observations that are not useful anymore.

The final value in each observation of a particular variable is based on the logic of the prior dataset.

 

For example, if I have 3 variables Timestamp Ticker1 Ticker2 Ticker3 for each observation and after the logic steps when I compare my final dataset with my colleagues: for example under Ticker1, I see GOOG but when my friends code is run she shows, correctly, AAPL.

So somewhere in the logic we're getting different results. 

 

 

Also try to identify the exact row where the differences occur and ideally find the expression in your code/data step for this row and data used that causes this difference.

This is what I'm trying to do today. I don't want to just use this SAS on Demand if i don't have to for extended periods. I didn't even know there was a free version like this of SAS till now.  Its important for me to find out what is happening so I can trust my future results.

 

 

"NOTE: The query requires remerging summary statistics back with the original data."

That's only a SAS note you get for certain SQL syntax. It's nothing to worry about and only worth looking into if it's about improving performance.


Thank you.

FreelanceReinh
Jade | Level 19

Hi @curiosity,

 


@curiosity wrote:

The code is mostly conditional datasteps (if-then statements) and proc sql and it is about 400 lines long.


In addition to possibly different settings of SAS system options, PROC SQL is a very good candidate for creating "slightly different" results (or even greater discrepancies) on different machines. This is because "in the absence of an ORDER BY clause, the processor is not obligated to deliver results in any particular order" (Howard Schreier [2008]: PROC SQL by Example, p. 74). So the order of observations can be different at times. This, in turn, can have an impact, e.g., on the summary statistics that you mentioned:

48    data _null_;
49    if .1+.2+.3 ne .2+.3+.1 then put 'Surprised?';
50    run;

Surprised?
NOTE: DATA statement used (Total process time):

But, obviously, as soon as IF conditions like x<=0.6 on such a summary statistic (x=.1+.2+.3 vs. x=.2+.3+.1) yield different results (here: FALSE in the first case, TRUE in the second, using Windows SAS 9.4M5), the differences that can arise downstream are virtually unlimited.

 

Also note that "SAS Studio" is just the user interface. The relevant (SAS 9.4) maintenance releases or even the operating systems might be different.

Rick_SAS
SAS Super FREQ

In addition to the excellent comments by @FreelanceReinh and others, there are computational issues related to the actual SAS Workspace Server that is running the code.

 

For example, the SAS 9 code might be running locally on a PC that is running Windows whereas the SAS Studio code might be running on a Linux server. Different chip sets and different computers can lead to different results, although the differences are usually small (such as 1E-14).

curiosity
Obsidian | Level 7
Thank you, I did notice this when I started to do the proc compare part. The value differences were miniscule and when I say there's differences between the two outputs, I left out these small differences.
SASKiwi
PROC Star

From experience, I've also seen differences arising between SAS version and maintenance releases even on the same OS. It is quite possible that you and your friend are using different maintenance releases of 9.4. The best way to ensure you have the same output is to use PROC COMPARE to compare output datasets. This will highlight all structural and data differences within a SAS dataset.

curiosity
Obsidian | Level 7

Thank you for your reply.

 

I've spent the greater part of the day following this suggestion.  I'm not well versed in proc sql but i notice the first place the datasets diverge is with this set of code:

proc sql;

create table input_data3 as
select *,sum(flag) as flag_tot
from input_data2
group by id,flag
order by id,flag desc;

quit;

In the input_data2 file, the PROC COMPARE on both SAS 9.4 (Unicode) and SAS OnDemand give the exact same results


Then I run the above sql code also on  both SAS 9.4 (Unicode) and SAS OnDemand and the results are different and proc compare shows that about 25% of the observations have different values across multiple variables.

 

Is there anything about this code that should run differently on these two SAS systems?
I tried to use proc sort by id, desc flag prior to the proc sql and still see differences in the two systems.

FreelanceReinh
Jade | Level 19

@curiosity wrote:

(...)

proc sql;

create table input_data3 as
select *,sum(flag) as flag_tot
from input_data2
group by id,flag
order by id,flag desc;

quit;

(...)

Is there anything about this code that should run differently on these two SAS systems?
I tried to use proc sort by id, desc flag prior to the proc sql and still see differences in the two systems.


You will need to sort in (or after) the PROC SQL step. For a successful check of equality of two datasets their sort order must be uniquely determined. To achieve this, you will probably want to add more items to the ORDER BY clause because "select *, ..." suggests that ID and FLAG are not the only variables in input_data2.

 

Let me create an example dataset input_data2 with only one additional variable X:

data input_data2;
do id=1 to 4;
  do flag=0 to 1;
    do x=1 to 3;
      output;
    end;
  end;
end;
run;

Here are the first 12 observations of the resulting input_data3 after running your PROC SQL step on this dataset under Windows SAS 9.4M5:

Obs    id    flag    x    flag_tot

  1     1      1     3        3
  2     1      1     1        3
  3     1      1     2        3
  4     1      0     2        0
  5     1      0     3        0
  6     1      0     1        0
  7     2      1     1        3
  8     2      1     2        3
  9     2      1     3        3
 10     2      0     2        0
 11     2      0     1        0
 12     2      0     3        0

The "random" permutations of the (previously sorted) values 1, 2, 3 of X in the first four ID-FLAG BY groups illustrate the quote by Howard Schreier in my earlier post, especially when compared to the results obtained on input_data2(obs=18), i.e., the first 18 observations (comprising IDs 1, 2 and 3) of the original dataset. They start like this:

Obs    id    flag    x    flag_tot

  1     1      1     2        3
  2     1      1     1        3
  3     1      1     3        3
  4     1      0     2        0
  5     1      0     3        0
  6     1      0     1        0
  ...

The fact that just omitting ID 4 led to a different sort order of observations with ID=1 demonstrates how unstable these results are as long as X is not included in the ORDER BY clause.

 

curiosity
Obsidian | Level 7

 


The fact that just omitting ID 4 led to a different sort order of observations with ID=1 demonstrates how unstable these results are as long as X is not included in the ORDER BY clause.

 


 

Thank you for taking the time to explain it in such detail.


I am going to try and go over all the proc sql steps and find out what the issue is.
When i did the sort by _all_ it pushed the problem down a few lines so I think the issue has to do with sorting prior, using the by part correctly and sorting after the SQL command as well.

Patrick
Opal | Level 21

If you can run a Proc Compare then one of the SAS tables must be coming form another environment and though these very small numerical precisions could still exist.

With Proc Compare use a FUZZ factor - something like: fuzz=0.000000000001

 

If the results still differ and the columns in your order by statement - id,flag desc - don't uniquely identify a record then ensure that both source tables are sorted exactly the same before using Proc Compare.

proc sort data=<source table> out=for_comparison;
by _all_;
run;

 Even though the numerical differences can be very small, they "hurt" of course still in comparisons like <variable><=0.1

SASKiwi
PROC Star

What does your SAS log report for this query. Are you getting data re-merging notes by any chance? If so then I think you need to recode this step so you don't get them. You are likely to find that recoding this step will also resolve the data differences. 

curiosity
Obsidian | Level 7

 


@SASKiwi wrote:

What does your SAS log report for this query. Are you getting data re-merging notes by any chance? If so then I think you need to recode this step so you don't get them. You are likely to find that recoding this step will also resolve the data differences. 


I spent days trying to fix this and instead just started from scratch and on a Window system and it worked. And that code works on SAS OnDemand as well so don't really know what the reason why the initial code that worked on SAS OnDemand didn't work on SAS 9.4 Unicode was but its possible it was due to the 're-merging; aspect you mentioned

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 1907 views
  • 11 likes
  • 6 in conversation