BookmarkSubscribeRSS Feed
_maldini_
Barite | Level 11

I am merging 2 SAS datasets by subject ID. The subject ID (FB_ID in attached screen shot) is stored as a character variable w/ length 20 in both datasets. There are roughly 3,900 observations in the merged dataset. Approximately 300 of the subject IDs appear in the dataset in scientific notation, despite the fact that the variable is a character varaible. Each observation associated w/ a subject ID appearing in scientific notation has missing values for the variables in one of the datasets. 

 

It is possible that the subject IDs appearing in scientific notation are exclusive to one of the datasets. This would explain why the values of the other variables would be missing. However, I can't verify this because I can't convert the scientifc notation, even when using a character format (FORMAT FB_ID $char20.).

 

Also, I have no idea why they appear in scientific notation.

 

When looking at the individual datasets (.csv), prior to merging, neither one of them have any subject IDs store in scientific notation. This notation is only found in the merged dataset.

 

 

Does this make sense to anyone? Thanks for your help.

 


Scientific notation.jpg
5 REPLIES 5
ballardw
Super User

I would back track the data in processing sequence to see where these first appear.

data test;

  set dsname;

   where index(FB_id, 'E') > 0;

run;

should find these. Try different data sets in the processing sequence to find the culprit dataset. The find out how that set was made.

 

That will likely give you the best place to address the issue.

 

 

_maldini_
Barite | Level 11

@ballardw Thanks for your suggestion. It is helpful.

 

Using your code, I found that the values in scientific notation appear in one of the sorted datasets (i.e. "challenge_sort") , but not in the unsorted dataset (i.e. "challenge"). Again, this makes no sense to me. Why would it matter whether the data were sorted by this variable for not?

 

This is the syntax that I am using to sort the datasets prior to the merge. This is just for the one dataset containing the scientific notation.

 

PROC SORT DATA=challenge OUT=challenge_sort;
BY FB_ID;
RUN;
ballardw
Super User

jcorroon wrote:

 

@ballardw Thanks for your suggestion. It is helpful.

 

Using your code, I found that the values in scientific notation appear in one of the sorted datasets (i.e. "challenge_sort") , but not in the unsorted dataset (i.e. "challenge"). Again, this makes no sense to me. Why would it matter whether the data were sorted by this variable for not?

 

This is the syntax that I am using to sort the datasets prior to the merge. This is just for the one dataset containing the scientific notation.

 

PROC SORT DATA=challenge OUT=challenge_sort;
BY FB_ID;
RUN;

I might suspect that you may have had something overwrite the simple sort you show from your description. I would look for a bit of

code similar to:

Data challenge_sort;

   set challenge_sort;

<other code>;

 

and see if there were any rename and assignments playing with the ID variable (create character from numeric).

 

Something else might be to compare the system dates on the input and output sets. It may be that Challenge was remade at a later date than challenge_sort and that later version now had the Id as a numeric.

 

A good flow chart of the intended order of operations tagged to the program files might be in order.

Kurt_Bremser
Super User

Take a thorough look at the log of the whole process, and scan for NOTEs about automatic conversion from character to numeric and vice versa. You might have introduced one or more by accident.

FreelanceReinh
Jade | Level 19

@_maldini_: Is it possible that Excel was involved at any point in the process? I'm asking because '1.00005E+14' (shown in your screenshot) is not typical for SAS ('1.0000457E14'), but it is the default format in which Excel (2013) displays a number like 100004567890123.

 

Your screenshot also shows a permanent informat $20. This suggests that raw data have been imported in an automated way (PROC IMPORT?), which is always a great opportunity to get inconsistent variable types.

 

Why does FB_ID have a length of 20 at all? The values shown in your screenshot have only 15 digits. If there were longer values, it would even be risky to store them as numeric values (in SAS, let alone in Excel) because of limitations in numerical accuracy.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 3083 views
  • 1 like
  • 4 in conversation