BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
LPark
Calcite | Level 5

Greetings.

I have dissected and re-written this code in with checks for hours, and am completely stuck as to why the errors are occurring. This code involves taking combinations of cancer histology and primary site codes and mapping them to specific cancer diagnosis types:

data lymphoma2;

     merge lymphoma1 (in=in1 keep=sitecd histocd)

          map (keep=histocd diagcd);    

     by histocd;

     if in1;

     if 700<=sitecd<=729 then diagcd=38;

run;

in the above code, the merge takes the diagnosis codes (diagcd) from dataset "map" and maps them to the histology codes (histcd) in the cancer records from the dataset "lymphoma1". The if statement then takes records with site codes (sitecd) between 700 and 729 and hard codes them specifically to diagcd=38, instead of using the diagcd in the "map" dataset. However, after this data step, there are records where diagcd=38 but not the correct sitecd combination (ie, records with sitecd=770 were also mapped to diagcd=38). What is strange is if I had created a new variable, I wouldn't have the same problem:

data lymphoma2;

     merge lymphoma1 (in=in1 keep=sitecd histocd)

          map (keep=histocd diagcd);    

     by histocd;

     if in1;

     newcd=diagcd;

     if 700<=sitecd<=729 then newcd=38;

run;

I would really appreciate any insight as to why this error occurs. Thank you in advance for your help!

1 ACCEPTED SOLUTION

Accepted Solutions
Quentin
Super User

Hi,

Would I be correct to guess that this is a one-to-many merge?  That is, map is unique by histocd but lymphoma1 is not?

When you merge datasets using a BY statement, variables read from a data set are implicitly retained.  The PDV is initialized when the BY group changes.

In a one to many merge, if you reassign a value in the "one" dataset, that value will be retained until the by-group changes.  When I took a SAS training course, they had us work through the merge process and program data vector with paper, to understand this.

Here's an example of your problem:

data many;
input id test score;
cards;
1 1 10
1 2 15
1 3 20
2 1 10
2 2 15
2 3 20
;

data one;
input id gender $1;
cards;
1 M
2 F
;
run;

71   data want;
72     merge many (keep=id test score)
73           one  (keep=id gender)
74     ;
75     by id;
76     if test=2 then gender='X';
77     put (_all_)(=);
78   run;

id=1 test=1 score=10 gender=1
id=1 test=2 score=15 gender=X
id=1 test=3 score=20 gender=X
id=2 test=1 score=10 gender=2
id=2 test=2 score=15 gender=X
id=2 test=3 score=20 gender=X
NOTE: There were 6 observations read from the data set WORK.MANY.
NOTE: There were 2 observations read from the data set WORK.ONE.
NOTE: The data set WORK.WANT has 6 observations and 4 variables.

When you create a new variable, that variable is NOT retained, so everything works as you want.

A "rule" might be to avoid changing the values of any variables when you merge data in a one-to-many merge.  Another sign that you may be headed for trouble is when something just feels odd.  In my above example, why would I recode a patient-attribute (gender), based on a test-level (test) attribute?  In your example, why recode diagnostic code based on site id?

HTH,

--Q.

BASUG is hosting free webinars Next up: Jane Eslinger presenting PROC REPORT and the ODS EXCEL destination on Mar 27 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

View solution in original post

8 REPLIES 8
Tom
Super User Tom
Super User

I don't understand the question. Here is sample data that I think represents your "problem".

What value do you want for DIAGCD for the first observation if not the value from MAP?

data lymphoma1 ;

input histocd sitecd ;

cards;

1 700

2 770

;;;;

data map ;

  input histocd diagcd ;

cards;

1 0

2 38

;;;;

data lymphoma2;

     merge lymphoma1 (in=in1 keep=sitecd histocd)

          map (keep=histocd diagcd);  

     by histocd;

     if in1;

     if 700<=sitecd<=729 then diagcd=38;

put (_all_) (=);

run;


histocd=1 sitecd=700 diagcd=38

histocd=2 sitecd=770 diagcd=38

LPark
Calcite | Level 5

Dear Tom,

Thank you for responding to my plea for help. Let's change up the example:

data lymphoma1 ;

input histocd sitecd ;

cards;

1 700

2 770

;;;;

data map ;

  input histocd diagcd ;

cards;

1 10

2 11

;;;;


Then based on my code, I would expect the final result:

histocd=1 sitecd=700 diagcd=38 (this is the result of the hard code since sitecd=700)

histocd=2 sitecd=770 diagcd=11 (this comes straight from the map)

but when I run the code with the datasets I attached earlier, diagcd for records with sitecd outside the 700-729 range get changed to 38.

Please let me know if I can clarify further.

Tom
Super User Tom
Super User

If you reset the value of the IN= variable at the top of the data step loop you can see when new records are actually read from the dataset.

This will explain why value for DIAGCD are carried forward onto other observations for the same HISTOCD group.  A new record is NOT read from the MAP dataset to get back to the old value of DIAGCD.

data lymphoma ;

input histocd sitecd ;

cards;

1 600

1 700

1 770

;;;;

data map ;

  input histocd diagcd ;

cards;

1 11

;;;;

libname a 'c:\downloads';

data lymphoma2;

in1=0;

in2=0;

     merge lymphoma (in=in1 keep=sitecd histocd)

          map (in=in2 keep=histocd diagcd);

     by histocd;

if first.histocd then n=0;

n+1;

put (in1 in2 n histocd sitecd diagcd ) (=) '-> ' @;

if 700<=sitecd<=729 then diagcd=38;

put diagcd= ;

run;

in1=1 in2=1 n=1 histocd=1 sitecd=600 diagcd=11 -> diagcd=11

in1=1 in2=0 n=2 histocd=1 sitecd=700 diagcd=11 -> diagcd=38

in1=1 in2=0 n=3 histocd=1 sitecd=770 diagcd=38 -> diagcd=38

Quentin
Super User

Hi,

Would I be correct to guess that this is a one-to-many merge?  That is, map is unique by histocd but lymphoma1 is not?

When you merge datasets using a BY statement, variables read from a data set are implicitly retained.  The PDV is initialized when the BY group changes.

In a one to many merge, if you reassign a value in the "one" dataset, that value will be retained until the by-group changes.  When I took a SAS training course, they had us work through the merge process and program data vector with paper, to understand this.

Here's an example of your problem:

data many;
input id test score;
cards;
1 1 10
1 2 15
1 3 20
2 1 10
2 2 15
2 3 20
;

data one;
input id gender $1;
cards;
1 M
2 F
;
run;

71   data want;
72     merge many (keep=id test score)
73           one  (keep=id gender)
74     ;
75     by id;
76     if test=2 then gender='X';
77     put (_all_)(=);
78   run;

id=1 test=1 score=10 gender=1
id=1 test=2 score=15 gender=X
id=1 test=3 score=20 gender=X
id=2 test=1 score=10 gender=2
id=2 test=2 score=15 gender=X
id=2 test=3 score=20 gender=X
NOTE: There were 6 observations read from the data set WORK.MANY.
NOTE: There were 2 observations read from the data set WORK.ONE.
NOTE: The data set WORK.WANT has 6 observations and 4 variables.

When you create a new variable, that variable is NOT retained, so everything works as you want.

A "rule" might be to avoid changing the values of any variables when you merge data in a one-to-many merge.  Another sign that you may be headed for trouble is when something just feels odd.  In my above example, why would I recode a patient-attribute (gender), based on a test-level (test) attribute?  In your example, why recode diagnostic code based on site id?

HTH,

--Q.

BASUG is hosting free webinars Next up: Jane Eslinger presenting PROC REPORT and the ODS EXCEL destination on Mar 27 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
LPark
Calcite | Level 5

The merging seems to work fine - I checked the individual records before and after. I also verified the results with and without the "if" statement. To address your question - there are specific histology codes that are associated with non-Hodgkin lymphoma and leukemia. The "map" dataset, therefore, provides the correct diagnosis code to each of the histology codes from the "lymphoma" dataset. However, if a record with a non-Hodgkin lymphoma histology occurs at a site related to the brain or nervous system (700<=sitecd<=729), we want those records coded specifically to "central nervous system lymphoma" (diagcd 38). This is what the if statement is trying to accomplish, but for some reason, it's also mapping diagcd 38 to other sites. I attached datasets in my original posting, which I hope will illustrate the problem. If you run:

proc freq data=lymphoma2;

     tables diagcd*sitecd/list;

run;

you'll see that there are sitecd>729 that were mapped to diagcd=38. Without the "if" statement in the data step, those records would have been mapped to diagcd=37 or 39.

art297
Opal | Level 21

I think that Quentin already explain why you aren't getting what you expect. The following is one way to control the behavior that he described:

data lymphoma2 (drop=_diagcd);

     merge lymphoma1 (in=in1 keep=sitecd histocd)

           map (keep=histocd diagcd rename=(diagcd=_diagcd));   

     by histocd;

     if in1;

     if 700<=sitecd<=729 then diagcd=38;

     else diagcd=_diagcd;

run;

LPark
Calcite | Level 5

Thank you, all! I understand now - this has been very helpful. And I will be sure to create new variables for this purpose in the future.

ballardw
Super User

You may want to invest time in some custom formats if you do a lot of this type of coding. Once you have the formats or informats created you can do a lot with things, sometimes without even creating new variables, for grouping variables.

For instance I have codes that use clinic level identifiers to create regions or agency information. And if one of the clinics changes I only have to change that in the format and the new behavior is available at the next report time without changing the report code.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 958 views
  • 3 likes
  • 5 in conversation