BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
UniversitySas
Quartz | Level 8

So my long data looks like this:

data source;
INFILE DATALINES dlm='#';
length ID $ 3
VAR $ 32
CLASSIFICATION $ 32;
input ID VAR CLASSIFICATION;
DATALINES;
201 # PepsiCo # Name
201 # 111 main street # AddressLine1
201 # Philadelphia # City
201 # PA # State
201 # 21491 # Zipcode
201 # None # Relationship
201 # Charity # Status
201 # Help # Purpose
201 # 9001 # DollarAmount
201 # CocaCola # Name
201 # 245 Cork street # AddressLine1
201 # Floor 43 # AddressLine2
201 # Philadelphia # City
201 # PA # State
201 # 21492 # Zipcode
201 # None # Relationship
201 # Charity # Status
201 # Build Factory # Purpose
201 # 4100 # CurrentContribution
201 # 13101 # TotalAmount
331 # Adidas # Name
331 # 115 walnut avenue # AddressLine1
331 # New York # City
331 # NY # State
331 # 255191 # Zipcode
331 # Charity # Status
331 # Help # Purpose
331 # 143 # FutureContribution
331 #143 # TotalAmount
334 # Nike # Name
334 # 123 Stevens Road # ForeignAddressLine1
334 # apartment 4D # ForeignAddressLine2
334 # Denpasar # City
334 # Bali # State
334 # 2512 # Zipcode
334 # N/A # Relation
334 # Help # Purpose
334 # 2000 # CurrentContribution
334 # 2000 # TotalAmount
;
RUN;

The output I am after is this:

output.JPG

 

The basic structure of the data is this:

IDs appear multiple times, and they have names associated to them. The last time the ID appears, there is an additional entry called "total" which sums up the entire amount for "contribution" for that ID. Then the New ID starts.


Every "Name" has:

 

an "Address line 1". Sometimes it contains the string "Foreign", in which case I want the column that says "Foreign" to say "Y" and "N" otherwise.

 

There is sometimes also an "Address line 2", but NOT always. Regardless, I want to force an "Address line 2" into the data.

 

There is always a "city", "state" and "postcode"

Sometimes "Relation" is not there, so I want to force it there.

It should always comes 1 row after "postcode", and 1 row before "status"

 

Sometimes "status" is not there. It should always come two rows after "postcode" in the long data file. Needs to be forced too.

 

Sometimes "Purpose" is not there. It always comes three rows after "postcode" in the long data file. Needs to be forced too.

 

"Contribution Amount" is always there - but it either contains the string "current" or "future" in it. if it says "Future" I want the "Future" column to say "Y", if not then "N".

 

For rows that are sometimes not there, and have to be forced in; here I want the corresponding value to be a ".".


I actually made a similar post earlier - but the data at the time was IMPOSSIBLE to work with. By some stroke of luck, the raw data I received is singificantly improved in that it actually has the row names (even though they're not always listed). So I don't need to guess the structure of the data anymore.

 

Any help is seriously appreciated.

 

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User
/* step 1: create a unique id for ever name */
/* also do some manipulations as needed */
data int1;
set source;
retain newid 0;
if classification = 'Name' then newid + 1;
if substr(classification,1,7) = 'Foreign'
then do;
  classification = substr(classification,8);
  output;
  if index(classification,'1')
  then do;
    classification = 'Foreign';
    var = 'Y';
    output;
  end;
end;
else output;
run;

/* step 2: transpose along the unique id */
proc transpose
  data=int1
  out=int2 (drop=_name_)
;
by newid;
var var;
id classification;
run;

/* step 3: get a reference table for unique id to id */
proc sort
  data=int1 (keep=id newid)
  out=int3
  nodupkey
;
by newid;
run;

/* step 4: merge the original id back in */
data want (drop=newid);
merge
  int3
  int2
;
by newid;
run;

This code handles the transpose. You can add additional code in step 4 to force the horizontal order of variables, or to create derived columns (eg the "future" column), and to set default values for missing columns.

View solution in original post

5 REPLIES 5
Kurt_Bremser
Super User
/* step 1: create a unique id for ever name */
/* also do some manipulations as needed */
data int1;
set source;
retain newid 0;
if classification = 'Name' then newid + 1;
if substr(classification,1,7) = 'Foreign'
then do;
  classification = substr(classification,8);
  output;
  if index(classification,'1')
  then do;
    classification = 'Foreign';
    var = 'Y';
    output;
  end;
end;
else output;
run;

/* step 2: transpose along the unique id */
proc transpose
  data=int1
  out=int2 (drop=_name_)
;
by newid;
var var;
id classification;
run;

/* step 3: get a reference table for unique id to id */
proc sort
  data=int1 (keep=id newid)
  out=int3
  nodupkey
;
by newid;
run;

/* step 4: merge the original id back in */
data want (drop=newid);
merge
  int3
  int2
;
by newid;
run;

This code handles the transpose. You can add additional code in step 4 to force the horizontal order of variables, or to create derived columns (eg the "future" column), and to set default values for missing columns.

UniversitySas
Quartz | Level 8

You are a legend! This worked BEAUTIFULLY.

Just one question, I had some issues in step 2 when trying to do the transpose. In my data, the values in classification are actually much longer, upto 60 characters, and the first 40 or so characters are the same. 

 

The error message I get is:

 

NOTE: The above message was for the following BY group:
newid=12814
ERROR: The ID value "grantorcontriapprovedforfuture__" occurs twice in the same BY group.
ERROR: The ID value "grantorcontriapprovedforfuture__" occurs twice in the same BY group.
ERROR: The ID value "grantorcontriapprovedforfuture__" occurs twice in the same BY group.

My work-around was to hardcode a shorter string, but I had to do this for 20 different values.

Is there anything else I can do, so that it reads the string in it's entirety instead of the first 32 characters?

 

Thanks again so much for your help.

UniversitySas
Quartz | Level 8
In other words, are you saying that renaming them to shorter character lengths is the only work around?
Kurt_Bremser
Super User

@UniversitySas wrote:
In other words, are you saying that renaming them to shorter character lengths is the only work around?

Absolutely. Anything that exceeds the maximum variable name length is truncated, and when that leads to ambiguities, you get those ERROR messages.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 759 views
  • 2 likes
  • 2 in conversation