BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Chung-Li
Quartz | Level 8

Hi there, 

 

When I was processing dataset this morning, I faced a problem.

Let's take a look at my data first:

 

30210027甲三郎30210397乙先天冰30210069丙小姐30210493丁阿旺

30210044物阿梅30210021己阿有30240052更有言30210346心一?

...

 

Basic it's in the pattern of "8-digit number" + "Chinese name of a person", and it is stored in the text file.

My problem is: I don't know how to divide this seemingly random variable into the way I want?

 

As I said, I want to divide each row into "ID1" + "Name1" + "ID2" + "Name2" + "ID3" + "Name3" + "ID4" + "Name4"

As a whole, I wish the data looked like

ID1               Name1   ID2               Name2      ID3              Name3    ID4              Name4

 

30210027    甲三郎    30210397    乙先天冰    30210069    丙小姐    30210493    丁阿旺

30210044    物阿梅    30210021    己阿有        30240052   更有言     30210346    心一?

...

 

I've tried three ways

#1 Import

It leads to a result that I wouldn't conclude it as a failure, but it's definitely not a success:

未命名.png

 

 

#2 Infile

Some variables are wrong because of the random length of the Chinese name, plus there's no delimiter between each variable.

 

#3 Adding delimiter, such as ",", by my hand.

For this method, I added a comma between each variable.

It works actually. However, considering the total of around 10000 variables awaits me to process, It may not work that well as I thought...

 

Therefore, I'm here to ask for your kindly help.

Any suggestions would be helpful.

Thanks in advance! 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User
data have;
infile cards truncover;
input x $80.;
cards;
30210027甲三郎30210397乙先天冰30210069丙小姐30210493丁阿旺
30210044物阿梅30210021己阿有30240052更有言30210346心一?
;
run;
data want;
 set have;
 id1=scan(x,1,,'kd');
 name1=scan(x,1,,'d');
  id2=scan(x,2,,'kd');
 name2=scan(x,2,,'d');
  id3=scan(x,3,,'kd');
 name3=scan(x,3,,'d');
  id4=scan(x,4,,'kd');
 name4=scan(x,4,,'d');
run;

View solution in original post

8 REPLIES 8
FredrikE
Rhodochrosite | Level 12

I would go for the first one "import" which puts it all in one variable - var.

Then i would walk through the dataset and for each line do a loop (do i = 1 to length(var)) and check each position in var, if its a numeric one then add it to a id variable and if it is a character put it as a name.

 

Usually I always ask for delimiter in the files sent to system, to prevent this kind of logic 🙂

 

//fredrik

Chung-Li
Quartz | Level 8
Fredrik,

Thank you for your help!
Sure, having delimiter can definitely make things easier, but sometimes things just don't go that well.
gamotte
Rhodochrosite | Level 12

Hello,

 

You can use a regex to add commas in your file each time a digit is followed by a non-digit

 Edit : and vice-versa

 

Example with sed :

> cat > test << EOF
> 012321akjghjkghj8576543jhgjhgjhg657654654kugjyfj
> 354354354jytjhfhgxchs9687657654jhgjhgjhgj65746576354yrzgredh
> EOF

 

> sed "s/\([0-9]\)\([^0-9]\)/$1,$2/g; s/\([^0-9]\)\([0-9]\)/$1,$2/g" test
01232,kjghjkgh,57654,hgjhgjh,5765465,ugjyfj
35435435,ytjhfhgxch,68765765,hgjhgjhg,574657635,rzgredh

Chung-Li
Quartz | Level 8
Thanks Gamotte!
I'll check "regex".
mkeintz
PROC Star

 

  1. Take the LONG_TEXT variable and change all numbers to blanks, producing  a space-separated list of names in NAME_LIST
  2. As you iteratively extract each name, blank it out in LONG_TEXT, leaving a space-separated list of ID's:

 

data have;
  input long_text $60.;
datalines;
30210027aaaaaaaa30210397bbbb30210069cccc30210493dddd
run;

data want;
  set have;
  name_list=translate(long_text,' ','0123456789');
  array name {6} $10;
  do I=1 to countw(name_list);
    name{I}=scan(name_list,I,' ');
    long_text=transtrn(long_text,trim(name{I}),' ');
  end;
  array id{6};
  do I=1 to countw(long_text);
    id{I}=input(scan(long_text,I,' '),10.);
  end;
drop name_text long_text; run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Chung-Li
Quartz | Level 8
Mkeintz,
What a masterpiece this is!
Also, it works really well when I apply it to my data!
I guess now I need to spend some time to study this code.

Thanks!
Ksharp
Super User
data have;
infile cards truncover;
input x $80.;
cards;
30210027甲三郎30210397乙先天冰30210069丙小姐30210493丁阿旺
30210044物阿梅30210021己阿有30240052更有言30210346心一?
;
run;
data want;
 set have;
 id1=scan(x,1,,'kd');
 name1=scan(x,1,,'d');
  id2=scan(x,2,,'kd');
 name2=scan(x,2,,'d');
  id3=scan(x,3,,'kd');
 name3=scan(x,3,,'d');
  id4=scan(x,4,,'kd');
 name4=scan(x,4,,'d');
run;
Chung-Li
Quartz | Level 8
Ksharp,

This is so understandable for me, and it works!
Thank you so much!

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 3542 views
  • 6 likes
  • 5 in conversation