BookmarkSubscribeRSS Feed
ArseneWenger
Fluorite | Level 6

Hi,

I am new to the hash code and I have a data step that I want to convert to hash code to decrease run time ( i have millions of rows). I would appreciate any help to covert same code to hash code.

 

Here is the logic of my datastep:  I am creating a new variable "tag" which can 1 or 0. If it is my first.ID and value of status is not 8 then 0 else 1 until it hits 1 again. So basically I am retaining the tag value until I hit any other value other than 8.

 

Example:

 

data try;
input id status;
datalines;
1 8
1 . 
1 1
1 .
2 8
2 .
2 .
2 .
3 1
3 .
3 8
3 .
4 .
4 .
4 .
5 .
5 .
5 8
5 .
;
run;

 

data try1 ;
set try;
by id;
retain tag 0;
if first.id then
tag=0;
if status=8 then
tag=1;
if status=1 then
tag=0;
run;

 

 

 

Output:

 id status tag
1 8 1
1 . 1
1 1 0
1 . 0
2 8 1
2 . 1
2 . 1
2 . 1
3 1 0
3 . 0
3 8 1
3 . 1
4 . 0
4 . 0
4 . 0
5 . 0
5 . 0
5 8 1
5 . 1

 

13 REPLIES 13
Astounding
PROC Star

Your existing program is fine.  How could hashing speed this up in any way?

ArseneWenger
Fluorite | Level 6
As I mentioned that I would several millions of rows and its taking forever.
Reeza
Super User

What's your definition of forever?

I don't think there's anyway to simplify this process so you're best off trying to figure out where/why it's being slow first. 

 

A few million rows should be processed pretty quickly unless you have memory or space restrictions.

ArseneWenger
Fluorite | Level 6

Regardless of time constraint can anyone help me understand how can the same code be converted to hash code for knowledge purpose? I would really appreciate that.

 

I am not lying when I said it's very very slow (several hours) and I feel I want to test if using hash code improves timing.

 

Thanks in advance for your help and understanding.

Reeza
Super User

If a data step takes hours for less than 10 million rows, you either have some other step your'e not showing (a sort?) or something else is wrong with your set up. It takes about 20 minutes for me to process (multiple procs/data steps/sql) 30 million rows.

Reeza
Super User

If you'd like to convert it as an academic exercise, feel free to do so.

 

Asking someone else to do it for you, isn't what this forum is for. If you're looking for help with it, show us what you have so far and we can help with what you correct/modify if it as necessary. And the 'requirement' to switch to hash makes me think this is likely homework.

 


@ArseneWenger wrote:

Regardless of time constraint can anyone help me understand how can the same code be converted to hash code for knowledge purpose? I would really appreciate that.

 

I am not lying when I said it's very very slow (several hours) and I feel I want to test if using hash code improves timing.

 

Thanks in advance for your help and understanding.


 

ArseneWenger
Fluorite | Level 6

Thanks, everyone!

Reeza this platform is for helping not for being mean. I am not a student in a college which you should have understood from my original post.

 

I have always experienced this behavior from you from all of my posts. 

I understand posting useless replies also bumps your rating as it doesn't see the content you are posting.

 

Again, thanks for spoiling the thread, I will sure figure out how to optimize this code. 

 

 

 

Reeza
Super User

@ArseneWenger wrote:

Thanks, everyone!

Reeza this platform is for helping not for being mean. I am not a student in a college which you should have understood from my original post.

 


Why should I understand you're not a student? We only see exactly what you post, so based on your question, I don't see any reason why I would conclude you're not a student. I don't care if you're a student, worker, manager or the CEO of a company, you'll get the same treatment from me. I care about how you phrase questions. It's one thing to ask for help, it's another to ask people to do work for you for free. People asking for the second without showing any attempts is my pet peeve, so yeah, I'm probably 'meaner' than necessary. I still think it's perfectly fair to ask someone to show what they've attempted before helping them. In fact, it's part of the requirements in other common forums (StackOverflow) and in general it's how you ask a  good question - see guidelines here:

https://stackoverflow.com/help/how-to-ask

and here:

https://communities.sas.com/t5/SAS-Communities-Library/How-to-ask-a-question-in-SAS-Analytics-U-Comm...


I understand posting useless replies also bumps your rating as it doesn't see the content you are posting.

 


I don't care about status, views or the number of posts on here. If I did, wouldn't answering your question get me more points and 'status'? In fact, the answer is probably less characters than typing out this response. 


Again, thanks for spoiling the thread, I will sure figure out how to optimize this code.  

My apologies for spoiling your thread, I promise to never spoil another thread for you ever again. 

novinosrin
Tourmaline | Level 20

Hi @ArseneWenger Sir/Madam, There's no reason for you get so heated about anything over a person or people just because they gave their opinion. Chill. The point all the super users were trying to make is that your question doesn't quite warrant a hash solution is all. All the three who responded in this thread are really amazing people and I have shamelessly used their solutions over the years. 

 

This is an amazing friendly forum and many have been benefitted(myself included) and so we shouldn't create room for unnecessary feelings of haste. The positive spirit of this forum should go on forever. Well, I hope.

 

OK, May i ask why do you want a hash solution? what are you trying to load in hash table? What would be your look up keys and what data would you associate with those keys? How would you add those to the table? Through direct load or load during the pass of dataset? If you choose the latter, how would you counter efficiency as this is going to be an overhead?  Now, if for some reason you loaded your data, how do you think the object dot notation method will fetch your  data? Doesn't this make a one full pass of a dataset, plus a hash table plus a routine of adding and fetching data into the PDV? So inessence, there is more I/O, memory and others etc to further affect performance?

 

If you can help me explain the above with logical reasoning and some critical thinking, we can make this discussion an enjoyable experience. Have a nice evening

ballardw
Super User

Hash is one way of looking things up in another data set. Since you do not show anything that resembles a look up then Hash does not seem likely to help.

 

You can get a small improvement by changing the code slightly:

data try1 ;
   set try;
   by id;
   retain tag 0;
   if first.id then  tag=0;
   if status=8 then  tag=1;
   Else if status=1 then  tag=0;
run;

Put the status value in the first if statement that occurs more often if such is the case.

 

 

If you have more such values then describe them.

ArseneWenger
Fluorite | Level 6

Thanks will try that.

ballardw
Super User

Another possibility for a small improvement especially if the number of missing is large.

 

data try1 ;
   set try;
   by id;
   retain tag 0;
   if first.id then  tag=0;
   select (status);
      when(1) tag=0;
      when(8) tag=1;
      otherwise ;
   end;
run;

The select branches directly to one of 3 instructions with the otherwise being any code executed for values other than 1 or 8 such as missing. Since you don't actually want to do anything specific then a null statement is appropriate.

 

The time improvement would not be large but the If/then/else/if will always do 2 comparisons if the status is missing where the select will do only one.

novinosrin
Tourmaline | Level 20

@ArseneWenger  My humble suggestion is to proceed with @ballardw simple yet neat approach with one pass and done and that is pretty much what all of us would. But here you go, just for your needs, have fun.- Nonetheless, I would never do this for my work

 

data try;

input id status;

datalines;

1 8

1 .

1 1

1 .

2 8

2 .

2 .

2 .

3 1

3 .

3 8

3 .

4 .

4 .

4 .

5 .

5 .

5 8

5 .

;

run;

 

 

data want;

    if _N_ = 1 then do;

      declare hash h(dataset: 'try', ordered: 'yes',multidata:'yes');

      declare hiter iter('h');

      h.defineKey('id');

      h.defineData('id','status');

      h.defineDone();

      call missing(id, status);

   end;

   rc = iter.first();

   do while (rc = 0);

   if lag(id) ne id then tag=0;

   if status=8 then tag=1;

   else if status=1 then tag=0;

     output;

      rc = iter.next();

   end;

   drop rc;

run;

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 1284 views
  • 5 likes
  • 5 in conversation