Short guide to Ruby internals

This article is a practical guide to the internals of MRI Ruby. Our goal is to add new feature to the language and to learn something about how Ruby works inside. Given this huge topic we cannot cover everything in one article, so we will be looking only on relevant parts. I expect you to be familiar with Ruby, basic C and some common sense about how parsers work.

Ruby has big and constantly changing codebase not very generous with comments. Therefore article is more like a detective story rather than pointers to documentation.

Let me introduce you to the syntax we are about to implement:

records.map &:save(false)

It is extension to &:symbol syntax that allows you to bind some parameters to the proc. Quick recap on how &:symbol stuff works. When you pass a block argument to a method using &object syntax—object.to_proc gets internally called and result is used as block argument. Symbol#to_proc returns a proc that sends self to the first passed agument. So our feature is a syntax sugar for this:

records.map{ |o| o.save(false) }

Not a huge win in terms of characters typed, but it will allow us to produce more expressive code. Some other usecases:

json_array_of_hashes.map(&:[]('name'))
array.select(&:is_a?(Number))
array_of_strings.flat_map(&:split(','))

To simplify out task we are not going to add this syntax to all the block arguments, only to symbols.

If you want to follow this article with Ruby source I used, checkout 44a9509f2ff2be85b97ade9806857e0948c29a1b commit.

Looking around

We will begin thinking about our implementation by looking at relevant parts of Ruby. Symbol#to_proc is our natural choice. Rdoc tells us that we need to look at sym_to_proc C function, that sits in the string.c file. I will put it here so you can look at the whole thing when needed. Explanation will follow.

 1 /*
 2  * call-seq:
 3  *   sym.to_proc
 4  *
 5  * Returns a _Proc_ object which respond to the given method by _sym_.
 6  *
 7  *   (1..3).collect(&:to_s)  #=> ["1", "2", "3"]
 8  */
 9
10 static VALUE
11 sym_to_proc(VALUE sym)
12 {
13     static VALUE sym_proc_cache = Qfalse;
14     enum {SYM_PROC_CACHE_SIZE = 67};
15     VALUE proc;
16     long id, index;
17     VALUE *aryp;
18
19     if (!sym_proc_cache) {
20         sym_proc_cache = rb_ary_tmp_new(SYM_PROC_CACHE_SIZE * 2);
21         rb_gc_register_mark_object(sym_proc_cache);
22         rb_ary_store(sym_proc_cache, SYM_PROC_CACHE_SIZE*2 - 1, Qnil);
23     }
24
25     id = SYM2ID(sym);
26     index = (id % SYM_PROC_CACHE_SIZE) << 1;
27
28     aryp = RARRAY_PTR(sym_proc_cache);
29     if (aryp[index] == sym) {
30       return aryp[index + 1];
31     }
32     else {
33         proc = rb_proc_new(sym_call, (VALUE)id);
34         aryp[index] = sym;
35         aryp[index + 1] = proc;
36         return proc;
37     }
38 }

Lets look at sym_to_proc type signature:

static VALUE sym_to_proc(VALUE sym);

Function sym_to_proc takes one parameter of VALUE type and returns VALUE. We know that in Ruby world Symbol#to_proc accepts no arguments and should return Proc instance. This very likely means two things. First: sym is reference to self passed to the function. Second: since both symbol and proc have the same type (VALUE) then it is used to store any kind of Ruby objects.

We are going to jump from one function to another very often. When doing this by yourself I suggest you to use Ctags to quickly jump between function symbols and ack to search inside project. Personally I use emacs with ack-mode and vanilla TAGS. Vim users have TAGS support and ack.vim, for Sublime I suggest you ctags plugin and vanilla project search.

Jumping to VALUE signature:

typedef uintptr_t VALUE;

VALUE is just a pointer type. So values in Ruby are allocated somewhere and we just pass pointers around. Now back to our function (I use number lines only for main function we analyze):

13 static VALUE sym_proc_cache = Qfalse;

Variable sym_proc_cache is static (which means that its value will be stored between calls of the function). Name tells us that it will be used as a cache, but for now it's initialized with Qfalse. Our first guess is that Qfalse is an instance of false class, but lets check definition:

#define Qfalse ((VALUE)RUBY_Qfalse)

Ok, it's a macro, casting RUBY_Qfalse to VALUE, our next stop is RUBY_Qfalse:

/* special constants - i.e. non-zero and non-fixnum constants */
enum ruby_special_consts {
#if USE_FLONUM
    RUBY_Qfalse = 0x00,
    ...
#else
    RUBY_Qfalse = 0,
    ...
#endif
};

RUBY_Qfalse is just some magic number casted to VALUE . It means Qfalse is special object referenced by constant. Back to sym_to_proc:

14 enum {SYM_PROC_CACHE_SIZE = 67};
15 VALUE proc;
16 long id, index;
17 VALUE *aryp;

Couple of variable definitions for cache size and some temporary variables:

19 if (!sym_proc_cache) {
20     sym_proc_cache = rb_ary_tmp_new(SYM_PROC_CACHE_SIZE * 2);
21     rb_gc_register_mark_object(sym_proc_cache);
22     rb_ary_store(sym_proc_cache, SYM_PROC_CACHE_SIZE*2 - 1, Qnil);
23 }

Lets read function names and try to understand what is going on. If sym_proc_cache is not set we create an array with rb_ary_tmp_new, mark it "in use" for GC with rb_gc_register_mark_object, and set to Qnil(another global constant object like Qfalse) element at index SYM_PROC_CACHE_SIZE*2 - 1 (as side effect increasing array capacity enough to hold at least SYM_PROC_CACHE_SIZE*2 - 1 objects).

Now lets find out how array is created, jumping throug rb_ary_tmp_new → ary_new → ary_alloc we get to line where an array is allocated:

NEWOBJ_OF(ary,
          struct RArray,
          klass,
          T_ARRAY |
          RARRAY_EMBED_FLAG |
          (RGENGC_WB_PROTECTED_ARRAY ? FL_WB_PROTECTED : 0));

This line is using NEWOBJ_OF macro to give us fresh object initialized with correct type. Creating objects and managing their lifecycle is a job of Gargbage Collector defined in gc.c. To save us some time we won't pay attention to GC this time, trusting Ruby to do the right thing.

But how do we do anything useful with array? Here we use rb_ary_store to put nil into the array. Searching for definition in header files points us to intern.h. This file is included in ruby.h and visible almost in every file inside Ruby source. It contains lots of helpers functions, and some useful methods to manipulate array:

/* array.c */
void rb_mem_clear(register VALUE*, register long);
VALUE rb_assoc_new(VALUE, VALUE);
VALUE rb_check_array_type(VALUE);
VALUE rb_ary_new(void);
VALUE rb_ary_new_capa(long capa);
VALUE rb_ary_new_from_args(long n, ...);
VALUE rb_ary_new_from_values(long n, const VALUE *elts);
VALUE rb_ary_tmp_new(long);
void rb_ary_free(VALUE);
void rb_ary_modify(VALUE);
VALUE rb_ary_freeze(VALUE);
VALUE rb_ary_shared_with_p(VALUE, VALUE);
VALUE rb_ary_aref(int, VALUE*, VALUE);
VALUE rb_ary_subseq(VALUE, long, long);
void rb_ary_store(VALUE, long, VALUE);
VALUE rb_ary_dup(VALUE);
VALUE rb_ary_resurrect(VALUE ary);
VALUE rb_ary_to_ary(VALUE);
VALUE rb_ary_to_s(VALUE);
VALUE rb_ary_cat(VALUE, const VALUE *, long);
VALUE rb_ary_push(VALUE, VALUE);
VALUE rb_ary_pop(VALUE);
VALUE rb_ary_shift(VALUE);
VALUE rb_ary_unshift(VALUE, VALUE);
VALUE rb_ary_entry(VALUE, long);
VALUE rb_ary_each(VALUE);
VALUE rb_ary_join(VALUE, VALUE);
VALUE rb_ary_reverse(VALUE);
VALUE rb_ary_rotate(VALUE, long);
VALUE rb_ary_sort(VALUE);
VALUE rb_ary_sort_bang(VALUE);
VALUE rb_ary_delete(VALUE, VALUE);
VALUE rb_ary_delete_at(VALUE, long);
VALUE rb_ary_clear(VALUE);
VALUE rb_ary_plus(VALUE, VALUE);
VALUE rb_ary_concat(VALUE, VALUE);
VALUE rb_ary_assoc(VALUE, VALUE);
VALUE rb_ary_rassoc(VALUE, VALUE);
VALUE rb_ary_includes(VALUE, VALUE);
VALUE rb_ary_cmp(VALUE, VALUE);
VALUE rb_ary_replace(VALUE copy, VALUE orig);
VALUE rb_get_values_at(VALUE, long, int, VALUE*, VALUE(*)(VALUE,long));
VALUE rb_ary_resize(VALUE ary, long len);
void rb_mem_clear(register VALUE*, register long);
VALUE rb_assoc_new(VALUE, VALUE);
VALUE rb_check_array_type(VALUE);
VALUE rb_ary_new(void);
VALUE rb_ary_new_capa(long capa);
VALUE rb_ary_new_from_args(long n, ...);
VALUE rb_ary_new_from_values(long n, const VALUE *elts);
VALUE rb_ary_tmp_new(long);
void rb_ary_free(VALUE);
void rb_ary_modify(VALUE);
VALUE rb_ary_freeze(VALUE);
VALUE rb_ary_shared_with_p(VALUE, VALUE);
VALUE rb_ary_aref(int, VALUE*, VALUE);
VALUE rb_ary_subseq(VALUE, long, long);
void rb_ary_store(VALUE, long, VALUE);
VALUE rb_ary_dup(VALUE);
VALUE rb_ary_resurrect(VALUE ary);
VALUE rb_ary_to_ary(VALUE);
VALUE rb_ary_to_s(VALUE);
VALUE rb_ary_cat(VALUE, const VALUE *, long);
VALUE rb_ary_push(VALUE, VALUE);
VALUE rb_ary_pop(VALUE);
VALUE rb_ary_shift(VALUE);
VALUE rb_ary_unshift(VALUE, VALUE);
VALUE rb_ary_entry(VALUE, long);
VALUE rb_ary_each(VALUE);
VALUE rb_ary_join(VALUE, VALUE);
VALUE rb_ary_reverse(VALUE);
VALUE rb_ary_rotate(VALUE, long);
VALUE rb_ary_sort(VALUE);
VALUE rb_ary_sort_bang(VALUE);
VALUE rb_ary_delete(VALUE, VALUE);
VALUE rb_ary_delete_at(VALUE, long);
VALUE rb_ary_clear(VALUE);
VALUE rb_ary_plus(VALUE, VALUE);
VALUE rb_ary_concat(VALUE, VALUE);
VALUE rb_ary_assoc(VALUE, VALUE);
VALUE rb_ary_rassoc(VALUE, VALUE);
VALUE rb_ary_includes(VALUE, VALUE);
VALUE rb_ary_cmp(VALUE, VALUE);
VALUE rb_ary_replace(VALUE copy, VALUE orig);
VALUE rb_get_values_at(VALUE, long, int, VALUE*, VALUE(*)(VALUE,long));
VALUE rb_ary_resize(VALUE ary, long len);

They can become handy later. Returning to our function:

25 id = SYM2ID(sym);

From types of arguments we can see that SYM2ID turns VALUE into long, as usual lets look at definition:

#define SYM2ID(x) RSHIFT((unsigned long)(x),RUBY_SPECIAL_SHIFT)

We are bit shifting VALUE pointer and casting it to 32 bit long.

26 index = (id % SYM_PROC_CACHE_SIZE) << 1;
27
28 aryp = RARRAY_PTR(sym_proc_cache);
29 if (aryp[index] == sym) {
30     return aryp[index + 1];
31 }

Here we are hashing id to integer from 0 to SYM_PROC_CACHE_SIZE and then doubling result. Again from the name and types we can understand that macro RARRAY_PTR takes Ruby array and gives us a pointer to an underlying C array.

Now we are checking our cache for hit, comparing pointers aryp and self. We can do this because symbols never go away once created, so they have constant addresses. In case of hit we return next element in array. So we see that aryp is used as simple list of pairs where even elements are keys storing hashed indexes and odd elements are cache values storing procs.

32 else {
33     proc = rb_proc_new(sym_call, (VALUE)id);
34     aryp[index] = sym;
35     aryp[index + 1] = proc;
36     return proc;
37 }

If cache fails, we create new proc with rb_proc_new passing function sym_call and id containing hash of the symbol. And then update cache with new proc. We will not ivestigate for now how rb_proc_new works, because it would require us to jump into the VM code, which is a topic for a few articles on its own. We will assume it just creates new proc using passed C function pointer.

In the end sym_to_proc doesn't do much work after all. It is just a caching wrapper for a new proc created with sym_call.

Digging deeper

Now lets look at sym_call body (I reformatted the code to fit page width):

 1 static VALUE
 2 sym_call(VALUE args,
 3           VALUE sym,
 4           int argc,
 5           VALUE *argv,
 6           VALUE passed_proc)
 7 {
 8     VALUE obj;
 9
10     if (argc < 1) {
11         rb_raise(rb_eArgError, "no receiver given");
12     }
13     obj = argv[0];
14     return rb_funcall_with_block(obj,
15                                  (ID)sym,
16                                  argc - 1,
17                                  argv + 1,
18                                  passed_proc);
19 }

Type signature tells us that function takes a few VALUE objects and a pointer to array of VALUEs with length argc.

10 if (argc < 1) {
11     rb_raise(rb_eArgError, "no receiver given");
12 }

Function body checks that there is at least one argument and raises InvalidArgumentsError.

13 obj = argv[0];
14 return rb_funcall_with_block(obj,
15                                (ID)sym,
16                                argc - 1,
17                                argv + 1,
18                                passed_proc);

Safe guess is that we call function sym on object obj, passing some arguments and block to it. rb_funcall_with_block type signature:

VALUE
rb_funcall_with_block(VALUE recv,
                        ID mid,
                        int argc,
                        const VALUE *argv,
                        VALUE pass_procval);

Argument names in function declaration are consistent with our hypothesis. Method id (mid) is passed using an ID type:

typedef unsigned long ID;

It's just a long integer. In our function we got it from sym_to_proc.

33 proc = rb_proc_new(sym_call, (VALUE)id);

We pass symbol converted to ID (whith SYM2ID), cast it back to VALUE and then inside sym_call we cast it again to ID. This tells us that all the method identifiers can be easily converted from symbols. There is also helper rb_intern that casts C string to method id.

Now we have seen how rb_proc_new actually works. That means we are one step from rewriting it to match our needs.

Defining C methods for ruby objects

When we looked up how to_proc works we missed one thing—how this method is declared as ruby method. For now we have C function sym_to_proc, but we don't know how it is connected to to_proc.

Good news, this part of Ruby internals is well documented because it is used by all native extensions. You can read about it in class.c. Short version: ruby methods with C implementation are defined with rb_define_method function. Lets look at it source:

void
rb_define_method(VALUE klass, const char *name,
                 VALUE (*func)(ANYARGS), int argc)
{
    rb_add_method_cfunc(klass,
                        rb_intern(name),
                        func,
                        argc,
                        NOEX_PUBLIC);
}

Function func can take several types of arguments. They are distinguished whith argc:

zero or positive number: This means the method body function takes a fixed number of parameters
-1: This means the method body function is "argc and argv" style.
-2: This means the method body function is "self and args" style.

Planning the implementation

Now that we have seen all the relevant parts and can think of implementation for our feature. We will do it in two steps. First: we need to pass(curry Symbol#to_proc arguments to resulted proc. Second: add a new syntax that will be calling modified to_proc function.

Compiling ruby from source

But before we can change any code, let me do the quick "Compile Ruby 101":

git clone [email protected]:ruby/ruby.git
cd ruby
autoconf
./configure
make

This will take a while, go make a stretch. You can test that everything works as expected with:

./ruby -rirb -e "IRB.start"

Rewriting `sym_to_proc`

Lets write a testcase for our to_proc change:

[1,2,3,4].map(&:+.to_proc(3)) == [4,5,6,7]

Lets see it fails:

irb(main):001:0> [1,2,3,4].map(:+.to_proc(3)) == [4,5,6,7]
ArgumentError: wrong number of arguments (1 for 0)
        from (irb):1:in `to_proc'
        from (irb):1

So our current goal is to extend sym_to_proc implementation so that it can take multiple arguments.

 rb_define_method(rb_cSymbol, "to_sym", sym_to_sym, 0);
-rb_define_method(rb_cSymbol, "to_proc", sym_to_proc, 0);
+rb_define_method(rb_cSymbol, "to_proc", sym_to_proc, -2);
 rb_define_method(rb_cSymbol, "succ", sym_succ, 0);

We changed to_proc to take "self and args" style arguments,

 static VALUE
-sym_to_proc(VALUE sym)
+sym_to_proc(VALUE sym, VALUE args)
 {

added additional arguments to sym_to_proc signature,

     enum {SYM_PROC_CACHE_SIZE = 67};
-    VALUE proc;
-    long id, index;
+    VALUE proc, proc_arguments;
+    long id, index, args_length = RARRAY_LEN(args);
     VALUE *aryp;

added some temporary variables and used RARRAY_LEN macro to get arguments length.

-    if (aryp[index] == sym) {
+    if (aryp[index] == sym && args_length == 0) {
        return aryp[index + 1];
     }

We are not going to cache functions that has arguments curried. Caching such function would require some way to check parameters too and it is too much work for us now.

     else {
-       proc = rb_proc_new(sym_call, (VALUE)id);
+       proc_arguments = rb_ary_new_capa(2);
+
+       rb_ary_store(proc_arguments, 0, sym);
+       if(RARRAY_LENINT(args) > 0) {
+         rb_ary_store(proc_arguments, 1, args);
+       } else {
+         rb_ary_store(proc_arguments, 1, Qnil);
+       }
+
+       proc = rb_proc_new(sym_call, proc_arguments);

Instead of passing symbol id to our proc we created a new array with initial capacity 2 and added our symbol and proc arguments to it.

+       if(args_length == 0){
        aryp[index] = sym;
        aryp[index + 1] = proc;
+       }
        return proc;

And we added guard for our cache, updating it only if our function has no arguments.

Rewriting `sym_call`

Now we did our preparations and passed arguments to sym_call.

-sym_call(VALUE args, VALUE sym, int argc, VALUE *argv, VALUE passed_proc)
+sym_call(VALUE args,
+         VALUE sym_and_args,
+         int argc,
+         VALUE *argv,
+         VALUE passed_proc)
 {
-    VALUE obj;
+    ID id;
+    VALUE obj, tmp;
+    VALUE *aryp = RARRAY_PTR(sym_and_args);

We renamed argument to be sym_and_args, setup arguments that we need and got to array values with RARRAY_PTR macro.

+    id = SYM2ID(aryp[0]);
     obj = argv[0];
+    argv++;
+    argc--;

We got method id from our original symbol (first element from sym_and_args). And then we got the first argument from the current invocation of proc, which is method receiver.

+    if(NIL_P(aryp[1])) {
+    } else if(0 == RARRAY_LENINT(sym_and_args)) {
+      tmp = rb_ary_plus(aryp[1], rb_ary_new_from_values(argc, argv));
+      argv = RARRAY_PTR(tmp);
+      argc = RARRAY_LENINT(tmp);
+    } else {
+      argv = RARRAY_PTR(aryp[1]);
+      argc = RARRAY_LENINT(aryp[1]);
+    }

If we have some original to_proc arguments (second element of array is not nil) and some arguments from the current call, we concatenate them to a new array. In other cases we use one of the passed arrays.

-    return rb_funcall_with_block(obj, (ID)sym, argc - 1, argv + 1, passed_proc);
+    return rb_funcall_with_block(obj, id, argc, argv, passed_proc);

And finally we change method call to accept new arguments.

Lets test our implementation:

irb(main):001:0>  [1,2,3,4].map(:+.to_proc(3)) == [4,5,6,7]
 => true

This implementation has some perfomance issues. In worst case scenario we are creating two new arrays for each call of the proc. Also we don't cache procs with arguments so our function will suffer some perfomance penalties when executed inside iterator. But optimization of this stuff is for sure beyond our current scope.

So now we have to_proc working and it's time to look at the parser.

Bison quick tour

Ruby uses Bison to generate parser for the language. It produces code that takes stream of language tokens (symbols, operators, reserved words, numbers, strings etc) from lexer and turns them into an Abstract syntax tree.

Grammar for ruby is defined inside parse.y. It is 11 kloc monster with lots of stuff inside, but good news is we don't need to read the whole file and actually need small part of it. Grammar rules the are heart of parsers, they define valid language constructions and how AST should be built.

Now I will give you a brief overview of Bison using ruby source as example and then we will implement our feature. If you stuck or need more detailed introduction you can look at Bison cheatsheet or Bison documentation.

High level structure of Bison grammar files:

%{
C declarations
%}

Bison declarations

%%
Grammar rules
%%

Additional C code

But there is one twist in parse.y file. Ruby comes with Ripper library for parsing ruby source. And Ripper uses annotated comments for doing its job. We will simply ignore them.

Lets look at Bison declarations section, it begins with %union block:

%union {
    VALUE val;
    NODE *node;
    ID id;
    int num;
    const struct vtable *vars;
}

It describes possible types that values can have, this mapping will be used when we will be defining possible language tokens and symbols. We have already seen VALUE and ID, lets look at NODE definition:

typedef struct RNode {
    VALUE flags;
    VALUE nd_reserved;  /* ex nd_file */
    union {
        struct RNode *node;
        ID id;
        VALUE value;
        VALUE (*cfunc)(ANYARGS);
        ID *tbl;
    } u1;
    union {
        struct RNode *node;
        ID id;
        long argc;
        VALUE value;
    } u2;
    union {
        struct RNode *node;
        ID id;
        long state;
        struct rb_global_entry *entry;
        struct rb_args_info *args;
        long cnt;
        VALUE value;
    } u3;
} NODE;

It's struct that can hold up to the three different types of values and some flags. It can be used as tree structure because unions fields contain recursive link to NODE.

Next step is to define possible symbols for ruby language, they are created with directives %token and %type. They represent terminal and nonterminal symbols respectively. From Bison documentation:

A terminal symbol (also known as a token type) represents a class of syntactically equivalent tokens. You use the symbol in grammar rules to mean that a token in that class is allowed. The symbol is represented in the Bison parser by a numeric code, and the yylex function returns a token type code to indicate what kind of token has been read. You don't need to know what the code value is; you can use the symbol to stand for it.

A nonterminal symbol stands for a class of syntactically equivalent groupings. The symbol name is used in writing grammar rules. By convention, it should be all lower case.

Lets search parse.y for some relevant token definitions:

%token <id>   tIDENTIFIER tFID tGVAR tIVAR tCONSTANT tCVAR tLABEL
%type <node> expr_value arg_value primary_value fcall
%type <node> command_args aref_args opt_block_arg block_arg var_ref var_lhs
%type <id>   fsym keyword_variable user_variable sym
             symbol operation operation2 operation3

Lets take symbol as example, it is nonterminal symbol with type ID. And now the most important part, grammar rule for the symbol:

symbol : tSYMBEG sym
       {
          lex_state = EXPR_END;
          /*%%%*/
          $$ = $2;
          /*%
          $$ = dispatch1(symbol, $2);
          %*/
       }
       ;

Comment blocks are special annotation for Ripper, so we can ignore them.

First line says that symbol token is concatenation of two dependent tokens: tSYMBEG and sym. tSYMBEG means ':', special token for marking beginning of a symbol. sym is a dependant rule, that describes possible character combinations for symbol body.

Body of a rule describes the return value for a block. It should set special variable $$ to some C value. It can use return values of dependant rules using numbered variables, starting with $1.

Lets look at the grammar rule sym:

sym   : fname
      | tIVAR
      | tGVAR
      | tCVAR
      ;

Vertical line is a case operator, our symbol can be one of the following symbols: fname, tIVAR, tGVAR, tCVAR. These are terminal symbols for different combination of characters and numbers representing: function name, instance variable, global variable and class variable respectively. Example valid character combinations for symbol rule are: :fname, :Fname, :@instance_variable, :$global_variable, :@@global_variable. Note that this rule doesn't cover :"string" syntax.

Token defenitions above define token with name block_arg. From the name we infer that it has something to do with blocks. Lets look at the rule for it,

block_arg    : tAMPER arg_value
            {
                /*%%%*/
                $$ = NEW_BLOCK_PASS($2);
                /*%
                $$ = $2;
                %*/
            }

block_arg is a combination of an ampersand symbol and a dependent arg_variable symbol. The result of our rule is arg_value passed through NEW_BLOCK_PASS macro. From the symbol definitions table we can see that both arg_value and block_arg have type NODE. Lets take a close look at NEW_BLOCK_PASS macro:

#define NEW_BLOCK_PASS(b) NEW_NODE(NODE_BLOCK_PASS,0,b,0)

It is calling another macro NEW_NODE to create NODE of the type NODE_BLOCK_PASS and passing our arg_value as the third param. Inside it builds a new NODE with flags NODE_BLOCK_PASS and reference to block value as u2.node.

Writing our rule

Our rule should add special case to block_arg rule, that covers :symbol(arguments) syntax. It should return NODE tree that is equivalent to the following ruby method call:

:symbol.to_proc(arguments)

To do it we should know how to build structure for calling methods. Lets search parse.y for a rule that is responsible for calling method with arguments. Search for '.' brings us to following rule (I removed Ripper comments):

        | primary_value '.' operation2
            {
            $<num>$ = ruby_sourceline;
            }
          opt_paren_args
            {
            $$ = NEW_CALL($1, $3, $5);
            nd_set_line($$, $<num>4);
            }

We are using NEW_CALL macro to build a CALL_NODE with primary_value, operation2 and opt_paren_args as arguments. Primary value rule describes many possibilities, one of them is literal symbol that is created with rule literal:

literal : numeric
        | symbol
            {
            /*%%%*/
            $$ = NEW_LIT(ID2SYM($1));
            /*%
            $$ = dispatch1(symbol_literal, $1);
            %*/
            }
        | dsym
        ;

It has NODE type that can be build out of symbol rule, passing ID to combination of ID2SYM (reverse macro for our old friend SYM2ID) and NEW_LIT. This gives us enough information to extend block_arg rule:

block_arg    : tAMPER arg_value
            {
                /*%%%*/
                $$ = NEW_BLOCK_PASS($2);
                /*%
                $$ = $2;
                %*/
            }
+           | tAMPER symbol paren_args
+           {
+               /*%%%*/
+               $$ = NEW_BLOCK_PASS(NEW_CALL(NEW_LIT(ID2SYM($2)),
+                                                    rb_intern("to_proc"),
+                                                    $3));
+               /*%
+                  $$ = 0;
+               %*/
+           }
        ;

First we build rule definition out of the simple subrules tAMPER, symbol and paren_args. In the rule body we pass a new method call to the NEW_BLOCK_PASS. It is built with NEW_CALL that takes three arguments: literal value from the symbol (second argument, because first one is tAMPER), string "to_proc" converted to ID with rb_intern and paren arguments.

We see that ID is used heavily in ruby source, and there are many functions to convert them to various types. Function rb_intern is one of the most useful functions and is defined in ruby.h header file. It takes C string and converts it into ID. Type definition:

ID rb_intern(const char*);

Now lets compile our code and make sure that everything is working as expected.

irb(main):001:0> [1,2,3,4].map &:*(10)
 => [10, 20, 30, 40]
irb(main):002:0> [1, "this", 3, "is", 4, "awesome"].select &:is_a?(String)
 => ["this", "is", "awesome"]
irb(main):003:0> ["a, b", "c"].flat_map(&:split(','))
 => ["a", " b", "c"]

It works! With very small and tidy patch we have added a new syntax to the ruby language. But there is a couple more things.

Running tests

Just to make sure that we didn't brake anything we should run ruby test suite

$ make test
# truncated output
test succeeded
PASS all 1008 tests

Looks good to me. But for the real ruby patch, we should write testaces for our feature.

Reimplementing this stuff in vanilla Ruby

And as final note, we can implement almost the same thing with five lines of vanila ruby:

class Symbol
  def [](*passed_args)
    proc { |o, *args| o.send(self, *passed_args, *args) }
  end
end

Lets test it:

irb(main):001:0> [1,2,3,4].map &:*[10]
 => [10, 20, 30, 40]

Now we know something about how ruby actually works. One warning though, you should not run you production systems on custom patched ruby versions just because you can. Good way is to submit patch to ruby mailing list, or to create ruby extension.

Tune in next time for the dive into the wonders of GC and Ruby VM.

30 Mar 2014