This article is a practical guide to the internals of MRI Ruby. Our goal is to add new feature to the language and to learn something about how Ruby works inside. Given this huge topic we cannot cover everything in one article, so we will be looking only on relevant parts. I expect you to be familiar with Ruby, basic C and some common sense about how parsers work.
Ruby has big and constantly changing codebase not very generous with comments. Therefore article is more like a detective story rather than pointers to documentation.
Let me introduce you to the syntax we are about to implement:
It is extension to &:symbol
syntax that allows you to bind some parameters to the proc.
Quick recap on how &:symbol
stuff works.
When you pass a block argument to a method using &object
syntax—object.to_proc
gets internally called and result is used as block argument.
Symbol#to_proc
returns a proc that sends self
to the first passed agument.
So our feature is a syntax sugar for this:
Not a huge win in terms of characters typed, but it will allow us to produce more expressive code. Some other usecases:
To simplify out task we are not going to add this syntax to all the block arguments, only to symbols.
If you want to follow this article with Ruby source I used, checkout 44a9509f2ff2be85b97ade9806857e0948c29a1b
commit.
We will begin thinking about our implementation by looking at relevant parts of Ruby.
Symbol#to_proc
is our natural choice.
Rdoc tells us that we need to look at sym_to_proc
C function, that sits in the string.c
file.
I will put it here so you can look at the whole thing when needed.
Explanation will follow.
Lets look at sym_to_proc
type signature:
Function sym_to_proc
takes one parameter of VALUE
type and returns VALUE
.
We know that in Ruby world Symbol#to_proc
accepts no arguments and should return Proc
instance.
This very likely means two things.
First: sym
is reference to self
passed to the function.
Second: since both symbol and proc have the same type (VALUE
) then it is used to store any kind of Ruby objects.
We are going to jump from one function to another very often. When doing this by yourself I suggest you to use Ctags to quickly jump between function symbols and ack to search inside project. Personally I use emacs with ack-mode and vanilla TAGS. Vim users have TAGS support and ack.vim, for Sublime I suggest you ctags plugin and vanilla project search.
Jumping to VALUE
signature:
VALUE
is just a pointer type.
So values in Ruby are allocated somewhere and we just pass pointers around.
Now back to our function (I use number lines only for main function we analyze):
Variable sym_proc_cache
is static (which means that its value will be stored between calls of the function).
Name tells us that it will be used as a cache, but for now it's initialized with Qfalse
.
Our first guess is that Qfalse
is an instance of false class, but lets check definition:
Ok, it's a macro, casting RUBY_Qfalse
to VALUE
, our next stop is RUBY_Qfalse
:
RUBY_Qfalse
is just some magic number casted to VALUE
.
It means Qfalse
is special object referenced by constant.
Back to sym_to_proc
:
Couple of variable definitions for cache size and some temporary variables:
Lets read function names and try to understand what is going on.
If sym_proc_cache
is not set we create an array with rb_ary_tmp_new
, mark it "in use" for GC with rb_gc_register_mark_object
,
and set to Qnil
(another global constant object like Qfalse
) element at index SYM_PROC_CACHE_SIZE*2 - 1
(as side effect increasing array capacity enough to hold at least SYM_PROC_CACHE_SIZE*2 - 1
objects).
Now lets find out how array is created, jumping throug rb_ary_tmp_new
→ ary_new
→ ary_alloc
we get to line where an array is allocated:
This line is using NEWOBJ_OF
macro to give us fresh object initialized with correct type.
Creating objects and managing their lifecycle is a job of Gargbage Collector defined in gc.c
.
To save us some time we won't pay attention to GC this time, trusting Ruby to do the right thing.
But how do we do anything useful with array? Here we use rb_ary_store
to put nil
into the array.
Searching for definition in header files points us to intern.h
.
This file is included in ruby.h
and visible almost in every file inside Ruby source.
It contains lots of helpers functions, and some useful methods to manipulate array:
They can become handy later. Returning to our function:
From types of arguments we can see that SYM2ID
turns VALUE
into long
,
as usual lets look at definition:
We are bit shifting VALUE
pointer and casting it to 32 bit long
.
Here we are hashing id
to integer from 0 to SYM_PROC_CACHE_SIZE
and then doubling result.
Again from the name and types we can understand that macro RARRAY_PTR
takes Ruby array and gives us a pointer to an underlying C array.
Now we are checking our cache for hit, comparing pointers aryp
and self
.
We can do this because symbols never go away once created, so they have constant addresses.
In case of hit we return next element in array.
So we see that aryp
is used as simple list of pairs where even elements are keys storing hashed indexes and odd elements are cache values storing procs.
If cache fails, we create new proc with rb_proc_new
passing function sym_call
and id
containing hash of the symbol.
And then update cache with new proc.
We will not ivestigate for now how rb_proc_new
works, because it would require us to jump into the VM code, which is a topic for a few articles on its own.
We will assume it just creates new proc using passed C function pointer.
In the end sym_to_proc
doesn't do much work after all.
It is just a caching wrapper for a new proc created with sym_call
.
Now lets look at sym_call
body (I reformatted the code to fit page width):
Type signature tells us that function takes a few VALUE
objects and a pointer to array of VALUE
s with length argc
.
Function body checks that there is at least one argument and raises InvalidArgumentsError
.
Safe guess is that we call function sym
on object obj
, passing some arguments and block to it.
rb_funcall_with_block
type signature:
Argument names in function declaration are consistent with our hypothesis.
Method id (mid
) is passed using an ID
type:
It's just a long
integer.
In our function we got it from sym_to_proc
.
We pass symbol converted to ID
(whith SYM2ID
), cast it back to VALUE
and then inside sym_call
we cast it again to ID
.
This tells us that all the method identifiers can be easily converted from symbols.
There is also helper rb_intern
that casts C string to method id.
Now we have seen how rb_proc_new
actually works. That means we are one step from rewriting it to match our needs.
When we looked up how to_proc
works we missed one thing—how this method is declared as ruby method.
For now we have C function sym_to_proc
, but we don't know how it is connected to to_proc
.
Good news, this part of Ruby internals is well documented because it is used by all native extensions.
You can read about it in class.c
.
Short version: ruby methods with C implementation are defined with rb_define_method
function.
Lets look at it source:
Function func
can take several types of arguments.
They are distinguished whith argc
:
Now that we have seen all the relevant parts and can think of implementation for our feature.
We will do it in two steps.
First: we need to pass(curry Symbol#to_proc
arguments to resulted proc.
Second: add a new syntax that will be calling modified to_proc
function.
But before we can change any code, let me do the quick "Compile Ruby 101":
This will take a while, go make a stretch. You can test that everything works as expected with:
sym_to_proc
Lets write a testcase for our to_proc
change:
Lets see it fails:
So our current goal is to extend sym_to_proc
implementation so that it can take multiple arguments.
We changed to_proc
to take "self and args" style arguments,
added additional arguments to sym_to_proc
signature,
added some temporary variables and used RARRAY_LEN
macro to get arguments length.
We are not going to cache functions that has arguments curried. Caching such function would require some way to check parameters too and it is too much work for us now.
Instead of passing symbol id to our proc we created a new array with initial capacity 2 and added our symbol and proc arguments to it.
And we added guard for our cache, updating it only if our function has no arguments.
sym_call
Now we did our preparations and passed arguments to sym_call
.
We renamed argument to be sym_and_args
, setup arguments that we need and got to array values with RARRAY_PTR
macro.
We got method id from our original symbol (first element from sym_and_args
).
And then we got the first argument from the current invocation of proc, which is method receiver.
If we have some original to_proc
arguments (second element of array is not nil
) and some arguments from the current call, we concatenate them to a new array.
In other cases we use one of the passed arrays.
And finally we change method call to accept new arguments.
Lets test our implementation:
This implementation has some perfomance issues. In worst case scenario we are creating two new arrays for each call of the proc. Also we don't cache procs with arguments so our function will suffer some perfomance penalties when executed inside iterator. But optimization of this stuff is for sure beyond our current scope.
So now we have to_proc
working and it's time to look at the parser.
Ruby uses Bison to generate parser for the language. It produces code that takes stream of language tokens (symbols, operators, reserved words, numbers, strings etc) from lexer and turns them into an Abstract syntax tree.
Grammar for ruby is defined inside parse.y
.
It is 11 kloc monster with lots of stuff inside, but
good news is we don't need to read the whole file and actually need small part of it.
Grammar rules the are heart of parsers, they define valid language constructions and how AST should be built.
Now I will give you a brief overview of Bison using ruby source as example and then we will implement our feature. If you stuck or need more detailed introduction you can look at Bison cheatsheet or Bison documentation.
High level structure of Bison grammar files:
%{
C declarations
%}
Bison declarations
%%
Grammar rules
%%
Additional C code
But there is one twist in parse.y
file.
Ruby comes with Ripper library for parsing ruby source.
And Ripper uses annotated comments for doing its job.
We will simply ignore them.
Lets look at Bison declarations section, it begins with %union
block:
It describes possible types that values can have, this mapping will be used when we will be defining possible language tokens and symbols.
We have already seen VALUE
and ID
, lets look at NODE
definition:
It's struct
that can hold up to the three different types of values and some flags.
It can be used as tree structure because unions fields contain recursive link to NODE
.
Next step is to define possible symbols for ruby language, they are created with directives %token
and %type
.
They represent terminal and nonterminal symbols respectively.
From Bison documentation:
A terminal symbol (also known as a token type) represents a class of syntactically equivalent tokens. You use the symbol in grammar rules to mean that a token in that class is allowed. The symbol is represented in the Bison parser by a numeric code, and the yylex function returns a token type code to indicate what kind of token has been read. You don't need to know what the code value is; you can use the symbol to stand for it.
A nonterminal symbol stands for a class of syntactically equivalent groupings. The symbol name is used in writing grammar rules. By convention, it should be all lower case.
Lets search parse.y
for some relevant token definitions:
Lets take symbol
as example, it is nonterminal symbol with type ID
.
And now the most important part, grammar rule for the symbol:
Comment blocks are special annotation for Ripper, so we can ignore them.
First line says that symbol
token is concatenation of two dependent tokens: tSYMBEG
and sym
.
tSYMBEG
means ':
', special token for marking beginning of a symbol.
sym
is a dependant rule, that describes possible character combinations for symbol body.
Body of a rule describes the return value for a block. It should set special variable $$ to some C value.
It can use return values of dependant rules using numbered variables, starting with $1
.
Lets look at the grammar rule sym
:
Vertical line is a case operator, our symbol can be one of the following symbols: fname
, tIVAR
, tGVAR
, tCVAR
.
These are terminal symbols for different combination of characters and numbers representing: function name, instance variable, global variable and class variable respectively.
Example valid character combinations for symbol
rule are: :fname, :Fname, :@instance_variable, :$global_variable, :@@global_variable
.
Note that this rule doesn't cover :"string"
syntax.
Token defenitions above define token with name block_arg
.
From the name we infer that it has something to do with blocks.
Lets look at the rule for it,
block_arg
is a combination of an ampersand symbol and a dependent arg_variable
symbol.
The result of our rule is arg_value
passed through NEW_BLOCK_PASS
macro.
From the symbol definitions table we can see that both arg_value
and block_arg
have type NODE
.
Lets take a close look at NEW_BLOCK_PASS
macro:
It is calling another macro NEW_NODE
to create NODE
of the type NODE_BLOCK_PASS
and passing our arg_value as the third param.
Inside it builds a new NODE
with flags NODE_BLOCK_PASS
and reference to block value as u2.node
.
Our rule should add special case to block_arg
rule, that covers :symbol(arguments)
syntax.
It should return NODE
tree that is equivalent to the following ruby method call:
To do it we should know how to build structure for calling methods.
Lets search parse.y
for a rule that is responsible for calling method with arguments.
Search for '.' brings us to following rule (I removed Ripper comments):
We are using NEW_CALL
macro to build a CALL_NODE
with primary_value
, operation2
and opt_paren_args
as arguments.
Primary value rule describes many possibilities, one of them is literal symbol that is created with rule literal
:
It has NODE
type that can be build out of symbol
rule, passing ID
to combination of ID2SYM
(reverse macro for our old friend SYM2ID
) and NEW_LIT
.
This gives us enough information to extend block_arg
rule:
First we build rule definition out of the simple subrules tAMPER
, symbol
and paren_args
.
In the rule body we pass a new method call to the NEW_BLOCK_PASS
.
It is built with NEW_CALL
that takes three arguments: literal value from the symbol (second argument, because first one is tAMPER
), string "to_proc"
converted to ID
with rb_intern
and paren arguments.
We see that ID
is used heavily in ruby source, and there are many functions to convert them to various types.
Function rb_intern
is one of the most useful functions and is defined in ruby.h
header file.
It takes C string and converts it into ID
.
Type definition:
Now lets compile our code and make sure that everything is working as expected.
It works! With very small and tidy patch we have added a new syntax to the ruby language. But there is a couple more things.
Just to make sure that we didn't brake anything we should run ruby test suite
Looks good to me. But for the real ruby patch, we should write testaces for our feature.
And as final note, we can implement almost the same thing with five lines of vanila ruby:
Lets test it:
Now we know something about how ruby actually works. One warning though, you should not run you production systems on custom patched ruby versions just because you can. Good way is to submit patch to ruby mailing list, or to create ruby extension.
Tune in next time for the dive into the wonders of GC and Ruby VM.