Deep dive into elasticsearch-rails integration

about.gitlab.com

Elasticsearch offers global full text search.
You can do this by using the top most search field.

w:1200

There is no demo today.

Mostly Ruby code except the very end.

Implementation Overview

  • elasticsearch-rails common use pattern
  • How GitLab uses it

Abbreviations:

  • elasticsearch -> es
  • Elasticsearch -> ES

Part 1:

How does elasticsearch-rails gem work?

elasticsearch-rails gem

  • Three libraries in one
    1. elasticsearch-persistence
    2. elasticsearch-rails
    3. elasticsearch-model (the important one)

Proxy

They bridge our models and the Elasticsearch server.

w:1200

There are two types of proxies:

1. ClassMethodsProxy

handling class level tasks such as searching
Issue -> class proxy -> (reads from server)

2. InstanceMethodsProxy

handling instance level tasks such as indexing one document
issue -> instance proxy -> (writes to server)

Basic setup

class Book < ActiveRecord::Base
  include Elasticsearch::Model
  include Elasticsearch::Model::Callbacks
end

__elasticsearch__

The two __elasticsearch__ methods gives us proxy objects.

p = b.__elasticsearch__
p.class # InstanceMethodProxy
p.target # b
p = Book.__elasticsearch__
p.class # ClassMethodProxy
p.target # Book

Proxy object

Proxy object contains all the commands we need.

Book.__elasticsearch__.import
Book.__elasticsearch__.search('foobar').records

Indexing (1)

i = Issue.new(...)
i.save!

Indexing (2)

ES::Model::Callback module adds Rails model callbacks:

after_commit lambda { __es__.index_document  }, on: :create ⭐
after_commit lambda { __es__.update_document }, on: :update
after_commit lambda { __es__.delete_document }, on: :destroy

Indexing (3)

# Elasticsearch::Model::Indexing::InstanceMethods
def index_document(options={})
  document = self.as_indexed_json  ①
  
  client.index(
    { index: index_name,
      type:  document_type,
      id:    self.id,
      body:  document }.merge(options)
  ) ③
end

# Elasticsearch::Model::Serializing::InstanceMethods
def as_indexed_json(options={})
  self.as_json(options.merge root: false) ②
end

Indexing (2) as_indexed_json

We can define our own index data:

class Issue < ActiveRecord::Base
  def as_indexed_json
    { title: self.title }
  end
end

Method delegation

METHODS = [:search, :mapping, :mappings,
:settings, :index_name, :document_type, :import]

METHODS.each do |method|
  delegate method, to: :__elasticsearch__ unless
    self.public_instance_methods.include?(method)
end
book.import
book.search('foobar').records

Anatomy of a Proxy

class ClassMethodsProxy
  include ES::Model::Client::ClassMethods
  include ES::Model::Naming::ClassMethods
  include ES::Model::Indexing::ClassMethods
  include ES::Model::Searching::ClassMethods
  include ES::Model::Importing::ClassMethods
class InstanceMethodsProxy
  include ES::Model::Client::InstanceMethods
  include ES::Model::Naming::InstanceMethods
  include ES::Model::Indexing::InstanceMethods
  include ES::Model::Serializing::InstanceMethods

Anatomy of a proxy

Each proxy has access to a client.

def client
  @client ||= Elasticsearch::Model.client
end
  • just a Faraday client.
  • knows server url, port, etc.
  • by default, all proxies share the same client

Part 2:

How GitLab uses elasticsearch-rails?

Problem

  • ES indexing took a long time (up to many days)
  • Some schema changes required re-indexing
  • During reindexing, search results can be incomplete

Goal:

Zero downtime search when data schema changes

#328

Decouple schema and search code for ActiveModels to allow for versioned schema

  • Have multiple versions of logic
  • dynamically choose which to call at run-time
Rails modelSwitchboardES-rails proxy

⬆️⬆️⬆️⬆️⬆️⬆️⬆️⬆️⬆️⬆️⬆️⬆️

The implementation will be explained in 3 parts.

Rails modelSwitchboardES-rails proxy

Example: Snippet search

We used to have Snippet includes SnippetSearch module,
containing search related logic.

However, module does no allow dynamic swapping.

Rails modelSwitchboardES-rails proxy

The proxy design is flexible in that we can have separate classes.

Instead of using the same class in all kind of searches,
we can subclass these proxies:

  • SnippetClassProxy < ClassMethodProxy
  • SnippetInstanceProxy < InstanceMethodProxy
Rails modelSwitchboardES-rails proxy

And then we can have different versions of Snippet proxies:

  • V12p1::SnippetClassProxy < SnippetClassProxy
  • V13p0::SnippetClassProxy < SnippetClassProxy
Rails modelSwitchboardES-rails proxy

Common logic are extracted as a common super class

V12p1::SnippetClassProxy

is a subclass of

V12p1::ApplicationClassProxy

is a subclass of

ClassMethodsProxy

How do we choose which version to use?

Switchboard

Rails modelSwitchboardES-rails proxy

Switchboard

Previously:

model ---__es__---> proxies

Now:

model ---__es__---> switchboard ------> proxies

Rails modelSwitchboardES-rails proxy

Switchboard Classes

  • MultiVersionClassProxy
  • MultiVersionInstanceProxy
Rails modelSwitchboardES-rails proxy

Q:

How do we choose which version to route to?

A:

This is based case by case,
for example, if we have two index v1 and v2:

Rails modelSwitchboardES-rails proxy

Assuming v1 is in sync, v2 is still indexing:

methodversion
searching v1
indexing v1 & v2
removing index manually selected
Rails modelSwitchboardES-rails proxy
methodversion
searching v1
indexing v1 & v2
removing index

elastic_reading_target

returns one version, the synced version, e.g.:

def elastic_reading_target
  version('V12p1')
end
Rails modelSwitchboardES-rails proxy
methodversion
searching v1
indexing v1 & v2
removing index

elastic_writing_targets

returns array of all versions

def elastic_reading_target
  [
    version('V12p1'),
    version('V12p2')
  ]
end
Rails modelSwitchboardES-rails proxy
methodversion
searching v1
indexing v1 & v2
removing index

methods_for_all_write_targets

Array of methods to be forwarded to all versions:

def methods_for_all_write_targets
  [:index_document, :delete_document,
  :update_document, :update_document_attributes]
end
Rails modelSwitchboardES-rails proxy
methodversion
searching v1
indexing v1 & v2
removing index

methods_for_one_write_target

Array of methods not to be delegated, caller specifies version to call:

def methods_for_all_write_targets
  [:import, :create_index! :delete_index!]
end
Rails modelSwitchboardES-rails proxy

Switchboard Recap

method(s) versions to delegate to:
*methods other than below elastic_reading_target
methods_for_all_write_targets elastic_writing_targets
methods_for_one_write_target *user defined
Rails modelSwitchboardES-rails proxy

Forwarding to multiple write versions

def generate_forwarding
  methods_for_all_write_targets.each do |method| 
    self.class.forward_to_all_write_targets(method) ①
def forward_to_all_write_targets(method)
  return if respond_to?(method)

  define_method(method) do |*args|
                      ②                         
    responses = elastic_writing_targets.map do |elastic_target|
      elastic_target.public_send(method, *args)
    end
    responses.find { |response|
      response['_shards']['successful'] == 0
    } || responses.last ③ 
  end
Rails modelSwitchboardES-rails proxy

Forwarding to single read version

def generate_forwarding
  # ... continue from earlier

  read_methods = elastic_reading_target
    .real_class.public_instance_methods  ①

  read_methods -= methods_for_all_write_targets
  read_methods -= methods_for_one_write_target  ② 
  read_methods -= self.class.instance_methods
  read_methods.delete(:method_missing)

  read_methods.each do |method|
    self.class.forward_read_method(method) ③ 
  end
end
Rails modelSwitchboardES-rails proxy

class and real_class

Elasticsearch tries to be smart, and overrides class method on InstanceProxy.

SnippetInstanceProxy#class would be SnippetClassProxy 😱

This can result in cryptic errors.
To obtain the actual class, real_class is defined.

def real_class
  self.singleton_class.superclass
end
Rails modelSwitchboardES-rails proxy

How are targets specified

def version(version)
  version = Elastic.const_get(version, false) if version.is_a?(String)
  # Now version is Elastic::V12p1
  version.const_get(proxy_class_name, false).new(data_target)
  # Now we return Elastic::V12p1::IssueInstanceProxy
end

def proxy_class_name
  "#{@data_class.name}InstanceProxy"
  # @data_class is the model class, e.g. `Issue`
end

2.3 Rails Model

Rails modelSwitchboardES-rails proxy

Elastic::ApplicationVersionedSearch

  • provides the __elasticsearch__ methods
    (returning switchboard)
  • permission checks
  • misc
Rails modelSwitchboardES-rails proxy

Elastic::ApplicationVersionedSearch

def __elasticsearch__(&block)
  @__elasticsearch__ ||=
    ::Elastic::MultiVersionInstanceProxy.new(self)
end

class_methods do
  def __elasticsearch__
    @__elasticsearch__ ||=
       ::Elastic::MultiVersionClassProxy.new(self)
  end
end

Now we can have two versions of search code,
they can have their own client, pointing to two different index.

width:800px

We can even point two versions to two different cloud providers.

width:800px

Current status: pending

Currently we only have one version.
We hard that verison to elastic_reading_target and elastic_writing_targets.

I still think there are some benefits:

  • Cleaner model
  • Testing can be done on proxies

Next step

  • maybe we can rename the class to be actually "switchboard"?
  • maybe you would prefer to remove the switchboard?
  • generate_forwarding should not be done per initialization

Special Thanks

Markus Koller
Marcel van Remmerden
Denys Mishunov
Darva Satcher
Kai Armstrong
James Lopez

Q&A

Hello everyone, My name is mark. I am currently in the Fulfillment team, but I was in the search team for a while. I thought it would benefitial that I give a presentation on elasticsearch integration, before my memory fades away.

For those of you who doesn't know elasticsearch it offers full text search I believe currently it is accessed using the top right search field only.

There is no demo today, because all the change I am discussing are backstage changes, and would not affect the user.

So today the session will be split into two parts. The first part will be to briefly introduce the elasticsearch rails architecture The second part will focus on how we used the library in a slightly different way, and the reason behind it.

Since the word "elasticsearch" is so long, in order to fit it in the slides, often I'll abbreviate it.

so part one elasticsearch-rails is a ruby gem, maintained by elastic the company

the gem is consisted of three parts elastic persistence allows rails to save data on elasticsearch server instead of sql databases. We don't use this. elastic rails provides some useful utilities for rails. We use part of it for instrumentation. The most important part will be es model. This enables active model to talk to elasticsearch server. We will only be covering this part today.

es model's core is to provide a proxy to bridge rails model and the server. All the search related logic resides inside the proxy. All search commands, go through the proxy

There are two different proxies, one for the class level, and one for the instance level. Instance level proxy is closely coupled with a single active record. For example it generate the data from the record for storage. Class level proxy handles higher level commands, such as search or import.

On the official readme, you wil see the most simple setup is to include the model module like this. By doing so, we gain all the search functionalities.

after including the __es__ methods will become available on both the class level and instance level. The class level _es_ method, would return ClassMethodProxy And instance level es method would return instanceMethodProxy The proxy both points back to the source using the `target` method.

We call all search related functionalities through thess proxies

Here I'll use indexing as a exmaple to show how the gem works. When an issue is created, what will happen behind the scenes?

Previously we included the `callback` module, which will setup 3 after_commit callbacks for create/udpate and delete Since we are creating a record the first callback is triggered. It calls the proxy's index_document method

The index_docoument will first prepare the data for indexing. Here at point 1 we see it calls as indexed json method. By default this method would serialize all the attributes. Then a client is used to send this data to the server. at this point, everything is done.

But someitmes we don't necessarily want every attribute to be searchable. We might only want to search title. Here we can override the default by defining our own as_indexed_json in our model. Just return a hash of things we need indexed.

The method name __elasticsearch__ is very long. Typing it all the time is tiresome. es rails also provides convenience methods to bypass this. A few of the the methods are delegated to the proxy objects. So now we can just type book.search directly.

proxies are consisted of many modules. This means we can cherrypick only the things we need

Proxy itself uses a client to send http requests to the server. By default, all proxies share the same client which is just a Faraday client. The client will have the information on where it can find the server, such as the url or the port.

as a summary, the simple setup gives us two methods to access the proxy object, and from there we can talk to the server.

In GitLab we use Elasticsearch in slightly different way.

Until last year, we had issues reindexing stuffs. Everytime we reindex, it would take as long as a week to do. During this time, search results will be incomplete. yet we require re-indexing since we do change data schema from time to time.

so our goal was to allow zero downtime search when data schema changes

The development path was decided to allow multiple versions of search code to co-exist at the same time. and we determine which version to call at run time.

For example, We used to put all snippet related search logic in the snippet search module, which is included by Snippet model. This is less flexible if we want to have mutiple versions. On the other hand, using classes and objects is very suitable for this kind of task

To keep things dry we also extract the common logic into a super class

we have a switchboard which would redirect commands to the desired version

Recall earlier, in the simplest setup, `__es__` would take us from the record to the proxy. Now we can have `__es__` take us to a switch board, which would then forward calls to the correct version class.

at the time I named the switchboard classes like this, which is kind of wordy now

The name of the switchboards are * `MultiVersionClassProxy` * `MultiVersionInstanceProxy` Not great names. I didn't have a good name when I wrote these.

總結一下,一開始我列出的表格, 每個格子可以用先前四個方法來表達

So how is the table represented in code? The `generate_forwarding` class method is responsible for setup method forwarding at boot time. let's look at the first step. for each method in methods_for_all_write_targets we call `forward_to_all_write_targets` This would dynamically define a method, to call each writing targets. (as indicated by the circle 2) Lastly, we collect all the responses. We return the unsuccessful one if it exists, otherwise the call is considered successful

That's for write operations. For read operations, we only need to forward that to one version which is in sync. How do we determine which are read methods? We first just take all the methods filter out the write methods, We also filter out the instance methods here. lastly we filter out method missing. the remaining methods are considered read methods, and we call forward_read_method to each of them (see circle 3) generate_forwarding 的第二步是把只需要一個version的方法設好 forward

Let's take a detour here. and talk about `real_class` method. Elasticsearch tries to be smart, and overrides `class` method on InstanceProxy to point to the classproxy. `SnippetInstanceProxy#class` would be `SnippetClassProxy` This can result in cryptic errors. To obtain the actual class, I have to re-create one on my own, and I call it `real_class`

so when we call version method, what do we get? We first get the namespace, and then we call proxy_class_name to get the name of the proxy we want. lastly we fetch the proxy with that name from the namespace

That's all for switches, now we are at the final part, to connect rails model to switchboard. Each searchable model includes ApplicationVersionedSearch module It provides access to switchboard it provides permission check and various other things

here we can see that it simply initiates the switchboard classes, one for class level and one for instance level.

This is an overview graph of where each class are: We have Activemodel on the top left, which have access to the switchboards, and the switchboards can determine which version of the proxy to pass the command on.

So we have explained all three layer of the integration Now we can have two versions of the search logic. This means we can have separete client setting, pointing to different indices

theoretically we can even point to two different clouds. We can migrate from one cloud provider to the next.

Currently we only have one version. We hard code `elastic_reading_target` and `elastic_writing_targets` to that version Ths is because we switched the focus to enabling global search on gitlab.com first

* In the past, we included many ES related methods into our model * Now model is slimmer