Remove the need for external tools for managing Elasticsearch #6283

yurishkuro · 2024-12-01T00:22:22Z

PR #5797 is a good example of eliminating the need for schema initialization script for Cassandra by moving the logic directly into Jaeger binary.

We have three standalone tools for managing ES that we can also merge into Jaeger binary directly. All of them are already implemented in Go, so are fairly straightforward to embed in the Jaeger binary (fewer binary artifacts to release).

cmd/es-index-cleaner/ is used as a cron job to delete old ES indices. The logic can instead be executed from the Jaeger binary on a timer.
cmd/es-rollover/ - similar to cleaner but for rolling over indices.
esmapping-generator/ - just prints the index mappings to stdout (probably for manual installation)

Note: this issue is not about removing the existing binaries, which we still want to maintain until Jaeger v1 EOL in 2026. We just want to make them not required for operating Jaeger-v2.

Tasks

extends configuration for ES in Jaeger v2 to support additional parameters for the cleaner & rollover scripts, such as frequency of running, including enabled properties for each.
enable index cleaner capability in v2, update integration tests to use that
enable index rollover capability in v2, update integration tests to use that
add a sub-command es-mappings to jaeger v2 binary to print es mappings
update documentation accordingly to no longer reference the separate scripts

The text was updated successfully, but these errors were encountered:

Rohanraj123 · 2024-12-01T08:07:16Z

I am working on it

akstron · 2024-12-01T15:16:59Z

Hi @yurishkuro , for cmd/es-index-cleaner/, we would only require the filters to be added in the config, rest other fields used for client creation would already be available to us.
Below config should suffice it I believe:

type Cleaner struct {
	// Separator between date fragments.
	IndexDateSeparator string
	// Whether to filter archive indices.
	Archive bool
	// Whether to filter rollover indices.
	Rollover bool
	// Indices created before this date will be deleted.
	DeleteBeforeThisDate time.Time
}

Does it sound good?

yurishkuro · 2024-12-01T17:32:16Z

IndexDateSeparator - if the main storage doesn't need this argument, why does cleaner need it?

Archive bool - archive storage is configured as separate storage in v2, so it can have its own cleaner if desired.

Rollover bool - not clear. What's the alternative?

DeleteBeforeThisDate time.Time - doesn't make sense, cleaner is a permanent periodic job, not something that takes an absolute time. We already have max span age parameter iirc, is it not sufficient to drive the behavior of the cleaner?

akstron · 2024-12-02T17:48:36Z

I mostly took out the filters used in the current implementation.

IndexDateSeparator - if the main storage doesn't need this argument, why does cleaner need it?

Yes, I don't think so we need it. There is DateLayout in index config, which we can probably use to find indices here.

Archive bool - archive storage is configured as separate storage in v2, so it can have its own cleaner if desired.

Makes sense. Thanks

Rollover bool - not clear. What's the alternative?

I was thinking maybe I can use the exact filter used for the rollover case, in the current implementation. But I probably have to read more about how this is being used here before commenting further. If you can provide me with some pointers, that would be really helpful? Thanks

DeleteBeforeThisDate time.Time - doesn't make sense, cleaner is a permanent periodic job, not something that takes an absolute time. We already have max span age parameter iirc, is it not sufficient to drive the behavior of the cleaner?

Was mostly referring to the time calculated before which we should delete indices. This calculation should actually be done based on a time.Duration i.e., spanMaxAge you are referring to.

Also, looks like rohan has already started working on it.

akstron · 2024-12-02T17:52:49Z

Also, shouldn't this be added as an extension?

yurishkuro · 2024-12-03T00:40:04Z

it could be an extension but I think it's very closely coupled with the storage implementation that it makes more sense for it to be just an option in the config

akstron · 2024-12-03T18:09:32Z

Rollover bool - not clear. What's the alternative?

Just wanted to add this info. We won't need this config as well. Rollover indices can be treated as index-per-day pattern, where we just need to delete those based on the creationTime.
In conclusion, I believe maxAge should be more than enough for cleaning indices.

akstron · 2024-12-18T17:13:59Z

Hi @Rohanraj123 , let me know if you are working on it? I would like to take up merging cmd/es-index-cleaner/. Thanks

Rohanraj123 · 2024-12-22T08:38:28Z

Hi @Rohanraj123 , let me know if you are working on it? I would like to take up merging cmd/es-index-cleaner/. Thanks

yeah go ahead please!

akstron · 2024-12-23T12:24:23Z

Thanks, I will start working on it.

akstron · 2024-12-23T14:22:22Z

Hi @yurishkuro , I see that for cmd/es-index-cleaner, the IndicesClient wraps a simple http client underneath instead of wrapping the es go client library. Do you know if there is any particular reason for it? If not, should we refactor it first to use the es go client library?

yurishkuro · 2024-12-23T20:52:18Z

ideally all logic should go through regular clients

Manik2708 · 2024-12-30T15:48:19Z

@yurishkuro I am little confused on the question, whether after this change, will we still be giving the cmds (For those who still want to keep these services in their control) or should we have to discard the cmd and move the code to pkg or a mixture of these two, keeping the cmd as it is but moving the code to a sharerd package for es and cmd?

yurishkuro · 2024-12-30T16:06:39Z

@Manik2708 they should become sub-commands of jaeger binary, i.e. added to this list:

$ go run ./cmd/jaeger help
2024/12/30 11:06:11 application version: git-commit=, git-version=, build-date=
Jaeger backend v2

Usage:
  jaeger [flags]
  jaeger [command]

Available Commands:
  completion  Generate the autocompletion script for the specified shell
  components  Outputs available components in this collector distribution
  docs        Generates documentation
  help        Help about any command
  validate    Validates the config without running the collector
  version     Print the version.

Manik2708 · 2024-12-30T16:41:16Z

@yurishkuro Also is there any connection between RolloverFrequency and es-rollover?

yurishkuro · 2024-12-30T16:55:07Z

@Manik2708 you have to check the code where/how it's used

Manik2708 · 2025-01-02T08:09:00Z

@yurishkuro Es Index Rollover runs for a single server, but while integrating it with Jaeger Binary we can have multiple servers. So should we always rollover run for all servers or we have to ask from user for through config file, like in this way

type Rollover struct {
...
// RunForAllServers enables rollover for all servers
RunForAllServers bool
// RunForServers enables rollover for the provided servers if RunForAllServers is false
RunForServers []string
}

yurishkuro · 2025-01-02T15:52:47Z

@Manik2708 there are two ways

Can we make the operation idempotent or no-op if already done? If we can then it's safe to run from all servers. This would be the preferred way. For example, Cassandra schema initialization works this way.
If we must have a single server doing this, then we will have to use leader/follower election, for which we already have the code (it's only used with adaptive sampling, look for DistributedLock).

Manik2708 · 2025-01-06T14:03:24Z

@Manik2708 there are two ways

Can we make the operation idempotent or no-op if already done? If we can then it's safe to run from all servers. This would be the preferred way. For example, Cassandra schema initialization works this way.

If we must have a single server doing this, then we will have to use leader/follower election, for which we already have the code (it's only used with adaptive sampling, look for DistributedLock).

@yurishkuro As far I have understood from the code and by using it, all the operations of es-rollover cmd seems to be idempotent. Still I want to clear some doubts:

Are all operations of es-rollover (cmd) idempotent?
If yes then can we run it for all servers asynchronously? Like in this way:

for _, server := range configuration.Servers {
       go startEsRollover(configuration, server, logger)
}

The init operation should be run only once (after setting jaeger) or everytime before running rollover?

yurishkuro · 2025-01-06T15:42:06Z

Are all operations of es-rollover (cmd) idempotent?

this is my question to you. Also the timing - if one instance initiates a rollover at :00min and then another at :05min, what's going to happen?

Why do you need to run goroutine per-server? There's only one cluster.

init

As I understand init needs to execute only once, so you may need to add a check for it to verify if it's already been executed on a cluster. I would approach all of steps like that - check first, then run.

Manik2708 · 2025-01-06T20:30:07Z

@yurishkuro So as per my search:

Init has already become idempotent by Make rollover idempotent #1407. It can also be seen in go:

jaeger/cmd/es-rollover/app/init/action.go

Lines 87 to 91 in 15a83a7

    
           	// check for reason, ignore already exist error 
        
           	if strings.Contains(errorMap["type"].(string), "resource_already_exists_exception") { 
        
           		return nil 
        
           	} 
        
           }

So we can call init everytime on calling rollover, init will always check and run (as said by your comment)

Rollover: The operation in elastic search itself is idempotent that means Rollover doesn't happen on already rolled over data. So we can call rollover also everytime without taking care of idempotency. (This is also the reason no user using es-rollover faced the problem of idempotency as rollover has no affect on already rolled over data.
Lookback: This only cleanes the data, so if it is already cleaned it will return without any error (logging that no index to that alias was present).
The doubt which I am having is that: The official es client takes an array of servers, so this official client takes care of multiple nodes of the same cluster but the client used in es-rollover is a http client not an es-client. And as per my knowledge (please correct if I am wrong) in a cluster of es, communicating with one node is sufficient to make the change in cluster. The official client is for load balancing or following the leader/follower or any other similar algorithm. So I think using a single server with leader/follower election would be a good choice here rather than using all servers or running for all servers synchronously can also be an option with an e2e test with a cluster of 3 nodes.

yurishkuro · 2025-01-06T20:58:20Z

correct, a command can be sent to any node.

zzzk1 · 2025-01-10T15:02:42Z

@yurishkuro For the esmapping-generator/, we should create a new command in cmd/internal/esmapping and reuse the logic from esmapping-generator/app in this command. After that, add internal/esmapping to all-in-one/main.go and jaeger/main.go using the following syntax: command.AddCommand(esmapping.Command()). Is that right?🤔

yurishkuro · 2025-01-10T16:08:58Z

@zzzk1 yes. There was a PR #6327 for that, I just realized I never submitted my review for it.

zzzk1 · 2025-01-11T05:22:38Z

@Rohanraj123 Are you still working on esmapping-generator？

Rohanraj123 · 2025-01-11T05:24:11Z

@Rohanraj123 Are you still working on esmapping-generator？

Yes I was waiting for the review. I will (re)start working from today only.

Manik2708 · 2025-01-14T10:58:34Z

@yurishkuro This is what I am thinking of options of AutomatedRollover (requires UseILM and UseReadWriteAliases to be true).

type struct AutomatedRollover {
	Enabled bool
	PolicyName string // If user has created a policy and want to do rollover with that policy
	Conditons []string // Conditions like max_age etc.
	SkipDependencies bool // Whether user want to apply rollover on dependencies index or not
        DisableDeleteOperation bool // Whether user want to disable the delete oepration of automated rollover (by default it will be on)
}

And this as the flow:

Please review this, if it looks good, I would start working on it!

yurishkuro · 2025-01-14T15:31:01Z

@Manik2708

What is the official recommendation from ES when using ILM? Is the user expected to create read/write aliases, or is the policy doing that automatically?
I would treat policy name not as a signal for create/not-create, but just a default that the user can override, and Jaeger always checks if the policy exists and creates if it does not. I am worried about auto-creating the policy overall in terms of how much customization is required for that process - is the list of conditions completely sufficient to represent the ILM policy?

For your struct

I think enabled is unnecessary, we can treat the presence/absence of the config entry as the same bool.
I don't like fields SkipDependencies and DisableDeleteOperation, are they necessary?

Manik2708 · 2025-01-14T15:54:19Z

What is the official recommendation from ES when using ILM? Is the user expected to create read/write aliases, or is the policy doing that automatically?

As per my understanding it requires manually, please see this: https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index-lifecycle-management.html#ilm-gs-alias-apply-policy It requires index.lifecycle.rollover_alias in the template.

I would treat policy name not as a signal for create/not-create, but just a default that the user can override, and Jaeger always checks if the policy exists and creates if it does not. I am worried about auto-creating the policy overall in terms of how much customization is required for that process - is the list of conditions completely sufficient to represent the ILM policy?

This idea seems great! We are concerned for only rollover action of policy which requires only condition (even if you will see the rollover part of e-rollover it is only asking for conditions) so I think conditions would be enough for automated rollover. Please see this: https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-rollover.html#ilm-rollover-options

I think enabled is unnecessary, we can treat the presence/absence of the config entry as the same bool.

Could agree but have a question which is linked with the next point, if let's say SkipDependencies and DisableDeleteOperation are eliminated, what would then empty string array of Conditions would imply? I think in that case we have to keep this Enabled option.

I don't like fields SkipDependencies and DisableDeleteOperation, are they necessary?

For SkipDependencies, does that imply we should always rollover on dependency index?

For DisableDeleteOperation, delete is an action of ILM policy, which is similar to cleaning, should we always enable/disable it? Please see this for context of delete operation: https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-delete.html

yurishkuro · 2025-01-14T16:07:56Z

I think in that case we have to keep this Enabled option.

Your config could look like:

  backends:
    my_es:
      elasticsearch:
        ...
        index_maintenance:

I.e. even if this index_maintenance section is completely empty we can still tell that it exists or not. This is a common pattern in OTEL config, e.g. for enabling or not an HTTP endpoint.

Another high level question I have - do we even need to invest in custom ILM? Or should we be looking into data streams #4708, which may have different management process.

Manik2708 · 2025-01-14T16:15:19Z

I studied about data streams also, official docs suggest using ILM over data streams for rollover. https://www.elastic.co/guide/en/elasticsearch/reference/current/data-streams.html#data-streams-rollover

Manik2708 · 2025-01-14T16:32:12Z

@yurishkuro Then where should we head? ILM or data streams!

yurishkuro · 2025-01-14T16:34:36Z

Then where should we head? ILM or data streams!

@Manik2708 my preference is data streams, but like you said they are not mutually exclusive.

Manik2708 · 2025-01-14T16:39:25Z

Ok, then let's try with data streams?

Manik2708 · 2025-01-17T11:45:56Z

@yurishkuro I feel a need of a helping PR for this issue (can be a prerequisite) which will be refactoring of e2e tests of Index Rollover and Cleaner to make them reusable for v2 e2e tests just like storage integration tests. This will help the contributors to test the functionality e2e and for reviewers it will make the job little easy as it will reduce the PR size (You can see the PR #6500 becoming very big due to this refactoring). Should I raise a PR or should it be included with the functionality itself?

yurishkuro · 2025-01-18T00:51:34Z

smaller PRs are always better.

yurishkuro added good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement labels Dec 1, 2024

dosubot bot added area/storage storage/elasticsearch v2 labels Dec 1, 2024

Rohanraj123 mentioned this issue Dec 2, 2024

feat: Remove the need for external tools for managing ElasticSearch #6291

Open

4 tasks

Rohanraj123 mentioned this issue Dec 8, 2024

feat: Add esmapping-generator into jaeger binary #6327

Open

4 tasks

akstron mentioned this issue Dec 27, 2024

[WIP] Add ES Auto Index Cleaner #6425

Closed

4 tasks

Manik2708 mentioned this issue Jan 1, 2025

[WIP]Removed ES-Rollover binary and integrated with Jaeger Binary #6456

Closed

4 tasks

Manik2708 mentioned this issue Jan 7, 2025

[v2]Added Auto Index Rollover #6500

Draft

4 tasks

Manik2708 mentioned this issue Jan 16, 2025

[Feature]: Support elasticsearch data stream #4708

Open

Remove the need for external tools for managing Elasticsearch #6283

Remove the need for external tools for managing Elasticsearch #6283

Comments

yurishkuro commented Dec 1, 2024 • edited Loading

Tasks

Rohanraj123 commented Dec 1, 2024

akstron commented Dec 1, 2024

yurishkuro commented Dec 1, 2024

akstron commented Dec 2, 2024

akstron commented Dec 2, 2024

yurishkuro commented Dec 3, 2024

akstron commented Dec 3, 2024

akstron commented Dec 18, 2024

Rohanraj123 commented Dec 22, 2024

akstron commented Dec 23, 2024

akstron commented Dec 23, 2024

yurishkuro commented Dec 23, 2024

Manik2708 commented Dec 30, 2024 • edited Loading

yurishkuro commented Dec 30, 2024

Manik2708 commented Dec 30, 2024

yurishkuro commented Dec 30, 2024

Manik2708 commented Jan 2, 2025

yurishkuro commented Jan 2, 2025

Manik2708 commented Jan 6, 2025

yurishkuro commented Jan 6, 2025

Manik2708 commented Jan 6, 2025

yurishkuro commented Jan 6, 2025 • edited Loading

zzzk1 commented Jan 10, 2025

yurishkuro commented Jan 10, 2025

zzzk1 commented Jan 11, 2025

Rohanraj123 commented Jan 11, 2025 • edited Loading

Manik2708 commented Jan 14, 2025

yurishkuro commented Jan 14, 2025

Manik2708 commented Jan 14, 2025 • edited Loading

yurishkuro commented Jan 14, 2025 • edited Loading

Manik2708 commented Jan 14, 2025

Manik2708 commented Jan 14, 2025

yurishkuro commented Jan 14, 2025

Manik2708 commented Jan 14, 2025

Manik2708 commented Jan 17, 2025

yurishkuro commented Jan 18, 2025

yurishkuro commented Dec 1, 2024 •

edited

Loading

Manik2708 commented Dec 30, 2024 •

edited

Loading

yurishkuro commented Jan 6, 2025 •

edited

Loading

Rohanraj123 commented Jan 11, 2025 •

edited

Loading

Manik2708 commented Jan 14, 2025 •

edited

Loading

yurishkuro commented Jan 14, 2025 •

edited

Loading