Login

I am planning to store events in elastic search. It can have around 100 million events at any point time. To de-dupe events, I am planning to create _id column of length 100 chars by concatenating below fields
entity_id - UUID (37 chars) +
event_creation_time (30 chars) +
event_type (30 chars)

This store will be having normal reads & writes along with aggregate queries (no updates / deletes)
Can you please let me know if there would be any performance impact or any other side-effects of using such lengthy string _id columns instead of default Ids.

Thanks,
Harish

The [`_id`][1] field is **not indexed** and **not stored** by default so there is no performance issue `storage` wise.

Since you will be indexing millions of documents, the only major performance issue you will face is while `bulk indexing`. You have to make sure there is a `sequential pattern` to your `_id`s. From the [Docs][2]

> - If you don’t have a natural ID for each document, use Elasticsearch’s auto-ID functionality. It is optimized to avoid
> version lookups, since the autogenerated ID is unique.
> - If you are using your own ID, try to pick an ID that is [friendly to Lucene][3]. Examples include zero-padded sequential IDs, UUID-1,
> and nanotime; these IDs have **consistent, sequential patterns that
> compress well**. In contrast, IDs such as UUID-4 are essentially
> random, which offer poor compression and slow down Lucene.

In that blog, long time Lucene committer Michael McCandless compares different ways of `_id` generation and IMO it is one of the finest articles I have read.

Hope this helps!

[1]:

[To see links please register here]

[2]:

[To see links please register here]

[3]:

[To see links please register here]

waikvson

collude823