Create an account

Very important

  • To access the important data of the forums, you must be active in each forum and especially in the leaks and database leaks section, send data and after sending the data and activity, data and important content will be opened and visible for you.
  • You will only see chat messages from people who are at or below your level.
  • More than 500,000 database leaks and millions of account leaks are waiting for you, so access and view with more activity.
  • Many important data are inactive and inaccessible for you, so open them with activity. (This will be done automatically)


Thread Rating:
  • 544 Vote(s) - 3.51 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Performance impact of using a string of length 100 characters as _Id column in Elastic Search

#1
I am planning to store events in elastic search. It can have around 100 million events at any point time. To de-dupe events, I am planning to create _id column of length 100 chars by concatenating below fields
entity_id - UUID (37 chars) +
event_creation_time (30 chars) +
event_type (30 chars)

This store will be having normal reads & writes along with aggregate queries (no updates / deletes)
Can you please let me know if there would be any performance impact or any other side-effects of using such lengthy string _id columns instead of default Ids.

Thanks,
Harish
Reply

#2
The [`_id`][1] field is **not indexed** and **not stored** by default so there is no performance issue `storage` wise.

Since you will be indexing millions of documents, the only major performance issue you will face is while `bulk indexing`. You have to make sure there is a `sequential pattern` to your `_id`s. From the [Docs][2]

> - If you don’t have a natural ID for each document, use Elasticsearch’s auto-ID functionality. It is optimized to avoid
> version lookups, since the autogenerated ID is unique.
> - If you are using your own ID, try to pick an ID that is [friendly to Lucene][3]. Examples include zero-padded sequential IDs, UUID-1,
> and nanotime; these IDs have **consistent, sequential patterns that
> compress well**. In contrast, IDs such as UUID-4 are essentially
> random, which offer poor compression and slow down Lucene.

In that blog, long time Lucene committer Michael McCandless compares different ways of `_id` generation and IMO it is one of the finest articles I have read.

Hope this helps!

[1]:

[To see links please register here]

[2]:

[To see links please register here]

[3]:

[To see links please register here]

Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

©0Day  2016 - 2023 | All Rights Reserved.  Made with    for the community. Connected through