A Comprehensive Guide to OpenSearch Architecture

July 15th, 2024Resources

6 min read

OpenSearchis an open-source search and analytics engine derived from Elasticsearch. It was created by forking Elasticsearch and Kibana following concerns within the community about the direction of those projects, particularly regarding the licensing changes made by Elastic N.V. OpenSearch is developed and maintained by the OpenSearch community, which includes a diverse group of contributors and organizations committed to building and improving an open, collaborative, and transparent search and analytics platform.

This article will outline the various aspects and capabilities of OpenSearch architecture. This is crucial for a user to understand as it can offer numerous benefits. For example, knowledge of OpenSearch architecture allows users to comprehend how OpenSearch works. This includes understanding the components involved, such as indexing, querying, and data storage mechanisms. With this understanding, users can better use OpenSearch's capabilities to meet their specific requirements.

In addition to this, by understanding OpenSearch architecture users can integrate the solution more seamlessly with other systems and technologies. Understanding how OpenSearch interacts with other tools and services can facilitate smoother integrations and extend its functionality to better suit their needs. Also, knowing the architecture of OpenSearch assists users in implementing security measures and guaranteeing compliance with relevant regulations and standards. By understanding how data is stored, accessed, and secured within OpenSearch, users can employ appropriate security measures to protect sensitive information and guarantee compliance with data protection regulations.

Contents

OpenSearch Architecture
Hosted OpenSearch

OpenSearch Architecture

OpenSearch is comprised of numerous components that work together to provide a fully functional service. To help you understand how OpenSearch operates we have compiled a guide including the most crucial elements of the service.

Data Organization

In OpenSearch, data organization revolves around the concept of indices, documents, fields, and mappings.

Index: An index is a logical collection of documents that share a similar structure and are stored and managed together. It serves as the primary unit for organizing and partitioning data within an OpenSearch cluster.
Indices: Indices are used to organize and partition data for efficient storage and retrieval. In OpenSearch, users can create multiple indices based on their data requirements, such as separating data by type or source.
Documents: A document is a JSON object that contains data to be indexed. Each document is stored in an index and has a unique identifier called the document ID. Documents can represent various entities or records, such as products, users, and log entries.
Fields: Fields are key-value pairs within a document that store specific data. Each field represents a piece of information about the document, such as a name, age, timestamp, etc. Fields can be of different data types, such as strings, numbers, dates, and booleans.
Mappings: Mappings outline the schema or structure of the documents within an index. They specify the data types and properties of each field, such as whether a field is indexed, analyzed, or stored. Mappings help guarantee consistency and accuracy in data indexing and searching.
Shards and Replicas: OpenSearch indexes are divided into shards, smaller, scalable data storage and processing units. Each shard is a fully functional index in itself and can be distributed across multiple nodes for parallel processing and fault tolerance. Replicas are copies of shards that provide redundancy and high availability. They also improve read performance by distributing query load across multiple replicas.

Shards

Shards are the fundamental units of horizontal scalability and distributed data storage in OpenSearch architecture. They allow for the distribution of data across multiple nodes in a cluster, enabling parallel processing and enhancing performance. Shards are logical partitions of an index, each consisting of a subset of the index's data. When you create an index in OpenSearch, you specify the amount of primary shards it should have. The data in the index is distributed across these primary shards.

By splitting an index into multiple shards, OpenSearch can distribute data via numerous nodes in a cluster. This horizontal scalability enables the cluster to manage large volumes of data and high query loads efficiently. Also, queries and operations on an index can be conducted in parallel across its shards. This parallelism enhances throughput and reduces latency, as each shard can independently process a portion of the workload. As well as this, sharding improves fault tolerance by distributing data and processing across various nodes. If a node in the cluster fails, the data stored in its shards can still be accessed from other nodes, guaranteeing high availability and data resilience. Lastly, each primary shard in OpenSearch has one or more replica shards. Replica shards are copies of the primary shards and serve as backups. They offer redundancy and fault tolerance, enabling failover in case of node failures.

Inverted Index

OpenSearch is renowned for its ability to retrieve documents quickly due to its rapid full-text search capability. This ultra-fast search feature is facilitated by an inverted index, which includes the following components and details.

Tokenization: When documents are indexed in OpenSearch, the text content is first tokenized into individual terms or tokens. This process involves breaking down the text into words, numbers, or other meaningful units, while also removing stop words and punctuation.
Term Frequency (TF): For each document, the inverted index records the frequency of each term occurring within that document. This information is stored in the inverted index alongside the term itself.
Inversion: The inverted index inverts this information by creating a mapping between each term and the documents in which it appears. For each term, the index maintains a list of document identifiers such as document IDs or pointers, where that term occurs.
Posting Lists: The list of document identifiers associated with each term is known as a posting list. These posting lists are sorted to enable efficient retrieval and intersection operations during search queries.

Document Searching

For OpenSearch to operate a search query, the service uses a query and fetch with a distributed search algorithm. It involves constructing and executing queries against the indexed data using the powerful query DSL (Domain-Specific Language). We have broken down how a search works in OpenSearch.

Query DSL: OpenSearch offers a rich set of query DSL constructs that enable you to specify multiple search criteria and conditions. Common query types are as follows:
- Match Query: Searches for documents containing specific terms or phrases.
- Term Query: Matches documents that include a specific term in a particular field.
- Nested Query: Searches within nested objects or arrays within documents.
- Range Query: Finds documents where a field value falls within a specified range.
- Bool Query: Combines multiple query clauses using boolean logic (AND, OR, NOT).
- Nested Query: Searches within nested objects or arrays within documents.
- Full-Text Query: Uses techniques like fuzzy matching and stemming to find relevant documents based on the full-text content.
Query Execution: When you submit a search request to OpenSearch, it parses the query DSL supplies and conducts the query against the inverted index. The index is consulted to highlight documents that match the search criteria specified in the query.
Scoring: OpenSearch calculates a relevance score for each matched document dependent on factors. This scoring mechanism helps rank search results so that the most relevant documents appear higher in the result set.
Retrieval: Once the search is executed and documents are scored, OpenSearch attains the top-ranked documents based on their relevance scores. These documents are then returned as search results to the user or application that initiated the query.
Pagination and Sorting: OpenSearch supports pagination and sorting of search results, enabling users to navigate through large result sets efficiently. Users can specify the number of results to return per page and the sorting criteria for ordering the results based on specific fields or relevance scores.

Aggregations

OpenSearch aggregations are useful features that enable you to examine and summarize data in a variety of ways, offering valuable insights into the distribution, patterns, and trends within indexed documents. With aggregations, you can execute complex analytics on your data, such as computing statistics, finding unique values, and grouping documents based on specified criteria.

Types of aggregations: Metric aggregations, such as average, sum, and stats. Bucket aggregations, such as data histogram, range, and filters. Pipeline aggregations, bucket script, bucket selector, and derivative & cumulative sum.
Nested Aggregations: You can nest aggregations within each other to execute multi-level analytics. For example, you can compute average prices per category, and then further break down each category by subcategory.
Aggregation Framework: OpenSearch offers a flexible and expressive aggregation framework that enables you to compose aggregations hierarchically, allowing for complex analytics and insights generation.
Visualization and Analysis: Aggregation results can be visualized using multiple visualization tools such as Kibana, or they can be consumed programmatically for further analysis and decision-making.

Nodes

In OpenSearch, nodes are individual instances or servers that make up a distributed OpenSearch cluster. Each node executes certain roles and responsibilities within the cluster to collectively allow for efficient indexing, searching, and data management.

Node Type	Description
Cluster manager	Manages the working of a cluster and keeps track of the cluster state. This includes creating and deleting indexes, monitoring the nodes that join and leave the cluster, checking the health of each node in the cluster, and allocating shards to nodes.
Cluster manager eligible	Selects a node among them as the cluster manager node via a voting process.
Data	Stores and searches data. Executes all data-related operations (indexing, searching, aggregating) on local shards. These can be seen as the worker nodes of your cluster and require more disk space than any other node type.
Ingest	Pre-processes data before holding it in the cluster. Operates an ingest pipeline that transforms your data before adding it to an index.
Coordinating	Delegates client requests to the shards on the data nodes, collects and aggregates the results into one final result, and sends this result back to the client.
Dynamic	Delegates a specific node for custom work, such as machine learning (ML) tasks, stopping the consumption of resources from data nodes and therefore not affecting any OpenSearch functionality.
Search	Provides access to searchable snapshots. Incorporates techniques like frequently caching used segments and removing the least used data segments in order to access the searchable snapshot index (stored in a remote long-term storage source, for example, Amazon S3 or Google Cloud Storage).

Hosted OpenSearch

Fully understanding all aspects of OpenSearch architecture and why it’s important for your organization is challenging, but this challenge can be alleviated by opting for a Hosted OpenSearch solution, such as the one provided by Logit.io. Logit.io's Hosted OpenSearch service manages all the infrastructure, including configuration, setup, and hosting enabling you to create production-ready OpenSearch Stacks within minutes.

By using Logit.io, you can benefit from the extensive capabilities of OpenSearch without the challenging and time-consuming configuration and maintenance, such as dashboards with multi-tenancy. Due to Logit.io offering the most recent version of managed OpenSearch, we're equipped to provide multi-tenancy for OpenSearch dashboards as well. This functionality streamlines the process for platform users to establish separate containers for storing index patterns, dashboards, visualizations, and reports. These containers, referred to as tenants, can also be set up to grant access exclusively to specific roles.

If you’re interested in finding out more regarding Logit.io Hosted OpenSearch, feel free to book a free demo, or begin exploring the platform for yourself with a 14-day free trial.

If you've enjoyed this article why not read The Top 10 OpenSearch Plugins or Cassandra vs OpenSearch next?

Logging

Metrics

Observability

Features

Grafana Demo

Prometheus as a Service

ELK as a Service

Monitoring

Logging

Compliance and Auditing

Analysis

Platform-Specific Logging

CMMC Solution

Datadog Alternative

Splunk Alternative

Logz.io Alternative

New Relic Alternative