Eland lets you run Pandas-like analytics and ML directly on your Elasticsearch data without moving it.

Let’s see Eland in action. Imagine you have a dataset of web server logs in Elasticsearch, and you want to predict the response time of a particular request.

First, install Eland:

pip install eland

Now, connect to your Elasticsearch cluster and load your data into an Eland DataFrame. Let’s assume your data is in an index named logs and you want to sample 100,000 documents:

import eland as ed
from elasticsearch import Elasticsearch

# Replace with your Elasticsearch connection details
es_client = Elasticsearch(
    cloud_id="YOUR_CLOUD_ID",
    api_key=("YOUR_API_KEY_ID", "YOUR_API_KEY_SECRET")
)

# Load data into an Eland DataFrame
try:
    df = ed.eland_frame_from_elasticsearch(es_client, "logs", limit=100000)
    print("Successfully loaded data into Eland DataFrame.")
    print(df.head())
except Exception as e:
    print(f"Error loading data: {e}")
    # Handle connection or index errors here
    exit()

This df is not a Pandas DataFrame; it’s an Eland DataFrame. Eland translates your Pandas operations into Elasticsearch queries. For example, to select specific columns and filter rows:

# Select relevant columns and filter for successful requests (status code 200)
request_data = df[["request_method", "request_url", "response_time_ms", "status_code"]].copy()
successful_requests = request_data[request_data["status_code"] == 200]

print("\nFiltered data head:")
print(successful_requests.head())

When you run successful_requests[request_data["status_code"] == 200], Eland doesn’t pull all 100,000 documents into memory. Instead, it constructs an Elasticsearch query to filter the data on the server. The .copy() is important here to avoid SettingWithCopyWarning if you plan to modify successful_requests later.

Now, let’s prepare this data for a machine learning model. We’ll one-hot encode categorical features like request_method and request_url.

# One-hot encode request_method
request_method_encoded = pd.get_dummies(successful_requests["request_method"], prefix="method")
successful_requests = pd.concat([successful_requests, request_method_encoded], axis=1)

# One-hot encode request_url - this can be complex, for demonstration, let's simplify
# In a real scenario, you might need more sophisticated feature engineering for URLs
successful_requests["url_segment"] = successful_requests["request_url"].str.split('/').str[1] # Example: Get the first segment
url_segment_encoded = pd.get_dummies(successful_requests["url_segment"], prefix="url")
successful_requests = pd.concat([successful_requests, url_segment_encoded], axis=1)

# Drop original categorical columns and the simplified url_segment
successful_requests = successful_requests.drop(columns=["request_method", "request_url", "status_code", "url_segment"])

print("\nData after one-hot encoding:")
print(successful_requests.head())

Notice that even with pd.get_dummies, Eland is designed to handle this. For operations that can’t be directly translated to Elasticsearch, Eland might fetch necessary data in chunks or use Elasticsearch’s scripting capabilities. However, for common Pandas operations like get_dummies, Eland aims to push as much as possible to Elasticsearch. For very large or complex feature engineering, you might eventually need to convert to a Pandas DataFrame using .to_pandas(), but Eland encourages you to stay within its framework as long as possible.

The core problem Eland solves is the data movement bottleneck. Traditional ML workflows often involve exporting massive datasets from operational databases (like Elasticsearch) to data science environments, which is slow, resource-intensive, and can lead to stale data. Eland allows you to leverage your existing Elasticsearch infrastructure for analytics and ML.

Internally, Eland acts as a translator. When you call a method like df.mean(), Eland converts this into an Elasticsearch aggregation query (e.g., using avg aggregation). For more complex operations, it might use Elasticsearch’s Painless scripting language or other server-side features. The goal is always to perform computation as close to the data as possible.

The levers you control are primarily through the Elasticsearch query itself and how you structure your Eland DataFrame. You can filter data extensively before loading it, select only necessary columns, and use Eland’s mapping capabilities to ensure Elasticsearch understands your data types correctly.

One thing most people don’t realize is how Eland handles operations that cannot be directly translated to Elasticsearch queries. For instance, if you perform a complex multi-step aggregation that Elasticsearch doesn’t natively support, Eland might intelligently fetch the results of a partial Elasticsearch aggregation and then perform the final computation in Python. This is a hybrid approach that balances server-side efficiency with Python’s flexibility, avoiding the need to pull the entire raw dataset into memory.

The next step is to train a machine learning model, perhaps using scikit-learn, on this prepared Eland DataFrame.

Want structured learning?

Take the full Elasticsearch course →