Best practices when creating tasks¶
When it comes to creating tasks in Lime CRM, there are several factors to consider in order to achieve both resiliency and optimal performance. In this discussion, we'll delve into these important considerations. Get ready to equip yourself with the knowledge needed to create robust and high-performing tasks in Lime CRM!
1. Estimate the task duration¶
Understanding the amount of data the integration will handle is crucial for estimating the duration of a task. This preliminary analysis sets the foundation for a smooth integration process.
It's common to only use a small data set when testing a task during development. This is good since it speeds up the development process. However, it's important to be cautious as there are risks involved. It's possible to mistakenly assume that a task is fast based on testing it with a small data set, only to discover that it takes hours, or even days, to complete when running it with production data. So, while it's helpful for initial testing, we need to be mindful of the potential variations when working with larger data sets.
Use a large data set for capacity testing
Use a small data set for feature testing your tasks, but also perform capacity testing using a dataset at least in the size of the dataset used by the production environment. Always use anonymized data in development environments.
2. Estimate expected memory usage¶
When dealing with hefty data volumes, it's easy to overlook the memory consumption aspect.
It's worth noting that the memory requirements for a task may vary between cloud and on-premise environments.
In our cloud environment, tasks that excessively consume memory are automatically terminated. If you check the logs, you'll notice that the task was SIGKILL
-ed.
However, this automatic termination behavior does not apply to our on-premise environment. In this case, it becomes your responsibility to ensure that memory usage remains in check. If the memory consumption exceeds limits, it can adversely affect the server and lead to degraded performance.
For optimal performance, we strongly advise keeping your task's memory usage below 200 MB. Going beyond this threshold is often an indication of suboptimal implementation.
< 200 MB
As a rule of thumb, a task should never consumer more than 200 MB of memory.
So, let's keep a watchful eye on memory usage to ensure smooth sailing in both cloud and on-premise environments. It's all about finding that sweet spot for optimal performance!
Estimate memory usage¶
To estimate the memory usage of a task, a simple approach is to utilize the Resource Monitor application on your computer. By examining the memory consumption of the process associated with your task, you can obtain a quick estimate of its memory usage. When calculating the memory usage of your task, it's important to exclude the memory consumed by the task handler itself while in an idle state. Typically, the task handler occupies around 150-200 MB of memory when it's not actively processing any tasks. This means that the max amount of memory that your taskhandler should consume when running a single task is 400 MB (200 MB +200 MB).
Use the Resource Monitor
Utilize the Resource Monitor application on your computer to measure the memory consumption.
Data chunking¶
Harness the power of Data Chunking! Here's a short Python code example that demonstrates the benefits of using a generator when fetching large amounts of data from another system:
def fetch_data_from_external_system():
# Simulating data fetching from an external system
for i in range(1, 1000001):
yield i
# Fetching data using a generator
data_generator = fetch_data_from_external_system()
# Processing data from the generator
for data in data_generator:
# Perform desired operations on each data item
logger.info(data)
In this example, the fetch_data_from_external_system()
function generates data using a simple for
loop. Instead of fetching all the data at once and storing it in a list or another data structure, it utilizes a generator by using the yield
keyword. This allows the data to be fetched and generated incrementally as needed, without consuming excessive memory.
The data generator can then be iterated over using a for loop, processing each data item as desired. This approach ensures that only one data item is loaded into memory at a time, making it more memory-efficient and suitable for handling large datasets.
!!! tip "The next
function
Python’s next()
function returns the next item of an iterator. You can use this if you only want to obtain a single item from the generator.
By using generators, you can fetch and process data in manageable chunks, reducing memory consumption and improving overall performance when dealing with substantial amounts of data.
Another simple approach of creating a generator is to use the map
and filter
functions. In Python, when you use the map()
function to apply a function to a sequence of elements, it returns a generator object. The generator object is an iterator that generates the results lazily as you iterate over it, rather than computing and storing all the results upfront in memory.
This means that you can create a generator with a single line of code, cool!
def fetch_from_external_system(id_):
...
ids = [171, 772, 838]
# Apply the mapping function using map()
data_generator = map(fetch_from_external_system. ids)
The map
and filter
functions are awesome!
The map
and filter
functions creates generators.
Sometimes, working with a larger data set can actually speed up your task. In such instances, you can retrieve larger data chunks before initiating the processing phase. Here's an example that demonstrates batching a specific number of fetches from the data_generator mentioned earlier:
import itertools
# Fetching and processing data in batches using itertools.islice
batch_size = 50
data_batches = (
itertools.islice(data_generator, batch_size)
for _ in itertools.count()
)
# Processing each batch of data
for batch in data_batches:
if not batch:
break
# Perform desired operations on the batch of data
logger.info(list(batch))
In this example, the data_batches
generator expression generates batches of data using itertools.islice
. Each batch is obtained as an iterator returned by islice
.
The code then iterates over the data_batches, processing each batch of data individually or collectively based on the desired operations. In the example, list(batch)
is used to convert each batch iterator to a list for printing purposes.
One caveat with generators that is good to be aware of is that it can only be iterated over once:
>>> my_iterator = iter([1, 2, 3, 4])
>>> [x for x in my_iterator]
[1, 2, 3, 4]
>>> [x for x in my_iterator]
[]
By leveraging batching, you can enhance the speed of your task by working with larger chunks of data at once. Give it a try and witness the efficiency boost in action!
Memory control
Use generators and fetch data in batches by utilizing itertools.islice
.
Learn more about generators
Check out this tutorial, if you want to learn more about generators.
Reduce memory usage when working with Lime objects¶
It's crucical to to think about memory usage when working with lime objects. Here are two essential guidelines to follow:
- Avoid storing large collections of Lime objects in lists. Beware of memory consumption! A list holding 100,000 Lime objects can occupy over 100 MB of memory.
- Optimize database operations by committing changes in batches. Aim for a suitable batch size, typically around 50 objects per transaction. This approach enhances performance and reduces memory overhead, ensuring efficient utilization of system resources.
3. Make the task performant¶
Beware of poorly written tasks! They may work fine with small datasets but can severely impact system performance and slow down operations when dealing with production data. Optimize your tasks to handle larger datasets and keep your solution running smoothly.
Identify the bottlenecks¶
To measure the performance of your task and identify the parts that are taking the longest, you can use various techniques and tools, including APM (Application Performance Monitoring) tools like APM in Kibana or manual logging. Here's an approach using manual logging:
-
Import the
time
module in your Python code:import time
-
Place timestamps at different points in your code to measure the execution time:
start_time = time.time() # Code segment 1 # ... segment1_time = time.time() # Code segment 2 # ... segment2_time = time.time() # Code segment 3 # ... end_time = time.time()
-
Calculate the duration of each code segment by subtracting the respective timestamps:
duration_segment1 = segment1_time - start_time duration_segment2 = segment2_time - segment1_time duration_segment3 = end_time - segment2_time
-
Log the durations or print them for analysis:
logger.debug(f"Duration of segment 1: {duration_segment1}") logger.debug(f"Duration of segment 2: {duration_segment2}") logger.debug(f"Duration of segment 3: {duration_segment3}")
By placing timestamps at different points and calculating the durations, you can identify which parts of your task are taking the longest to execute.
Trim down those database queries!¶
Minimize the number of queries during integration to avoid performance bottlenecks and database overload.
Example: Importing contracts¶
Let's supercharge the integration process! Imagine you're importing contracts into the Lime database while ensuring no duplicates are created. The conventional approach involves querying the database for each contract, which can be time-consuming, especially for large datasets.
But here's a counterintuitive twist: sometimes, prioritizing memory usage leads to the right solution. For instance, storing 1 million integers in memory only takes up around 80 MB. This "good memory investment" can work wonders and significantly expedite your task.
To maximize efficiency, let's strike the perfect balance between query management and memory utilization. Here's a faster approach: pre-fetch the existing contract IDs and store them in a nifty Set
data structure within your code. By doing so, you reduce the number of database queries from the total number of contracts to just one. This optimization allows you to swiftly check for contract existence in-memory, turbocharging your integration process.
With this optimized strategy, you'll be importing contracts with lightning speed, leaving no room for duplicates and effortlessly conquering the integration challenge. Let's level up your performance!
4. Write good log messages¶
Don't underestimate the power of comprehensive logging in your integrations! Logging plays a crucial role in troubleshooting and gaining insights into the behavior of your integration, especially when dealing with failures or unexpected issues. By writing detailed log messages, you create a valuable resource for understanding and resolving any hiccups along the way. So, let's make logging a priority and pave the way for smooth sailing through any integration challenges that come our way!
Use logs to track the execution flow¶
Using logs to track the execution flow provides a detailed and chronological record of the program's actions, aiding in understanding the sequence of operations and facilitating effective debugging and troubleshooting. Below are two examples of well-crafted log statements that provide valuable information for understanding program execution and troubleshooting:
logger.info("Trying to fetch json file from external system...")
json_file = get_file_from_external_system()
and
logger.info("Performing data validation before processing")
validate_data(data)
Log input parameters¶
Logging input parameters helps capture the initial state of a function and provides a reference for understanding how functions are being invoked and with what values. See example below:
logger.info(f"Processing data for user: {username}")
Warning
Never log sensitive data such as API keys or other credentials.
Log exception details¶
Logging exception details provides valuable insights into the occurrence of errors, including the specific error message, stack trace, and relevant contextual information, aiding in effective debugging and issue resolution.
import logging
import requests
logger = logging.getLogger(__name__)
try:
url = "https://api.example.com/some-endpoint"
logger.info(f"Sending API request to {url}")
response = requests.get(url)
response.raise_for_status()
# Process the response data...
except requests.exceptions.RequestException as e:
logger.warning("API request failed with error code {response.status_code}")
5. Write good documentation for the task¶
Create a docstring for the task to explain how the purpose of the task and how it works.
When writing a good docstring in Python, it's important to provide a clear and concise description of the function's purpose and functionality. Clearly state what the function does and any important details or caveats.
Using typing annotations in the function signature can help convey the expected types of input parameters and return values. This allows users to understand the data types the function works with and what it produces.
Example:
def import_data_from_blob():
"""
Fetch JSON files from Azure Blob Storage and import data into the Lime database.
This function retrieves JSON files stored in Azure Blob Storage, reads their content,
and performs data import operations into the Lime database.
"""
...