Working recently with Hibernate batch processing I’ve followed the default pattern promoting callings session.clear()
periodically. In this article I’m going to describe what problems can be caused by this approach and how to avoid them.
Batching is processing multiple entities in some way. Multiple can mean anything: hundred, thousand, milion, or hundred milions. It’s nothing uncommon to fetch rows from a database and do something with the fetched data, one row after another. However, in Hibernate exists something called a First level cache, which basically is a in-memory storage (a Map
) which keeps all entities fetched in the current session. This means, that fetched entity references will survive you local function scope and will be kept in memory until the session is closed.
If you want to fetch 100,000,000 entities from a DB, where each of them potentially drags along other entities in eager relations, you can quickly end up with out of memory error. This is why Hibernate tutorial encourages to do periodical session.clear()
during such a batch processing.
Before we discuss the main subject, let’s take a look at alternatives to batch processing with session.clear()
.
As stated in the tutorial as an alternative you can use a StatelessSession
. Entities fetched in a stateless session aren’t cached in the first level cache. However, when you take a glance at limitiations you can see the following amongst the others:
What does it mean exactly? It means that all batch-processing code written using stateless sessions has to be written from scratch considering all those limitations and you can’t use any of your already written and tested business service methods, which the most probably use features mentioned above. You just have to write this code again only for batch-processing purposes.
It’s never been an option for me.
This is more reasonable option. If you can process your entities in separate transactions, what means that if a processing of one entity fails, it doesn’t influence a processing of other entities (for example we send a bunch of emails and if one fails, other will still be sent), you can just create a separate transaction to process each entity and you can forget about the problem. This is because in frameworks like Spring or Micronaut a session is created together with a new transaction, and after processing a single entity it’s flushed and garbage collected, what removes also all fetched entity references from the first level cache. Conceptually it can be written in the following way:
@Transactional @Singleton
public class MyRepository {
public MyEntity findById(UUID id) {
// ...
}
public List<UUID> findIdsForBatchProcessing() {
// ...
}
}
@Singleton // non-transactional
public class MyService {
@Inject protected MyRepository repository;
// this method shouldn't be called from a transactional code because it means
// the existence of outer transaction what extends the session boundary
// what causes first level cache won't be cleaned up after each process() invocation
@Transactional(Transactional.NOT_SUPPORTED)
public void doBatch() {
repository.findIdsForBatchProcessing().stream().forEach(id -> process(id));
}
@Transactional
protected void process(UUID id) {
MyEntity e = repository.findById(id);
// ...
}
}
When the process()
method invocation starts, the new session is created together with a new transaction, and the entity is loaded by the id. When the method invocation ends the session is flushed and closed, what also means removing all entities in its first level cache from the memory.
Note, that for old frameworks like Spring the above invocation from
doBatch()
toprocess()
will not be called transactionally and requires additional effort to be made. In newest frameworks like Micronaut it works out of the box.
One can ask what happens when we have 100,000,000 entities. In such a scenario there’s 100,000,000 UUID-s fetched from the database what makes 100,000,000 x 128 bit for UUID = 100,000,000 x 16 bytes = ~1,5GB of data. It can pose a problem with storing it in local RAM if we use really thin machines for our services, and it certainly pose a problem with transferring this data from a database server to a service server through the network in a single step.
To avoid fetching all ID-s from a DB you can split this set to a few smaller ones, which will be fetched separately. However, I’d try to avoid checking for count(*)
and paginating this data somehow, because between transactions there can appear more entities, some of them can be deleted or the internal database order can be changed. A better solution is to generate some unique batchProcessingId
and keep it with the processed entity:
@Entity
public class MyEntity {
protected UUID batchProcessingId;
// ...
}
@Transactional @Singleton
public class MyRepository {
public List<UUID> findIdsForBatchProcessing(UUID batchProcessingId, int maxResults) {
entityManager.query("select e.id from MyEntity e where e.batchProcessingId != :batchProcessingId", MyEntity.class)
.setParameter("batchProcessingId", batchProcessingId)
.setMaxResults(maxResults)
.getResultList();
}
}
@Singleton
public class MyService {
@Transactional(Transactional.NOT_SUPPORTED)
public void doBatch() {
UUID batchProcessingId = UUID.randomUUID();
List<UUID> list;
do {
list = repository.findIdsForBatchProcessing(batchProcessingId, 1000);
list.stream().forEach(id -> process(id, batchProcessingId));
} while (!list.isEmpty());
}
@Transactional
protected void process(UUID id, UUID batchProcessingId) {
MyEntity e = repository.findById(id);
// ...
e.batchProcessingId = batchProcessingId;
}
}
But, for the purpose of this article, we neither use StatelessSessiong
nor separate transactions. We just want to use the same (outer) transaction for the whole processing. So, what’s the problem with session.clear()
?
Usually, during a batch processing in a single transaction, you are going to accomplish the following workflow:
For example, in step 1 you can fetch all users who have not sent email, in step 2 you send those emails, and in step 3 you want to update User.sentEmailsCount
. The problem is that if you do session.clear()
in the second step, all entities from step 1 become detached and you can’t use them in the finalization stage.
The example above is very simplified. To solve this problem you can re-attach all detached User
-s to the session before updating sentEmailsCount
field. But, if you consider more complex example, where more entities are fetched from DB in step 1 and then should be used in step 3, it can become uncontrallable and throw LazyInitializationException
all over. And the worst case scenario is when batch processing can be called from different service methods, which have completely different set of entities loaded, and you can’t tell what should be re-attached to the session after batch processing to ensure the whole logic will be executed properly.
In such a scenario, instead of doing session.clear()
I prefer to detach only entities which were fetched in the batch processing step, and preserve all previously loaded entities, to ensure the outer logic can be executed apropriately. Here’s the conceptual implementaion:
public class HibernateBatchCleaner {
protected SessionImpl session;
protected Set<EntityKey> snapshot = new HashSet<>();
public HibernateBatchCleaner(EntityManager entityManager) {
this.session = entityManager.unwrap(SessionImpl.class);
doSnapshot();
}
protected void doSnapshot() {
Arrays.stream(session.getPersistenceContext().reentrantSafeEntityEntries()).forEach(entry -> {
snapshot.add(entry.getValue().getEntityKey());
});
}
public void cleanup() {
session.flush(); // existing changes should be applied, and then memory cleaned up
Arrays.stream(session.getPersistenceContext().reentrantSafeEntityEntries()).forEach(entry -> {
if (!snapshot.contains(entry.getValue().getEntityKey())) {
session.detach(entry.getKey());
}
});
}
}
@Transactional @Singleton
public class MyRepository {
@Inject protected EntityManager entityManager;
public Stream<MyEntity> findEntitiesForBatchProcessing() {
// using stream we don't load the whole set into memory, but load it in bunches
return entityManager.query(/*...*/, MyEntity.class).getResultStream();
}
public void doBatch(Stream<MyEntity> stream, Consumer<MyEntity> c) {
HibernateBatchCleaner cleaner = new HibernateBatchCleaner(entityManager);
try {
stream.forEach(entity -> {
c.accept(entity);
cleaner.cleanup();
});
} finally {
cleaner.cleanup();
}
}
}
@Transactional @Singleton
public class MyService {
public void doBatch() {
repository.doBatch(repository.findEntitiesForBatchProcessing(), this::process);
}
protected void process(MyEntity entity) {
// ...
}
}
This way we only cleanup entities loaded during the batch processing while we preserve those loaded in preparation stage, they are still attached to the session and we can change them after the processing is done.