Tons of tickets "OTRS Scheduler Daemon Cron: ArticleSearchIndexRebuild"

hkais · Post by **hkais** » 21 Feb 2022, 14:48

I am receiving tons of tickets (every 2 minutes, 2 tickets) with

Subject
OTRS Scheduler Daemon Cron: ArticleSearchIndexRebuild

Body
Error: Active indexing process already running! Skipping...

What I did to get rid of this issue:

rebooted

started

Code: Select all

sudo -u otrs /opt/otrs/bin/otrs.Console.pl Maint::Ticket::FulltextIndexRebuildWorker --children 5 --limit 99999999 --force-pid

[*] still receiving the tickets

used above command based of this thread As of viewtopic.php?f=62&t=38498&p=171889&hil ... ..#p171889

Post by **Johannes** » 21 Feb 2022, 18:06

Hi,

what is the output of:

Code: Select all

bin/otrs.Console.pl  Maint::Ticket::FulltextIndex --status

What is the configuration of:

Code: Select all

Daemon::SchedulerCronTaskManager::Task###ArticleSearchIndexRebuild

Why did you manually started the reindexing, especially with those high numbers? This may lead to a high load -> Longer runtimes.
Usually the workers are handled by the daemon itself and not manually by the user. If you manually start the task, you block the daemon job for this, which leads to the error / warning mail you get.

Regards
Johannes

hkais · Post by **hkais** » 21 Feb 2022, 19:30

Johannes wrote: ↑21 Feb 2022, 18:06 what is the output of:
Code: Select all
bin/otrs.Console.pl  Maint::Ticket::FulltextIndex --status

Indexed Articles: 7.9% (649189/8241771)
it will take a while

Johannes wrote: ↑21 Feb 2022, 18:06 What is the configuration of:
Code: Select all
Daemon::SchedulerCronTaskManager::Task###ArticleSearchIndexRebuild

Screenshot_02682.png

Johannes wrote: ↑21 Feb 2022, 18:06 Why did you manually started the reindexing, especially with those high numbers? This may lead to a high load -> Longer runtimes.
Usually the workers are handled by the daemon itself and not manually by the user. If you manually start the task, you block the daemon job for this, which leads to the error / warning mail you get.

I started it since it was reported in the other thread to solve my issue. I had before already the errors/warning.

Would have expected that the ArticleSearchIndexRebuild would detect an running cron and simply skip the execution for now. But not clear if it is intentionally implemented that way, to always report.

I am open to learn, what would have been the better approach to run the full rebuild?

Post by **Johannes** » 21 Feb 2022, 20:54

The error is reported because the IndexWorker has been started on its own. Which is not the way it was designed. It is just used if the process gets killed somehow and is stuck. (full disk or similar).
The Maint::Ticket::FulltextIndex tries to start it's worker because the job does not know anything about the Workers itself. -> You started them manually.

I would say to fix the handling:
1) stop all existing workers
2) make sure they are really stopped
3) mark everything for reindex

Code: Select all

Maint::Ticket::FulltextIndex --rebuild

4) let the daemon start the indexing process itself, using the defaults
5) just watch the progress using the status flag

You still may get a warning mail if the childs take really long, but you don't have to look after it. Just have a regular look at the index summary using:

Code: Select all

Maint::Ticket::FulltextIndex --status

With your article count I would assume that in 2 hrs to max 3hrs everything should be done.
If it takes longer your DB performance might be a bit off.

Or you can just wait until the job is done on your machine. But I would assume that it is consuming a lot of RAM and will, eventually swap, which slows it even more down. So I would go with the defaults.

Regards

hkais · Post by **hkais** » 21 Feb 2022, 21:05

Johannes wrote: ↑21 Feb 2022, 20:54 3) mark everything for reindex
Code: Select all
Maint::Ticket::FulltextIndex --rebuild

just to clearly understand.
The rebuild would mark all tickets as "non indexed". Have I got it?

What is the mechanism/algo now of reindxing all tickets?
first the newest tickets and going into past by each run?

And is the rebuild working in the chuncks of 20.000 tickets per run?
So every cron trigger will cause to run 20.000 tickets to reindex?

Johannes wrote: ↑21 Feb 2022, 20:54 With your article count I would assume that in 2 hrs to max 3hrs everything should be done.
If it takes longer your DB performance might be a bit off.

It is running already for some hours and has done about 9%
The 5 processes I have started are running all about 100%
Disk and MariaDB are ideling around.

Johannes wrote: ↑21 Feb 2022, 20:54 Or you can just wait until the job is done on your machine. But I would assume that it is consuming a lot of RAM and will, eventually swap, which slows it even more down. So I would go with the defaults.

No swap usage at all
RAM is only consumed about 16GB of 32GB available, where maria is taking over 22% of RAM and stable on this
If I would have known, that CPU is limit, I would have kicked in 16 CPUs and would have started 15 procs to make my disks & maria to work more

So from my monitoring here, exactly the opposite behavior. Low IOPS due to high mem usage on maria, and high CPU

Post by **Johannes** » 21 Feb 2022, 21:56

hkais wrote: ↑21 Feb 2022, 21:05 just to clearly understand.
The rebuild would mark all tickets as "non indexed". Have I got it?

correct

hkais wrote: ↑21 Feb 2022, 21:05 What is the mechanism/algo now of reindxing all tickets?
first the newest tickets and going into past by each run?

A ticket search is performed. As far as I remember old to new.

hkais wrote: ↑21 Feb 2022, 21:05 And is the rebuild working in the chuncks of 20.000 tickets per run?
So every cron trigger will cause to run 20.000 tickets to reindex?

Correct. Split up to 4 worker jobs.

hkais wrote: ↑21 Feb 2022, 21:05 If I would have known, that CPU is limit, I would have kicked in 16 CPUs and would have started 15 procs to make my disks & maria to work more

This would not really help. MariaDB, except you have a very special one. MariaDB is single threaded in 99% of the cases. In Postgres, MSSQL or Oracle you get more performance with more CPUs. The perl code itself starts multiple workers to get more performance.
But there is still a catch: it takes much more time to get all viewable / searchable content for all of you 8.2 million articles, then it takes to just get 20k of them. Thats the reason for the limit. Seems strange, but you will take much more time trying to get a complete cow in your mouth than trying to eat the cow in smaller chunks.

Bye

hkais · Post by **hkais** » 21 Feb 2022, 22:40

Johannes wrote: ↑21 Feb 2022, 21:56
hkais wrote: ↑21 Feb 2022, 21:05 What is the mechanism/algo now of reindxing all tickets?
first the newest tickets and going into past by each run?
A ticket search is performed. As far as I remember old to new.

can I control this somehow? Means for me it is interesting to have first the new one and the older can take time as much as it is needed.

Johannes wrote: ↑21 Feb 2022, 21:56 This would not really help. MariaDB, except you have a very special one. MariaDB is single threaded in 99% of the cases. In Postgres, MSSQL or Oracle you get more performance with more CPUs. The perl code itself starts multiple workers to get more performance.
But there is still a catch: it takes much more time to get all viewable / searchable content for all of you 8.2 million articles, then it takes to just get 20k of them. Thats the reason for the limit. Seems strange, but you will take much more time trying to get a complete cow in your mouth than trying to eat the cow in smaller chunks.

now I am more confused.
Afaik the limit was for me the limit for how much articles are getting processed by one run.
So if I understand you properly, perl-code is now trying to run in a chunk of 8.2 million articles. That would be a total mess in terms of performance.

If I would have started the reindexing with --limit 20.000, so my previous run, would it terminate after 20.000 processes articles or would it run in chuncks of 20.000 and would process all 8.2 million articles?
Not fully clear here?

Can I kill/terminate the processing without corrupting anything?

Post by **Johannes** » 21 Feb 2022, 23:11

now I am more confused.
Afaik the limit was for me the limit for how much articles are getting processed by one run.
So if I understand you properly, perl-code is now trying to run in a chunk of 8.2 million articles. That would be a total mess in terms of performance.

If I would have started the reindexing with --limit 20.000, so my previous run, would it terminate after 20.000 processes articles or would it run in chuncks of 20.000 and would process all 8.2 million articles?
Not fully clear here?

Yes it is, but you set the limit, you made the mess yourself

As far as I can tell, from your first post.

On the default config it iterates through all tickets, 20k/4 worker, until the index is at 100% > all article „marked for reindex“ are processed. No need to start anything manual or modify anything.

You may loose the current progress, but the search index is volatile and can de dropped and rebuild at any time.

can I control this somehow? Means for me it is interesting to have first the new one and the older can take time as much as it is needed.

No. Not without modifying the code. Which is not recommended.

hkais · Post by **hkais** » 21 Feb 2022, 23:22

Johannes wrote: ↑21 Feb 2022, 23:11 Yes it is, but you set the limit, you made the mess yourself As far as I can tell, from your first post.

fully confirm

since of missing knowledge on the background processes...

Johannes wrote: ↑21 Feb 2022, 23:11 On the default config it iterates through all tickets, 20k/4 worker, until the index is at 100% > all article „marked for reindex“ are processed. No need to start anything manual or modify anything.

ok stopped my wrong processing.
So my initial config was running per worker 5000 articles (20.000 / 4 workers), which caused to run into an issue not processing fast enough and reporting to my users.
So I have reduced my processing configuration to only (a tenth) 2000 articles with 5 workers, which would process every minute 400 articles per worker.

To be sure, 5 workers mean 5 perl processes get fired up?

If so I only see 4 processes running at 100%, so assuming of course that my config of 2000 was also not taken for the new cron processing. Can I somehow see what the cron is really using?
20.000 / 4 workers
or
2.000 / 5 workers
?

Also one additional question is open to me:
what does the parameter MaximumParallelInstances does in this case?

Post by **Johannes** » 21 Feb 2022, 23:47

Your admins only get the notification when you start the processing from the shell. If the daemon starts the process it should not trigger a warning.

You can’t see the config the worker uses. There is only one sysconfig and the worker uses it. You can only override it using the shell parameter, like —limit and so on.

20k tickets should not take that long to be processed. Just stop everything, mark everything for rebuild and let the daemon do its job. You can still work, it’s just a background job.

You could also measure the time it takes for the 20k tickets to be processed, to get an idea how long it will take. Just start one worker job(without parameter) and measure the runtime.

This varies based on your db setting and general system performance.

Also one additional question is open to me:
what does the parameter MaximumParallelInstances does in this case?

A daemon job can be spin up multiple instances. Not every job does support this (Indexing does not), it is just a needed parameter for the job config.

hkais · Post by **hkais** » 22 Feb 2022, 15:08

Johannes wrote: ↑21 Feb 2022, 21:56 This would not really help. MariaDB, except you have a very special one. MariaDB is single threaded in 99% of the cases. In Postgres, MSSQL or Oracle you get more performance with more CPUs. The perl code itself starts multiple workers to get more performance.
But there is still a catch: it takes much more time to get all viewable / searchable content for all of you 8.2 million articles, then it takes to just get 20k of them. Thats the reason for the limit. Seems strange, but you will take much more time trying to get a complete cow in your mouth than trying to eat the cow in smaller chunks.

I have had restarted it yesterday with the cron scheduler to see if the performance is changing. Sadly the performance did not change at all.
The errors are gone and no tickets are getting created anymore.

But still unclear why the tickets have been created before. Since there was no manual trigger of the TicketIndex run.
Since it is gone for now, not needed to investigate it, but it is anyway strange that out of nowhere an regular error with the indexing appeared.

hkais · Post by **hkais** » 22 Feb 2022, 23:58

any option to increase the speed of the reindexing?

since yesterday restart:

Code: Select all

Indexed Articles: 8.7%

With this speed the reindexing will take
9pm yesterday roughly started till now about 11pm
=> 26h runtime

26/8.7 = ~3h per percent
=> 300h / 12.5days of total runtime to reindex

my database and disks are waiting the most time and perl is running to 100%

Screenshot_02686.png

and

Screenshot_02687.png

Having 6 CPUs right now enabled for processing, so two are left over for Znuny operations

Not clear to me, why perl is eating that much CPU and only a very slow progress?
MariaDB is very rarely showing processes being processed

Code: Select all

show processlist;

So looks like neither disks nor maria is here a problem at all?

Any ideas how to speed up this process, or what is the bottleneck here?

Post by **Johannes** » 23 Feb 2022, 09:53

Hi,

thats way too long. You are correct. But without access to your instance it's really hard to tell.

On the perl side nothing fancy is happening:
- Perform a ArticleSearch from the flagged articles to get a list of relevant IDs
- Split up in chunks (Workers)
- For every worker it's the same
-- Delete index entries for given article
---GetArticle
---GetSearchableContent

Code: Select all

    
    # Content to search
    my %DataKeyMap = (
        'MIMEBase_From'    => 'From',
        'MIMEBase_To'      => 'To',
        'MIMEBase_Cc'      => 'Cc',
        'MIMEBase_Bcc'     => 'Bcc',
        'MIMEBase_Subject' => 'Subject',
        'MIMEBase_Body'    => 'Body',
    );
    # get the article content itself
    my %ArticleData = $Self->ArticleGet(
        TicketID      => $Param{TicketID},
        ArticleID     => $Param{ArticleID},
        UserID        => $Param{UserID},
        DynamicFields => 0,
    );

-- Filter String or not, based on your setting.
-- Rebuild Index (Insert index entry for this article)
-- Remove Rebuild Flag
--done

Back to the performance:

There are so many variables here which are very hard to search for without access:
- MySQL Monitoring.
Queries per second is interesting.

Not clear to me, why perl is eating that much CPU and only a very slow progress?

Thousands of small inserts. A Show processlist only helps to find long running queries. Not a thousand small ones (mytop would be start)
- Is it a cluster?
- Is replication active?
- Size of the db itself
- MySQL Cache settings, is read or write preferred?
- Modified Filter in the Sysconfig?
- Monitoring of IO Wait to get an idea what really happens
- Type of host (virt. | container | bare metal)
- Type of FS and (if attachments are in the FS)

I think the customer backend does not play a role in this game, because only Article information are fetched.

Could also be related to missing or too much indices on the db itself from a migration. And sadly a whole lot more I'm not thinking of right now.

But without real digging and a lot of time for debugging on your instance, I doubt that we will be able to help.

Greetings

hkais · Post by **hkais** » 23 Feb 2022, 14:26

Johannes wrote: ↑23 Feb 2022, 09:53 On the perl side nothing fancy is happening:
- Perform a ArticleSearch from the flagged articles to get a list of relevant IDs
- Split up in chunks (Workers)
- For every worker it's the same
-- Delete index entries for given article
---GetArticle
---GetSearchableContent
Code: Select all
    
    # Content to search
    my %DataKeyMap = (
        'MIMEBase_From'    => 'From',
        'MIMEBase_To'      => 'To',
        'MIMEBase_Cc'      => 'Cc',
        'MIMEBase_Bcc'     => 'Bcc',
        'MIMEBase_Subject' => 'Subject',
        'MIMEBase_Body'    => 'Body',
    );
    # get the article content itself
    my %ArticleData = $Self->ArticleGet(
        TicketID      => $Param{TicketID},
        ArticleID     => $Param{ArticleID},
        UserID        => $Param{UserID},
        DynamicFields => 0,
    );
-- Filter String or not, based on your setting.
-- Rebuild Index (Insert index entry for this article)
-- Remove Rebuild Flag
--done

In which file is this?
And is it possible to trace what the process is doing?
Afaik if the database+disk would be too slow, perl should not be on 100% CPU usage. Since it would wait more time compared to processing something in the CPU

Johannes wrote: ↑23 Feb 2022, 09:53 There are so many variables here which are very hard to search for without access:
- MySQL Monitoring.
Queries per second is interesting.

let me check what is running here already

Johannes wrote: ↑23 Feb 2022, 09:53
Not clear to me, why perl is eating that much CPU and only a very slow progress?
Thousands of small inserts. A Show processlist only helps to find long running queries. Not a thousand small ones (mytop would be start)

Screenshot_02690.png

Johannes wrote: ↑23 Feb 2022, 09:53 - Is it a cluster?

no, a virtual machine which can get enough CPUs + RAM and has a fast SAN storage underneath

Johannes wrote: ↑23 Feb 2022, 09:53 - Is replication active?

no

Johannes wrote: ↑23 Feb 2022, 09:53 - Size of the db itself

46GB on disk (du -h /var/lib/mysql/otrs)

Johannes wrote: ↑23 Feb 2022, 09:53 - MySQL Cache settings, is read or write preferred?

configured to have tons of caches for DB to have a fast UI. So read prefered. What parameters are interesting to you?

Johannes wrote: ↑23 Feb 2022, 09:53 - Modified Filter in the Sysconfig?

puhh yes, but years ago. Need to ask if it was documented that time. If you name me what you need can recheck these settings

Johannes wrote: ↑23 Feb 2022, 09:53 - Monitoring of IO Wait to get an idea what really happens

IOPS and IO wait are on monitoring and are not relevant to the capacity the storage array is able to deliver

Johannes wrote: ↑23 Feb 2022, 09:53 - Type of host (virt. | container | bare metal)

virtualized / vsphere

Johannes wrote: ↑23 Feb 2022, 09:53 - Type of FS and (if attachments are in the FS)

ext4
attachments are in DB

Johannes wrote: ↑23 Feb 2022, 09:53 I think the customer backend does not play a role in this game, because only Article information are fetched.

Could also be related to missing or too much indices on the db itself from a migration. And sadly a whole lot more I'm not thinking of right now.

if it would be an index issue, afaik it should be visible in the processlist? From my experience on DBs if it was an index issue, either CPU or IOPS exploded on DB. And in result you could see some queries took ages.
anyway any idea how to verify quickly if indexes are okay on db?
Also not sure if mariadb is capable of adding a deferred index update? Or any experiences with dropping indexes and rebuilding them after creation.

Post by **Johannes** » 23 Feb 2022, 16:25

Ok, first of all. You need to get the attachments out of the database.

MySQL and most of the other Databases are very bad at handling blob storage. One of the first things to do one a production instance.
Blobs cant be indexed or searched very well. And, depending on your MySQL config, you may also have a large ib_data file?
This would explain your performance issue. This would point to a wrong setting / historical change of file_per_table. In the earlier versions all data have been stored in file. So even after you exported all article information (plain text+attachments) it would be the same size. If file_per_table is correct (or the DB was imported on a new instance) the size would roughly decrease by the amount of article storage.

To export the attachments you can you the console command ArticleStorageSwitch. See the documentation here:
https://doc.znuny.org/doc/manual/admin/ ... rs-storage

Note: You may need double the space on your MySQL Storage partition article storage. MySQL alters tables using copy, change, write back.

I would also suggest to use a separate share / mount for it or a LVM partition. To make it easier to extend it, if needed.
I would recommend XFS for the ArticleStorage mount. With ext4 you may run into problems with the available inodes.

After the migration the database should be way smaller.

- Is it a cluster?
- Is replication active?

Was referencing to the DB, not the VM.

configured to have tons of caches for DB to have a fast UI. So read prefered. What parameters are interesting to you?

Ok. I cant help with MySQL performance tuning. We usually hire someone if it is special or (first try) create a fresh config to reduce possible errors from the past.

- Modified Filter in the Sysconfig?
puhh yes, but years ago. Need to ask if it was documented that time. If you name me what you need can recheck these settings

I'm talking of SearchIndex filter. There are only two:
- Ticket::SearchIndex::Attribute
- Ticket::SearchIndex::Filters
May not even be relevant, if you use unfiltered storage. Which improves performance for indexing but stores a lot of unnecessary stuff.

IOPS and IO wait are on monitoring and are not relevant to the capacity the storage array is able to deliver

If you say so.

anyway any idea how to verify quickly if indexes are okay on db?

The Support Assessment would tell you.
First check is: bin/otrs.Console.pl Maint::Database::Check

Greetings

Edit: missed this

In which file is this?
And is it possible to trace what the process is doing?
Afaik if the database+disk would be too slow, perl should not be on 100% CPU usage. Since it would wait more time compared to processing something in the CPU

Kernel/System/Ticket/Article/Backend/MimeBase.pm

Yes can can use funny tools like strace or nytprof. But as far as I can tell, the perl code is not the problem here. I tested already two days ago. 950k articles ~ took about ~55 Minutes. Times 10 for your article count = 550 Minutes, which is "normal" with the current implementation and usually done over night. You can increase the amount of workers, but I doubt that the result would change a lot.

Znuny Open Source Ticketsystem

Tons of tickets "OTRS Scheduler Daemon Cron: ArticleSearchIndexRebuild"

Tons of tickets "OTRS Scheduler Daemon Cron: ArticleSearchIndexRebuild"

Re: Tons of tickets "OTRS Scheduler Daemon Cron: ArticleSearchIndexRebuild"

Re: Tons of tickets "OTRS Scheduler Daemon Cron: ArticleSearchIndexRebuild"

Re: Tons of tickets "OTRS Scheduler Daemon Cron: ArticleSearchIndexRebuild"

Re: Tons of tickets "OTRS Scheduler Daemon Cron: ArticleSearchIndexRebuild"

Re: Tons of tickets "OTRS Scheduler Daemon Cron: ArticleSearchIndexRebuild"

Re: Tons of tickets "OTRS Scheduler Daemon Cron: ArticleSearchIndexRebuild"

Re: Tons of tickets "OTRS Scheduler Daemon Cron: ArticleSearchIndexRebuild"

Re: Tons of tickets "OTRS Scheduler Daemon Cron: ArticleSearchIndexRebuild"

Re: Tons of tickets "OTRS Scheduler Daemon Cron: ArticleSearchIndexRebuild"

Re: Tons of tickets "OTRS Scheduler Daemon Cron: ArticleSearchIndexRebuild"

Re: Tons of tickets "OTRS Scheduler Daemon Cron: ArticleSearchIndexRebuild"

Re: Tons of tickets "OTRS Scheduler Daemon Cron: ArticleSearchIndexRebuild"

Re: Tons of tickets "OTRS Scheduler Daemon Cron: ArticleSearchIndexRebuild"

Re: Tons of tickets "OTRS Scheduler Daemon Cron: ArticleSearchIndexRebuild"