Wikisource Ebooks: Investigate job queue for more efficient ebook generation [16H]
Closed, ResolvedPublicNov 4 2020
Actions

Assigned To

Authored By

	Samwilson
	May 21 2020, 5:25 AM

Description

As a Wikisource user, I want the team to investigate a queue-running system, so it can be determined if a) such a system would improve reliability to a meaningful degree, and b) if the work would be manageable and within scope for the team.

Background: We should investigate adding a queue-running system, similar to what we built for #eventmetrics, for WSExport. This would mean that users would submit a request for an ebook, it'd be added to the queue, and they'd get a status page indicating the progress. The queue would first generate the epub, which would then be available for download, and then it'd generate the derivative forms (PDF etc.) and make those available when done. This would help with errors such as T250614. This system would also effectively give us a cache (e.g. if two people request the same ebook, only one queue process would need to run). Task for that: T222936.

Acceptance Criteria:

Investigate the primary work that would need to be done in order to implement a queue-running system, similar to what we built for #eventmetrics, for WSExport.
Investigate the main challenges, risks, and possible dependencies associated with implementing such a system
Provide a general estimate/idea, if possible, of the potential impact it may have on ebook export reliability.
- In other words, do we have a strong hunch that this could, indeed, improve reliability (and in a considerable way)? Why or why not?
Provide a general estimation/rough sense of the level of difficulty of effort required in doing such work

NOTES:

We will also want to see how often many people are downloading at once (i.e., how often there are issues that implementing a job queue could address). This could be data we work with Jennifer to get, potentially.
We will investigate UX questions related to informing users of the status of a potential download status (i.e., if it is almost done downloading, if there are errors and they should try again, etc) in a separate investigation. Refer to T256707 parent task to see relevant design & UX for the ebook export process.

Details

Due Date: Nov 4 2020, 5:00 AM

Related Objects

Mentioned In: T266190: Wikisource Export: Migrate existing DB system to use Doctrine ORM
T264122: Wikisource Ebooks: investigate if we can prevent automated downloads (to improve reliability) [8H]
T265660: Wikisource Export: Cache all API requests
T265431: Wikisource Ebooks: Can we determine the percentage of automated vs. human downloads of books on Wikisource?
T253282: Wikisource ebooks: Investigate using subpages from all pages, not just those with ws-summary
T244099: Spike: Investigate "Improve export of electronic books" [8 hours]
Mentioned Here: T256707: Wikisource ebook export design
T222936: Wikisource Ebooks: Investigate cache generated ebooks [8H]
T250614: Timeout while generating PDF using WSexport

Event Timeline

Samwilson created this task.May 21 2020, 5:25 AM

Restricted Application added a project: Community-Tech. · View Herald TranscriptMay 21 2020, 5:25 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Samwilson mentioned this in T244099: Spike: Investigate "Improve export of electronic books" [8 hours].May 21 2020, 5:25 AM

ifried moved this task from New & TBD Tickets to Needs Discussion on the Community-Tech board.Jun 11 2020, 4:28 PM

ifried renamed this task from Add job queue for more efficient ebook generation to Wikisource Ebooks: Add job queue for more efficient ebook generation.Jun 11 2020, 10:39 PM

ifried renamed this task from Wikisource Ebooks: Add job queue for more efficient ebook generation to Wikisource Ebooks: Investigate job queue for more efficient ebook generation.

ifried updated the task description. (Show Details)

ifried added a project: All-and-every-Wikisource.

ifried updated the task description. (Show Details)Jun 11 2020, 10:57 PM

ifried updated the task description. (Show Details)Jun 16 2020, 8:24 PM

ifried updated the task description. (Show Details)Jun 16 2020, 9:12 PM

ARamirez_WMF renamed this task from Wikisource Ebooks: Investigate job queue for more efficient ebook generation to Wikisource Ebooks: Investigate job queue for more efficient ebook generation [8H}.Jun 19 2020, 12:03 AM

ARamirez_WMF moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.

ifried moved this task from Up Next (June 3-21) to Kanban-2020-21-Q1 on the Community-Tech board.Jul 21 2020, 11:51 PM

ifried edited projects, added Community-Tech (Kanban-2020-21-Q1); removed Community-Tech.

Samwilson mentioned this in T253282: Wikisource ebooks: Investigate using subpages from all pages, not just those with ws-summary.Jul 28 2020, 12:37 AM

ARamirez_WMF renamed this task from Wikisource Ebooks: Investigate job queue for more efficient ebook generation [8H} to Wikisource Ebooks: Investigate job queue for more efficient ebook generation [8H].Aug 20 2020, 11:29 PM

ifried moved this task from Kanban-2020-21-Q1 to Up Next (June 3-21) on the Community-Tech board.Aug 27 2020, 9:14 PM

ifried edited projects, added Community-Tech; removed Community-Tech (Kanban-2020-21-Q1).

ifried updated the task description. (Show Details)Aug 28 2020, 9:40 PM

ifried moved this task from Up Next (June 3-21) to Needs Discussion on the Community-Tech board.Sep 15 2020, 6:00 PM

ifried updated the task description. (Show Details)Sep 24 2020, 3:24 PM

ifried updated the task description. (Show Details)

ifried updated the task description. (Show Details)Sep 24 2020, 5:49 PM

ARamirez_WMF renamed this task from Wikisource Ebooks: Investigate job queue for more efficient ebook generation [8H] to Wikisource Ebooks: Investigate job queue for more efficient ebook generation [16H].Sep 24 2020, 5:57 PM

ARamirez_WMF moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.

ifried updated the task description. (Show Details)Oct 1 2020, 3:36 PM

ifried moved this task from Up Next (June 3-21) to Kanban-2020-21-Q1 on the Community-Tech board.Oct 8 2020, 8:53 PM

ifried edited projects, added Community-Tech (Kanban-2020-21-Q1); removed Community-Tech.

ARamirez_WMF set Due Date to Oct 21 2020, 4:00 AM.Oct 8 2020, 10:05 PM

ARamirez_WMF changed the subtype of this task from "Task" to "Deadline".

MusikAnimal claimed this task.Oct 13 2020, 9:09 PM

MusikAnimal moved this task from Ready 🎬 to In Development 💻 on the Community-Tech (Kanban-2020-21-Q1) board.

MusikAnimal mentioned this in T265431: Wikisource Ebooks: Can we determine the percentage of automated vs. human downloads of books on Wikisource? .Oct 14 2020, 10:33 PM

MusikAnimal mentioned this in T265660: Wikisource Export: Cache all API requests.Oct 15 2020, 8:27 PM

Samwilson mentioned this in T264122: Wikisource Ebooks: investigate if we can prevent automated downloads (to improve reliability) [8H].Oct 19 2020, 5:14 AM

An important distinction to be made with Event Metrics – ultimately the "report" data in Event Metrics gets pre-stored in a database, indefinitely. This isn't a problem in terms of storage because they are just numbers. In addition, Event Metrics had a timestamp of when the report was generated, and you as the user will always get that version of the report until you ask for an updated version. In our case, we end up with a epub (or other format), which is not as cheap to store, and also we want to automatically ensure the user is served the latest possible version. So the two systems won't work exactly the same.

In its simplest form, the purpose of a job queue would be to ensure there aren't but so many resource-intensive things going on at the same time. Beyond that, we need to also cache the exported file for some period of time. So for now I'm just going to go off of the investigation at T222936 and recommend a brief period, say 10 minutes.

Combining the system we use for Event Metrics, I envision the system working something like this:

First we need a table in the database to keep track of the jobs. The schema could look something like:

id – unique ID of the job
filename – unique filename for the exported book, something like [title]-[lang]-[font].[format]
submitted – when the job was submitted
status – status of the job, stored as a smallint but in English it's one of:
- queued – waiting to be spawned by cron
- started – currently processing
- failed_timeout – timed out (we'll come up with a maximum period of time that any job should run, Sam and others probably know what a good value would be)
- failed_unknown – failed for some other known reason

There's no "completed" status because the system will refer to the file system to determine whether a job is needed.

So the whole pipeline might look something like:

Request comes in
Check if a file already exists for the desired work/format
1. If it exists, serve it. Nothing else needs to be done.
(no file exists) Check the job table to see if a job is pending for the desired work/format.
1. If a job exists, the client goes back to step 1 (continually pinging the server until a file becomes available, with some reasonable timeout before failing gracefully)
(no job exists) A row is added to the job table and set to "queued"
A cron is ran every minute that spawns queued jobs (setting the status to "started"), never allowing more than N jobs to be running at the same time (probably can be a fairly high threshold)

Under this system, we will never be exporting the same thing twice at the same time. This doesn't cover the caching part (more on that below), it just ensures any unique combination of title/format/font is exported synchronously, and ensures that at any given time we have enough resources (RAM and such) to do properly export a work. I don't think we've confirmed that RAM and/or CPU overhead is actually a major problem, but the job queue nonetheless can ensure it stays that way.

Now, more about this cron job. It's workflow could be something like:

Start processing
Once complete, the file now exists which is the first thing that gets checked, so the corresponding row in the job table can be deleted. The next request will serve the actual file.
If the job fails, we set the status to indicate this (whether it was due to timeout or something unknown). This needs to be persisted so that on the next ping from the client, we can tell them the export failed.
1. We don't have emailing of errors set up yet for WSExport, but when we do, it could email us when this happens.
Delete files that over 10 minutes old

#4 is effectively the caching bit. It could be done by a separate cron, but I worry about them competing with each other (one is trying to serve a book and just as it's about to the other deletes it).

A few things I'm unsure about:

Should we cache just the epub, the final format, or both?
For small books that export quickly, the job queue system could slow down the experience significantly because you have to wait up to a minute for the next cron run. Perhaps the queue can be bypassed if we can programmatically determine the work isn't very expensive to export (say by number of pages). This is basically what we did for Event Metrics. We also have the option of simulating a cron rate of less than a minute by having multiple cron jobs and putting a sleep in there. Definitely a hack but from quick research this isn't unheard of.

I'm going to stop here for now and let the other engineers read this and we will discuss tomorrow in our meeting. I'm not certain if the above is the best approach but it will allow us to steal a lot of already-written code from Event Metrics.

ARamirez_WMF edited projects, added Community-Tech (Kanban-2020-21-Q2); removed Community-Tech (Kanban-2020-21-Q1).Oct 20 2020, 8:21 PM

ARamirez_WMF moved this task from Ready 🎬 to Review/Feedback 💬 on the Community-Tech (Kanban-2020-21-Q2) board.Oct 20 2020, 8:25 PM

MusikAnimal mentioned this in T266190: Wikisource Export: Migrate existing DB system to use Doctrine ORM.Oct 21 2020, 9:50 PM

ARamirez_WMF changed Due Date from Oct 21 2020, 4:00 AM to Nov 4 2020, 5:00 AM.Oct 22 2020, 7:39 PM

MusikAnimal closed this task as Resolved.Nov 24 2020, 11:41 PM

MusikAnimal moved this task from Review/Feedback 💬 to Done 🏁 on the Community-Tech (Kanban-2020-21-Q2) board.

MusikAnimal moved this task from Backlog to Done on the All-and-every-Wikisource board.

Wikisource Ebooks: Investigate job queue for more efficient ebook generation [16H]Closed, ResolvedPublicNov 4 2020Actions

Description

Details

Related Objects

Event Timeline

Wikisource Ebooks: Investigate job queue for more efficient ebook generation [16H]
Closed, ResolvedPublicNov 4 2020
Actions