Moirai: A Time Sequence Foundation Model for Universal Forecasting

TL;DR: Moirai is a reducing-edge time sequence foundation model, offering in model forecasting capabilities. It stands out as a flexible time sequence forecasting model able to addressing various forecasting duties across various domains, frequencies, and variables in a nil-shot manner.  To total this, Moirai tackles four predominant challenges: (i) construction of a LOTSA, an attractive-scale and various time sequence dataset, comprising 27 billion observations spanning 9 determined domains, (ii) pattern of various patch measurement projection layers, allowing a single model to capture temporal patterns across varied frequencies, (iii) implementation of an any-variate attention mechanism, empowering a single model to address forecasts across any variable, and (iv) integration of a mix distribution to model flexible predictive distributions. Thru comprehensive overview in both in-distribution and out-of-distribution settings, Moirai demonstrates its prowess as a nil-shot forecaster, consistently turning in aggressive or superior performance when compared with fats-shot units.

The need for a in model forecaster 

Time sequence information pervades a gigantic assortment of domains, including retail, finance, manufacturing, healthcare, and natural sciences. Across these sectors, time sequence forecasting is a valuable utility with valuable implications for determination making. Despite the truth that valuable strides had been made in deep learning for time sequence forecasting, most unusual developments unruffled predominantly adhere to the old paradigm of coaching a model for a particular dataset with a mounted, pre-outlined context and prediction measurement. Such a paradigm inevitably imposes a predominant burden in relation to computational charges for coaching these units, namely when scaling to beautiful numbers of customers.

For example, a rising query for cloud computing companies has magnified the significance of efficiently managing sources in I.T. infrastructure. Operational forecasting has emerged as a valuable factor within the pipeline of managing these sources, because the foremost driving ingredient for capacity planning, value range planning, scenario risk evaluate, cost optimization, and anomaly detection. On the replacement hand, with the ever-rising query for compute sources and the rising measurement of I.T. infrastructure, the flexibility of provider providers to address the forecasting wants across the multitude of duties is continually challenged, on high of having to accomplish job/individual-suppose forecasters.

This motivates us to transfer in direction of the in model forecasting paradigm (look Figure 1), the set a single beautiful pre-trained model is ready to handling any time sequence forecasting command.

Figure 1. A in model forecaster is an attractive pre-trained model able to handling any time sequence forecasting command. It’s trained on an attractive-scale time sequence dataset spanning various domains. Compared to the novel paradigm, in model forecasting faces the three key concerns with i) various frequencies, ii) any-variate forecasting, and iii) varied distributions.

The challenges for constructing a in model forecaster

The paradigm shift in direction of foundation units was as soon as on the beginning sparked by the sphere of Natural Language Processing (NLP) which efficiently trained Huge Language Models (LLMs) on various net-scale information, able to tackling a ample assortment of downstream duties and are even multilingual. One predominant innovation that enables for LLMs to address various languages is Byte Pair Encoding (BPE) – changing heterogeneous languages into a unified format. Unlike NLP, the sphere of time sequence does no longer accept as true with a BPE identical, making it non-trivial to accomplish a time sequence foundation that may well tackle the heterogeneity of time sequence information.

  • At the birth, the frequency (e.g., minutely, hourly, on a typical foundation sampling rates) of time sequence performs an foremost draw in figuring out the patterns whine within the information. On the replacement hand, frightful-frequency learning poses challenges in consequence of adverse interference, with novel approaches on the total circumventing this self-discipline for multi-frequency datasets by coaching one model per frequency.
  • Secondly, time sequence information repeat heterogeneity in relation to dimensionality, the set multivariate time sequence may well well also simply accept as true with varied numbers of variables. Moreover, every variable in general measures a semantically determined quantity across datasets. Whereas treating every variable of a multivariate time sequence independently can mitigate this self-discipline, a in model model must ideally be flexible ample to assist in mind interactions between variables and yarn for exogenous covariates.
  • Thirdly, probabilistic forecasting is a valuable requirement for many functions. On the replacement hand, replacement datasets catch varied fortify and distributional properties. For instance, utilizing a symmetric distribution (e.g., Regular, Student-T) because the predictive distribution may well well also simply no longer be correct for determined time sequence. As a result, odd approaches that pre-define a straightforward parametric distribution may well well also simply lack the flexibleness mandatory to capture the more than just a few differ of datasets effectively.
  • Lastly, the draw of an attractive pre-trained model able to in model forecasting necessitates a comprehensive dataset spanning various domains. Unfortunately, novel time sequence datasets are in general insufficiently beautiful and various to fortify the coaching of such units.

Our Sleek Formula: Unified Practising of Universal Time Sequence Forecasting Transformers

Figure 2. The final structure of Moirai. The visualization depicts a 3-variate time sequence, the set variates 0 and 1 whine aim variables (i.e., these to be forecasted), and variate 2 serves as a dynamic covariate (with acknowledged values within the forecast horizon). The expend of a patch measurement of 64, every variate is patchified into three tokens. These patch embeddings, along with sequence and variate identifiers, are fed into the Transformer. The murky patches within the visualization denote the forecast horizon to be predicted. The corresponding output representations of these patches are then mapped into the parameters of the aggregate distribution.

To address these challenges, we whine unusual enhancements (look Figure 2) to the old time sequence Transformer structure to address the heterogeneity of arbitrary time sequence information. Right here are some of essentially the most valuable substances and contributions of our work:

  • At the birth, we suggest to address the command of varied frequencies in time sequence information by learning various input and output projection layers. These layers are designed to address the more than just a few patterns whine in time sequence of varied frequencies. By employing patch-basically basically based mostly projections with elevated patch sizes for high-frequency information and vice versa, the projection layers are actually expert to learn the patterns suppose to every frequency.
  • Secondly, we kind out the topic of varied dimensionality utilizing our proposed Any-variate Consideration mechanism. This suggests simultaneously considers both the time and variate axes as a single sequence, leveraging Rotary Reveal Embeddings (RoPE) and realized binary attention biases to encode the time and variate axes, respectively. Importantly, Any-variate Consideration enables the model to settle for an arbitrary assortment of variates as input.
  • Thirdly, we overcome the command of requiring flexible predictive distributions by introducing a mix of parametric distributions. By optimizing the adverse log-likelihood of a flexible distribution, we make sure our model is aggressive with aim metric optimization, a grand feature for pre-coaching in model forecasters. This suggests enables for subsequent overview utilizing any aim metric.
  • Lastly, to facilitate the coaching of our beautiful time sequence model, we introduce the LOTSA, the supreme assortment of initiate time sequence datasets by collating publicly on hand sources of time sequence datasets. This effort targets to conceal a gargantuan spectrum of domains, consolidating datasets from various sources with varied formats. The ensuing assortment spans 9 domains, with a total of 27B observations, with key statistics in Tables 2 and 3. Extra little print on essentially the most valuable properties of these datasets, care for the domain, frequency, assortment of time sequence, assortment of aim variates, quantity

of past covariates, and the total assortment of observations will be realized in our compare paper (

Deeper Dive: Moirai

Illustrated in Figure 2, Moirai follows a (non-overlapping) patch-basically basically based mostly capacity to modeling time sequence with a masked encoder structure. One of our proposed adjustments to elongate the structure to the any-variate environment is to “flatten” multivariate time sequence, thinking about all variates as a single sequence. Patches are in consequence of this truth projected into vector representations via a multi-patch measurement input projection layer. The [mask] signifies a learnable embedding that replaces patches falling inner the forecast horizon. The output tokens are then decoded via the multi-patch measurement output projection into the parameters of the aggregate distribution. Whereas no longer visualized, (non-learnable) instance normalization is applied to inputs/outputs, aligning with essentially the most unusual odd apply for deep forecasting units. 

In our pre-coaching job, we formulate the draw to optimize the aggregate distribution log-likelihood. The create of both the information distribution and job distribution are two serious aspects of the pre-coaching pipeline. This create imparts versatile capabilities to our Huge Time Sequence Model (LTM), enabling it to adapt to a unfold of downstream duties. This pliability stands no longer just like the existing deep forecasting paradigm, the set units are on the total in fact expert for suppose datasets and settings.


We practice Moirai in 3 sizes – little/injurious/beautiful with 14m/91m/311m parameters! On in-distribution evaluations utilizing the Monash Time Sequence Forecasting Benchmark, Moirai displays phenomenal performance, beating all baselines. 

In out-of-distribution/zero-shot forecasting evaluations, Moirai consistently demonstrates aggressive performance, and in some instances, surpasses converse-of-the-art fats-shot units. This superiority is noticed across probabilistic forecasting and prolonged-sequence forecasting benchmarks.

Right here are some visualizations of zero-shot forecasts from Moirai on the unusual datasets. As depicted, Moirai adeptly crafts forecasts marked by discernible seasonal patterns from ETTh1-1 and ETTh1-2, while additionally accurately capturing vogue patterns from ETTm1-1 and ETTm1-2. These illustrations underscore Moirai’s capacity to issue insightful predictions across varied instances.

Affect: Why Moirai Matters

Moirai provides tough zero-shot forecasting capabilities across a various differ of time sequence spanning replacement domains and frequencies. By harnessing the vitality of beautiful-scale information pretraining, this time-sequence foundation model revolutionizes the landscape, departing from the out of date one-model-per-dataset capacity. It affords ample advantages to customers in downstream forecasting duties, looking down the need for additional information, wide computational sources, and expert input on the total required for achieving honest forecasts with deep learning units. Additionally, Moirai’s capacity to address multivariate time sequence of any dimension additional democratizes honest forecasting by reducing reliance on both computational sources and deep learning abilities. Besides to being the biggest breakthrough for academia, Moirai has various functions including IT Operations, Gross sales Forecasting, Skill Planning, Vitality Forecasting and plenty others.

The Bottom Line

  • Moirai is designed to total in model forecasting with masked encoder-basically basically based mostly time sequence transformers.
  • LOTSA is the supreme assortment of initiate information for pre-coaching time sequence forecasting units.
  • Moirai addresses key challenges of in model forecasting to fortify varied domains, various frequencies, and any-variate in a nil-shot manner.
  • Evaluated in both in-distribution and out-of-distribution settings, Moirai shines as a nil-shot forecaster, turning in aggressive and even superior performance when compared with fats-shot units.

Find Extra

Salesforce AI invites you to dive deeper into the ideas mentioned in this blog put up (look links beneath). Join with us on social media and our net set to procure in model updates on this and other compare initiatives.

  • Be taught extra: Test up on our compare paper (, which describes our work in better detail.
  • Code: Test up on our code on GitHub:
  • Dataset: Test up on LOTSA information on Hugging Face: 
  • Contact us:
  • Follow us on Twitter: @SalesforceResearch, @Salesforce
  • Weblog: To read other blog posts, please look
  • Notable set: To learn extra about all of the entertaining initiatives at Salesforce AI Research, please visit our predominant net set at

In regards to the Authors

Gerald Woo is a Ph.D. candidate within the Industrial PhD Program at Singapore Administration University and a researcher at Salesforce AI Research Asia and his compare specializes in deep learning for time-sequence, including illustration learning, and forecasting.

Chenghao Liu is a Lead Utilized Scientist at Salesforce AI Research Asia, engaged on AIOps compare, including time sequence forecasting, anomaly detection, and causal machine learning.

Doyen Sahoo is the Director, of Salesforce AI Research Asia. Doyen leads various initiatives relating to AI for IT Operations or AIOps, AI for Instrument, and Time-Sequence intelligence –  engaged on both classic and applied compare.

Caiming Xiong holds the positions of Managing Director and Vice President at Salesforce AI Research. He oversees the draw and utility of technologies similar to Huge Language Models (LLM), Multimodal LLMs, Huge Action Models, AI for tool, Time Sequence, and other foundational compare areas. Additionally, Caiming directs the transition of these AI initiatives from compare phases into manufacturing environments.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like