PyTorch is one of the best frameworks for building LSTM models, especially in the large projects. PyTorch provides torch.nn.LSTM with:
- Support for multiple layers.
- Automatic gate handling and state tracking.
- Handle all gradient computations automatically with .backward(), making training LSTMs straightforward.
- Easily move your LSTM model to GPU for faster training.
Pytorch LSTM documentation: https://docs.pytorch.org/docs/stable/generated/torch.nn.LSTM.html. Let’s quickly recall the math:
- Forget Gate
\[ f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \]
- Input Gate
\[ i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \]
- Cell Update
\[ \tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) \]
- Output Gate
\[ o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \]
- New Cell State
\[ C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t \]
- Hidden State Update
\[ h_t = o_t \cdot \tanh(C_t) \]
Read more about LSTM at here.
1. Parameters
1.1. input_size
“The number of expected features in the input x“
It tells the LSTM how many features are in each time step of your sequence. It means the dimensionality of the input at each time step.
E.g. You’re passing a batch of stock price data where each day includes
- Only close price: input_size = 1
- Open, high, low, close, volume: input_size = 5
1.2. hidden_size
“The number of features in the hidden state h“
It defines the internal memory and capacity of your LSTM. We can think of it as the number of cells (units) in LSTM.
hidden_size | Pros | Cons |
---|---|---|
Small (16–32) | Fast, low memory, less overfitting | Can’t model complex patterns |
Medium (64–128) | Good balance between capacity and speed | Still limited on very complex data |
Large (256–512+) | Can model long term dependencies and complex patterns | Slower, more memory, higher risk of overfitting |
We should start with 64–128 and tune based on model performance.
1.3. num_layers
“Number of recurrent layers. E.g., setting num_layers=2
would mean stacking two LSTMs together to form a stacked LSTM, with the second LSTM taking in outputs of the first LSTM and computing the final results. Default: 1″
Instead of using a single LSTM layer, PyTorch allows you to stack multiple LSTM layers on top of each other. It specifies how many LSTM layers are stacked vertically in your model.
- num_layers = 1: one LSTM layer processes the sequence.
- num_layers = 2: two LSTM layers, output (hidden states) of layer 1 is fed into layer 2.
- num_layers = N: the model has N stacked LSTM layers, one feeding into the next.
Pros | Cons |
---|---|
Increases model capacity and abstraction power. Lower layers learn basic patterns, higher layers learn more abstract features. | More layers, more parameters, slower training, and higher overfitting risk. Not always helpful for small datasets. |
1.4. bias
“If False
, then the layer does not use bias weights b_ih and b_hh. Default: True
“
bias = true means that each LSTM layer will include bias vectors in its internal computations. These biases are added to the linear transformations inside each gate of the LSTM: Forget Gate, Input Gate, Cell Update, Output Gate. It helps the model:
- Learn better even if inputs are zero.
- Shift the activation function (like sigmoid or tanh).
- Improve convergence and flexibility.
1.5. batch_first
“If True
, then the input and output tensors are provided as (batch, seq, feature) instead of (seq, batch, feature). Note that this does not apply to hidden or cell states. See the Inputs/Outputs sections below for details. Default: False
“
The batch_first parameter controls the input and output tensor shape format.
- batch_first = true: Input shape is (batch_size, seq_len, input_size). Most people naturally think of matches how we usually organize and understand data.
- batch_first = false: Input shape is (seq_len, batch_size, input_size). It is compatible with other layers to make integration smoother.
1.6. dropout
“If non-zero, introduces a Dropout layer on the outputs of each LSTM layer except the last layer, with dropout probability equal to dropout
. Default: 0″
The dropout applies Dropout regularization between LSTM layers, not within each time step. Dropout is a technique used to prevent overfitting. It works by randomly turning off some neurons (setting them to 0) during training. When num_layers > 1, Dropout is applied on outputs between stacked layers (not on time steps).
1.7. bidirectional
“If True
, becomes a bidirectional LSTM. Default: False
“
Setting bidirectional = true means the LSTM will process the input sequence in both forward and backward directions. Benefits of bidirectional LSTM:
- The model can see both past and future at each time step.
- Often improves performance on NLP and sequence modeling tasks.
- Useful for text classification, speech recognition,…
When we don’t use bidirectional in LSTM:
- Predicting the future: stock price forecasting, weather prediction, sensor forecasting,…
- Real time / Streaming data: live voice-to-text, robot movement control,…
- Lightweight / Faster: LSTMs have 2x more parameters and are slower.
- Don’t need full context: event detection, some anomaly detection,…
1.8. proj_size
“If > 0
, will use LSTM with projections of corresponding size. Default: 0″
proj_size lets you reduce the size of the LSTM’s output hidden state h_t while keeping a large internal memory size. When we apply proj_size
- We want a powerful LSTM but don’t need large output vectors. Avoids wasting parameters in later layers.
- Using a deep LSTM (e.g. num_layers=3 or more).
- Using a bidirectional LSTM, bidirectional doubles output size.
2. Pytorch
2.1. LSTM model
import torch.nn as nn
class StockLSTM(nn.Module):
def __init__(self, input_size=1, hidden_size=64, num_layers=2):
super().__init__()
self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
self.fc = nn.Linear(hidden_size, 1)
def forward(self, x):
out, _ = self.lstm(x)
out = self.fc(out[:, -1, :])
return out
model = StockLSTM()
- nn: the neural network module in PyTorch.
- nn.LSTM: the main LSTM layer.
- nn.Linear: converts the last hidden state into a single output (the predicted value).
2.2. Prepare the data
def load_stock_data(ticker, start, end):
df = yf.download(ticker, start=start, end=end)[["Close"]]
df.dropna(inplace=True)
df.columns = [col[0] if isinstance(col, tuple) else col for col in df.columns]
return df
- yfinance.download() gets stock data between start and end dates.
- [[“Close”]] keeps only the closing price that is used in time-series predictions.
- df.dropna removes any rows where data is missing, ensuring clean input to the model.
- df.columns flattens column names in case they are tuples.
2.3. Normalize the input data
def scale_data(df):
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df.values)
return scaled, scaler
It is used to normalize your input data before training a model like LSTM by using MinMaxScaler from sklearn.preprocessing. It scales all values to a normalized range [0, 1] by using formula:
\[ x_{\text{scaled}} = \frac{x – x_{\min}}{x_{\max} – x_{\min}} \]
LSTMs (and most neural networks) train much more efficiently on scaled data.
2.4. Time-series data
def create_sequences(data, seq_length=50):
x, y = [], []
for i in range(len(data) - seq_length):
x.append(data[i : i + seq_length])
y.append(data[i + seq_length])
return np.array(x), np.array(y)
It is used to convert raw time-series data into supervised learning format, where each input is a sequence of values and the target is the next value. The input params:
- data: the time-series data you want to model (NumPy array or list of numbers).
- seq_length: defines the length of each input sequence that the LSTM will use to predict the next value.
E.g. suppose your input params are:
data = [10, 11, 12, 13, 14, 15, 16]
seq_length = 3
So, your output data should be:
X (sequence of 3 values) | y (next value) |
---|---|
[10, 11, 12] | 13 |
[11, 12, 13] | 14 |
[12, 13, 14] | 15 |
[13, 14, 15] | 16 |
2.5. Training
def load_data(start: str = "2020-01-01", end: str = "2025-01-01"):
# 1. Load stock data
df = load_stock_data("MANU", start=start, end=end)
df.reset_index(inplace=True)
# 2. Scale the "Close" data
scaled_data, scaler = scale_data(df[["Close"]])
df["ScaledClose"] = scaled_data
# 3. Create input sequences for the model
X_np, y_np = create_sequences(scaled_data, seq_length=50)
# 4. Convert to PyTorch tensors
X_tensor = torch.tensor(X_np, dtype=torch.float32)
y_tensor = torch.tensor(y_np, dtype=torch.float32)
# 5. Setup DataLoader
dataset = TensorDataset(X_tensor, y_tensor)
loader = DataLoader(dataset, batch_size=64, shuffle=True)
# 6. Define model, loss, optimizer
model = StockLSTM(input_size=1, hidden_size=64, num_layers=2)
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# 7. Train the model
loss_history = []
for epoch in range(10):
epoch_loss = 0.0
for X_batch, y_batch in loader:
output = model(X_batch)
loss = loss_fn(output, y_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_loss += loss.item()
avg_loss = epoch_loss / len(loader)
loss_history.append(avg_loss)
# 8. Predict the next price
with torch.no_grad():
last_seq = torch.tensor(scaled_data[-50:], dtype=torch.float32).unsqueeze(0)
predicted = model(last_seq).numpy()
predicted_price = scaler.inverse_transform(predicted)[0][0]
return {
"loss_history": [float(l) for l in loss_history],
"predicted_next_price": float(predicted_price)
}
- Load Manchester United stock price
- LSTM: 2 layers deep with hidden size of 64.
- Loss function: Mean Squared Error (MSE).
- Optimizer: Adam, adaptive and efficient for time series data.
- Runs for 10 epochs using mini-batch gradient descent for demonstration. We should increase the number of epochs after confirming everything works. It should be around 100 epochs. Always monitor training & validation loss to avoid overfitting.
- Take the most recent 50 scaled values as input because LSTMs learn patterns in sequences. Giving the most recent 50 points gives it recent trends. For the real application, it depends on the goal, data behavior, and model capacity.
- loss_history: a list of average losses from each training epoch.
- predicted_next_price: The model’s predicted next stock price based on the most recent 50 time steps.
Here is the sample response:
{
"loss_history": [
0.06302432411987531,
0.026943786952056382,
0.01185139650969129,
0.006286132811127524,
0.004926267729483937,
0.004644053485734682,
0.003801512115291859,
0.0035777755669857327,
0.003405814849477457,
0.0030937544805438896
],
"predicted_next_price": 17.11322021484375
}
The full source code: https://huggingface.co/spaces/insightaiglobal/stock-prediction/tree/main