1. Introduction

In the landscape of deep learning, models are typically designed to predict a target variable yy given an input 𝐱\mathbf{x}. Autoencoders, however, subvert this paradigm. At their core, an autoencoder is a neural network trained to reproduce its own input, effectively learning to approximate the identity function f(𝐱)𝐱f(\mathbf{x}) \approx \mathbf{x}.

While training a network to act as a simple “copy machine” might sound mathematically trivial, the true power of an autoencoder lies in its architectural constraints. By forcing the input data through a low-dimensional bottleneck before reconstructing it, the network is restricted from simply memorizing the input space. Instead, it is compelled to learn a compact, informative representation of the data’s underlying continuous manifold. This compressed latent representation serves as a powerful foundation for a multitude of advanced downstream tasks.

1.1. The Autoencoder Concept

Autoencoders belong to the broader family of self-supervised learning architectures. In these systems, the supervision signal is derived inherently from the data itself rather than from expensive, human-annotated labels. The network learns to predict missing or transformed parts of the data from the remaining parts.

In the specific case of an autoencoder, the “missing part” is the entirety of the original input. By minimizing the reconstruction error between the input data and the output prediction, the network naturally learns to encode the most salient features necessary for faithful reconstruction, discarding noise and redundant information in the process.

Figure 1 – High-level conceptual diagram of an autoencoder, illustrating the input mapping to a compressed latent code and expanding back to the reconstructed output [1].

1.2. Primary Objectives

The motivation behind deploying autoencoders generally falls into three primary objectives:

1.3. Article Structure

Beyond basic reconstruction, the autoencoder framework serves as a conceptual bridge between traditional representation learning and modern generative modeling. This article will deconstruct the architecture, implementation, and evolution of autoencoders. The progression is organized as follows:

2. Core Architecture and Mathematical Foundations

At the heart of every autoencoder lies a simple but powerful premise: a neural network can learn to compress information and subsequently reconstruct it. This is achieved through a tripartite architecture composed of an encoder, a latent space (or bottleneck), and a decoder. Each component plays a distinct functional role in transforming high-dimensional input data into a lower-dimensional manifold and then projecting it back to the original input space.

Figure 2 – Detailed schematic of encoder, bottleneck, decoder, and loss feedback [2].

2.1. The Encoder Mapping

The encoder is a deterministic mapping function, typically parameterized by a neural network, that compresses the input vector into a latent representation.

Let the input data be denoted as 𝐱D\mathbf{x} \in \mathbb{R}^D. The encoder function fθ()f_{\theta}(\cdot), parameterized by weights and biases θ\theta, maps the input to a latent vector 𝐳d\mathbf{z} \in \mathbb{R}^d:

𝐳=fθ(𝐱)\mathbf{z} = f_{\theta}(\mathbf{x})

For a standard feedforward layer, this operation can be expanded as 𝐳=σ(𝐖𝐱+𝐛)\mathbf{z} = \sigma(\mathbf{W} \mathbf{x} + \mathbf{b}), where 𝐖\mathbf{W} is the weight matrix, 𝐛\mathbf{b} is the bias vector, and σ()\sigma(\cdot) is a non-linear activation function (such as ReLU or GeLU). The encoder’s objective is to capture the most salient, invariant features of the data distribution while systematically discarding noise and redundancy.

2.2. The Latent Space (Bottleneck)

The latent space, often referred to as the bottleneck, is the compressed internal representation carrying the structural essence of the input. Its dimensionality, dd, dictates the information capacity of the network.

Geometrically, the latent space can be interpreted as a low-dimensional manifold embedded within the high-dimensional input space D\mathbb{R}^D. By forcing the network to route all information through this restrictive bottleneck (d<D)(d < D), we prevent it from trivially memorizing the data. Instead, the network must learn the intrinsic coordinates of this manifold, ensuring that each point 𝐳\mathbf{z} corresponds to a distinct, structurally valid reconstruction.

2.3. The Decoder Mapping

The decoder performs the inverse geometric transformation, projecting the low-dimensional latent codes back into the original, high-dimensional input space.

Denoted as gϕ()g_{\phi}(\cdot) and parameterized by ϕ\phi, the decoder maps the latent vector 𝐳\mathbf{z} to a reconstructed output 𝐱^D\hat{\mathbf{x}} \in \mathbb{R}^D:

𝐱^=gϕ(𝐳)\hat{\mathbf{x}} = g_{\phi}(\mathbf{z})

Together, the encoder and decoder form a composite function. The network’s success is determined by how closely the reconstruction 𝐱^=gϕ(fθ(𝐱))\hat{\mathbf{x}} = g_{\phi}(f_{\theta}(\mathbf{x})) mirrors the original input 𝐱\mathbf{x}.

2.4. Reconstruction Loss Functions: A Probabilistic Perspective

Training an autoencoder involves minimizing a reconstruction loss (𝐱,𝐱^)\mathcal{L}(\mathbf{x}, \hat{\mathbf{x}}). While standard implementations treat this as a simple error metric, it is fundamentally grounded in Maximum Likelihood Estimation (MLE). The choice of loss function implies a specific probabilistic assumption about the underlying data distribution.

MSE=1Ni=1N𝐱i𝐱^i22\mathcal{L}_{\text{MSE}} = \frac{1}{N} \sum_{i=1}^{N} \|\mathbf{x}_i – \hat{\mathbf{x}}_i\|_2^2
BCE=1Ni=1Nj=1D[xi,jlogx^i,j+(1xi,j)log(1x^i,j)]\mathcal{L}_{\text{BCE}} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{D} \left[ x_{i,j} \log \hat{x}_{i,j} + (1 – x_{i,j}) \log (1 – \hat{x}_{i,j}) \right]

2.5. Undercomplete vs. Overcomplete Networks

Autoencoders are categorized by the relationship between the input dimension DD and the latent dimension dd:

2.6. Training Procedure

Despite being an unsupervised algorithm, an autoencoder is trained using the standard supervised learning machinery—the only difference is that the input 𝐱\mathbf{x} acts as its own target label.

The optimization loop proceeds as follows:

  1. Forward Pass: Compute the latent representation 𝐳=fθ(𝐱)\mathbf{z} = f_{\theta}(\mathbf{x}) and the reconstruction 𝐱^=gϕ(𝐳)\hat{\mathbf{x}} = g_{\phi}(\mathbf{z})
  2. Loss Calculation: Evaluate the objective function (𝐱,𝐱^)\mathcal{L}(\mathbf{x}, \hat{\mathbf{x}}), augmented with any regularization terms (e.g., sparsity or contractive penalties).
  3. Backward Pass: Backpropagate the error gradients θ,ϕ\nabla_{\theta, \phi} \mathcal{L} through the decoder and then through the encoder using the chain rule.
  4. Parameter Update: Adjust the weights θ\theta and θϕ\theta \phiusing gradient-based optimizers such as Adam or Stochastic Gradient Descent (SGD).

Each layer is followed by a non-linear activation function, such as ReLU, to introduce expressive capacity. The final decoder layer typically uses a Sigmoid activation when input data are normalized to the [0, 1] range.

Finding the right latent size is therefore a balance between compression efficiency and reconstruction quality. This trade-off is typically explored empirically through experiments on validation data.


3. The Multilayer Perceptron (MLP) Autoencoder

The Multilayer Perceptron (MLP) autoencoder is the most fundamental instantiation of the autoencoder family. Built exclusively from fully connected (dense) layers, this architecture treats input data as flat, one-dimensional vectors. While it is the standard choice for tabular datasets, sensor arrays, or low-dimensional continuous signals, examining its application to image data reveals both the core principles of representation learning and the inherent limitations of feedforward networks.

3.1. Architectural Overview

An MLP autoencoder dictates that every neuron in a given layer is connected to every neuron in the subsequent layer. The architecture symmetrically scales down the input into the latent space and then scales it back up.

Consider a standard baseline task: reconstructing grayscale images from the MNIST dataset. An original 28×2828 \times 28 pixel image is first flattened into a vector 𝐱784\mathbf{x} \in \mathbb{R}^{784}. The network architecture sequentially compresses this vector:

Figure 3: Example structure of an MLP autoencoder and sample reconstructions for the MNIST dataset. The left column shows input digits; the right column shows reconstructed outputs [3]

Each intermediate layer applies a non-linear activation function, such as ReLU (f(x)=max(0,x)f(x) = \max(0, x)), to learn complex, non-linear mappings. The final layer of the decoder typically applies a Sigmoid activation (f(x)=11+exf(x) = \frac{1}{1 + e^{-x}}) to constrain the reconstructed pixel values strictly within the [0,1][0, 1] range, matching the normalized input distribution.

3.2. Bottleneck Sizing: The Compression Trade-off

The dimensionality of the latent space is the most critical hyperparameter in an MLP autoencoder. It governs a strict trade-off between compression efficiency and reconstruction fidelity:

Figure 4 – Latent space visualization. By projecting the 32-dimensional latent vectors down to 2D using t-SNE or UMAP, we can observe how the autoencoder naturally clusters similar structural inputs, such as digits of the same class, without any label supervision.

3.3. PyTorch Implementation

The following code provides a robust, minimal implementation of an MLP autoencoder in PyTorch using nn.Sequential blocks. This structure emphasizes the symmetry between the encoder and decoder.

3.4. Reconstruction Quality and Spatial Limitations

When trained on visually simple datasets like MNIST, the MLP autoencoder successfully learns to reconstruct the inputs. However, a critical analysis of the outputs reveals a distinct blurriness and occasional structural tearing.

Figure 5 – MNIST reconstruction samples from latent space

This degradation is not merely a failure of capacity, but a fundamental architectural flaw when dealing with high-dimensional spatial data:

  1. Loss of Spatial Locality: By flattening a 2D image into a 1D vector, the MLP destroys the local grid structure. It treats two adjacent pixels with the exact same mathematical independence as two pixels on opposite corners of the image.
  2. Lack of Translation Invariance: If a feature (like a curved edge) appears in the top-left of an image during training, an MLP must learn a completely separate set of weights to recognize that exact same edge if it appears in the bottom-right.
  3. Parameter Inefficiency: Fully connected layers require a massive number of parameters. Scaling an MLP to handle high-resolution images (e.g., 1024×1024×31024 \times 1024 \times 3) becomes computationally intractable and highly prone to overfitting.

To effectively reconstruct and generate high-fidelity spatial data, the architecture must natively respect local correlations—a requirement that leads us directly to the Convolutional Autoencoder.


4. The Convolutional Neural Network (CNN) Autoencoder

As established in the previous section, treating high-dimensional spatial data (like images) as flattened one-dimensional vectors strips away critical structural context. The Multilayer Perceptron is fundamentally agnostic to spatial locality. To build an autoencoder capable of capturing the rich, hierarchical structure of visual data, we must embed a strong inductive bias into the architecture. The Convolutional Neural Network (CNN) Autoencoder achieves this by leveraging local receptive fields, shared weights, and spatial pooling.

4.1. Spatial Coherence and the Convolutional Inductive Bias

Natural images contain profound local correlations; adjacent pixels are highly likely to share similar properties and belong to the same structural features (e.g., edges, textures). A CNN autoencoder respects this spatial coherence through the convolution operation.

Instead of learning a unique, dense weight for every pixel combination, a convolutional encoder sweeps learned filters (kernels) across the input. This provides two critical theoretical advantages:

  1. Parameter Efficiency: The weights of a kernel are shared across the entire spatial domain, drastically reducing the parameter count and mitigating the risk of overfitting on high-resolution data.
  2. Translation Invariance: A feature learned in one region of the image can be seamlessly recognized in another. If the encoder learns a Gabor-like filter to detect a vertical edge in the top-left corner, that exact same filter will successfully encode a vertical edge in the bottom-right.

Figure 6 – CNN autoencoder architecture showing spatial down-sampling through convolutions and up-sampling through transposed convolutions, converging at a dense bottleneck [4].

4.2. Downsampling and Upsampling Mechanics

A CNN autoencoder replaces dense compression with a geometric reduction of spatial dimensions accompanied by an expansion in channel depth.

4.3. PyTorch Implementation

The following implementation demonstrates a robust CNN autoencoder designed for single-channel images like MNIST. Notice how the encoder transforms the spatial tensor into a true vector bottleneck to force holistic representation learning, before reshaping it for the spatial decoder.

4.4. Comparative Analysis: MLP vs. CNN Autoencoders

When comparing the empirical results of CNN autoencoders against their MLP counterparts on visual data, the theoretical advantages of convolutional architectures translate into stark performance differences:

Aspect MLP Autoencoder CNN Autoencoder
Input Format Flattened vectors (1D) Spatial tensors (2D/3D)
Spatial Awareness None (destroys grid context) Preserved (exploits local correlation)
Parameter Efficiency Extremely low; scales quadratically High; scales by kernel size and depth
Feature Hierarchy Global, unstructured Local \to Global (edges to objects)
Reconstruction Quality Blurry, lacking fine structural details Sharp, preserving edges and textures

The CNN autoencoder thus forms the essential bridge between simple reconstructive networks and modern, high-fidelity computer vision systems. By successfully mapping high-dimensional spatial data into reliable, dense latent codes, these networks unlock powerful downstream applications. In the next section, we will explore how this reconstructive capacity is practically deployed to solve unsupervised problems, most notably in the domain of anomaly detection.

5. Practical Applications: Anomaly Detection and Beyond

While autoencoders are fundamentally designed for representation learning, their unique ability to learn the intrinsic structure of a dataset without labels makes them incredibly powerful for applied machine learning. One of the most ubiquitous and commercially valuable applications of this architecture is unsupervised anomaly detection. By leveraging the reconstruction error as a measurable proxy for “normality,” we can identify out-of-distribution events in complex, high-dimensional spaces where traditional rule-based logic fails.

5.1. The Principle of Anomaly Detection

The mathematical intuition behind autoencoder-based anomaly detection relies on the geometry of the latent space.

During the training phase, the autoencoder is exposed exclusively to “normal” or baseline data. Consequently, the encoder learns to map only the manifold of this normal distribution, and the decoder learns to project from this specific subspace back to the original dimensions. The network allocates its limited parameter capacity entirely to minimizing the reconstruction loss of typical structural patterns.

During inference, when the network encounters an anomalous input 𝐱anom\mathbf{x}_{\text{anom}}(e.g., a defective part, a fraudulent transaction, or a network intrusion), the encoder is forced to project this unseen pattern onto the nearest point of the “normal” latent manifold. Because the anomaly lacks the structural correlations the network was optimized for, the decoder will reconstruct it poorly, yielding a high reconstruction error.

The anomaly detection pipeline is formalized as follows:

  1. Compute Reconstruction Error: For a new sample 𝐱,\mathbf{x}, compute E(𝐱)=|𝐱fdec(fenc(𝐱))|22E(\mathbf{x}) = |\mathbf{x} – f_{\text{dec}}(f_{\text{enc}}(\mathbf{x}))|_2^2.
  2. Establish a Threshold: Define a scalar threshold τ\tau. This is typically derived statistically from the validation set of normal data (e.g., the 95th or 99th percentile of baseline reconstruction errors).
  3. Classification: If E(𝐱)>τE(\mathbf{x}) > \tau, flag the sample as anomalous.

Figure 8 – Anomaly detection thresholding concept. A histogram showing the distribution of reconstruction errors for normal data versus anomalous data, illustrating the separability provided by the threshold [5].

5.2. Cross-Domain Applications

This reconstructive approach to outlier detection is highly adaptable and has been deployed across numerous technical domains:

5.3. Expanding Utility: Denoising and Pretraining

Beyond anomaly detection, the autoencoder framework serves as a versatile utility belt for data scientists:

Figure 9 – Examples of the Denoising Autoencoder process. The network successfully removes injected Gaussian noise, recovering the underlying signal structure [6].


6. The Generative Extension: Variational Autoencoders (VAE)

The standard autoencoder is a powerful tool for compression and feature extraction, but it harbors a critical limitation: it is not a generative model. Because the standard autoencoder is trained deterministically to minimize reconstruction error, its latent space is highly irregular and discontinuous. If you sample an arbitrary point 𝐳\mathbf{z} from the latent space of an MLP or CNN autoencoder, the decoder will likely output meaningless noise.

To transition from a reconstructive model to a true generative model, we must enforce a continuous, densely packed probabilistic structure on the latent space. This is the foundational breakthrough of the Variational Autoencoder (VAE).

6.1. From Deterministic Encoding to Probabilistic Inference

In a classical autoencoder, the encoder maps an input 𝐱\mathbf{x} to a single, discrete latent vector 𝐳\mathbf{z}. In a VAE, the encoder maps the input to a probability distribution over the latent space.

Specifically, the encoder qϕ(𝐳|𝐱)q_\phi(\mathbf{z}|\mathbf{x}) outputs the parameters of a multivariate Gaussian distribution: a mean vector 𝝁\boldsymbol{\mu} and a variance vector 𝝈2\boldsymbol{\sigma}^2.

fenc(𝐱)=(𝝁,𝝈2)f_{\text{enc}}(\mathbf{x}) = (\boldsymbol{\mu}, \boldsymbol{\sigma}^2)

During the forward pass, the latent representation 𝐳\mathbf{z} is stochastically sampled from this distribution:

𝐳𝒩(𝝁,diag(𝝈2))\mathbf{z} \sim \mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2))

This probabilistic mapping ensures that similar inputs are encoded into overlapping topological regions, creating a smooth, continuous manifold where every local point corresponds to a valid, sensible reconstruction.

Figure 10 – Comparison of latent spaces. The standard autoencoder creates isolated points, while the VAE creates overlapping Gaussian spheres, ensuring a continuous generative manifold [7].

6.2. The Reparameterization Trick

A fundamental engineering challenge arises when introducing stochastic sampling into a neural network: you cannot backpropagate gradients through a random sampling operation. The sampling step blocks the deterministic chain rule required to update the encoder’s weights ϕ\phi.

The reparameterization trick elegantly resolves this by mathematically decoupling the randomness from the learned parameters. Instead of sampling 𝐳\mathbf{z} directly from the parameterized distribution, we sample a noise variable 𝝐\boldsymbol{\epsilon} from a standard unit Gaussian prior, 𝒩(0,𝐈)\mathcal{N}(0, \mathbf{I}), and deterministically scale and shift it:

𝐳=𝝁+𝝈𝝐,where𝝐𝒩(0,𝐈)\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}, \quad \text{where} \quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I})

By routing the stochasticity into the independent auxiliary variable 𝝐\boldsymbol{\epsilon}, the operations acting on 𝝁\boldsymbol{\mu} and 𝝈\boldsymbol{\sigma} become standard differentiable nodes in the computational graph.

Figure 11 – The Reparameterization Trick. A computational graph showing how the stochastic node is sidestepped to allow uninterrupted gradient flow back to the encoder [2].

6.3. The Evidence Lower Bound (ELBO)

Training a VAE requires optimizing a dual-objective loss function. We want to maximize the likelihood of the data while ensuring the learned latent distributions closely resemble a chosen prior (typically a standard normal distribution). We achieve this by maximizing the Evidence Lower Bound (ELBO), which translates to minimizing the following loss function:

VAE=𝔼qϕ(𝐳|𝐱)[logpθ(𝐱|𝐳)]+DKL(qϕ(𝐳|𝐱)p(𝐳))\mathcal{L}_{\text{VAE}} = \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[-\log p_\theta(\mathbf{x}|\mathbf{z})] + D_{\text{KL}}\big(q_\phi(\mathbf{z}|\mathbf{x}) \parallel p(\mathbf{z})\big)

This equation consists of two opposing forces:

  1. The Reconstruction Loss (first term): Measures how well the decoder pθ(𝐱|𝐳)p_\theta(\mathbf{x}|\mathbf{z}) reconstructs the input from the sampled latent vector. This is typically implemented as Mean Squared Error (MSE) or Binary Cross-Entropy (BCE).
  2. The Kullback-Leibler (KL) Divergence (second term): Acts as a powerful regularizer. It measures the statistical distance between the encoder’s predicted distribution qϕ(𝐳|𝐱)q_\phi(\mathbf{z}|\mathbf{x}) and the standard normal prior p(𝐳)=𝒩(0,𝐈)p(\mathbf{z}) = \mathcal{N}(0, \mathbf{I}). It mathematically forces the latent distributions to pack closely around the origin, preventing the network from “cheating” by placing distributions infinitely far apart to avoid overlap.

For a multivariate Gaussian with diagonal covariance, the KL divergence term has a closed-form analytical solution:

DKL=12j=1d(1+log(σj2)μj2σj2)D_{\text{KL}} = -\frac{1}{2} \sum_{j=1}^{d} \left(1 + \log(\sigma_j^2) – \mu_j^2 – \sigma_j^2\right)

6.4. PyTorch Implementation

In practice, for numerical stability, the encoder is designed to output the log-variance (logσ2\log \sigma^2) rather than the variance directly.

6.5. Generative Sampling and Latent Space Interpolation

Once the VAE is trained, the encoder is completely discarded for generation tasks. To synthesize entirely novel data, we simply sample a vector 𝐳\mathbf{z} from the prior 𝒩(0,𝐈)\mathcal{N}(0, \mathbf{I}) and pass it through the decoder.

Furthermore, because the KL divergence forces the space to be dense and continuous, we can perform latent space interpolation. By taking the latent vectors of two distinct real images (𝐳A\mathbf{z}_A and 𝐳B\mathbf{z}_B) and calculating the linear sequence between them (𝐳t=(1t)𝐳A+t𝐳B\mathbf{z}_t = (1-t)\mathbf{z}_A + t\mathbf{z}_B), the decoder will generate a sequence of images that smoothly morphs from the first image to the second, representing structurally valid, intermediate semantic concepts.

Figure 12 – Latent interpolation between MNIST digits, demonstrating how the model learns smooth semantic transitions (e.g., a “3” morphing cleanly into an “8”) rather than abrupt, noisy pixel shifts.

6.6. Legacy in Modern Generative AI

The Variational Autoencoder was a paradigm shift. Its probabilistic formulation of latent spaces served as the conceptual and architectural foundation for the current era of generative AI:

7. Conclusion

Autoencoders represent a masterclass in the power of architectural constraints. By simply forcing a neural network to compress and reconstruct its own input, we transition from relying on expensive, human-annotated labels to unlocking the intrinsic, self-supervised structure of the data itself.

Throughout this post, we have traced the evolution of this architecture:

Whether you are building a predictive maintenance pipeline to detect manufacturing anomalies, or studying the latent diffusion models that power today’s state-of-the-art image synthesis, the autoencoder remains an indispensable architectural pillar in the deep learning repertoire.

Hands-On Exploration: Interactive Autoencoder Framework

If you want to move beyond the theory and experiment with these architectures firsthand, I have built a modular, Interactive Autoencoder Framework available on my GitHub:

https://github.com/turhancan97/simple-autoencoder-demo

This repository provides a complete, YAML-configurable PyTorch pipeline to train and visualize MLP, CNN, and VAE models on datasets like MNIST. Its standout feature is the real-time demo.py application, which generates a visual 2D mapping of the trained bottleneck. By simply clicking and dragging your mouse across the latent space clusters, you can watch the decoder continuously generate and morph reconstructions in real-time. It is an excellent practical tool for observing the structural differences between the deterministic latent space of a standard MLP and the smooth, continuous generative manifold of a VAE.

Reference

  1. Dense in. Stack Overflow. Stack Overflow. Published 2019. Accessed October 30, 2025. https://stackoverflow.com/questions/55233114/how-to-create-autoencoder-using-dropout-in-dense-layers-using-keras
  2. ‌Matthew Bernstein. Matthew N. Bernstein. Published March 14, 2023. Accessed October 30, 2025. https://mbernste.github.io/posts/vae/
  3. ‌rvislaywade. Visualizing MNIST using a Variational Autoencoder. Kaggle.com. Published March 8, 2018. Accessed October 30, 2025. https://www.kaggle.com/code/rvislaywade/visualizing-mnist-using-a-variational-autoencoder
  4. ‌Deep Discriminative Latent Space for Clustering. ResearchGate. Published online 2018. doi:https://doi.org/10.48550//arXiv.1805.10795
  5. AutoEncoder Convolutional Neural Network for Pneumonia Detection. (2024). ResearchGate. https://doi.org/10.48550//arXiv.2409.02142
  6. ‌Rosebrock, A. (2020, February 24). Denoising autoencoders with Keras, TensorFlow, and Deep Learning – PyImageSearch. PyImageSearch. https://pyimagesearch.com/2020/02/24/denoising-autoencoders-with-keras-tensorflow-and-deep-learning/
  7. ‌Van, A. (2020, May 14). Alexander Van de Kleut. Alexander van de Kleut. https://avandekleut.github.io/vae/