
Title: Autoencoding Galaxy Spectra I: Architecture
Authors: P. Melchior, Y. Liang, C. Hahn, et al.
First author’s institution: Department of Astrophysical Sciences, Princeton University, NJ, USA, Center for Statistics & Machine Learning, Princeton University, NJ, USA
Designation: Submitted to AG
Imagine for a moment that you have never seen a car before and you want to build a toy car as a gift for your car-loving friend. Understandably, when faced with this dilemma, you may feel that the outlook is bleak. However, now suppose another friend comes along and has the bright idea to send you to the nearest busy street corner, explaining that everything passing by can be considered a car. Then they bid you farewell. Armed with this new information, you are revitalized. After a few minutes of watching vehicles pass by, you quickly form a picture in your mind of what a car might look like. After about an hour, you head back inside ready to take on your challenge: you have to build your own (toy) car based only on what you’ve seen.
This example is obviously contrived, but provides a useful touch point from which we can understand how Autoencoder, or, more generally, any unsupervised machine learning algorithm might work (see these Astrobytes for examples of machine learning used in astronomy). If you think about how you would approach the above challenge, the basic principle might be obvious: with just observations of passing cars, you would fit in the patterns you noticed and use them to “reconstruct” a definition. Car in your mind. You may find that most of them have four wheels, many have headlights, and they all generally have the same shape. From this, you can make a decent toy car that your friend will be proud of. The basic idea behind an unsupervised learning task is this: the algorithm is presented with data and tries to identify relevant features of that data to serve some of the presented goals. In the particular case of an autoencoder, that goal is to learn how to reconstruct original data from a compressed dataset, just as you would try to do by building a toy car from memory. Specifically, you (or the computer) boil down observations of cars (data) into shared characteristics (called ‘latent features’) and reconstruct the car from those characteristics (reconstruct the data). This process is illustrated schematically in Figure 1.

It is applied in astronomy
To understand how this machine learning method is applied in today’s paper, let’s take the example of car reconstruction a step further. Instead of being able to observe a random sample of cars driving by, let’s say this time your friend was less intelligent and took pictures of a few different cars on their phone. Based on this, you might be able to do a decent job of producing something looking at Like a car, but your car doesn’t run very well (eg, the wheels might be attached to the chassis and your car won’t move). Alternatively, your friend may want to challenge you further and describe how a car works. If so, you might be able to make a decently functional toy car, but it won’t look very accurate. These situations are similar to some of the challenges present in current approaches to modeling galaxy spectra, the subject of today’s paper.
Approaches to modeling galaxy spectra can be divided between empirical and data-driven models and theoretical models. The first of these is the equivalent of the pictures of cars your friend showed you – astronomers use ‘template’ spectra and observations of local galaxies to construct model spectra that fit observations of systems at high redshift. Although useful, these are usually based on observations of local galaxies and may therefore be limited to a limited wavelength range if a cosmological redshift correction is included. Theoretical models, on the other hand, reflect the last suggestion from your friend; That is, they construct model spectra based on a physical understanding of the emission and absorption in the interstellar medium and in stars and nebulae. These are interpretable and physically motivated, so can be applied at high redshifts, for example, but usually rely on some approximations and therefore cannot accurately capture the complexity of real spectra.
Despite these challenges, the historical utility of applying template spectra to describe new observations of other spectra suggests that these data may not be as inherently complex as they appear—perhaps the differences between spectra boil down to some relevant parameters. . This goes back to the discussion around autoencoders and motivates the approach of today’s paper – finding a low-dimensional embedding (read: simple representation) of spectra that makes reconstruction an easy matter.
How to make a galaxy spectrum
Most conventional galaxy spectrum analysis pipelines work by converting the observed (redshifted) spectrum to an emitted spectrum in the rest frame of the galaxy and fitting the observation to a model. This means that the spectra are limited in the commonly usable wavelength range to a range shared between all the different spectra in a survey sample. In the present authors’ architecture, they choose to keep the spectra as observed, i.e., they do not perform any kind of redshift processing prior to their analysis, allowing them to run the algorithm over the entire wavelength range of the observation, thereby saving more. data. Today’s paper presents this algorithm, called SPENDER, which is schematically represented in Figure 2.

The algorithm takes an input spectrum and first passes it through three convolution layers to reduce dimensionality. Then, since the spectra are not redshifted, the processed data is passed through a focus layer. It’s very similar to what you do when you watch cars pass by on the street – even though there are many cars passing by and they’re all moving at different speeds at different places, you focus on them. attention To train your neural network (read: brain) to know what a car is, on specific cars and specific features of those cars. This layer does the same thing by identifying which part of the spectrum to focus on; That is, where the relevant emission and absorption records may be. And then, to finish Encoding of data, the data is passed through a multi-layer perceptron (MLP) which transforms the data into the galactic rest-frame and compresses the data into s-dimensions (the desired dimensionality of the latent space).
Now the model has to decode the embedded data and try to reconstruct the original spectrum. It does this by passing the data through three ‘activation layers’ that process the data through some preset function. These layers transform a simple, low-dimensional (latent) representation of the data into the rest-frame spectrum of the galaxy. Finally, this representation is transferred back to the observation and the reconstruction process is completed.
In practice, the contributions of different parts of the data to the final result depend on initially unknown weights. To learn this weight, the model is Trained – Reconstructed and original data are compared and weights are changed (roughly by trial and error) until an optimal set of weights is achieved.
So how does it do that?
The results of running the SPENDER model on an example spectrum from a galaxy in the Sloan Digital Sky Survey are given in Fig. 3 .

Visually, the model seems to perform well in reproducing the given spectrum. Figure 3 shows one of the advantages of such a model. Not only is the model able to reproduce different complexities of the spectrum, but by varying the resolution of the reconstructed spectrum, the model is able to distinguish overlapping (or mixed) features in the input data (see the two adjacent OII lines in Fig. 3, for example). Ultimately, the nature of SPENDER construction means that data can be passed to the model as it is received from the instrument – as the model is trained without input redshifting or cleaning, the model learns to incorporate that processing into its analysis. Such an architecture can also be used to generate mock spectra and provides a new approach to modeling galaxy spectra in detail, alleviating some of the problems inherent in existing empirical galaxy modeling approaches.
Astrobyte edited by Katie Proctor
Featured Image Credit: Adapted from Paper and Bisier (via FreeImages)
About Sahil Hegde
I am a first-year astrophysics PhD student at UCLA. I now use semi-analytic models to study the formation of the first stars and galaxies in the Universe. I graduated from Columbia University and am originally from the San Francisco Bay Area. Outside of astronomy you’ll find me playing tennis, surfing (read: wiping out), and playing board games/TTRPGs!