IAM

ARTICLE

Python Scripts to Prepare ArXiv Submissions

Generally, papers are written to be published at conferences or journals. While some journals care about the LaTeX source used to compile the submitted papers, most venues just expect compiled PDFs to be submitted. However, ArXiv always requires the full LaTeX source to be compiled on the ArXiv servers. As the LaTeX source of every ArXiv paper can be downloaded, this usually involves removing all comments, unused figures/files and “flattening” the directoy structure as ArXiv does not handle subdirectories well. In this article, I want to share two simple scripts that take care of the latter two problems: removing unused files and flattening.

Introduction

Preparing a paper for ArXiv usually involves the same steps. Authors generally want to remove comments and unused files — for example, images for figures/plots. Also, ArXiv usually expects a "flat" submission, that is, ArXiv does not work well with sub-directories in the uploaded submission. Over the years, I came up with a set of Python scripts to remove unused files and flatten the remaining files. In this article, I want to share these scripts.

The code including basic documentation can be found on GitHub:

Code on GitHub

Requirements are minimal and should not require any Python packages not included in a standard Python 3.x installation.

Usage

Make sure to create a backup of the LaTeX project before using the below scripts.

sanitize_submission.py: Often, I found myself generating lots of figures and plots for various experiments and then only showing a subset in the final paper. This script uses the snapshot package to identify those actually used in the paper and remove all others. To make this possible, \RequirePackage{snapshot} has to be included in the LaTeX project. Compiling it will then generate a .dep file that lists included files as

*{file}   {./images/./logo/logo.pdf}{0000/00/00 v0.0}
*{file}   {./images/./logo/mpi_is.png}{0000/00/00 v0.0}
...

The dependency file can be scanned in Python to identify all used files and match these to the files found in a provided directory. Afterwards, the files not used can be deleted. This is done by first running

python sanitize_submission.py --mode=check --dep_file=paper.deb --asset_directory=images/ --extensions=png

which should list all found files in images/ identifyin those used in paper.tex (which generated paper.dep). If the list looks reasonable, the changes can be made final using

python sanitize_submission.py --mode=delete --dep_file=paper.deb --asset_directory=images/ --extensions=png

Using the --extensions flag, the script can be run only for specific file types (like images, or PDF figures/plots).

flatten_submission.py can subsequently be used to "flatten" the remaining files. By this, I mean to move images and figures from sub-directories to the root directory. The main problem with doing this automatically is that the same filenames might be used in different directories. To avoi this problem, the directory names will be kept as prefixes.

By default, running

python flatten_submission.py --mode=check --asset_directory=images/ --extensions=png,jpg,jpeg

will only consider \includegraphics statements. Other includes such as \input statements are generally more difficult to handle but can be handled if the use is consistent and a corresponding regualr expression can be extracted. Then, this command will print out a list of files that will be renamed/copied. If the list looks good,

python flatten_submission.py --mode=copy --asset_directory=gfx/ --extensions=png,jpg,jpeg

will make the changes final. As filenames are often limited to 64 characters, the script includes a update_filepath that can be used to shorten parts of filenames using pattern matching.

Conclusion

Both scripts above are simple tools to prepare papers for ArXiv. Usually, I would remove comments manually, then make a backup and run the scripts in the order described above. Afterwards, there should not be unused files and all files should be in the root directory — ready to be uploaded to ArXiv.

What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.