Python Scripts to Prepare ArXiv Submissions

Introduction

Preparing a paper for ArXiv usually involves the same steps. Authors generally want to remove comments and unused files — for example, images for figures/plots. Also, ArXiv usually expects a "flat" submission, that is, ArXiv does not work well with sub-directories in the uploaded submission. Over the years, I came up with a set of Python scripts to remove unused files and flatten the remaining files. In this article, I want to share these scripts.

The code including basic documentation can be found on GitHub:

Code on GitHub

Requirements are minimal and should not require any Python packages not included in a standard Python 3.x installation.

Usage

Make sure to create a backup of the LaTeX project before using the below scripts.

sanitize_submission.py: Often, I found myself generating lots of figures and plots for various experiments and then only showing a subset in the final paper. This script uses the snapshot package to identify those actually used in the paper and remove all others. To make this possible, \RequirePackage{snapshot} has to be included in the LaTeX project. Compiling it will then generate a .dep file that lists included files as

*{file}   {./images/./logo/logo.pdf}{0000/00/00 v0.0}
*{file}   {./images/./logo/mpi_is.png}{0000/00/00 v0.0}
...

The dependency file can be scanned in Python to identify all used files and match these to the files found in a provided directory. Afterwards, the files not used can be deleted. This is done by first running

python sanitize_submission.py --mode=check --dep_file=paper.deb --asset_directory=images/ --extensions=png

which should list all found files in images/ identifyin those used in paper.tex (which generated paper.dep). If the list looks reasonable, the changes can be made final using

python sanitize_submission.py --mode=delete --dep_file=paper.deb --asset_directory=images/ --extensions=png

Using the --extensions flag, the script can be run only for specific file types (like images, or PDF figures/plots).

flatten_submission.py can subsequently be used to "flatten" the remaining files. By this, I mean to move images and figures from sub-directories to the root directory. The main problem with doing this automatically is that the same filenames might be used in different directories. To avoi this problem, the directory names will be kept as prefixes.

By default, running

python flatten_submission.py --mode=check --asset_directory=images/ --extensions=png,jpg,jpeg

will only consider \includegraphics statements. Other includes such as \input statements are generally more difficult to handle but can be handled if the use is consistent and a corresponding regualr expression can be extracted. Then, this command will print out a list of files that will be renamed/copied. If the list looks good,

python flatten_submission.py --mode=copy --asset_directory=gfx/ --extensions=png,jpg,jpeg

will make the changes final. As filenames are often limited to 64 characters, the script includes a update_filepath that can be used to shorten parts of filenames using pattern matching.

Conclusion

Both scripts above are simple tools to prepare papers for ArXiv. Usually, I would remove comments manually, then make a backup and run the scripts in the order described above. Afterwards, there should not be unused files and all files should be in the root directory — ready to be uploaded to ArXiv.

IAM

DAVIDSTUTZ

ARTICLE

Introduction

Usage

Conclusion

SEARCHTHEBLOG

ARCHIVES

TAGS