Introduction
Preparing a paper for ArXiv usually involves the same steps. Authors generally want to remove comments and unused files — for example, images for figures/plots. Also, ArXiv usually expects a "flat" submission, that is, ArXiv does not work well with sub-directories in the uploaded submission. Over the years, I came up with a set of Python scripts to remove unused files and flatten the remaining files. In this article, I want to share these scripts.
The code including basic documentation can be found on GitHub:
Code on GitHubRequirements are minimal and should not require any Python packages not included in a standard Python 3.x installation.
Usage
Make sure to create a backup of the LaTeX project before using the below scripts.
sanitize_submission.py: Often, I found myself generating lots of figures and plots for various experiments and then only showing a subset in the final paper. This script uses the snapshot package to identify those actually used in the paper and remove all others. To make this possible, \RequirePackage{snapshot}
has to be included in the LaTeX project. Compiling it will then generate a .dep
file that lists included files as
*{file} {./images/./logo/logo.pdf}{0000/00/00 v0.0} *{file} {./images/./logo/mpi_is.png}{0000/00/00 v0.0} ...
The dependency file can be scanned in Python to identify all used files and match these to the files found in a provided directory. Afterwards, the files not used can be deleted. This is done by first running
python sanitize_submission.py --mode=check --dep_file=paper.deb --asset_directory=images/ --extensions=png
which should list all found files in images/
identifyin those used in paper.tex
(which generated paper.dep
). If the list looks reasonable, the changes can be made final using
python sanitize_submission.py --mode=delete --dep_file=paper.deb --asset_directory=images/ --extensions=png
Using the --extensions
flag, the script can be run only for specific file types (like images, or PDF figures/plots).
flatten_submission.py can subsequently be used to "flatten" the remaining files. By this, I mean to move images and figures from sub-directories to the root directory. The main problem with doing this automatically is that the same filenames might be used in different directories. To avoi this problem, the directory names will be kept as prefixes.
By default, running
python flatten_submission.py --mode=check --asset_directory=images/ --extensions=png,jpg,jpeg
will only consider \includegraphics
statements. Other includes such as \input
statements are generally more difficult to handle but can be handled if the use is consistent and a corresponding regualr expression can be extracted. Then, this command will print out a list of files that will be renamed/copied. If the list looks good,
python flatten_submission.py --mode=copy --asset_directory=gfx/ --extensions=png,jpg,jpeg
will make the changes final. As filenames are often limited to 64 characters, the script includes a update_filepath
that can be used to shorten parts of filenames using pattern matching.
Conclusion
Both scripts above are simple tools to prepare papers for ArXiv. Usually, I would remove comments manually, then make a backup and run the scripts in the order described above. Afterwards, there should not be unused files and all files should be in the root directory — ready to be uploaded to ArXiv.