Metadata-Version: 2.1
Name: arxivloader
Version: 1.0.2
Summary: Wrapper for the arXiv API
Home-page: https://github.com/stammler/arxivloader/
Author: Sebastian Stammler
Author-email: sebastian.stammler@gmail.com
Maintainer: Sebastian Stammler
License: BSD
Project-URL: Source Code, https://github.com/stammler/arxivloader/
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: BSD License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Topic :: Scientific/Engineering :: Physics
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: bs4
Requires-Dist: lxml
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: requests
Requires-Dist: tqdm

# arXiv Loader

[![GitHub](https://img.shields.io/github/license/stammler/arxivloader) ](https://github.com/stammler/arxivloader/blob/master/LICENSE) [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](https://github.com/stammler/arxivloader/blob/master/.github/CODE_OF_CONDUCT.md)  
[![PyPI - Downloads](https://img.shields.io/pypi/dm/arxivloader?label=PyPI%20downloads)](https://pypistats.org/packages/arxivloader)

This tool is a wrapper of the [arXiv API](https://arxiv.org/help/api/) that allows you to retrieve metadata of articles published on arXiv as `pandas.DataFrame`.  
Please abide by the [Terms of Usage](https://arxiv.org/help/api/tou) of the arXiv API.

## Installation

`pip install arxivloader`

## Usage

Please consult the [arXiv API documentation](https://arxiv.org/help/api/user-manual#_query_interface) for help in constructing a valid query string.

### Searching by keyword

To search for a keyword the query needs to start with `search_query=` followed by a prefix and the search word.  
Possible prefixes are 

| Prefix | Explanation       |
|:-------|:------------------|
| ti     | Title             |
| au     | Author            |
| abs    | Abstract          |
| co     | Comments          |
| jr     | Journal Reference |
| cat    | Subject Category  |
| rn     | Report Number     |
| id     | arXiv ID          |
| all    | All of the above  |

Please have a look at the [arXiv API documentation](https://arxiv.org/help/api/user-manual#query_details) for details.

```
import arxivloader

keyword = "DustPy"
prefix = "all"
query = "search_query={pf}:{kw}".format(pf=prefix, kw=keyword)
columns = ["id", "title", "authors"]

df = arxivloader.load(query, columns=columns)
print(df)
```

|    | id           | title                                                               | authors                                                                              |
|---:|:-------------|:--------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
|  0 | 2207.00322v2 | DustPy: A Python Package for Dust Evolution in Protoplanetary Disks | Sebastian Markus Stammler; Tilman Birnstiel                                          |
|  1 | 2110.04007v1 | The formation of wide exoKuiper belts from migrating dust traps     | E. Miller; S. Marino; S. M. Stammler; P. Pinilla; C. Lenz; T. Birnstiel; Th. Henning |

### Searching by id

To search for a specific arXiv ID the query needs to start with `id_list=` followed by a comma-separated list of arXiv IDs:

```
import arxivloader

IDs = ["1909.04674", "1909.10526"]
query = "id_list={}".format(",".join(IDs))
columns = ["id", "title", "authors"]

df = arxivloader.load(query, columns=columns)

print(df)
```

|    | id           | title                                                               | authors                                                                                                       |
|---:|:-------------|:--------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------|
|  0 | 1909.04674v1 | The DSHARP Rings: Evidence of Ongoing Planetesimal Formation?       | Sebastian M. Stammler; Joanna Drazkowska; Til Birnstiel; Hubert Klahr; Cornelis P. Dullemond; Sean M. Andrews |
|  1 | 1909.10526v1 | Including Dust Coagulation in Hydrodynamic Models of Protoplanetary Disks: Dust Evolution in the Vicinity of a Jupiter-mass Planet  | Joanna Drazkowska; Shengtai Li; Til Birnstiel; Sebastian M. Stammler; Hui Li                                  |

### Filtering specific articles by keywords

If both, `search_query=` and `id_list=` are present, the given arXiv articles are filtered by the give key word.

```
import arxivloader

keyword = "DSHARP"
prefix = "ti"
IDs = ["1909.04674", "1909.10526"]
query = "search_query={pf}:{kw}&id_list={ids}".format(pf=prefix, kw=keyword, ids=",".join(IDs))
columns = ["id", "title", "authors"]

df = arxivloader.load(query, columns=columns)

print(df)
```

|    | id           | title                                                         | authors                                                                                                       |
|---:|:-------------|:--------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------|
|  0 | 1909.04674v1 | The DSHARP Rings: Evidence of Ongoing Planetesimal Formation? | Sebastian M. Stammler; Joanna Drazkowska; Til Birnstiel; Hubert Klahr; Cornelis P. Dullemond; Sean M. Andrews |

### Searching by date

It is possible to only retrieve entries in a specified date window.  
This query selects all publications that have been submitted to `astro-ph.EP` on July 1st 2022 between 8am and 1pm.

```
import arxivloader

prefix = "cat"
cat = "astro-ph.EP"
submittedDate = "[20220701080000+TO+20220701130000]"
query = "search_query={pf}:{cat}+AND+submittedDate:{sd}".format(pf=prefix, cat=cat, sd=submittedDate)
columns = ["id", "title", "authors", "published"]

df = arxivloader.load(query, columns=columns, sortBy="submittedDate", sortOrder="ascending")
print(df)
```

|    | id           | title                                                               | authors                                                               | published           |
|---:|:-------------|:--------------------------------------------------------------------|:----------------------------------------------------------------------|:--------------------|
|  0 | 2207.00273v1 | Whistler Waves As a Signature of Converging Magnetic Holes in Space Plasmas | Wence Jiang; Daniel Verscharen; Hui Li; Chi Wang; Kristopher G. Klein | 2022-07-01 08:55:54 |
|  1 | 2207.00322v2 | DustPy: A Python Package for Dust Evolution in Protoplanetary Disks | Sebastian Markus Stammler; Tilman Birnstiel                           | 2022-07-01 10:25:59 |

### Searching by category

It is possible to search large number of articles by category. Please be responsible with the traffic this query causes.

```
import arxivloader

keyword = "astro-ph.EP"
prefix = "cat"
query = "search_query={pf}:{kw}".format(pf=prefix, kw=keyword)
columns = ["id", "title", "primary_category", "categories", "published"]

df = arxivloader.load(query, columns=columns, sortBy="submittedDate", sortOrder="descending", num=1000, page_size=100)

print(df.head(5))
```

|    | id           | title                                                                  | primary_category   | categories                 | published           |
|---:|:-------------|:-----------------------------------------------------------------------|:-------------------|:---------------------------|:--------------------|
|  0 | 2210.11357v1 | The Key Factors Controlling the Seasonality of Planetary Climate       | physics.ao-ph      | physics.ao-ph; astro-ph.EP | 2022-10-20 15:45:43 |
|  1 | 2210.11305v1 | On the origin of the dichotomy of stellar activity cycles              | astro-ph.SR        | astro-ph.SR; astro-ph.EP   | 2022-10-20 14:34:33 |
|  2 | 2210.11207v1 | $\texttt{KOBEsim}$: a Bayesian observing strategy algorithm for planet detection in radial velocity blind-search | astro-ph.EP        | astro-ph.EP; astro-ph.IM   | 2022-10-20 12:33:03 |
|  3 | 2210.11103v1 | Lower-than-expected flare temperatures for TRAPPIST-1                  | astro-ph.SR        | astro-ph.SR; astro-ph.EP   | 2022-10-20 08:55:47 |
|  4 | 2210.10909v1 | TOI-3884 b: A rare 6-R$_{\oplus}$ planet that transits a low-mass star with a giant and likely polar spot | astro-ph.EP        | astro-ph.EP                | 2022-10-19 22:19:15 |

## Options

`arxivloader.load()` has several keyword arguments:

| Keyword     | Default value  | Description                                                                 |
|:------------|:---------------|:----------------------------------------------------------------------------|
| `num`       | 10             | Maximum total number of entries to be retrieved.                            |
| `start`     | 0              | Starting index of query.                                                    |
| `page_size` | 10             | The entries are retrieved in pages. The maximum allowed page size is 30000. |
| `delay`     | 3.             | Delay in seconds between page requests.                                     |
| `sortBy`    | `"relevance"`  | Possible values: `"relevance"`, `"lastUpdatedDate"`, `"submittedDate"`.     |
| `sortOrder` | `"descending"` | Possible values: `"descending"`, `"ascending"`.                             |
| `columns`   | `["id", "title", "summary", "authors", "primary_category", "categories", "comments", "updated", "published", "doi", "links"]`  | List of the columns the `pandas.DataFrame` should contain.                          |
| `timeout`   | 10.            | Timeout in seconds for a single page.                                       |
| `verbosity` | 2              | Level of verbosity.                                                         |

The default options are usually good enough.  
The `delay` has to be at least three seconds to be fair with the load on the arXiv API.  
It can happen that the arxiv API does not respond for a query. `timeout` will set the time after which `arxivloader` assumes a failed attempt and will retry at most five times. Please note, that `timeout` needs to be larger than the arXiv API takes to process the query, which depends on `page_size`. Consider two minutes for ten thousand entries in a page.  
If `verbosity` is `0`, `arxivloader` will not display anything on screen. If `verbosity` is `1`, `arxivloader` will print out the number of retrieved entries at the end of execution. If `verbosity` is `2`, `arxivloader` will additionally show a progess bar.

## Acknowledgements

Thank you to arXiv for use of its open access interoperability.
