import re
regex = re.compile(r"@(?P<citekey>\w[\w:.#$%&\-+?<>~/]*\w+)", flags=re.MULTILINE)
test_str = ("The [Pandoc manual](https://pandoc.org/MANUAL.html#citations) defines the following syntax for citation keys:\n\n"
"> The citation key must begin with a letter, digit, or `_`, and may contain alphanumerics, `_`, and internal punctuation characters (`:.#$%&-+?<>~/`). Here are some examples:\n\n"
"## Valid citekeys\n\n"
"Manubot supports citations like `@source:identifier`, where `source` is one of the options described below. The citekeys in this section are valid according to the Pandoc syntax.\n\n"
"1. DOI (Digital Object Identifier), cite like `@doi:10.15363/thinklab.4`.\n"
" Shortened versions of DOIs can be created at [shortdoi.org](http://shortdoi.org/).\n"
" shortDOIs begin with `10/` rather than `10.` and can also be cited.\n"
" For example, Manubot will expand `@doi:10/993` to the DOI above.\n"
" We suggest using shortDOIs to cite DOIs containing forbidden characters, such as `(` or `)`.\n"
"2. PubMed Central ID, cite like `@pmcid:PMC4497619`.\n"
"3. PubMed ID, cite like `@pmid:26158728`.\n"
"4. _arXiv_ ID, cite like `@arxiv:1508.06576v2`.\n"
"5. ISBN (International Standard Book Number), cite like `@isbn:9781339919881`.\n"
"6. URL / webpage, cite like `@url:https://nyti.ms/1QUgAt1`.\n"
" URL citations can be helpful if the above methods return incorrect metadata.\n"
" For example, `@doi:10.1038/ng.3834` [incorrectly handles](https://github.com/manubot/manubot/issues/158) the consortium name resulting in a blank author, while `@url:https://doi.org/10.1038/ng.3834` succeeds.\n"
" Similarly, `@url:https://doi.org/10.1101/142760` is a [workaround](https://github.com/manubot/manubot/issues/16) to set the journal name of bioRxiv preprints to _bioRxiv_.\n"
"7. Wikidata Items, cite like `@wikidata:Q50051684`.\n"
" Note that anyone can edit or add records on [Wikidata](https://www.wikidata.org), so users are encouraged to contribute metadata for hard-to-cite works to Wikidata as an alternative to using a `raw` citation.\n"
"8. For references that do not have any of the persistent identifiers above, use a raw citation like `@raw:old-manuscript`.\n"
" Metadata for raw citations must be provided manually.\n\n"
"Cite multiple items at once like:\n\n"
"```md\n"
"Here is a sentence with several citations [@doi:10.15363/thinklab.4; @pmid:26158728; @arxiv:1508.06576; @isbn:9780394603988].\n"
"```\n\n"
"More information at https://github.com/manubot/rootstock/blob/master/USAGE.md#citations\n\n"
"## Invalid citekeys\n\n"
"Citekeys in this section would be nice to support, but notice that they do not completely match the regex:\n\n"
"Citekey with parentheses @doi:10.1016/S0022-2836(05)80360-2\n"
"Citekey with closing slash @https://www.google.com/\n"
"Citekey with equal sign @https://openreview.net/forum?id=HkwoSDPgg\n\n"
"See https://github.com/jgm/pandoc/issues/6026 for discussion on a more flexible markdown syntax for citation keys.\n\n")
matches = regex.finditer(test_str)
for match_num, match in enumerate(matches, start=1):
print(f"Match {match_num} was found at {match.start()}-{match.end()}: {match.group()}")
for group_num, group in enumerate(match.groups(), start=1):
print(f"Group {group_num} found at {match.start(group_num)}-{match.end(group_num)}: {group}")
Please keep in mind that these code samples are automatically generated and are not guaranteed to work. If you find any syntax errors, feel free to submit a bug report. For a full regex reference for Python, please visit: https://docs.python.org/3/library/re.html