Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: akoumjian/datefinder
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v0.7.1
Choose a base ref
...
head repository: akoumjian/datefinder
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: master
Choose a head ref

Commits on Jul 15, 2019

  1. Improve strict=True

    ecatkins committed Jul 15, 2019
    Copy the full SHA
    38101d9 View commit details

Commits on Jul 16, 2019

  1. Copy the full SHA
    c5b4717 View commit details
  2. delete

    ecatkins committed Jul 16, 2019
    Copy the full SHA
    8bb67e4 View commit details

Commits on Jun 25, 2020

  1. Create FUNDING.yml

    akoumjian authored Jun 25, 2020
    Copy the full SHA
    2837aa8 View commit details

Commits on Aug 2, 2020

  1. Copy the full SHA
    9ee6ef9 View commit details
  2. Copy the full SHA
    f65e27e View commit details
  3. Copy the full SHA
    2790dbe View commit details
  4. Add test with full year

    akoumjian committed Aug 2, 2020
    Copy the full SHA
    0b86495 View commit details

Commits on Aug 5, 2020

  1. Add MANIFEST.in

    synapticarbors authored Aug 5, 2020
    Copy the full SHA
    148dea3 View commit details

Commits on Jul 13, 2021

  1. Copy the full SHA
    bbcd7e0 View commit details
  2. Copy the full SHA
    a9bd897 View commit details
  3. added conda badge

    sugatoray authored Jul 13, 2021
    Copy the full SHA
    32c8af0 View commit details
  4. Copy the full SHA
    bc5ce3b View commit details

Commits on Sep 5, 2021

  1. added danish days and months

    Need this wonderful package for danish date detection, hence my addition of a few time entities.
    MalteHB authored Sep 5, 2021
    Copy the full SHA
    c27b758 View commit details

Commits on Jul 31, 2022

  1. Merge pull request #159 from MalteHB/patch-1

    added danish days and months
    akoumjian authored Jul 31, 2022
    Copy the full SHA
    70309a4 View commit details
  2. Merge pull request #157 from sugatoray/docs/update_readme

    added conda installation option to readme
    akoumjian authored Jul 31, 2022
    Copy the full SHA
    c48c651 View commit details
  3. Fix issue #138 : Thu not recognised by regex

    janto authored and akoumjian committed Jul 31, 2022
    Copy the full SHA
    93fd71c View commit details
  4. Merge branch 'janto-master'

    akoumjian committed Jul 31, 2022
    Copy the full SHA
    2397240 View commit details
  5. Merge pull request #130 from synapticarbors/patch-1

    Add MANIFEST.in to include license file in source distribution
    akoumjian authored Jul 31, 2022
    Copy the full SHA
    fcb70b8 View commit details
  6. Merge pull request #109 from ecatkins/strict_fix

    Improve strict=True
    akoumjian authored Jul 31, 2022
    Copy the full SHA
    f2ab941 View commit details
  7. Copy the full SHA
    cd5b1c3 View commit details
  8. Create python-package.yml

    akoumjian authored Jul 31, 2022
    Copy the full SHA
    cb17634 View commit details
  9. Create python-publish.yml

    akoumjian authored Jul 31, 2022
    Copy the full SHA
    2e7c487 View commit details
  10. Update python-package.yml

    akoumjian authored Jul 31, 2022
    Copy the full SHA
    1fb7b34 View commit details
  11. Copy the full SHA
    bd7c6ce View commit details
  12. Minting 0.7.3

    akoumjian committed Jul 31, 2022
    Copy the full SHA
    ab59f2b View commit details

Commits on Aug 4, 2022

  1. Update README.rst

    akoumjian authored Aug 4, 2022
    Copy the full SHA
    3e9219d View commit details
  2. Copy the full SHA
    8485fa2 View commit details
  3. Fixes #172

    akoumjian committed Aug 4, 2022
    Copy the full SHA
    2741777 View commit details

Commits on Jan 22, 2023

  1. Add date range using '-'

    Joe Walker committed Jan 22, 2023
    Copy the full SHA
    78718c6 View commit details

Commits on Jan 23, 2023

  1. Merge pull request #185 from joe-walker/date_range_with_hyphen

    Add date range using '-'
    akoumjian authored Jan 23, 2023
    Copy the full SHA
    5376ece View commit details
1 change: 1 addition & 0 deletions .github/FUNDING.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
github: akoumjian
40 changes: 40 additions & 0 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions

name: Python package

on:
push:
branches: [ "master" ]
pull_request:
branches: [ "master" ]

jobs:
build:

runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.8", "3.9", "3.10"]

steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v3
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install flake8 pytest
python setup.py install
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pytest
39 changes: 39 additions & 0 deletions .github/workflows/python-publish.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# This workflow will upload a Python Package using Twine when a release is created
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries

# This workflow uses actions that are not certified by GitHub.
# They are provided by a third-party and are governed by
# separate terms of service, privacy policy, and support
# documentation.

name: Upload Python Package

on:
release:
types: [published]

permissions:
contents: read

jobs:
deploy:

runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v3
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install build
- name: Build package
run: python -m build --wheel
- name: Publish package
uses: pypa/gh-action-pypi-publish@27b31702a0e7fc50959f5ad993c78deac1bdfc29
with:
user: __token__
password: ${{ secrets.PYPI_API_TOKEN }}
18 changes: 0 additions & 18 deletions .travis.yml

This file was deleted.

1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
include LICENSE
21 changes: 11 additions & 10 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
datefinder - extract dates from text
====================================

.. image:: https://img.shields.io/travis/akoumjian/datefinder/master.svg
:target: https://travis-ci.org/akoumjian/datefinder
:alt: travis build status
.. image:: https://github.com/akoumjian/datefinder/actions/workflows/python-package.yml/badge.svg
:target: https://github.com/akoumjian/datefinder
:alt: Build Status

.. image:: https://img.shields.io/pypi/dm/datefinder.svg
:target: https://pypi.python.org/pypi/datefinder/
@@ -13,10 +13,6 @@ datefinder - extract dates from text
:target: https://pypi.python.org/pypi/datefinder
:alt: pypi version

.. image:: https://img.shields.io/gitter/room/nwjs/nw.js.svg
:target: https://gitter.im/datefinder/Lobby
:alt: gitter chat


A python module for locating dates inside text. Use this package to extract all sorts
of date like strings from a document and turn them into datetime objects.
@@ -28,10 +24,13 @@ This module finds the likely datetime strings and then uses
Installation
------------

**With pip**

.. code-block:: sh
pip install datefinder
**Note: I do not publish the version on conda forge and cannot verify its integrity.**

How to Use
----------
@@ -58,8 +57,10 @@ How to Use
2005-01-15 00:00:00
Support
-------
Demo
----

- 🎞️ `Video demo`_ by Calmcode.io. :star:

You can talk to us on `Gitter <https://gitter.im/datefinder/Lobby>`_ or just submit an issue on `github <https://github.com/akoumjian/datefinder/issues/>`_.
.. _Video demo: https://calmcode.io/shorts/datefinder.py.html

75 changes: 58 additions & 17 deletions datefinder/__init__.py
Original file line number Diff line number Diff line change
@@ -5,6 +5,7 @@
from datefinder.date_fragment import DateFragment
from .constants import (
REPLACEMENTS,
DELIMITERS_PATTERN,
TIMEZONE_REPLACEMENTS,
STRIP_CHARS,
DATE_REGEX,
@@ -20,8 +21,14 @@ class DateFinder(object):
Locates dates in a text
"""

def __init__(self, base_date=None):
def __init__(self, base_date=None, first="month"):
self.base_date = base_date
self.dayfirst = False
self.yearfirst = False
if first == "day":
self.dayfirst = True
if first == "year":
self.yearfirst = True

def find_dates(self, text, source=False, index=False, strict=False):

@@ -64,8 +71,14 @@ def _find_and_replace(self, date_string, captures):
# 2. match ' to'
# 3. match ' to '
# but never match r'(\s|)to(\s|)' which would make 'october' > 'ocber'
# but also include delimiters, like this 'date: '
full_match_pattern = (
r"(^|{delimiters_pattern}){key}($|{delimiters_pattern})".format(
delimiters_pattern=DELIMITERS_PATTERN, key=key
)
)
date_string = re.sub(
r"(^|\s)" + key + r"(\s|$)",
full_match_pattern,
replacement,
date_string,
flags=re.IGNORECASE,
@@ -99,7 +112,12 @@ def parse_date_string(self, date_string, captures):
# For well formatted string, we can already let dateutils parse them
# otherwise self._find_and_replace method might corrupt them
try:
as_dt = parser.parse(date_string, default=self.base_date)
as_dt = parser.parse(
date_string,
default=self.base_date,
dayfirst=self.dayfirst,
yearfirst=self.yearfirst,
)
except (ValueError, OverflowError):
# replace tokens that are problematic for dateutil
date_string, tz_string = self._find_and_replace(date_string, captures)
@@ -113,7 +131,12 @@ def parse_date_string(self, date_string, captures):

try:
logger.debug("Parsing {0} with dateutil".format(date_string))
as_dt = parser.parse(date_string, default=self.base_date)
as_dt = parser.parse(
date_string,
default=self.base_date,
dayfirst=self.dayfirst,
yearfirst=self.yearfirst,
)
except Exception as e:
logger.debug(e)
as_dt = None
@@ -139,9 +162,11 @@ def extract_date_strings_inner(self, text, text_start=0, strict=False):
if rng and len(rng) > 1:
range_strings = []
for range_str in rng:
range_strings.extend(self.extract_date_strings_inner(range_str[0],
text_start=range_str[1][0],
strict=strict))
range_strings.extend(
self.extract_date_strings_inner(
range_str[0], text_start=range_str[1][0], strict=strict
)
)
for range_string in range_strings:
yield range_string
return
@@ -159,6 +184,7 @@ def extract_date_strings_inner(self, text, text_start=0, strict=False):
# digits_modifiers = captures.get('digits_modifiers')
# days = captures.get('days')
months = captures.get("months")
years = captures.get("years")
# timezones = captures.get('timezones')
# delimiters = captures.get('delimiters')
# time_periods = captures.get('time_periods')
@@ -169,9 +195,16 @@ def extract_date_strings_inner(self, text, text_start=0, strict=False):
if len(digits) == 3: # 12-05-2015
complete = True
elif (len(months) == 1) and (
len(digits) == 2
len(digits) == 2
): # 19 February 2013 year 09:10
complete = True
elif (len(years) == 1) and (len(digits) == 2): # 09/06/2018
complete = True

elif (
(len(years) == 1) and (len(months) == 1) and (len(digits) == 1)
): # '19th day of May, 2015'
complete = True

if not complete:
continue
@@ -185,12 +218,12 @@ def extract_date_strings_inner(self, text, text_start=0, strict=False):
yield match_str, indices, captures

def tokenize_string(self, text):
'''
"""
Get matches from source text. Method merge_tokens will later compose
potential date strings out of these matches.
:param text: source text like 'the big fight at 2p.m. mountain standard time on ufc.com'
:return: [(match_text, match_group, {match.capturesdict()}), ...]
'''
"""
items = []

last_index = 0
@@ -202,19 +235,19 @@ def tokenize_string(self, text):
group = self.get_token_group(captures)

if indices[0] > last_index:
items.append((text[last_index:indices[0]], '', {}))
items.append((text[last_index : indices[0]], "", {}))
items.append((match_str, group, captures))
last_index = indices[1]
if last_index < len(text):
items.append((text[last_index:len(text)], '', {}))
items.append((text[last_index : len(text)], "", {}))
return items

def merge_tokens(self, tokens):
'''
"""
Makes potential date strings out of matches, got from tokenize_string method.
:param tokens: [(match_text, match_group, {match.capturesdict()}), ...]
:return: potential date strings
'''
"""
MIN_MATCHES = 3
fragments = []
frag = DateFragment()
@@ -264,7 +297,7 @@ def get_token_group(captures):
lst = captures.get(gr)
if lst and len(lst) > 0:
return gr
return ''
return ""

@staticmethod
def split_date_range(text):
@@ -284,7 +317,9 @@ def split_date_range(text):
return parts


def find_dates(text, source=False, index=False, strict=False, base_date=None):
def find_dates(
text, source=False, index=False, strict=False, base_date=None, first="month"
):
"""
Extract datetime strings from text
@@ -306,9 +341,15 @@ def find_dates(text, source=False, index=False, strict=False, base_date=None):
:param base_date:
Set a default base datetime when parsing incomplete dates
:type base_date: datetime
:param first:
Whether to interpret the the first value in an ambiguous 3-integer date
(01/02/03) as the month, day, or year. Values can be `month`, `day`, `year`.
Default is `month`.
:type first: str|unicode
:return: Returns a generator that produces :mod:`datetime.datetime` objects,
or a tuple with the source text and index, if requested
"""
date_finder = DateFinder(base_date=base_date)
date_finder = DateFinder(base_date=base_date, first=first)
return date_finder.find_dates(text, source=source, index=index, strict=strict)
7 changes: 3 additions & 4 deletions datefinder/constants.py
Original file line number Diff line number Diff line change
@@ -4,13 +4,12 @@
POSITIONNAL_TOKENS = r"next|last"
DIGITS_PATTERN = r"\d+"
DIGITS_SUFFIXES = r"st|th|rd|nd"
DAYS_PATTERN = "monday|tuesday|wednesday|thursday|friday|saturday|sunday|mon|tue|tues|wed|thur|thurs|fri|sat|sun"
MONTHS_PATTERN = r"january|february|march|april|may|june|july|august|september|october|november|december|enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre|jan\.?|ene\.?|feb\.?|mar\.?|apr\.?|abr\.?|may\.?|jun\.?|jul\.?|aug\.?|ago\.?|sep\.?|sept\.?|oct\.?|nov\.?|dec\.?|dic\.?"
DAYS_PATTERN = "monday|tuesday|wednesday|thursday|friday|saturday|sunday|mandag|tirsdag|onsdag|torsdag|fredag|lørdag|søndag|mon|tue|tues|wed|thu|thur|thurs|fri|sat|sun|man|tir|tirs|ons|tor|tors|fre|lør|søn"
MONTHS_PATTERN = r"january|february|march|april|may|june|july|august|september|october|november|december|enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre|januar|februar|marts|april|maj|juni|juli|august|september|oktober|november|december|jan[\.\s]|ene[\.\s]|feb[\.\s]|mar[\.\s]|apr[\.\s]|abr[\.\s]|may[\.\s]|maj[\.\s]|jun[\.\s]|jul[\.\s]|aug[\.\s]|ago[\.\s]|sep[^A-Za-z]|sept[\.\s]|oct[\.\s]|okt[\.\s]|nov[\.\s]|dec[\.\s]|dic[\.\s]"
TIMEZONES_PATTERN = "ACDT|ACST|ACT|ACWDT|ACWST|ADDT|ADMT|ADT|AEDT|AEST|AFT|AHDT|AHST|AKDT|AKST|AKTST|AKTT|ALMST|ALMT|AMST|AMT|ANAST|ANAT|ANT|APT|AQTST|AQTT|ARST|ART|ASHST|ASHT|AST|AWDT|AWST|AWT|AZOMT|AZOST|AZOT|AZST|AZT|BAKST|BAKT|BDST|BDT|BEAT|BEAUT|BIOT|BMT|BNT|BORT|BOST|BOT|BRST|BRT|BST|BTT|BURT|CANT|CAPT|CAST|CAT|CAWT|CCT|CDDT|CDT|CEDT|CEMT|CEST|CET|CGST|CGT|CHADT|CHAST|CHDT|CHOST|CHOT|CIST|CKHST|CKT|CLST|CLT|CMT|COST|COT|CPT|CST|CUT|CVST|CVT|CWT|CXT|ChST|DACT|DAVT|DDUT|DFT|DMT|DUSST|DUST|EASST|EAST|EAT|ECT|EDDT|EDT|EEDT|EEST|EET|EGST|EGT|EHDT|EMT|EPT|EST|ET|EWT|FET|FFMT|FJST|FJT|FKST|FKT|FMT|FNST|FNT|FORT|FRUST|FRUT|GALT|GAMT|GBGT|GEST|GET|GFT|GHST|GILT|GIT|GMT|GST|GYT|HAA|HAC|HADT|HAE|HAP|HAR|HAST|HAT|HAY|HDT|HKST|HKT|HLV|HMT|HNA|HNC|HNE|HNP|HNR|HNT|HNY|HOVST|HOVT|HST|ICT|IDDT|IDT|IHST|IMT|IOT|IRDT|IRKST|IRKT|IRST|ISST|IST|JAVT|JCST|JDT|JMT|JST|JWST|KART|KDT|KGST|KGT|KIZST|KIZT|KMT|KOST|KRAST|KRAT|KST|KUYST|KUYT|KWAT|LHDT|LHST|LINT|LKT|LMT|LMT|LMT|LMT|LRT|LST|MADMT|MADST|MADT|MAGST|MAGT|MALST|MALT|MART|MAWT|MDDT|MDST|MDT|MEST|MET|MHT|MIST|MIT|MMT|MOST|MOT|MPT|MSD|MSK|MSM|MST|MUST|MUT|MVT|MWT|MYT|NCST|NCT|NDDT|NDT|NEGT|NEST|NET|NFT|NMT|NOVST|NOVT|NPT|NRT|NST|NT|NUT|NWT|NZDT|NZMT|NZST|OMSST|OMST|ORAST|ORAT|PDDT|PDT|PEST|PET|PETST|PETT|PGT|PHOT|PHST|PHT|PKST|PKT|PLMT|PMDT|PMMT|PMST|PMT|PNT|PONT|PPMT|PPT|PST|PT|PWT|PYST|PYT|QMT|QYZST|QYZT|RET|RMT|ROTT|SAKST|SAKT|SAMT|SAST|SBT|SCT|SDMT|SDT|SET|SGT|SHEST|SHET|SJMT|SLT|SMT|SRET|SRT|SST|STAT|SVEST|SVET|SWAT|SYOT|TAHT|TASST|TAST|TBIST|TBIT|TBMT|TFT|THA|TJT|TKT|TLT|TMT|TOST|TOT|TRST|TRT|TSAT|TVT|ULAST|ULAT|URAST|URAT|UTC|UYHST|UYST|UYT|UZST|UZT|VET|VLAST|VLAT|VOLST|VOLT|VOST|VUST|VUT|WARST|WART|WAST|WAT|WDT|WEDT|WEMT|WEST|WET|WFT|WGST|WGT|WIB|WIT|WITA|WMT|WSDT|WSST|WST|WT|XJT|YAKST|YAKT|YAPT|YDDT|YDT|YEKST|YEKST|YEKT|YEKT|YERST|YERT|YPT|YST|YWT|zzz"
## explicit north american timezones that get replaced
NA_TIMEZONES_PATTERN = "pacific|eastern|mountain|central"
ALL_TIMEZONES_PATTERN = TIMEZONES_PATTERN + "|" + NA_TIMEZONES_PATTERN
DELIMITERS_PATTERN = r"[/\:\-\,\s\_\+\@]+"

# Allows for straightforward datestamps e.g 2017, 201712, 20171223. Created with:
# YYYYMM_PATTERN = '|'.join(['19\d\d'+'{:0>2}'.format(mon)+'|20\d\d'+'{:0>2}'.format(mon) for mon in range(1, 13)])
@@ -163,7 +162,7 @@
STRIP_CHARS = " \n\t:-.,_"

# split ranges
RANGE_SPLIT_PATTERN = r'\Wto\W|\Wthrough\W'
RANGE_SPLIT_PATTERN = r'\Wto\W|\Wthrough\W|\W-\W'

RANGE_SPLIT_REGEX = re.compile(RANGE_SPLIT_PATTERN,
re.IGNORECASE | re.MULTILINE | re.UNICODE | re.DOTALL)
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
@@ -23,7 +23,7 @@
# Versions should comply with PEP440. For a discussion on single-sourcing
# the version across setup.py and the project code, see
# https://packaging.python.org/en/latest/single_source_version.html
version="0.7.1",
version="0.7.3",
description="Extract datetime objects from strings",
long_description=long_description,
# The project's main homepage.
Loading