Thomas Gerald

Defense

At the end of the class you will present a paper in group (3 person per group). You must in your presentation :

Introduce the main idea and the context
Describe the different contributions
Explain the experimental protocol
Explain the pros and cons of the approach

Session 1 : Introduction

Lectures (slides)
CYK implementation in python

Report Instruction

The objective of the exercices is to introduce pos tagging and syntactic parsing.

Part 1: PoS tagging

For a set of sentences students will apply different pos tag approaches :

Manual annotation
SpaCy annotation (https://spacy.io/): pip install -U spacy
python -m spacy download en_core_web_sm (to download the small model for english)
sciSpaCy (https://allenai.github.io/scispacy/): pip install scispacy

The considered tag for part-of-speech manual annotation will be limited to the following list (https://universaldependencies.org/u/pos/):

ADJ, ADP, AUX, DET, NOUN, NUM, PROPN, PUNCT, SCONJ, VERB

The objective will be to compare the different PoS tagging approaches methods and to discuss their strength and weakness.

Part 2: Syntactic Parsing

In this exercise you will use the previously annotated samples and apply the cyk algorithm for constituency parsing. Notably you will apply the algorithms considering different possible pos for each words and produce the constuency trees. The main objective will to find the correct grammar to get the parsing you want (you can take inspiration of Context-free Grammar and syntactic tree: Example - slide 33) You should discuss the different results depending on the pos tag and the context free grammar choosen. Notice that we limit the non terminal to :

VP (verbal phrase)
NP (noun phrase)
PP (prepositional phrase)

Compare with Spacy Syntactic analysis(constituency parser). If you have problem(s) with dependency you can use the Berkeley constituency parser online

Dataset

General domain:

The cat sat on the couch.
Time flies like an arrow.
The spy saw the cop with the telescope.
The spy saw the cop with the revolver.

Specific domain:

Biology: Arabidopsis thaliana seedlings exhibit longer hypocotyls when they are grown under high ambient temperature, which is defined as thermomorphogenesis.
Astronomy: A spectrogram of PSN J10354824+3900279 obtained on Dec. 19.33 UT suggests that this is a type-Ia at redshift z 0.044.

Report Submission

• Work together in small groups (up to 3 students per group) • Prepare a final report composed of a header and three sections: • Header: first and last names and e-mail of all members of the group • First part: manual analysis of sentences • Second part: automatic analysis • Third part: observations • Send your report (PDF) to thomas.gerald@universite-paris-saclay.fr (up to January 16th)

Session 2 : Lemmatization & Tokenization

Lectures (slides)

Report Instruction

Part 1: Create your lemmatizer

In this exercise, the objective is to create your own lemmatizer for french language. We will test different lemmatization approaches : Based on a dictionary Based on machine learning approach (you can use sklearn) or define your own architecture with pytorch With and without PoS tag given as input you should report performances of the proposed algorithm and compare it to spacy lemmatizer.

Part 2: Implementing BPE algorithm

The second exercise is a guided implementation of the BPE algorithm proposed in "Neural Machine Translation of Rare Words with Subword Units" of Rico Sennrich, Barry Haddow and Alexandra Birch in 2016

Report Submission

Work together in small groups (up to 3 students per group)
You should submit one report archive (tar.gz or zip), containing the two notebooks (containing a conclusion and a description of your choices)

Text Mining And Chatbot

Defense

Session 1 : Introduction

Report Instruction

Part 1: PoS tagging

Part 2: Syntactic Parsing

Dataset

General domain:

Specific domain:

Report Submission

Session 2 : Lemmatization & Tokenization

Report Instruction

Part 1: Create your lemmatizer

Part 2: Implementing BPE algorithm

Report Submission