Text Mining And Chatbot
Defense
- Introduce the main idea and the context
- Describe the different contributions
- Explain the experimental protocol
- Explain the pros and cons of the approach
Session 1 : Introduction
- Lectures (slides)
- CYK implementation in python
Report Instruction
The objective of the exercices is to introduce pos tagging and syntactic parsing.Part 1: PoS tagging
For a set of sentences students will apply different pos tag approaches :- Manual annotation
- SpaCy annotation (https://spacy.io/): pip install -U spacy
- python -m spacy download en_core_web_sm (to download the small model for english)
- sciSpaCy (https://allenai.github.io/scispacy/): pip install scispacy
ADJ, ADP, AUX, DET, NOUN, NUM, PROPN, PUNCT, SCONJ, VERB
The objective will be to compare the different PoS tagging approaches methods and to discuss their strength and weakness.Part 2: Syntactic Parsing
In this exercise you will use the previously annotated samples and apply the cyk algorithm for constituency parsing. Notably you will apply the algorithms considering different possible pos for each words and produce the constuency trees. The main objective will to find the correct grammar to get the parsing you want (you can take inspiration of Context-free Grammar and syntactic tree: Example - slide 33) You should discuss the different results depending on the pos tag and the context free grammar choosen. Notice that we limit the non terminal to :- VP (verbal phrase)
- NP (noun phrase)
- PP (prepositional phrase)
Dataset
General domain:
- The cat sat on the couch.
- Time flies like an arrow.
- The spy saw the cop with the telescope.
- The spy saw the cop with the revolver.
Specific domain:
- Biology: Arabidopsis thaliana seedlings exhibit longer hypocotyls when they are grown under high ambient temperature, which is defined as thermomorphogenesis.
- Astronomy: A spectrogram of PSN J10354824+3900279 obtained on Dec. 19.33 UT suggests that this is a type-Ia at redshift z 0.044.
Report Submission
• Work together in small groups (up to 3 students per group) • Prepare a final report composed of a header and three sections: • Header: first and last names and e-mail of all members of the group • First part: manual analysis of sentences • Second part: automatic analysis • Third part: observations • Send your report (PDF) to thomas.gerald@universite-paris-saclay.fr (up to January 16th)
Session 2 : Lemmatization & Tokenization
- Lectures (slides)
Report Instruction
Part 1: Create your lemmatizer
In this exercise, the objective is to create your own lemmatizer for french language. We will test different lemmatization approaches : Based on a dictionary Based on machine learning approach (you can use sklearn) or define your own architecture with pytorch With and without PoS tag given as input you should report performances of the proposed algorithm and compare it to spacy lemmatizer.Part 2: Implementing BPE algorithm
The second exercise is a guided implementation of the BPE algorithm proposed in "Neural Machine Translation of Rare Words with Subword Units" of Rico Sennrich, Barry Haddow and Alexandra Birch in 2016Report Submission
- Work together in small groups (up to 3 students per group)
- You should submit one report archive (tar.gz or zip), containing the two notebooks (containing a conclusion and a description of your choices)