Text Mining And Chatbot


Defense (10mn presentation/5 mn questions) - 02/20/2026 - D101

At the end of the class you will present a paper in group (3 person per group). You must at least in your presentation :
  • Introduce the main ideas and the context (meaning to be aware of the domain)
  • Describe the different contributions
  • Explain the experimental protocol
  • Explain pros and cons of the approach and the experimental setting
  • Discuss the results and the impact on the field
  • Conclude

Sample List of proposed scientific papers

You can also choose an article that interests you and send it to us for approval (articles from ACL, EMNLP, or NLP conferences are more likely to be accepted).

Session 1 : Introduction

Report Instruction

The objective of the exercices is to introduce pos tagging and syntactic parsing.
Part 1: PoS tagging
For a set of sentences students will apply different pos tag approaches :
  • Manual annotation
  • SpaCy annotation (https://spacy.io/): pip install -U spacy
  • python -m spacy download en_core_web_sm (to download the small model for english)
  • sciSpaCy (https://allenai.github.io/scispacy/): pip install scispacy
The considered tag for part-of-speech manual annotation will be limited to the following list (https://universaldependencies.org/u/pos/):

ADJ, ADP, AUX, DET, NOUN, NUM, PROPN, PUNCT, SCONJ, VERB

The objective will be to compare the different PoS tagging approaches methods and to discuss their strength and weakness.
Part 2: Syntactic Parsing
In this exercise you will use the previously annotated samples and apply the cyk algorithm for constituency parsing. Notably you will apply the algorithms considering different possible pos for each words and produce the constuency trees. The main objective will to find the correct grammar to get the parsing you want (you can take inspiration of Context-free Grammar and syntactic tree: Example - slide 33) You should discuss the different results depending on the pos tag and the context free grammar choosen. Notice that we limit the non terminal to :
  • VP (verbal phrase)
  • NP (noun phrase)
  • PP (prepositional phrase)
Compare with Spacy Syntactic analysis(constituency parser). If you have problem(s) with dependency you can use the Berkeley constituency parser online
Dataset
General domain:

  • The cat sat on the couch.
  • Time flies like an arrow.
  • The spy saw the cop with the telescope.
  • The spy saw the cop with the revolver.

Specific domain:

  • Biology: Arabidopsis thaliana seedlings exhibit longer hypocotyls when they are grown under high ambient temperature, which is defined as thermomorphogenesis.
  • Astronomy: A spectrogram of PSN J10354824+3900279 obtained on Dec. 19.33 UT suggests that this is a type-Ia at redshift z 0.044.

Report Submission

• Work together in small groups (up to 3 students per group) • Prepare a final report composed of a header and three sections: • Header: first and last names and e-mail of all members of the group • First part: manual analysis of sentences • Second part: automatic analysis • Third part: observations • Send your report (PDF) to thomas.gerald@universite-paris-saclay.fr (up to January 16th)

Session 2 : Lemmatization & Tokenization

Report Instruction

Part 1: Create your lemmatizer
In this exercise, the objective is to create your own lemmatizer for french language. We will test different lemmatization approaches : Based on a dictionary Based on machine learning approach (you can use sklearn) or define your own architecture with pytorch With and without PoS tag given as input you should report performances of the proposed algorithm and compare it to spacy lemmatizer.
Part 2: Implementing BPE algorithm
The second exercise is a guided implementation of the BPE algorithm proposed in "Neural Machine Translation of Rare Words with Subword Units" of Rico Sennrich, Barry Haddow and Alexandra Birch in 2016

Report Submission

  • Work together in small groups (up to 3 students per group)
  • You should submit one report archive (tar.gz or zip), containing the two notebooks (containing a conclusion and a description of your choices)

Session 5 : Language Model

Report Instruction

Part 1: Create Your Language Model
In this exercise, the objective is to create your own transformer language model (train only for few iterations ). You will need the RoPE implementation and eventually some pretrained weights with the following parameters :
  • vocabulary size: 32768 (same as in your file)
  • embed_size: 256
  • intermediate_size: 512
  • num_heads: 4
  • hidden_layers: 8
Part 2: Fine tuning using LoRA
In this exercise, the objective is to adapt a language model.

Report Submission

  • Work together in small groups (up to 3 students per group)
  • You should submit one report archive (tar.gz or zip), containing the two notebooks (containing a conclusion and a description of your choices).

Session 6 : Chatbot

In this lecture we will review objectives and approaches for chatbot systems. We will particularly focus on task oriented chatbot and Natural Language Understanding (NLU).

Exercices : Intent detection and slot filling

You will find the instruction of the exercise in the notebooks located in the archive. You should once two exercise complete send the notebooks (no need to create a pdf report this time). This exercise is optional and you should prioritize the last week submission on LLM pretraining