Shelkid — Detecting veiled school bullying: scientific dossier

What we hope for from you

We address ourselves, respectfully, to researchers and institutions working on cyberviolence, school climate and natural language processing. Our approach is non-profit, free and not monetised: we are seeking neither funding nor endorsement — only your perspective, and, if the idea speaks to you, a collaboration.

If it seemed relevant to you, access to an annotated French cohort of children aged 8 to 13 (consensus ground truth) would be invaluable to us. This is our main blocker: this age range is almost absent from the literature, and all our results — obtained on adolescents — remain to be tested on it.
And, in time, the perspective of a child psychiatrist to clinically validate the thresholds, our sensors having been tested only on adults.

If we are turning to you, it is because these two doors — field data and clinical validation — can only be opened with researchers of your experience. We have tried to do our part: build the system, and honestly document its limits. For the rest, your expertise would be infinitely useful to us. We are fully aware of the value of your time, and thank you in advance for the attention you may be willing to give this work.

ML update — 12–13 June 2026 — Six successive evaluation runs on the ML engine (embeddings + logistic regression + 27 structural features).

New results (proof_fusion.py, 27 active features):

Run 4 — absolute record at FP 5% · config SENS+STRUCT · FR+CyberAgressionAdo · recall 61% · bullying 85%.
Data: crashtest-v2 FR + Sprugnoli WhatsApp FR + 3,015 real CyberAgressionAdo-v2 messages. Features B3: silence >2h + night-time hours 2am-5am. Level: semi-real/balanced.
Run 5 — B5 reciprocity regressive · FP 5%→21%. B5 disabled pending multi-participant data.
Data blocker: De Kindertelefoon / krisenchat / TRAILS corpora required.
Run 6 — CyberAgressionAdo-Large (night of 13/06) · config SENS+STRUCT · FR+CyberLarge · FP 2% · recall 62% · bullying 74% (+6 pts vs Run 4).
Data: + 5,608 real CyberAgressionAdo-Large messages (Zenodo, 36 scenarios). The v2+Large combination (bullying=63%) is less effective than Large alone — the two corpora add noise to each other. Provisional champion candidate: FR+CyberLarge. To be confirmed over more runs.

Honesty about the "signature" of the whisper (updated 14/06/2026): we tested two structural-signature hypotheses on real data, and we publish both negative results. (1) "The victim disappears / goes silent" — retracted: it was a parsing artefact of the Sprugnoli corpus (8 conversations out of 10 had a timestamp in place of the speaker). (2) "Convergence" (several against one) — visible on Sprugnoli (role-play, 72%) but NOT confirmed on real data with a control: non-discriminative on Conversations Gone Awry (Cornell, AUC ≈ 0.49) and a minority (≤ 18%) on OAA. Honest conclusion: no structural signature is, to date, validated on real data. The detector relies on content (an explainable classifier, at its known ceiling); the real blocker remains access to longitudinal group data with healthy control groups.

Methodological note: the FP figures quoted (Run 4: 5%, Run 6: 2%) are measured on a balanced semi-real dataset, not on the real SMS/chat register. The two figures are not directly comparable to the 5.40% of the rules engine (§7).

Executive summary

Quantified summary — calibrated, without over-claim

Cross-cutting warning: all our empirical results concern adolescents (12-17 years), never the target group (8-13 years). Pre-adolescence differs qualitatively (indirect aggression peaks around 11-13 years, language and usage change). Each figure below is therefore an optimistic upper bound for the target population, to be recalibrated.

6–21%WHISPER recall (real)

The true headline metric. Recall on veiled aggression — the tool's reason for being — is 50% (E2b) to 60% (E5) on a balanced 50/50 dataset (semi-real, optimistic); on real data (gold-adolescent, n=58), it drops to 6-21% — 3 to 10× lower. The honest headline metric is therefore this low real recall. The thesis remains unconfirmed as long as it is not improved and validated on a sufficient sample. semi-real + real

+0.21AUC thread − message

Suggestive indication, not foundational. On the same corpus, an isolated message is almost indistinguishable from noise (AUC 0.556) whereas the full thread becomes discriminative (AUC 0.762). But n=39, with no DeLong test nor CI: at this sample size the intervals probably overlap, and 0.556 is not distinguishable from 0.500. Above all, on the only clean held-out actually tested (n=9, §12), the thread AUC falls to 0.72 with p=0.43 — non-significant: the 0.762 is a point on a small sample, not an established result. Consistent with a temporal thesis, never a proof. exploratory / suggestive

0.574inter-rater alpha

0.574 overall — but 0.707 on the thread, above the floor (0.667). Inter-rater agreement (Krippendorff, n=9 raters) is 0.574 overall; it rises to 0.707 on the full thread and falls to 0.314 on the isolated message. Honest caveat: threads and isolated messages are different conversations (the isolated ones chosen as ambiguous) → consistent with the thesis, not a proof; the clean test (same exchange isolated vs threaded) remains to be done. Label noise caps any supervised metric → the product "flags to a human", never accuses. real (adolescents)

0/17recall of a mini-LLM alone

Documented public regression. GPT-4o-mini, reported at 75% recall on English synthetic cases, dropped to 0/17 in ALARM recall on real French cases. The honest lesson is "synthetic does not transfer to real" (not "never a model alone" — large LLMs are otherwise our best AUC scorers). real (adolescents)

~18–55%estimated PPV

The decisive figure, and it is bleak. Derived calculation (not measured): with the optimistic 60% recall and 5.40% false positives at the thread level, the positive predictive value is ~18% at 2% prevalence, ~37% at 5%, ~55% at 10%. At the real recall (6-21%), the PPV would be even lower. At realistic prevalence, the majority of alerts would be false alerts on a child. A prerequisite for any clinical validation. derived calculation

5.40%real false positives (thread)

Distribution FROZEN. On a real French SMS/chat register, false positives at the thread level fell from 9.44% to 5.40% after a fix (not validated out-of-sample: optimistic, to be re-measured). The 0.00% boasted elsewhere comes from a non-adversarial synthetic control corpus: optimistic, not predictive. No release until the real rate is under control and the thresholds clinically validated. real (adolescents)

Framing note. The high performance on severe and explicit cases (ALARM) is deliberately removed from the argument: these cases are already well caught by simple rules. Measuring one's performance on the easy part is not the innovation — what matters here is solely the recall on the whisper (50-60% balanced, 6-21% real), acknowledged as low.

The thesis, reframed — what we are really submitting. Our claim is not "we know how to detect veiled bullying": the tests do not yet demonstrate it. It is more fundamental, and more cautious — does relational aggression leave a measurable signature in the dynamics of a relationship, independently of the content of the messages? If so, school bullying is only one application among others (exclusion, isolation, manipulation). It is this question — the observation of a phenomenon, not the performance of a product — that we submit for critique.

Section 1

Problem and stakes

real — external epidemiologyexploratory — technical blockers

School bullying and its digital extension are widespread, and their consequences can be grave, sometimes dramatic (without being systematically lethal). In France, according to self-reported figures, 18% of young people report being cyberbullied (INSEE Références 2025 / DEPP) and 37% report bullying or cyberbullying (e-Enfance/3018, Caisse d'Épargne–Audirep barometer, May 2025); these rates measure self-reported violence in a broad sense, and not bullying in the strict Olweus sense (repetition, duration, intent to harm, power imbalance). Among 6-18-year-old victims, 25% report having thought about suicide or self-harm, 39% among girls (e-Enfance 2025). Suicide is the 2^nd cause of death among 15-24-year-olds (Santé publique France 2023). International (non-French, 2014-2015) meta-analyses suggest an association with an excess risk of suicidal ideation and behaviours — and not a direct mortality: van Geel et al. (JAMA Pediatrics 2014, n=284,375) report an OR of 3.12 vs 2.16 for ideation; Holt et al. (Pediatrics 2015) an OR of 2.34 (ideation) and 2.94 (suicidal behaviours).

The blocker is not explicit bullying (insults, threats), already well caught by rules. It is veiled bullying — repeated mockery without insult, silent exclusion, innuendo, group irony. This is exactly the canonical construct of relational aggression (Crick & Grotpeter, 1995; Björkqvist, Lagerspetz & Kaukiainen; Underwood) — a form of violence that targets the tie and the status, whether it travels through a direct or indirect channel. Its charge is not in the words but in the repetition, the relationship and the effect on the target — which makes it invisible to lexical filtering. An additional structural blocker: end-to-end encryption (iMessage, Signal, WhatsApp native on iOS) makes these channels opaque (0% detection possible). Any solution must therefore operate where the text is readable by the child themselves: on their device.

Figure 1 — Scale of the phenomenon in France (public sources)

Self-reported level (self-report; not bullying in the strict Olweus sense). Sources: INSEE Références / DEPP 2025; e-Enfance/3018 — Caisse d'Épargne–Audirep barometer, May 2025. Suicide is the 2^nd cause of death among 15-24-year-olds (Santé publique France 2023).

Figure 2 — Excess suicide risk associated with (cyber)bullying (odds ratios, meta-analyses)

Excess risk of suicidal ideation and behaviours (ideation ≠ mortality; international studies, non-French). Sources: van Geel et al., JAMA Pediatrics 2014 (n=284,375); Holt et al., Pediatrics 2015 (meta-analysis). Justifies the clinical stakes of early detection.

Section 2

The data: consensus annotation of a public French corpus

real — annotation

Data is the project's true bottleneck, not the model. Our ground-truth foundation is CyberAgressionAdo (Ollagnier et al., CNRS, Zenodo 14770265): 5,608 messages from real French adolescents' conversations, anonymised. This corpus is public and published — what is new is our annotation layer by multi-rater consensus, not the corpus itself. We attribute the contribution clearly: protocol and labels, never the resource.

Composition annotated by consensus: 202 conversations (118 "whisper", 61 "clear (nothing-to-report)", 23 "ambiguous"). An ultra-strict subset of 69 "gold" conversations serves as the highest-quality benchmark.

Two distinct setups not to be confused. The "median of 177 votes per conversation" comes from the gamified crowd collection (anonymous annotators) — it is what produces the consensus labels. Krippendorff's alpha (section 5) is computed on a panel of 9 raters, a different setup. Honest consequence: the alpha of 0.574 concerns the panel, not the labels actually trained on (which come from the crowd) — the agreement of the crowd collection has not been measured, so the real label-noise floor of the training labels may be worse than 0.574. The construct validity of anonymous crowd votes on subtle bullying is itself debatable — an acknowledged limitation.

In parallel, a gamified collection campaign (16 games, ~194 root first-names — conservative estimate, possibly over-estimated, to be refined, 6,088 responses over 12.5 days, 27 May – 9 June) fed the annotation. This count and that of the annotation measure different things and do not combine arithmetically.

Derived extensions — to be downgraded to exploratory / illustrative

Two synthetic datasets exist and must, for consistency with the lesson of the 0/17 ("synthetic does not transfer to real"), be treated as purely exploratory/illustrative, never evaluative: GroomingFR-Synth v1 (9,002 synthetic messages, CC BY-NC 4.0) and a non-adversarial control corpus of 720 synthetic neutral messages. The latter serves to illustrate the false-positive behaviour, but its 0.00% is optimistic and not predictive (section 7).

Figure 3 — Consensus-annotated corpus (202 consensus conversations)

Class	n	Share	Level
Whisper (veiled aggression)	118	58%	real
Clear (nothing-to-report)	61	30%	real
Ambiguous	23	11%	real
Consensus total	202	100%
Ultra-strict gold subset	69	—	real
Labels via crowd collection (median votes/conv.)	177	—	crowd setup
Panel for Krippendorff's alpha (raters)	9	—	panel setup

Real level (adolescents). Source: CyberAgressionAdo, Ollagnier et al., CNRS (Zenodo 14770265), internal consensus annotation. Labels = crowd; alpha = panel of 9 raters: two distinct setups. Enriched annotation base (58% positives) ≠ natural prevalence (assumed 2-10%, §7) → every class metric (recall, precision, PPV) must be recalibrated on the real prevalence, and appears optimistic by construction on this base.

Section 3

Central result: isolated message versus full thread

exploratory / suggestive — n=39, no inferential test

On the same real corpus and the same metric (AUC, n=39 conversations), an isolated message is almost indistinguishable from noise (AUC 0.556, barely above a surface baseline of 0.527), whereas the full thread becomes discriminative (AUC 0.762, apparent gain +0.21). This is consistent with the idea that the charge of the whisper is in the trajectory, not in the message.

This is not a "proven foundational result". At n=39, no paired DeLong test nor bootstrap CI was performed; the intervals of the two AUCs very probably overlap, and 0.556 is not distinguishable from 0.500. As long as the CI of the difference does not exclude 0, this gain remains suggestive / exploratory. The inferential test remains to be done, ideally consolidated on the 202 consensus conversations with a pre-declared test set (anti forking-paths: it is unknown why 39 rather than the 69 gold or the 202).

Figure 4 — AUC: isolated message vs full thread (same corpus, n=39)

Exploratory / suggestive level, n=39 — without CI nor DeLong test. Source: CyberAgressionAdo (CNRS). 0.556 is not distinguishable from chance (0.500) at this sample size. Gain +0.21 = consistent trend, not significantly established. Interpretation caveat: this gain may reflect a mere increase in context provided to the model; demonstrating that it stems from the relational and temporal structure (and not from text volume alone) is an experimental blocker in its own right — it is the core of the collaboration we are seeking.

Section 4

The progression scale: from surface to on-device

semi-real — re-balanced datasetsreal — support corpus

Performance progresses in steps. In AUC: surface 0.527 → small model 0.587 → frozen distilled model 0.684 → on-device distilled model 0.762 → language model ~0.79. The on-device version approaches the language model while running on the device — reading the thread locally, with that content never transmitted nor shown to a human. Important caveats: (1) the 0.762 is a point value on small n, non-significant on the only held-out tested (0.72, p=0.43, §12); (1b) numerical coincidence not to be over-interpreted: the 0.762 of the "full thread" in the ablation (§3) and the 0.762 of the "on-device distilled" model (here, §4) are two distinct measurements (window-vs-message ablation on one side, distillation/fine-tune on the other) that happen to land on the same value — it is not the same run reused; (2) these steps do not form a homogeneous learning curve — each step changes ≥1 condition (data volume, frozen vs fine-tuned model, sometimes a different held-out): it is a comparison of configurations, not a continuous progression measured on a single bench.

Reconciling 0.79 vs 0% (anticipated). The "~0.79" is an unmatched AUC benchmark of large LLMs (GPT-4o: AUC 0.787; Gemini-2.5: AUC 0.802). It does not contradict the "0/17" in section 6: the latter measures the ALARM recall of a mini-model (GPT-4o-mini), a different setup and model. Large LLMs are on the contrary our best AUC scorers — which is why we do not generalise "never a model alone".

On the 3-class classification (macro F1, Test B v2 dataset — 107 scenarios balanced 50/50, hence semi-real: re-scripted excerpts, optimistic metrics, not the natural prevalence), the progression of the fine-tuned models goes from E-ablation 22.8% to E5_QUALITY 81.4%. Total cost of the 5 fine-tunings: ~22-25 USD.

No "robust takeaways". The McNemar test (Holm-Bonferroni correction) is not significant at n=107: the gaps between models are unconfirmed trends, not proofs. "Monolingual beats multilingual" (78.3 vs 74.2) is noise at this stage. Furthermore, alpha=0.574 caps the achievable performance: label uncertainty exceeds the gap between models — reporting three significant figures on such labels is not defensible. The only defensible gap is the ablation (22.8% vs ~78%, enormous), which indicates that data enrichment (ToxiFrench 52,274 + control 720) is necessary.

Figure 5 — AUC escalation: from surface to language model

Level: semi-real for the fine-tuned models, real for the corpus. The ~0.79 = unmatched AUC benchmark (GPT-4o 0.787 / Gemini-2.5 0.802). The on-device version 0.762 reaches the full thread.

Figure 6 — Macro F1 3 classes per fine-tuned model (Test B v2, n=107, balanced)

Semi-real level (balanced 50/50 dataset, optimistic) — McNemar n.s. at n=107. Gaps < ~6 pts = trends, bounded by label noise (alpha 0.574). Only the gap to the ablation (22.8) is interpretable. Phase C requires ≥196 scenarios.

Section 5

Human reliability: humans too fail on the isolated message

real (adolescents) — n=9 raters, 202 conversations

Before asking a machine to decide, one must know whether humans agree. The overall Krippendorff's alpha is 0.574, CI [0.31; 0.79] — moderate agreement, honestly weak at the bounds, and below the floor of 0.667 the project set itself. The breakdown is telling: on the isolated message, agreement drops to 0.314; on the full thread, it rises to 0.707. On the double-coded subsample of the reliability pilot (18 conversations, 9 raters), thirteen reach ≥80% agreement.

This machine/human parallel is NOT an independent validation. Machine and humans both degrade on the isolated message for the same trivial reason: there is less information in a message than in a thread (confounding with "amount of information"). If the labels were created at the thread level, "the thread predicts better" is partly tautological on both sides. A true corroboration would require an external signal (clinical outcome, target's self-report). We therefore present this parallel as an internal consistency expected by construction, not as proof that "this is the nature of the object".

Figure 7 — Inter-rater agreement (Krippendorff's alpha): isolated message vs thread

Real level (adolescents), n=9 raters, 202 conversations. Doctrinal floor: 0.667. Double-coded subsample: 13 of the 18 pilot conversations (9 raters) at ≥80% agreement. Internal consistency with section 3 — not an independent corroboration.

Blocker #0 — the most fundamental objection, made a topic in its own right. This alpha of 0.574 is not one limit among others: it caps the entire programme. If human raters agree only at this level on what a whisper is, then the ground truth itself is unstable — and one can legitimately ask: what exactly does a model trained on this consensus learn? We embrace this as the first research object, prior to any performance: to co-build an operational definition of the whisper and a robust annotation protocol (explicit guidelines, rater training, adjudication of disagreements) for the 8-13 age group. As long as this agreement is not raised, no supervised metric is fully interpretable.

Section 6

What we have cleanly ruled out

real (adolescents/adults)

Rigour is also measured by what one eliminates. Four avenues tested then ruled out on real data, documented as negatives — not to be re-proposed.

Keystroke (typing rhythm). Tested on DUX (36 subjects) and EmoSurv (124 adults). Indistinguishable as a reliable emotional marker, consistent with the literature. Ruled out.
French encoders. 4 versions tested, indistinguishable from one another (~0.73 AUC each; the choice of encoder brings no marginal gain). The lever is not the encoder.
Generic veil (simple veiled toxicity, ToxiFrench). On 25,838 examples, AUC 0.488 — statistically indistinguishable from chance (0.500) (an AUC slightly below 0.5 may signal noise or a label inversion, to be checked). Generic toxicity does not capture school whispering.
Mini-LLM alone. GPT-4o-mini, reported at 75% recall on 30 English synthetic cases, dropped to 0/17 in ALARM recall on real French adolescent cases (11.8% on the whisper, 94.1% on the neutral ones). Public communication cancelled.

Scope of the lesson strictly limited. The "0/17" holds for one mini-model, one metric (ALARM recall), n=17. It does not establish "never a model alone" (large LLMs are our best AUC scorers). The true inference is: synthetic does not transfer to real — which, applied honestly, also downgrades our own synthetic datasets (GroomingFR-Synth, control 720) to exploratory. Testing GPT-4o/Gemini on the same 17 cases remains to be done before any doctrinal assertion.

Figure 8 — Avenues tested and ruled out (on real data)

Avenue	Data	Result	Verdict
Keystroke (typing rhythm)	DUX 36 + EmoSurv 124 (adults)	No reliable emotional marker	ruled out
FR encoders (4 versions)	—	~0.73 AUC, indistinguishable from one another	ruled out
Generic veil (ToxiFrench)	25,838 examples	AUC 0.488 ≈ chance (0.500)	ruled out
Mini-LLM alone (ALARM recall)	17 real FR adolescent cases	0/17 (0%)	ruled out

Real level. Failure register documented for the community. The 0.488 is indistinguishable from chance, not a "perfect chance". The 0/17 motivated caution on synthetic data, not the bundle doctrine.

Transfer: no existing corpus captures the whisper (direct evidence, Igor 09/06). We tested whether a detector trained elsewhere "sees" our whisper: (1) generic English implicit hate (ISHate) → AUC 0.58 ; (2) Ocampo's published peace_hatebert model, as-is → 0.57 ; (3) a model trained on the French covert aggression of CyberAgressionAdo-Large (0.90 in-domain) → 0.34, clearly below chance (0.500). Caveat (same reservation as for the 0.488): an AUC at 0.34 — well below chance — is a likely signal of a label inversion or a bug, to be checked; until the sign is confirmed, this 0.34 cannot serve as proof that the whisper is a "distinct construct". The English transfers (0.58 / 0.57), by contrast, remain usable. No clear positive transfer — neither English nor even French covert. Peer relational whispering is therefore plausibly a distinct construct, poorly covered by current corpora: a candidate justification for lock no. 1 (annotating this object), to be consolidated. n=58, to confirm; sign of the 0.34 to be checked.

Figure 9 — Regression of the mini-LLM alone: English synthetic vs French real

75%

Reported ALARM recall
(30 EN synthetic cases)

Measured ALARM recall
(17 real FR adolescent cases)

11.8%

Real WHISPER recall

94.1%

Real NOTHING (neutral) recall

Real level (measurement of 30 May). Lesson: English synthetic does not transfer to natural French adolescent text. Scope limited to this model and this metric.

Section 7

False positives, predictive value and distribution freeze

real — FR SMS/chat registerderived calculation — PPV

A detector that cries wolf is unusable. On the synthetic control corpus (720 neutral messages), the false-alert rate is 0.00% — but this corpus is non-adversarial, hence optimistic and not predictive. On a real French register (SMS/chat), the honest measurement is quite different: at the message level, E5 produces ~3.56% false positives; at the thread level, 9.44% initially, brought down to 5.40% after a first fix. Reservation: this fix was not validated on a held-out distinct from the evaluation bench — if it was tuned on the same register, the 5.40% is optimistic, to be re-measured out-of-sample (the real out-of-sample FP would probably be higher). At this stage, the distribution has been FROZEN.

The missing decisive figure: the positive predictive value (PPV)

Reporting recall alone masks the base-rate problem. Derived calculation (Bayes' formula, not measured, prevalence assumptions) from our figures: recall 60%, false positives 5.40% at the thread level. At realistic and low prevalence, the majority of alerts would be false on a child — a decisive figure for a tool affecting minors, and a prerequisite for the clinical validation in section 10. Acknowledged inconsistency in the calculation below: the 60% recall comes from a balanced 50/50 dataset (semi-real, optimistic) while the 5.40% false-positive rate comes from a real French SMS/chat register — these are two distinct datasets, and the real PPV is bleaker. At the real recall (6-21%, mid-range ~15%) with the same 5.40% FP, the PPV at 5% prevalence falls to ~13% (vs ~37% at the optimistic 60% recall): at realistic prevalence, nearly 9 in 10 alerts would be false alerts on a child.

Figure 10 — Estimated PPV according to whisper prevalence (derived calculation, not measured)

Assumed prevalence	Recall	False positives	Estimated PPV	Reading
2%	60%	5.40%	~18%	~4 alerts out of 5 are false
5%	60%	5.40%	~37%	the majority of alerts remain false
10%	60%	5.40%	~55%	barely one alert in two is correct

Level: derived calculation, not measured. PPV = (prev·recall) / (prev·recall + (1−prev)·FP). Assumptions: recall 60% (E5 whisper, balanced semi-real dataset), FP 5.40% (real thread) — two distinct datasets. At the real recall (~15%, gold-adolescent n=58), the PPV drops to ~5% (prev 2%), ~13% (prev 5%), ~24% (prev 10%) — even lower. Without the real base rate of the target population, no FP/recall figure is interpretable.

Severe and explicit cases are out of scope. Detections on severe/explicit cases (open insults, direct threats) are already caught by simple rules and do not constitute the innovation: they are deliberately removed from the argument. Any high performance figure would relate to this trivial scope — the only metric that matters here is the recall on the whisper (50-60% balanced, 6-21% real), still low and acknowledged as such.

Figure 11 — False positives: optimistic synthetic vs honest real, and freeze

Real level for the FR SMS/chat register; the 0.00% synthetic is optimistic and not predictive. Status: distribution FROZEN as long as the real rate is not under control and the thresholds clinically validated. Quarantine: the very flattering internal synthetic figures (1.6%, 0%, threshold curves) are exploratory, outside the conclusions.

Section 8 — unvalidated exploratory extensions

Systemic hypothesis and behavioural sensors

exploratory — outside the validated scope

Section separated on purpose. The project's core of evidence is the conversational trajectory of the whisper. Everything that follows (network hypothesis, grooming detection, 8 behavioural sensors) is an exploratory extension not validated on the target group, presented separately so as not to be confused with an established result. Grooming and veiled bullying are two distinct problems.

Network hypothesis: measuring the deformation of the social tie

A deeper hypothesis guides the R&D: bullying would be a loss of reversibility in the relational network (progressive exclusion, group pressure, asymmetric silence) — one would not measure the message but the deformation of the social fabric. This hypothesis rests on an established literature — bullying as a group process (Salmivalli et al., 1996) and social ostracism (Williams, 2009) — but it has not been tested by us on the target group. It is "consistent with" the hypothesis — never an established mechanism, and it must not steer the architecture as an established result.

8 behavioural sensors — validated on adults only

Eight sensors are wired in (typing, sleep, mobility, self-censorship, silence, dissonance…), all validated on adults/students, never on minors. A legitimate question from an evaluator: why pile up 8 sensors that cannot be validated on the target group rather than hardening the single signal that works? Honest answer: these are leads, not strengths, and each must be scientifically justified or abandoned. They remain here in reserve, outside the performance conclusions.

Figure 12 — Theoretical anchors of the network hypothesis (established references)

Anchor	Reference	Contribution
Bullying = group process	Salmivalli et al., 1996 (Aggressive Behavior)	roles: ringleader, assistant, reinforcer, bystander, defender
Social ostracism	Williams, 2009	exclusion / silence as a threat mechanism
Signed-graph balance	Cartwright & Harary, 1956	"all against one" = converging negative edges

Level: theoretical anchor (established literature), not a measurement of the project. The network hypothesis has not been tested by us on the 8-13 target group. "Consistent with" — does not steer the architecture as an established result.

Section 9

Ethics as an architectural constraint — and its legal tensions

design commitments — non-negotiable

The ethical principles are wired into the architecture: (1) 100% on-device detection; (2) content is analysed locally, never transmitted, never stored, never shown to a human (a signal is derived, no copy is kept; the content detector reads the text, but locally and without it leaving the device — the latency/reaction layer, for its part, needs no content); (3) nothing to parents automatically; (4) the signal belongs to the child; (5) referral to a trusted third party (not necessarily the parents) and the 3018 helpline. Two technical safeguards: the bundle rule (no isolated sensor ever triggers; 7-day baseline, minimum 3 converging sensors) and the dentelle (lace) lock (across 24 settings, 0 games offered to a child in danger — final clinical adjudication). These are design decisions, not measured results.

Unresolved legal tensions for 8-13-year-olds — to be addressed, not concealed. The slogan "the child never surveilled" is debatable: local processing remains processing. For this age range, three open questions arise: parental authority (8-13-year-olds are fully subject to it in French law); GDPR consent (the digital age of consent is 15 in France — an 8-13-year-old cannot consent alone, so parental consent is required); mandatory reporting in case of danger. "The signal belongs to the child" comes into direct tension with parental authority. The system must be validated by a GDPR-minors lawyer and possibly the CNIL — without which the academic and clinical partnership will be refused on legal-risk grounds.

Caution bound on the "0/n". Derived calculation (rule of three): a 0 observed on small n is not a guarantee. For the 0/24 dentelle, the upper bound of a CI is ~12.5% (≈ 3/24) — one cannot exclude up to ~1 child in 8 being poorly served. For the 0/720 synthetic, upper bound ~0.4%. Any "0" is therefore "no case observed, but n insufficient to exclude a residual rate".

Figure 13 — Wired-in ethical safeguards (design commitments)

Commitment	Implementation	Nature
On-device detection	100%, no server dependency	design
Content transmitted, stored or shown to a human	never — analysed locally, signal derived, no copy	design
Automatic reporting to parents	none	design
Ownership of the signal	the child (in tension with parental authority 8-13)	to be settled in law
Trusted third party + 3018	yes	design
Bundle rule	min. 3 converging sensors, never a single one	design
Dentelle	0/24 settings (CI upper bound ~12.5%)	not guaranteed at small n

Design commitments, not performance measurements. Anti-outing LGBTQ+ and coverage of minorities: gap identified, fixed, not yet measured for effectiveness. Clinical thresholds still to be validated.

Section 10

Roadmap — blockers by criticality

real — blockersexploratory — sensors

Figure 14 — Blockers in order of criticality

#	Blocker	Why it blocks
1	Annotated FR 8-13 cohort	Field absent from the literature; no external baseline; the whole evidence chain (measured on adolescents) to be re-validated
2	Clinical validation of thresholds (child psychiatrist)	Sensors validated on adults only; dentelle/synthesis thresholds = unmeasured human adjudication → blocks deployment
3	Statistical consolidation (Phase C)	≥196 scenarios (McNemar significance), DeLong on the foundational AUC, ECE/temperature-scaling recalibration, closing the whisper shortfall (50-60% balanced → 6-21% real)
4	End-to-end test on a real physical device	The only end-to-end proof still missing

Real level for the identified blockers; exploratory for the behavioural sensors. Research framework, free, non-monetised child-protection mission.

Section 11

Data and computing resources

public corpora + in-house corpuscontrolled infrastructure

The domain's bottleneck is French data. The project combines public corpora, an original corpus annotated by adolescents, and a free, controlled computing infrastructure — a guarantee of reproducibility and of no data leakage.

Figure 15 — Main datasets used (full list: 32 corpora, see bibliography)

Corpus	Volume	Licence / source	Use
CyberAgressionAdo v1/v2 (Ollagnier, Cabrio, Villata, Blaya)	FR, multiparty	CNRS / HAL	core of the engine (FR whisper)
Shelkid in-house corpus	6,088 responses · 16 games · ~194 root first-names (probably fewer — one participant produces several codes; pilot: 28 codes = 9 real children)	own collection, anonymous	adolescent consensus annotation
CoMeRe	132,166 real FR SMS/chat	CC-BY	false-positive bench (real register)
SynBullying	14,222 messages	EN + synthetic	roles, conversational harm
Sprugnoli (translated to FR)	2,192 messages (role-play 12-13 years)	research	annotated exclusion (rare in FR)
Generic veil (ToxiFrench, MLMA-indirect, TRAC, ImplicitHate)	25,838 examples	open licences	transfer test — negative result (AUC 0.488)
4 FR encoders (CamemBERT v1/v2, DistilCamemBERT, mDeBERTa)	—	MIT / open	comparison (indistinguishable ~0.73)

Children's personal data: never published, aggregated only. The 32 referenced corpora and the ~20 evaluated models appear in the bibliography below.

Computing infrastructure — free, dedicated, reproducible

All measurements (reproduction of the E5 champion, distillation, embeddings, fine-tuning) run on a dedicated GPU server, NVIDIA L40S 48 GB (ollama + PyTorch/CUDA stack), with no dependency on a paid API. This choice guarantees reproducibility, the absence of data transmission to a third party, and demonstrates the "on-device" feasibility: the distilled model (AUC 0.762) is sized to run on the child's device, the content never leaving the phone.

Annex — bibliography

References (184)

none invented · consolidated references

References used or cited by the project, classified by domain. References whose precise details remain to be completed will be finalised before any formal publication. Duplicates consolidated.

Bullying, relational aggression & ostracism

1. Björkqvist & Lagerspetz (1988-1992) — indirect aggression. 2. Crick & Grotpeter (1995), Child Development 66(3). 3. Olweus (~1993-1996), rBVQ/OBVQ. 4. Salmivalli et al. (1996), Aggressive Behavior 22(1). 5. Salmivalli & Voeten (2004), PMC11851402. 6. Smith, P. K. — cyberbullying. 7. Salmivalli (1996+) — group roles. 8. Vaillancourt — status/cortisol. 9. Latané & Darley — bystander intervention. 10. Williams (2009) — ostracism. 11. Casper & Card (2017) — meta-analysis. 12. Cyberball (Williams et al.). 13. Sargioti et al. (2022), DABSS, PMC9969485.

Networks & group dynamics

14. Vicsek (1995). 15. Cartwright & Harary (1956). 16. Heider (1946). 17. Ballerini & Cavagna (2012).

Sequential detection & weak signal

18. Neyman-Pearson. 19. Page (1954), CUSUM. 20. Donoho & Jin (2004), Higher Criticism. 21. SPRT. 22. eRisk CLEF 2024/2025 (ERDE), CEUR-WS Vol-3740/4038. 23. SINAI (2025), arXiv:2509.19861.

Self-excitation & contagion

24. Hawkes (1971). 25. Masuda et al. 26. Soni et al. (2020), FM-Hawkes. 27. Rizoiu et al. (2017/2020), arXiv:2006.06167. 28. Yao, Chelmis & Zois (2020/2021), 10.1145/3441141.

Stigmergy, quorum, thresholds & phases

29. Bonabeau & Theraulaz. 30. Keller & Segel (1971). 31. Bassler & Miller. 32. Newman — percolation. 33. Granovetter (1978). 34. Hegselmann & Krause. 35. Noelle-Neumann — spiral of silence. 36. Ising (mean field).

Reciprocity, reconciliation, anthropology

37. Nowak & Sigmund — indirect reciprocity. 38. Ohtsuki — leading-eight. 39. Nowak (1992+). 40. de Waal & van Roosmalen (1979). 41. Strauss (2019) — hyenas. 86. Girard (1972/1982). 87. Minuchin (1974). 88. Cybernetics (feedback). 89. Hatfield — emotional contagion.

Linguistics & pragmatics

42. Searle (1975). 43. Grice (1975). 44. ElSherief et al. (EMNLP 2021), Latent Hatred. 45. Menini & Moretti (FBK/IRIT). 46. Lu et al. (Georgia Tech, 2025) — victim-centered.

NLP / toxicity / hate speech & corpora

47. Van Hee et al. (2018), AMiCA, PLOS ONE. 48. Ollagnier, Cabrio, Villata & Blaya (2022/2024), CyberAgressionAdo, LREC/LREC-COLING/TAL. 49. Cheng, Silva & Liu (HANT). 50. Losada, Crestani & Parapar (eRisk). 51. Danescu-Niculescu-Mizil et al. (Cornell 2018), Conversations Gone Awry. 52. Hartvigsen et al. (~2022), ToxiGen. 53. Sprugnoli (2018, FR transl. 2026), W18-5107. 54. SynBullying. 55. TRAC-1. 56. MLMA. 57. HateCheck. 58. M-Phasis (LREC 2022). 59. CAD (CC-BY-4.0). 60. ConvAbuse (CC-BY-4.0). 61. ImplicitHate. 62. Civil Comments (CC0). 63. CoMeRe (132,166 msgs, CC-BY). 64. "What's up, Switzerland?" (UZH). 65. 88milSMS. 66. textdetox FR. 67. French Hate Superset. 68. ToxiFrench (Sciara et al. 2025), arXiv:2508.11281. 69. Jigsaw / Perspective API. 70. Detoxify / XLM-R toxic.

Transformers & French models

71. Martin et al. (2020), CamemBERT, arXiv:1911.03894. 72. Antoun, Sagot & Seddah (2023), CamemBERTa, arXiv:2306.01497. 73. Antoun et al. (2024), CamemBERT 2.0, arXiv:2411.08868. 74. cmarkea (2022), DistilCamemBERT. 75. Le et al. (2020), FlauBERT, arXiv:1912.05372. 76. He et al. (2021), DeBERTaV3, arXiv:2111.09543. 77. Conneau et al. (2020), XLM-R, arXiv:1911.02116. 78. Caselli et al. (2020), HateBERT, arXiv:2010.12472. 79. Hinton, Vinyals & Dean (2015), Distillation, arXiv:1503.02531.

Grooming & manipulation

80. O'Connell (2003). 81. Street et al. (2024), arXiv:2409.07958. 82. BF-PSR (USP). 83. PAN12/PANC, Vogt et al. (2021), ACL. 84. Park et al. (2025), SCoRL, arXiv:2503.06627. 85. Patronus AI (2024).

Calibration & uncertainty

124. Platt (1999). 125. Zadrozny & Elkan (2002), KDD. 126. Vovk, Gammerman & Shafer (2005). 127. Gal & Ghahramani (2016), arXiv:1506.02142. 128. Lakshminarayanan et al. (2017), arXiv:1612.01474. 129. Manokhin (2024), TACL.

Temporal & multi-actor architectures

130. Gu & Dao (2023), Mamba, arXiv:2312.00752. 131. Jacobs, Van Hee & Hoste (2020), arXiv:2010.06640. 132. Pradhan et al. (2024), ESIHGNN, arXiv:2405.03960. 133. Jiao et al. (2020), PMC8625403. 134. Losada & Crestani (2016), CLEF eRisk.

Additional arXiv (2025-2026)

135. Chehbouni et al., arXiv:2501.12537 (AAAI 2025). 136. Langlais et al. — backtranslation, ~1.7 M synthetic FR tweets. 137. Park et al., arXiv:2503.06627 (NAACL 2025). 138. arXiv:2502.12563 (2025). 139. Sciara et al., arXiv:2508.11281.

Mycology & criticality (counter-examples)

90. Simard — Wood Wide Web (debated hypothesis). 91. Karst, Jones & Hoeksema (2023), Nature Ecology & Evolution. 92. Bak (1996+) — self-organised criticality.

Prevalence & public health (France/Europe)

93. INSEE Références 2025 (SSMSI/DEPP). 94. e-Enfance/Caisse d'Épargne (2024, Audirep). 95. Santé publique France (2023). 96. Senate Report (2021). 97. EU Kids Online (LSE). 98. JRC Commission EU (2025). 99. Görzig, Milosevic & Staksrud (2017). 100. European Child & Adolescent Psychiatry (2022). 101. Tippett & Wolke (2014). 102. Arcep/CREDOC (2024). 103. DEPP (2024). 104. Elfe/Inserm cohort. + van Geel et al. (JAMA Pediatrics 2014, n=284,375); Holt et al. (Pediatrics 2015).

French research (Blaya, Debarbieux)

105. Blaya (2025), La cyberviolence, Que sais-je ?, PUF, ISBN 9782715429345. 106. Blaya, L'école à l'ère du 2.0, HAL halshs-03534707. 107. Blaya et al., CyberAgressionAdo-v1, HAL hal-03765860. 108. Debarbieux, Du climat scolaire (MEN). 109. Debarbieux (2011), Refuser l'oppression quotidienne.

Educational programmes & framework

110. KiVa (Turku). 111. Kärnä et al., PubMed 23659182. 112. Olweus (Norvège). 113. Eoullim (Corée). 114. NIER/kokoro (Japon). 115. FUSE (Irlande). 116. Common Sense. 117. CEOP/Internet Matters/MediaSmarts. 118. StopBullying.gov. 119. Be Internet Awesome (Google). 120. eSafety (Australie). 121. Kit ISC « Vivre ensemble » (ANCT). 122. pHARe (éduscol). 123. OK Groomer.

Private data (with consent) & network corpora

140. SNARE (Groningen). 141. Xinyin Chen — Shanghai. 142. China SAOM 2025. 143. Chile/Santiago (RSiena 2019). 144. KiVa (Turku). 145. TRAILS/PROSPER/Add Health (USA). 146. Future Proofing (Australia, n=934). 147. Kalahari Meerkat (Clutton-Brock et al. 2023). 148. Amboseli Baboon (babase). 149. Cayo Santiago macaques (CPRC). 150. Hyenas (Holekamp/Strauss). 151. GSHS (WHO). 152. PISA/TIMSS/KCYPS. 153. Newcomb (1956). 154. Czech cohort (Prague).

Parked EN corpora

155. Reddit (~2M). 156. EDOS (SemEval-2023 t.10). 157. CONDA/Gab. 158. Jigsaw Toxicity (159k).

Law & ethics

159. EU AI Act (Art. 5, 3(34), Annex III). 160. Guidelines C(2025) 5052. 161. GDPR Art. 8/9. 162. CNIL Reco 4 & 5. 163. ICO Children's Code (UK). 164. COPPA revised (USA). 165. Age Appropriate Design Code (UK 2021).

Tools, libraries & miscellaneous

166. R xergm/btergm/RSiena. 167. ewstools. 168. earlywarnings (R). 169. ONNX Runtime Mobile. 170. HF AutoTrain. 171. optimum-cli. 172. Hugging Face. 173. Modal. 174. Zenodo/GitHub/figshare. 175. Ortolang. 176. E. Stark (UZH). 177. A. Ollagnier (UCA). 178. C. Blaya. 179. É. Debarbieux. 180. Franklin (2017), No Oracle. 181. Boettiger-Hastings — prosecutor's fallacy. 182. ROOST Coalition (2025). 183. OFMIN (28 767 signalements, 2024). 184. StopNCII.org.

184 references extracted from the project's working documents — none invented. A few precise details will be finalised before formal publication. Duplicates consolidated (CyberAgressionAdo, KiVa, eRisk). Per-file provenance available on request.

Section 12

Inferential tests — what the p-values say, and do not say

real computation — internal GPU server (Igor)

At reviewers' request, we ran the missing inference tests — DeLong on the AUC, bootstrap confidence interval, McNemar between models — and we report the result whatever it is.

Test	Result	Reading
AUC on clean held-out (n=9)	0.72	DeLong p = 0.43: at n=9, the sample size is too small to conclude (95% CI [0.17; 1.00], bootstrap [0.13; 1.00]). Result neither confirmed nor refuted — consistent with the hypothesis, but inconclusive for lack of data.
McNemar (candidate vs distilled)	p = 1.0	No significant difference between models: the observed gap is compatible with noise.
Enlarged bench gold58 (n=58)	0.92 ⚠	Apparently p < 0.001 — but this bench likely overlaps the training data (the frozen champion scored only 0.58 on a fresh bench). Discarded out of caution, consistent with the retraction of eval-gold100. Not retained as proof.

Inferential conclusion, in all honesty: at this stage, our results are compatible with the hypothesis, but the sample size (on an uncontaminated held-out bench) is insufficient to conclude — this is the limiting factor, not (yet) the model. This is precisely what motivates blocker #1: only a larger annotated corpus on the 8-13 target group will allow a decision. For rigour, the high AUCs obtained on re-balanced benches or possibly seen during training are not retained as proofs. Computation run on 9 June 2026 on an internal GPU server (Igor); scripts and outputs kept.

Annex

Acknowledged limitations

What we will not be able to know before a supervised deployment. Three unknowns will remain open even after the pilot: the real false-positive rate on the target population and uses; the real predictive value (for lack of a known base rate); the acceptability by the children themselves. We write this so that no partner learns it after the fact: these are structural limits, not oversights.

Population shift (the deepest gap). No result has been obtained on the 8-13 target group. All empirical results concern adolescents (CyberAgressionAdo) or adults (EmoSurv, DUX). The whole evidence chain — AUC 0.762, alpha 0.574, recalls — is to be re-validated on 8-13-year-olds, where indirect aggression, language and usage differ qualitatively. "Validated on adolescents, intended for children": this is the object of blocker #1.

Small n, no inference on the key results. Foundational AUC on n=39 without DeLong nor CI (hence "suggestive", not "foundational"); McNemar not significant on n=107 (Holm-Bonferroni) → the gaps between models are trends. Phase C ≥196 scenarios required.

Label noise caps the performance. Alpha 0.574 < floor 0.667, CI [0.31; 0.79]. Annotation uncertainty exceeds the gap between models: no supervised metric should be reported to more than 2 significant figures as long as alpha < 0.667.

Weak headline metric: whisper 50-60% recall on a balanced dataset, but 6-21% on real data (gold-adolescent n=58). On real data, it is therefore the majority of subtle cases (≈ 79-94%) that escape the system, not just half. This is the priority line; the thesis remains unconfirmed as long as the whisper is not significantly > chance.

PPV probably mediocre. Derived calculation: ~18% at 2% prevalence, ~37% at 5%, ~55% at 10%. The majority of alerts would be false at realistic prevalence. Real base rate unknown.

Real false positives not under control. 5.40% at the thread level after fix → distribution FROZEN. The 0.00% synthetic is optimistic. Definitive real rate known only in a passive deployment phase.

"Real" vs "semi-real". The 50/50 balanced datasets (36, 107) are re-scripted excerpts → semi-real and optimistic, never the natural prevalence. The provenance of each dataset (39/36/69/107/17) must be traced and a primary dataset pre-declared (anti forking-paths).

Quarantine of the synthetic figures. False alerts at 1.6% then 0%, coverages 92/83%, threshold curves: all synthetic, exploratory, outside the conclusions. For consistency with the lesson of the 0/17, GroomingFR-Synth and the control 720 are downgraded to illustrative.

Unresolved 8-13 legal tensions. Parental authority / GDPR consent (<15 years) / mandatory reporting. Validation by a GDPR-minors lawyer and possibly the CNIL required. "Never surveilled" to be qualified: local processing = processing.

Systemic hypothesis and sensors. The network hypothesis = theoretical anchor (Salmivalli, Williams), not tested by us on the target group; outside the validated scope. 8 sensors validated on adults. Encryption (iMessage/Signal/WhatsApp native on iOS): 0% detection possible. E2E test on a physical device: never performed.

Calibrated vocabulary. "Proof/proven" is reserved for CIs that clearly exclude chance on the target group (none at this stage). Everywhere else: "suggests / consistent with / indication".