class: inverse center middle title-slide .top[ <img src="data:image/png;base64,#media/logos.png" width="100%" /> ] ### Preliminary analysis of crowdsourced sound data with FOSS ##### <span style='color:#D580E6;'>Nicolas Roelandt</span>, P. Aumond, L. Moisan .bottom[ ##### <span style='color:#D580E6;'>_FOSDEM 2023_</span>] ###### Press <span style='color:#D580E6;'>P</span> to access notes --- ## Introduction Traffic noise is a major health concern : -- - 1 million healthy life years (DALYs) lost each year in Western Europe due to traffic noise [WHO 2011](https://intranet.euro.who.int/__data/assets/pdf_file/0008/136466/e94888.pdf) -- - social cost of noise in France estimated at 147 billion euros per year [ADEME 2021](https://librairie.ademe.fr/air-et-bruit/4815-cout-social-du-bruit-en-france.html) --- ## How to find problematic areas ? -- - Direct measure on the whole area is not possible -- - Traditional way is simulation from traffic counts (air, rail, road) and infrastructure .center[<img src="data:image/png;base64,#https://raw.githubusercontent.com/Universite-Gustave-Eiffel/lasso-data-analysis/main/docs/presentations/202208_FOSS4G/media/carte-bruit-bouguenais.png" style="width: 45%" /> ] .center[Map generated with [NoiseModelling](https://noise-planet.org/noisemodelling.html)] ??? simulation: not real data --- ### UMRAE proposal : Capture sound environment with a smartphone app. .pull-left[<img src="data:image/png;base64,#https://noise-planet.org/assets/img/noisecapture/1.2.7/NoiseCapture_Measurement_spectrogram.jpg" style="width: 50%" />] .pull-right[<img src="data:image/png;base64,#https://noise-planet.org/assets/img/noisecapture/1.2.7/NoiseCapture_Results.jpg" style="width: 50%" />] .center[[NoiseCapture is available on F-Droid](https://f-droid.org/en/packages/org.noise_planet.noisecapture)] ??? crowdsourced data: lots of real data, large area coverage, disparate and poor quality material --- ### NoisePlanet Project - NoiseModelling: generate noise maps from Open Source geodata - NoiseCapture : measure and share sound environment - OnoMap : Spatial Data Infrastructure - Community maps .center[<img src="data:image/png;base64,#./media/noise_planet.png" style="width: 55%" /> ] .center[[noise-planet.org](https://noise-planet.org/)] --- layout: false class: inverse center middle ## What can we do with the data collected by the app ? ??? The question is straight forward. In this presentation, I'll speak about the dataset, the trails we are exploring, the results we got so far and some difficulties we had and how to mitigate those. --- ## NoiseCapture dataset -- - 3 years data extraction (2017-2020, still collecting) -- - 260 000 tracks worldwide -- - sound spectrum, tags and gps localization -- - ODC Open Database License v1.0 [data.univ-gustave-eiffel.fr/dataset.xhtml?persistentId=doi:10.25578/J5DG3W](https://data.univ-gustave-eiffel.fr/dataset.xhtml?persistentId=doi:10.25578/J5DG3W) ??? sound spectrum: third of octave, each 1s --- ## How to characterize of the user environment with the collected data ? -- 2 possibilities : -- - from the sound spectrum (ongoing analysis) -- - from the *tags* defined by the contributor ??? notes - from the sound spectrum: probably the hardest. Needs pattern recognition and maybe machine learning. Ongoing work - From the tags: not all tracks have tags. Information can be easily analyse with statistical software. We choose the second approach in a first time because it is easiest to do but the sound spectrum will be analysed in a second stage. If the tags can help to describe the sound spectrum, it will be a real plus. --- ## Database and subset .pull-left[ - 260 422 tracks - 124 363 with tags - 50280 not indoor or tests - 47 412 duration > 5 s - 11 492 in France ] -- .pull-right[<img src="data:image/png;base64,#https://universite-gustave-eiffel.github.io/lasso-data-analysis/articles/plots/tags_repartition.png" style="width: 150%" />] --- ## Toolkit A quite simple one: -- - PostgreSQL/PostGIS -- - R -- - Lots of R packages : Tidyverse, sf, geojsonsf, stats, suncalc... -- - Dependencies : Pandoc, Markdown, Reveal.js, Proj, GEOS, GDAL, etc... ??? The toolkit is relatively simple, PostgreSQL/Postgis because the data source is a PostgreSQL dump. So in order to access it, you have to set up a database to store the data. R is my work horse for many years, it is use, among other tools, by the UMRAE team. The complexity is underneath. The large amount of R package make reproducibilty complex. --- layout: true ### What do we found in the dataset ? --- #### Well known temporal sound source dynamics -- .pull-left[ <img src="data:image/png;base64,#https://github.com/Universite-Gustave-Eiffel/lasso-data-analysis/blob/main/docs/presentations/202208_FOSS4G/media/animals-tags.png?raw=true" style="width: 100%" /> .center[Bird songs at dawn] ] -- .pull-right[ <img src="data:image/png;base64,#https://github.com/Universite-Gustave-Eiffel/lasso-data-analysis/blob/main/docs/presentations/202208_FOSS4G/media/roads-tags.png?raw=true" style="width: 100%" /> .center[Commuters traffic noise] ] ??? The graph on the left shows the proportion of the "animals" tags around sunrise time (the center is the sunrise time of the day it was recorded). We can see a peak during the 1 hour before the sunrise. It is a well known temporal dynamic for bird songs. On the right, it is the proportion of "roads" tags (often associated with traffic noise) in local time. We can see two peaks around 8 to 9 AM and 18 to 19 PM. It is very similar to what is observed with commute times. --- #### Physical events -- .pull-left[ <img src="data:image/png;base64,#https://github.com/Universite-Gustave-Eiffel/lasso-data-analysis/blob/main/docs/presentations/202208_FOSS4G/media/wind-tags.png?raw=true" style="width: 90%" /> r(7) = .93 (p < 0.01) between `wind` tag proportion and the measured wind force ] -- .pull-right[ <img src="data:image/png;base64,#https://github.com/Universite-Gustave-Eiffel/lasso-data-analysis/blob/main/docs/presentations/202208_FOSS4G/media/rain-tags.png?raw=true" style="width: 90%" /> r(6) = 0.68 (p < 0.1) between the `rain` tag proportion and the measured rain fall ] ??? In a second time, we looked if the presence of certain tags related to physical events like the rain or the wind where coherent with the measure recorded by the national weather service. On the left, we can see the proportion of the tag "wind" regarding the wind force (Beaufort scale). The correlation is very good : O.93. On the right, this the proportion of rain tag regarding the measured rainfall. The correlation is not as good : 0.68. --- layout: true ## Reproducible Science is an issue --- ### Good - [Data available](https://data.univ-gustave-eiffel.fr/dataset.xhtml?persistentId=doi:10.25578/J5DG3W) - [Source code available](https://github.com/Universite-Gustave-Eiffel/lasso-data-analysis) (SQL scripts and R notebooks) - Setup available -- ### Bad - Some notebooks needs work on reproducibility (and code factoring) - Information on software environment is too scarce (and hard to reuse) --- ### Some avenues of investigation - R package [Renv](https://rstudio.github.io/renv/articles/renv.html) - [Docker](https://www.docker.com/) - [Guix](https://guix.gnu.org/) ??? - Renv: R package that can store and recreate an R environment (R and packages versions), limited to R - Docker : small virtual machines that can be re-build and/or re-run. There is R and Postgis images ready to use - Guix: to my knowledge the best way to reproduce a software environment --- layout: false class: inverse center middle # Conclusion --- layout: true # Conclusion --- - Crowdsourced data can be useful for science - This dataset is usable - FOSS are <span style='color:#D580E6;'>key for Reproducible Science</span> - Reproducible Science is <span style='color:#D580E6;'>hard to achieve</span> - Notebooks **are not enough** -- .center[ [data.univ-gustave-eiffel.fr/dataset.xhtml?persistentId=doi:10.25578/J5DG3W](https://data.univ-gustave-eiffel.fr/dataset.xhtml?persistentId=doi:10.25578/J5DG3W) <img src="data:image/png;base64,#https://noise-planet.org/assets/img/logos/noise_planet.png" style="width: 30%" /> [noise-planet.org](https://noise-planet.org/) ] ??? - This dataset is usable and there is still a lot to study - FOSS are a not only useful for the industry and science but <span style='color:#D580E6;'>key for Reproducible Science</span> - Reproducible Science is <span style='color:#D580E6;'>hard to achieve</span> and has to take in account as soon as possible within the project - Notebooks can be use as laboratory notebooks and can help reproductibility but **are not enough** --- layout: false class: center inverse .center[ <img src="data:image/png;base64,#media/logos.png" width="100%" /> ] # Thanks! Nicolas Roelandt - Univ. Gustave Eiffel [nicolas.roelandt@univ-eiffel.fr](mailto:nicolas.roelandt@univ-eiffel.fr) [@NRoelandt@sciences.re](https://social.sciences.re/@NRoelandt) This presentation : https://s.42l.fr/FOSDEM2023-LASSO Access to code source : [github.com/Universite-Gustave-Eiffel/lasso-data-analysis](https://github.com/Universite-Gustave-Eiffel/lasso-data-analysis) Detailed articles and notebooks : [universite-gustave-eiffel.github.io/lasso-data-analysis/articles/](https://universite-gustave-eiffel.github.io/lasso-data-analysis/articles/) .bottom-left[ ###### Slides created via the R packages [xaringan](https://github.com/yihui/xaringan) and [gadenbuie/xaringanthemer](https://github.com/gadenbuie/xaringanthemer) ]