Social Science Computing Unit - Budapest
Founding purpose
Aid the production of valuable social science research involving complex data processing, on large and interconnected datasets.
Our goal is to build and maintain a diverse catalog of complex, mostly large-scale datasets, we can use to create:
- reproducible and extendable multidisciplinary research
- scalable computational methods that generalize well for problems dealing with data concerning human behavior, social networks, geolocation, organizational structure and similar useful topics
- clear rules, procedures and infrastructure for data sharing and access
- connections between researchers of different fields who might benefit from similar data sources or computational approaches
- systems that automatically update datasets and research projects based on them
These datasets are intended to be created based on researcher requests by either collecting and organizing publicly available data, or processing purchased data.
Challenges
Most challenges arise from the complexity that the intertwined chains of operations applied to the data present, with the need to take into account certain factors that are usually not present while conducting social science research. Als, from the access management issue where sensitive information needs to be processed in a published and reproducible way.
Most common resulting issues:
- Difficulty of presentation
- Reuse
- reproduction
- extension
- Redundant, repeated work
- Possible cascading errors
Many of the main issues are expanded in this article
Approach
Operational Practice Borrowed from Software Engineering
- Adopt carefully selected tooling designed for large scale, complex data related software projects.
- Public and searchable task board for transparency and cooperation.
- Clear set of guiding principles like FAIR and JDDCP that can be cited during task priorization and other decision making.
Openness
All reports are open, all software is free and open-source, all data has a publicly available version
Output
Data Projects
Self hosted datasets strictly adhering to an internally developed template, maintained and updated while involved in projects, fully contained in a git repository stored on github.
- contains metadata, storage configuration and description of environments
- all datasets have at least one open and publicly available subset
- this can be scrambled, anonymized, use random samples, anything
- needs to be able to be used by any project using the dataset
- subsets are created using code available with the dataset
- only contain data from one source
- merging different datasets fro different sources happens at the project level
- all results are reproducible with one line of code
- dataset subsets are interchangeable with a simple configuration modification
- full pipeline is visualized and documented
Research Software
Open source software that is used across many projects and datasets, deemed worthy to be abstracted away and not found in other open source options.
- tested
- documented
- quality controlled
- open source
Contributions
- aswan
- endremborza (2023-05-07)
- papsebestyen (2022-05-01)
- colassigner
- endremborza (2022-11-23)
- dvc
- endremborza (2021-10-14)
- parquetranger
- endremborza (2023-05-30)
- sqlmermaid
- endremborza (2023-05-07)
- ydata-profiling
- endremborza (2022-04-06)