QuantNet being an online GitHub based organization is an integrated environment consisting
of different types of statistics-related documents and program codes called Quantlets.
The QuantNet Style Guide and the yamldebugger package allow a standardized audit and
validation of YAML annotated software repositories within this organization. The behavior
statistics of QuantNet users are measured with Web Metrics from Google Analytics.
We show how the search queries obtained from Google’s metrics can be used in the test
collections in order to calibrate and evaluate the information retrieval (IR) performance of
QuantNet’s search engine called QuantNetXploRer. For that purpose, different text mining
(TM) models will be examined by means of the new TManalyzer package. Further, we
introduce the Validation Pipeline (Vali-PP) and apply it on the YAML data. Vali-PP
is a functional multi-staged instrument for clustering analysis, providing multivariate statistical
analysis of the co-occurrence distribution of driving factors of the pipeline. The
new package rgithubS, which enables a GitHub wide search for code and repositories using
the GitHub Search API and which is an essential element of the QuantNet Mining
infrastructure, is briefly presented.
The TManalyzer results show that for all considered single term queries the number
of true positives is maximal in a latent semantic analysis model configuration (LSA50).
The Vali-PP analysis indicates that the optimality of the combination LSA50 and hierarchical
clustering (HC) applies to 70 ? 90% of the cluster sizes for most of the considered
quality indices. Further, we can infer that more accurate and comprehensive metadata
increases the clustering quality. Subsequently, the findings of our experimental design are
implemented into the QuantNetXploRer. The GitHub API driven QuantNetXploRer can
be found and mined under http://www.quantlet.de
Code Search, Software Repositories, Text Mining, Information Retrieval, Smart
Data, YAML, GitHub Search API, Google Analytics, Web Metrics, LSA, GVSM, Cluster
Validation, Quality Indices, Validation Pipeline