Wikipedia is an online encyclopedic resource providing free content created and maintained by its users. Within data science, it is often used for various text processing or NLP projects due to its extensive text corpus.
Questions tagged [wikipedia]
9 questions
1
vote
0 answers
Search for similar wikipedia articles based on a set of keywords
I want to solve two questions:
Which wikipedia articles could be interesting to me based on a list of keywords that are generated by the search terms I normally use in google(received by google takeout)?
Which wikipedia articles could be…
Pascal Widmann
- 23
- 3
1
vote
2 answers
IterativeImputer Evaluation
I am having a hard time evaluating my model of imputation.
I used an iterative imputer model to fill in the missing values in all four columns.
For the model on the iterative imputer, I am using a Random forest model, here is my code for…
StarGit
- 13
- 3
1
vote
0 answers
doc2vec - paragraph or article as document
I'm trying to train a doc2vec model on the German wiki corpus. While looking for the best practice I've found different possibilities on how to create the training data.
Should I split every Wikipedia article by each natural paragraph into several…
jonas
- 143
- 4
1
vote
1 answer
Minimum number of features for Naïve Bayes model
I keep on reading that Naive Bayes needs fewer features than many other ML algorithms. But what's the minimum number of features you actually need to get good results (90% accuracy) with a Naive Bayes model? I know there is no objective answer to…
E. Turok
- 11
- 1
1
vote
1 answer
How can I use Wikipedia2vec model for embedding my article named entities as 40% entities are not in a wikipedia?
I have news articles in my dataset containing named entities. I want to use the Wikipedia2vec model to encode the article's named entities. But some of the entities (around 40%) from our dataset articles are not present in Wikipedia.
Please suggest…
sajankar9
- 11
- 2
0
votes
0 answers
Can a dataset built upon another have more restrictive license?
I found a dataset built on top of Wikipedia dump, which comes in Huggingface Dataset library. The Wikipedia dump is licensed under CC BY-SA and the Huggingface Dataset is licensed under Apache-2.0, but there is no license specified for the dataset I…
Agata
- 1
0
votes
0 answers
Wikipedia corpus for NLP - Cleaned sentences
I can see many wikipedia dumps out there.
I am looking for a wikipedia-made corpus, in which every line is one sentence, without any wikipedia meta tags.
Nathan B
- 241
- 1
- 2
- 5
0
votes
1 answer
Correlation Wikipedia translated pages vs number of in links is weird (scatterplot)?
I'm trying to find a correlation measure for the number of Wikipedia pages an entity (an article) has been translated to vs number of links that point to that page (both measures that can point to the popularity of a page).
For instance I have
Work,…
Idkwhatnomeis
- 3
- 2
0
votes
1 answer
What correlation measure for Wikipedia translated pages vs number of in links?
I'm trying to find a correlation measure for the number of Wikipedia pages an entity (an article) has been translated to vs number of links that point to that page (both measures that can point to the popularity of a page). Is it possible to…
Idkwhatnomeis
- 3
- 2