Highest Voted Questions - Data Science Stack Exchange

8

votes

1 answer

What tokenizer does OpenAI's GPT3 API use?

I'm building an application for the API, but I would like to be able to count the number of tokens my prompt will use, before I submit an API call. Currently I often submit prompts that yield a 'too-many-tokens' error. The closest I got to an answer…

python-3.x tokenization gpt

asked Jul 08 '21 at 18:07

Herman Autore

83
1
3

8

votes

1 answer

what is the difference between "fully developed decision trees" and "shallow decision trees"?

As reading Ensemble methods on scikit-learn docs, it says that bagging methods work best with strong and complex models (e.g., fully developed decision trees), in contrast with boosting methods which usually work best with weak models (e.g.,…

scikit-learn decision-trees ensemble-modeling

asked Jan 11 '16 at 07:07

Mithril

373
6
15

8

votes

2 answers

What is the difference between BERT and Roberta

I want to understand the difference between BERT and Roberta. I saw the article below. https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8 It mentions that Roberta was trained on 10x more data but I don't…

bert transformer

asked Jul 01 '21 at 11:02

Noman Tanveer

83
1
1
7

8

votes

2 answers

Image clustering by similarity measurement (CW-SSIM)

I'm trying to use scikit-learn and pyssim for clustering a set of images - less than 100. The end goal is to place the images into several buckets (clusters) according to the calculated similarity measures - CW-SSIM. The task seems to be trivial,…

machine-learning r python scikit-learn k-means

asked Jan 10 '16 at 19:44

Oleg Puzanov

111
1
4

8

votes

4 answers

How to give name to topics created using LDA?

I have categorized 800,000 documents into 500 categories using the Mahout topic modelling. Instead of representing the topic using the top 5/10 words for each topics, I want to infer a generic name for the group using any existing algorithm. For the…

machine-learning data-mining nlp text-mining topic-model

asked Jan 07 '16 at 04:28

adihere

81
1
1
2

8

votes

2 answers

How to teach neural network a policy for a board game using reinforcement learning?

I need to use reinforcement learning to teach a neural net a policy for a board game. I chose Q-learining as the specific alghoritm. I'd like a neural net to have the following structure: layer - rows * cols + 1 neurons - input - values of…

machine-learning neural-network reinforcement-learning q-learning

asked Jan 05 '16 at 13:28

Luke

189
1
11

8

votes

1 answer

Why a restricted Boltzman machine (RBM) tends to learn very similar weights?

These are 4 different weight matrices that I got after training a restricted Boltzman machine (RBM) with ~4k visible units and only 96 hidden units/weight vectors. As you can see, weights are extremely similar - even black pixels on the face are…

rbm

asked Aug 11 '14 at 21:13

ffriend

2,791
16
18

8

votes

4 answers

How to select particular column in Spark(pyspark)?

testPassengerId = test.select('PassengerId').map(lambda x: x.PassengerId) I want to select PassengerId column and make RDD of it. But .select is not working. It says 'RDD' object has no attribute 'select'

apache-spark pyspark

asked Jan 03 '16 at 02:10

dsl1990

181
1
1
2

8

votes

1 answer

Coreference Resolution for German Texts

Does anyone know a libarary for performing coreference resolution on German texts? As far as I know, OpenNLP and Stanford NLP are not able to perform coreference resolution for German Texts. The only tool that I know is CorZu which is a python…

machine-learning nlp

asked Aug 11 '14 at 12:25

Pasmod Turing

463
2
6

8

votes

1 answer

Where exactly does $\geq 1$ come from in SVMs optimization problem constraint?

I've understood that SVMs are binary, linear classifiers (without the kernel trick). They have training data $(x_i, y_i)$ where $x_i$ is a vector and $y_i \in \{-1, 1\}$ is the class. As they are binary, linear classifiers the task is to find a…

machine-learning svm

asked Dec 26 '15 at 19:42

Martin Thoma

18,630
31
92
167

8

votes

2 answers

Machine Learning: Single input to variable number of outputs

Is there a machine learning algorithm that maps a single input to an output list of variable length? If so, are there any implementations of the algorithm for public use? If not, what do you recommend as a workaround? In my case, the input is a…

machine-learning

asked Dec 24 '15 at 21:09

ricksmt

183
1
5

8

votes

1 answer

Recognition human in images through HOG descriptor and SVM classifier performs poorly

I'm using a HOG descriptor, coupled with a SVM classifier, to recognise humans in pictures. I'm using the Python wrappers for OpenCV. I've used the excellent tutorial at pymagesearch, which explains what the algorithm does and furnishes hints on how…

python computer-vision object-recognition

asked Dec 21 '15 at 10:38

martina

255
2
8

8

votes

2 answers

Pylearn2 vs TensorFlow

I am about to dive into a long NN research project and wanted a push in the direction of Pylearn2 or TensorFlow? As of Dec 2015 has the community started to lean one direction or another? This link has given me concern about getting tied to…

machine-learning python neural-network

asked Dec 04 '15 at 14:24

user3155053

183
3

8

votes

1 answer

When do I have to use aucPR instead of auROC? (and vice versa)

I'm wondering if sometimes, to validate a model, it's not better to use aucPR instead of aucROC? Do these cases only depend on the "domain & business understanding" ? Especially, I'm thinking about the "unbalanced class problem" where, it seems…

machine-learning data-mining cross-validation model-evaluations

asked Nov 24 '15 at 11:50

jmvllt

619
1
8
15

8

votes

5 answers

Best way to search for a similar document given the ngram

I have a database of about 200 documents who's ngrams I have extracted. I want to find the document in my database that is most similar to a query document. In otherwords, I want to find the document in the database that shares the most number of…

nlp similarity search information-retrieval

asked Nov 17 '15 at 03:06

okebz

113
4

Most Popular