1

I have a data set, in excel format, with account names, reported symptoms, a determined root cause and a date in month year format for each row. I am trying to implement a mahout like system with a purpose of determining the likelihood symptoms an account can report by doing a user based similarity kind of a thing. Technically, I am just hoping to tweak the recommendation system into a deterministic system to spot out the probable symptoms an account can report on. Instead of ratings, I can get the frequency of symptoms by accounts. Is it possible to use a programming language or any other software to build such system?

Here is an example:

Account : X Symptoms : AB, AD, AB, AB

Account : Y Symptoms : AE, AE, AB, AB, EA

For the sake of this example, let's assume that all the dates are this month.

O/P: Account : X Symptom: AE

Here both of them have reported AB 2 or more times. I could fix such number as a threshold to look for probable symptoms.

SRS
  • 1,045
  • 5
  • 11
  • 22
  • Is it possible to use a programming language or any other software to build such system <- sure. but why are you not using mahout? maybe you could rephrase the question a little, but overall what you are about to do with symptoms is not that different from what recommenders do with products. – Regenschein Jan 06 '15 at 21:34
  • I guess so.Technically, I am not recommending symptoms here, I am just pointing the possible symptoms a user might report on. That is why I termed it as a deterministic system as opposed to recommendation system. – SRS Jan 06 '15 at 21:51
  • The format of the data needed for mahout is bit different from the one that I have now. If I arrange the data set in the format **Account Name**, **Symptom**,**Count** ;would it be enough to work with Mahout? – SRS Jan 06 '15 at 21:53
  • Yes, the input format is UserId, ItemId, "Preference". Maybe just give it a try: https://mahout.apache.org/users/recommender/userbased-5-minutes.html – Regenschein Jan 06 '15 at 22:19
  • Is there a way to work with Mahout for the data that has userName, SymptomName as opposed to Ids[numerical val]? Because, when I tried to run the program for such a data, it threw **Failed to load class "org.slf4j.impl.StaticLoggerBinder"** *Number Format Exception error* – SRS Jan 07 '15 at 17:31
  • It sounds like you just need to write some wrapper functions, or a wrapper program. – shadowtalker Jan 12 '15 at 04:31

1 Answers1

1

This seems to me as the plain old recommendation problem. The Accounts are the USERS and the Symptoms are the ITEMS. Each time an Account shows a particular Symptom your system will increment a count value.

Creating the following dataset:

ACCOUNT, SYMPTOM, COUNT

Now you can use/implement any sort of recommender system (Mahout is only an option, have you seen MyMediaLite) or you can even implement yours.

Let's reuse your ideas: * You'd like to use a user-based similarity * If an Account has shown 2 or more times the same symptom it seems to be important

So you can filter out the Account, Symptom pairs with less than 2 counts, and with the rest you create the following datasets:

  • User, Item dataset:

ACCOUNT, SYMPTOM

  • Table with a unique column containing all Users:

ACCOUNT

  • Table with a unique column containing all Items:

SYMPTOM

Now you can use directly the User-KNN algorithm from MyMediaLite.

With the recommender model already trained you can pass any ACCOUNT as input and it will give you a ranked list of the most probable SYMPTOMS that might appear.

Obs.: Initially ignore the time, then you could use it to partition your data in past/future and evaluate the recommendation in a more realistic way. ;-)

Augusto
  • 121
  • 1