algorithm - Predicting missing data values in a database -


I have a database that includes a whole group of records (about 600,000), where some records are missing in some fields Are there. My goal is to find a way to guess based on existing data, whether missing data values ​​(so I can fill them).

One option I see is clustering - that is to represent those records which are completed in the form of points of some places, looking for groups of digits, and then when the anonymous values ​​of data If a record is given, then try to find out if there is a group that is present in it, then the existing data values ​​are consistent. Although this is not possible because some data fields are on a nominal scale (e.g., color) and thus can not be applied in order.

Another idea is that I would like some types of potential models that will estimate the data, train it on existing data, and then use it for extrapolation.

Which algorithms are available to the above, and is there any available software that implements those algorithms (this software is going to be in C #)

Dealing with missing values ​​is a common question to do with the actual meaning of the data.

You can use (expanded) in many ways:

  1. Ignore the data row. This is usually done when the class label is missing (you consider the data mining target classification), or many attributes are missing from the row (not just one).

  2. Use a global stability to fill the unavailable values. such as "unknown", "N / A" or zero infinite it is used because sometimes it does not make sense to try and guess only the missing value. For example, if you have a DB, say, the candidates of college and residence status are missing for some people, filling in it does not matter ...

  3. Meaning of Attribution For example, if the average income of a US family is X, then you can use that value to replace the missing income values.

  4. Use properties for all samples of the same class . Say you have a car pricing DB, which categorizes cars among other things for "luxury" and "low budget" and you are working with the missing prices in the cost area.

  5. Along with the average cost of all luxury cars, the replacement cost of a luxury car is probably more accurate, if you find factors in low budget cars> worth Use data mining algorithms to estimate Value regression can be determined using estimate based tools, Bezian formalities, decision trees, clustering algorithms, that step method # 4 (k-minis \ maiden Etc.) Was used to generate input. ID3 tree generation), they are relatively easy and there are so many examples on the net.

For the package, if you can buy it and you look at Microsoft World SQL Server Analysis Services (SSAS for short), which applies most of the above mentioned above is.

Here are some links to the free data mining software package:

  • WEKA, however, not a C # that learning trees and baysian learning (to use Ruby) a lot It's a good introduction too:
  • / P>

    This is also a Ruby Library which gives me very useful (also for learning): < / P>

    Many of these samples should be online in any language online for these algorithms, so I'm sure you can easily get C # stuff Edit ...

    Edit:

    This forgot me in my original post. This is definitely one if you are playing with data mining ... Download (it SQL Server Analysis Services are required - SSAS - which is not free but you can download a test)

    You should try to easily play and exclude different techniques in Excel and before you apply this stuff yourself again, since you are in the Microsoft ecosystem, You can also go to SSAS based solutions and decide to do this for SQL Server users :)


Comments

Popular posts from this blog

c# - ListView onScroll event -

PHP - get image from byte array -

Linux Terminal Problem with Non-Canonical Terminal I/O app -