python - convert list of list to dataframe

Problem: I want to convert a list of list into a dataframe.

Setup: I have the following list:

``data = [[(1,0.8),(2,0.2)],[(0,0.1),(1,0.3),(2,0.6)],[(0,0.05),(1,0.05),(2,0.3),(3,0.4),(4,0.2)]]``

This is an LDA Document-Topic Probability List from `gensim` in which each list is a document and each tuple is one of five topic probabilities. (See an earlier question I posted on Stack Overflow here). The first element in the tuple represents the topic number, the second element is the probability that the topic probability for the document.

Note that while some documents (like the 3rd list) can have up to five tuples (topic probabilities), gensim LDA does not output probabilities for topics with less 0.01 probabilities. Therefore, examples like document 1 and document 2 have less than five tuples.

Goal: Use for loops to create a Document-Topic Probability matrix such that:

``ProbMatrix = [(0,0.8,0.2,0,0),(0.1,0.3,0.6,0,0),(0.05,0.05,0.3,0.4,0.2)]``

As noted above, for "missing" tuples (topics), zero's need to be plugged in. Once I get this list, I figure I can use pandas dataframe function to produce my final output (df) such that

``df = pd.DataFrame(ProbMatrix)``

My (Failed) Attempt:

``ProbMatrix = []for i in data: #each document ifor j in i: #each topic jif j[0] == 0:ProbMatrix[i,0].append(j[1])elif j[0] == 1:ProbMatrix[i,1].append(j[1])elif j[0] == 2:ProbMatrix[i,2].append(j[1])elif j[0] == 3:ProbMatrix[i,3].append(j[1])elif j[0] == 4:ProbMatrix[i,4].append(j[1])``

The problem is how I'm referencing ProbMatrix because I'm receiving the following error:

``TypeError: list indices must be integers, not tuple``

Bonus (that is, it'd be even better if you can help):

One problem I've found with gensim LDA is that, as mentioned, it does not output probabilities less than 0.01, even if `minimum_probability = None`. For example, see this earlier post. The example above is illustrative in that the topic probabilities sum to 1 for each document. However, in reality the output looks more like this:

``data = [[(1,0.79),(2,0.2)], # topic 1 probability 0.79 from 0.8[(0,0.09),(1,0.3),(2,0.6)], # topic 0 probability 0.09 from 0.1[(0,0.05),(1,0.05),(2,0.3),(3,0.4),(4,0.2)]]``

What I'm looking for is instead of putting zero into unknown topic probabilities, instead make the remaining missing topics an even probability such that topic probabilities for each document equal 1. For example, this would result in a ProbMatrix:

``ProbMatrix = [(0.0033,0.79,0.2,0.0033,0.0033),(0.09,0.3,0.6,0.005,0.005),(0.05,0.05,0.3,0.4,0.2)]``

I'm not 100% sure what you are asking but I think this is what you are looking for to get the `probmatrix` list you showed. you can do it like this

``````data = [[(1,0.8),(2,0.2)],
[(0,0.1),(1,0.3),(2,0.6)],
[(0,0.05),(1,0.05),(2,0.3),(3,0.4),(4,0.2)]]
probmatrix = []

for i in data:
tmp = [0,0,0,0,0]
for j in i:
tmp[j[0]] = j[1]
probmatrix.append(tmp)

df = pd.DataFrame(probmatrix)
print df

0     1    2    3    4
0  0.00  0.80  0.2  0.0  0.0
1  0.10  0.30  0.6  0.0  0.0
2  0.05  0.05  0.3  0.4  0.2
``````

Since you know there will only be five elements you can make a tmp list initialized with 5 zeros and just replace the ones that are non-zero

Not sure if it what you want but `i` is a document, and you are using it to adress `ProbMatrix`. you can make `ProbMatrix = {}` instead of `ProbMatrix = []` to use it as a dictionary.

You cannot reference a list of list with [i,j], in your case it's a list of tuples. You should first have a list of list. Try:

``````ProbMatrix[i].append(j[1])  # add a number to the list at row i
``````

Maybe I didn't get why you need 2 indices. In this case it should be:

``````ProbMatrix[i][j].append(j[1])
``````

If you know the desired shape of your output you can use `np.zeros` to create a zero filled Numpy array and fill accordingly.

``````import numpy as np
import pandas as pd

probMatrix = np.zeros(shape=(3,5))  # size of (num docs, k topics)

for doc_num, probs in enumerate(data):
for k_index, prob in probs:
probMatrix[doc_num, k_index] = prob
``````

Which will return:

``````array([[ 0.  ,  0.8 ,  0.2 ,  0.  ,  0.  ],
[ 0.1 ,  0.3 ,  0.6 ,  0.  ,  0.  ],
[ 0.05,  0.05,  0.3 ,  0.4 ,  0.2 ]])
``````

Which can be loaded directly into a pandas dataframe if needed, or is pretty useful just as it is.