You are here

The General Index

Primary tabs

SizeSeedsPeersCompleted
15.58 GiB00116
This torrent has no flags.


README file for the General Index - There are no rights reserved on this public domain data. - This is an alpha release dated October 4, 2021. - The General Index was created by Public.Resource.Org, Inc., a 501(c)(3) nonprofit. - The URL for the General Index is https://archive.org/details/GeneralIndex - The data files total 4.7 tbytes, but will expand to 37.9 tbytes when unzipped. - The corpus of 107,233,728 articles has been split into 16 slices, numbered from 0 to f. - The files in this distribution were created using the Postgres pgdump command. - The collection is not complete and text extraction was not always successful. - The metadata and sample files are here (https://archive.org/download/GeneralIndex/data). - The ngrams and keywords files are each on their own item. - Ngrams are on identifiers with the naming scheme GeneralIndex.ngrams.n where n=0..f - Keywords are on identifiers with the scheme GeneralIndex.keywords.n where n=0..f - So, ngram slice 0 is at https://archive.org/download/GeneralIndex.ngrams.0 You can see all the items here: https://archive.org/search.php?query=%22general%20index%22%20AND%20collection%3Amulticasting 1. The ngrams Table - The _ngrams table is the core of the General Index. - SpaCy is used to extract ngrams, from unigrams to 5-grams, into the doc_ngrams_n tables. - There are 355,279,820,087 rows in total. - Each row represents how many instances of an n-gram are in an article. - The files unzip to 2.1-2.3 tbytes each, for a total of 36 tbytes. - There are 3 sample files generated using head and fgrep. 2. The keywords table - The _keywords table extracts the meaningful terms in a document. - YAKE is used to extract document keywords. - There are 19,740,906,314 rows. - The files unzip to 95-102 gbytes each, for a total of 1.6 tbytes. - Sample files are available. 3. The metadata table - The _info table attempts to map an md5 unique identifier to metadata. - In some cases, we are unable to extract appropriate metadata. - In some cases, the data may be wrong. - The files unzip to 70 gbytes total. - A sample file is available. - *NEW* An updated combined metadata file that unzips to 70 gbytes is available. - The slice metadata files have also been updated with enhanced metadata. An easy way to begin is to start working with a single slice. Loading the keywords and metadata for one slice is a way to work with the data. While we provide Postgres load files, feel free to parse these into other formats. We hope to add other information, such as td/idf in the future. ========== The Tables ========== doc_ngrams_n – 16 slices: 0-f dkey [text]: document key (md5 hash of document) ngram [text]: proper case version of ngrams (unigrams, bigrams, trigrams, 4grams, 5grams) ngram_lc [text]: lower case version of ngrams – best for search ngram_tokens [int]: number of tokens (words) in the ngram (e.g., unigrams: 1, bigrams: 2) term_freq [numeric]: number of occurences of the ngram in the document doc_count [int]: always 1 (used for other analytic purposes) insert_date [date]: date record inserted into table, initial load has a null insert_date doc_keywords_n – 16 slices: 0-f dkey [text]: document key (md5 hash of document) keywords [text]: proper case version of keywords captured by YAKE process, from 1 to 5grams keywords_lc [text]: lower case version of keywords keywords_tokens [int]: number of tokens (words) in the keywords phrase (e.g., unigrams: 1, bigrams: 2) keyword_score [numeric]: YAKE score of how meaninful the word is in the document, the smaller value, the more meaningful doc_count [int]: always 1 (used for other analytic purposes) insert_date [date]: date record inserted into table, initial load has a null insert_date doc_info_n – 16 slices: 0-f dkey [text]: document key (md5 hash of document) meta_doi [text]: DOI for doc from doc_meta source doc_doi [text]: DOI for doc from original text doi [text]: DOI for doc from doc_meta if available, else from original text doc_pub_date [date]: publish date for document from original text meta_pub_date [date]: publish date for document from original text pub_date [date]: publish date for document from doc_meta if available, else from original text doc_authors [text]: list of authors from original text meta_authors [text]: list of authors from doc_meta authors [text]: list of authors from doc_meta if available, else from original text doc_title [text]: document title from original text meta_title [text]: document title from doc_meta title [text]: document title from doc_meta if available, else from original text /sign/ Carl Malamud (carl@media.org) :seal: Last revised: Mon Oct 22 12:17:08 PDT 2021
Info File: 

Comments

someone needs to hook this baby up to Sci-hub.

Meanwhile, I love guy in the video at https://archive.org/details/GeneralIndex

Is he even real. He's like a deepfaked geek.

I would have called it "Open Journal Index" or something.

"General" implies it would be "General". He means "General" within his world of academia only.

omicron wrote:

I would have called it "Open Journal Index" or something.

That makes sense!

omicron wrote:

"General" implies it would be "General". He means "General" within his world of academia only.

I do not follow..."general" implies it would be "general" is a tautology, but I think I know what you mean: it implies a non-specialized list of subjects and/or not limited to any one topic, correct?

But in what way is "general" limited to academia only?

anyway, for anybody who is confused, the archive.org website has a more complete explanation: this is an index to articles, and does not seem to contain the content of the articles.

Am I wrong, OP?

given that you can download the torrent from archive.org, I hope this torrent is linked to that other one so they can feed each other.

or how to use it... but im 1000% getting it...