Size | Seeds | Peers | Completed |
15.58 GiB | 0 | 0 | 116 |
This torrent has no flags.
README file for the General Index
- There are no rights reserved on this public domain data.
- This is an alpha release dated October 4, 2021.
- The General Index was created by Public.Resource.Org, Inc., a 501(c)(3) nonprofit.
- The URL for the General Index is https://archive.org/details/GeneralIndex
- The data files total 4.7 tbytes, but will expand to 37.9 tbytes when unzipped.
- The corpus of 107,233,728 articles has been split into 16 slices, numbered from 0 to f.
- The files in this distribution were created using the Postgres pgdump command.
- The collection is not complete and text extraction was not always successful.
- The metadata and sample files are here (https://archive.org/download/GeneralIndex/data).
- The ngrams and keywords files are each on their own item.
- Ngrams are on identifiers with the naming scheme GeneralIndex.ngrams.n where n=0..f
- Keywords are on identifiers with the scheme GeneralIndex.keywords.n where n=0..f
- So, ngram slice 0 is at https://archive.org/download/GeneralIndex.ngrams.0
You can see all the items here:
https://archive.org/search.php?query=%22general%20index%22%20AND%20collection%3Amulticasting
1. The ngrams Table
- The _ngrams table is the core of the General Index.
- SpaCy is used to extract ngrams, from unigrams to 5-grams, into the doc_ngrams_n tables.
- There are 355,279,820,087 rows in total.
- Each row represents how many instances of an n-gram are in an article.
- The files unzip to 2.1-2.3 tbytes each, for a total of 36 tbytes.
- There are 3 sample files generated using head and fgrep.
2. The keywords table
- The _keywords table extracts the meaningful terms in a document.
- YAKE is used to extract document keywords.
- There are 19,740,906,314 rows.
- The files unzip to 95-102 gbytes each, for a total of 1.6 tbytes.
- Sample files are available.
3. The metadata table
- The _info table attempts to map an md5 unique identifier to metadata.
- In some cases, we are unable to extract appropriate metadata.
- In some cases, the data may be wrong.
- The files unzip to 70 gbytes total.
- A sample file is available.
- *NEW* An updated combined metadata file that unzips to 70 gbytes is available.
- The slice metadata files have also been updated with enhanced metadata.
An easy way to begin is to start working with a single slice.
Loading the keywords and metadata for one slice is a way to work with the data.
While we provide Postgres load files, feel free to parse these into other formats.
We hope to add other information, such as td/idf in the future.
==========
The Tables
==========
doc_ngrams_n – 16 slices: 0-f
dkey [text]: document key (md5 hash of document)
ngram [text]: proper case version of ngrams (unigrams, bigrams, trigrams, 4grams, 5grams)
ngram_lc [text]: lower case version of ngrams – best for search
ngram_tokens [int]: number of tokens (words) in the ngram (e.g., unigrams: 1, bigrams: 2)
term_freq [numeric]: number of occurences of the ngram in the document
doc_count [int]: always 1 (used for other analytic purposes)
insert_date [date]: date record inserted into table, initial load has a null insert_date
doc_keywords_n – 16 slices: 0-f
dkey [text]: document key (md5 hash of document)
keywords [text]: proper case version of keywords captured by YAKE process, from 1 to 5grams
keywords_lc [text]: lower case version of keywords
keywords_tokens [int]: number of tokens (words) in the keywords phrase (e.g., unigrams: 1,
bigrams: 2)
keyword_score [numeric]: YAKE score of how meaninful the word is in the document, the smaller
value, the more meaningful
doc_count [int]: always 1 (used for other analytic purposes)
insert_date [date]: date record inserted into table, initial load has a null insert_date
doc_info_n – 16 slices: 0-f
dkey [text]: document key (md5 hash of document)
meta_doi [text]: DOI for doc from doc_meta source
doc_doi [text]: DOI for doc from original text
doi [text]: DOI for doc from doc_meta if available, else from original text
doc_pub_date [date]: publish date for document from original text
meta_pub_date [date]: publish date for document from original text
pub_date [date]: publish date for document from doc_meta if available, else from original text
doc_authors [text]: list of authors from original text
meta_authors [text]: list of authors from doc_meta
authors [text]: list of authors from doc_meta if available, else from original text
doc_title [text]: document title from original text
meta_title [text]: document title from doc_meta
title [text]: document title from doc_meta if available, else from original text
/sign/ Carl Malamud (carl@media.org) :seal:
Last revised: Mon Oct 22 12:17:08 PDT 2021
Comments
:0
wow
someone needs to hook this baby up to Sci-hub.
Meanwhile, I love guy in the video at https://archive.org/details/GeneralIndex
Is he even real. He's like a deepfaked geek.
slightly narcissistic
I would have called it "Open Journal Index" or something.
"General" implies it would be "General". He means "General" within his world of academia only.
omicron wrote:
That makes sense!
I do not follow..."general" implies it would be "general" is a tautology, but I think I know what you mean: it implies a non-specialized list of subjects and/or not limited to any one topic, correct?
But in what way is "general" limited to academia only?
anyway, for anybody who is confused, the archive.org website has a more complete explanation: this is an index to articles, and does not seem to contain the content of the articles.
Am I wrong, OP?
given that you can download the torrent from archive.org, I hope this torrent is linked to that other one so they can feed each other.
idk what this is
or how to use it... but im 1000% getting it...