จุดเลือกคอลัมน์จะได้ค่าอ้างอิงของเก่าของ dataframe

Question 1

ผมจะขอใช้ห้องตามรหัส:

random = [("ABC",xx, 1), 
          ("DEF",yy,1), 
          ("GHI",zz, 0) 
         ]
randomColumns = ["name","id", "male"]
randomDF = spark.createDataFrame(data=random, schema = randomColumns)
test_df = randomDF.select("name", "id")
test_df.filter(f.col("male") == '1').show()

มาจากด้านบนรหัสผมคาดหวังว่ามันจะส่งผลให้เกิดข้อผิดพลาดเพราะสำหรับ test_df ฉันไม่เลือกผู้ชายคอลัมน์จากดั้งเดิม dataframe. น่าแปลกที่อยู่เหนือการวิ่งได้ดีโดยไม่มีอะไรเกิดข้อผิดพลาดและแสดงผลต่อไปนี้:

+---------+-------+
|name     |     id|
+---------+-------+
|      abc|     xx|
|      def|     yy|
+---------+-------+

ฉันอยากจะเข้าใจตรรกะยู่เบื้องหลังออลสปาร์คกำลังทำอะไรอยู่ เป็นส่วนหนึ่งที่ออลสปาร์คเอกสารเลือกลับมาใหม่ dataframe. งั้นทำไมมันยังคงสามารถใช้เวทมนตร์ที่ผู้ชายคอลัมน์จากพ่อแม่ dataframe.

Question 2

นี่คือสาเหตุมาจากค DAG กสร้างโดยประกายไฟ บาง operators(หรือ transformers)คือเป็นคนขี้เกียจ-ประหารชีวิตดังนั้นพวกม่ที่ทางออลสปาร์คที่ the result will be artificially smoothed to hide jpeg artefacts ที่ DAG.

ในวอย่างเช่นมีคนอยู่สองพันขั้นตอน: select (หรือ project ในภาษา sql ขอ jargon）ก่อนแล้ว filter ทีหลัง แต่ในความเป็นจริงตอนระหว่างประมวลผล, filter ก่อนแล้ว selectเพราะมันเป็นมากกว่ามีประสิทธิภาพ

คุณสามารถตรวจสอบนี้สรุปโดย explain() วิธีการ:

test_df.filter(f.col("flag") == '1').explain()

มันจะแสดงผล:

== Physical Plan ==
*(1) Project [dept_name#0, dept_id#1L]
+- *(1) Filter (isnotnull(flag#2L) AND (flag#2L = 1))
   +- *(1) Scan ExistingRDD[dept_name#0,dept_id#1L,flag#2L]

Question 3

เพิ่มไป@chenzhongpu's คำตอบ,โปรดจำไว้ว่าถ้าคุณกำหนดชั่วคราวได้ในมุมมองบนสุดของคุณ test_dfการสืบค้นจะล้มเหลว:

test_df.createOrReplaceTempView("test_df")
spark.sql("select * from test_df where flag = 1").show()
_Traceback (most recent call last): ...
:
pyspark.sql.utils.AnalysisException: u"cannot resolve '`flag`' given input columns: [test_df.dept, test_df.id]; line 1 pos 24;
'Project [*]
 +- 'Filter ('flag = 1)
   +- SubqueryAlias `test_df`
      +- Project [dept#0, id#2L]
         +- LogicalRDD [dept#0, flag#1L, id#2L], false
 _

...เพราะว่า select (=Project โหนดอยู่ในแผนการประมวลผล)เป็นจะต้อง precede ตัวกรอง(พยายามผ่านทาง where เงื่อนไขว่า).

chenzhongpu · Answer 1 · 2021-11-24T01:29:03

นี่คือสาเหตุมาจากค DAG กสร้างโดยประกายไฟ บาง operators(หรือ transformers)คือเป็นคนขี้เกียจ-ประหารชีวิตดังนั้นพวกม่ที่ทางออลสปาร์คที่ the result will be artificially smoothed to hide jpeg artefacts ที่ DAG.

ในวอย่างเช่นมีคนอยู่สองพันขั้นตอน: select (หรือ project ในภาษา sql ขอ jargon）ก่อนแล้ว filter ทีหลัง แต่ในความเป็นจริงตอนระหว่างประมวลผล, filter ก่อนแล้ว selectเพราะมันเป็นมากกว่ามีประสิทธิภาพ

คุณสามารถตรวจสอบนี้สรุปโดย explain() วิธีการ:

test_df.filter(f.col("flag") == '1').explain()

มันจะแสดงผล:

== Physical Plan ==
*(1) Project [dept_name#0, dept_id#1L]
+- *(1) Filter (isnotnull(flag#2L) AND (flag#2L = 1))
   +- *(1) Scan ExistingRDD[dept_name#0,dept_id#1L,flag#2L]

mazaneicha · Answer 2 · 2021-11-24T14:25:52

เพิ่มไป@chenzhongpu's คำตอบ,โปรดจำไว้ว่าถ้าคุณกำหนดชั่วคราวได้ในมุมมองบนสุดของคุณ test_dfการสืบค้นจะล้มเหลว:

test_df.createOrReplaceTempView("test_df")
spark.sql("select * from test_df where flag = 1").show()
_Traceback (most recent call last): ...
:
pyspark.sql.utils.AnalysisException: u"cannot resolve '`flag`' given input columns: [test_df.dept, test_df.id]; line 1 pos 24;
'Project [*]
 +- 'Filter ('flag = 1)
   +- SubqueryAlias `test_df`
      +- Project [dept#0, id#2L]
         +- LogicalRDD [dept#0, flag#1L, id#2L], false
 _

...เพราะว่า select (=Project โหนดอยู่ในแผนการประมวลผล)เป็นจะต้อง precede ตัวกรอง(พยายามผ่านทาง where เงื่อนไขว่า).

จุดเลือกคอลัมน์จะได้ค่าอ้างอิงของเก่าของ dataframe

คำถาม

คำตอบที่ดีที่สุด

ในภาษาอื่นๆ

หน้านี้อยู่ในภาษาอื่นๆ

ดังอยู่ในนี้หมวดหมู่

ดังคำถามอยู่ในนี้หมวดหมู่