Big Data Engineer Resume
Irving, TexaS
SUMMARY:
- Big data engineer,with 6 years of experience who undertakes complex assignments, meets tight deadlines and delivers superior performance. Possesses practical knowledge in Data Analytics and Optimization. Applies strong analytical skills to inform senior management of key trends identified in the data. Operates with a strong sense of urgency and thrives in a fast - paced setting
- Well experienced in developing scripts using Spark,Hadoop, Sqoop, Yarn, Tez, Impala, Automic UC4, Oozie, Kafka, Hive.
- Well versed Experience on Scala, Python, HQL scripts and streaming data.
- Extensive experience in data transformation, data mapping from source to target database schemas.
- Experienced in building data pipelines on big data stack
- Experienced in working with Python for Statistical Analysis along with MySQL, NoSQL for various environments and business processes.
- Experience in dealing with large datasets with R, Python and math libraries like Pandas, Numpy, Scikitlearn.
- Experienced working on nosql like cassandra.
- Experienced in parallel processing to maximize calculation speed and efficiency.
- Experience in Cloudera and Hortonworks.
- Well versed using version control tools like GIT and Bit bucket.
- Worked in Agile enviroment, Followed strict agile practises using Jira.
TECHNICAL SKILLS:
Platforms: Methodologies
PROFESSIONAL EXPERIENCE:
Confidential, Irving, Texas
Big Data Engineer
Responsibilities.
- Build a data pipeline for getting data into insight store.
- Develop data model in Cassandra for storing insights
- Develop spark scala batch application to process the data and transfer the data into insight store.
- Develop spark streaming application to process the data and transfer the data into insight store.
- Schedule jobs on Oozie to automate Workflows.
- Design and develop data pipeline using Spark.
- Ingest data from Oracle Database, file feed to Hive using Spark.
- Create, Validate and maintain scripts to load data from various data sources to HDFS.
- Creating external tables on Hive to read data from Hadoop.
- Develop shell scripts to get filefeed data which is located in the remote cluster.
- Unit testing the Spark jobs and oozie work flows.
- Logging the errors through Log4j on to the logstash, elastic search and Kibana.
- Unit testing the data pipeline of Hive, spark and cassandra.
- Visualize the current trends on Kibana dashboards.
- Saving data into Kafka topic and Hive for auditing purposes.
Environment: Spark, IntelliJ, scala, Kibana, Shell scripting, Hive, Hadoop,Cassandra, Kafka.
Confidential
Big Data Engineer
Responsibilities.
- Develop HQL queries to perform DDL and DML’s .
- Develop transformations scripts using Spark SQL to implement logic on Lake data.
- Develop work flows between Hadoop, Hive and MariaDb .
- Schedule jobs on Oozie to automate Workflows.
- Design and develop data pipeline using the Automic and Aorta Framework.
- Ingest data from Maria Db to Hadoop using Aorta framework.
- Create, Validate and maintain scripts to load data from various data sources to HDFS.
- Creating external tables on Hive to read data from Hadoop.
- Develop shell scripts to build a framework between Hive and Maria Db.
- Unit testing the oozie jobs and oozie work flows.
- Developed Data Quality (dq) scripts to validate and maintain data quality for downstream applications.
- Processed data by developing spark scala application.
- Validated the Thoughtspot (BI tool) data against base tables in data lake.
- Perfomed end to end testing of Opportunities and KPI applications.
- Validated the inventory data in Data lake against Teradata and Informix data.
Environment: Hive,Hadoop,Scala,HQL,MariaDb,SQL,Aortaframework,Automic,Teradata,Informix,Hortonworks
Hadoop and Spark Developer
Confidential
Responsibilities:
- Worked on a live 60 nodes Hadoop cluster running CDH5.
- Worked with structured, unstructured and semi structured data of 23 TB in size.
- Developed a Sqoop incremental Import Job for importing data into HDFS.
- Moved data from HDFS into spark shell to perform processing of stock data.
- Used yarn as a resource manager for parallel computing.
- Performed transformations and actions on spark for different file formats like parquet, Avro, json
and text files.
- Exported new sets of datasets into staging tables into MariaDB.
- Developed programs using Scala for spark (RDD, Data frames and Spark sql) to handle
Mathematical calculations and deployed it in the cluster.
- Developed programs using Scala for spark streaming to handle live data using DStream to analyze
market trends.
Environment: Hadoop Ecosystem, HDFS, MapReduce, Sqoop, Spark, Spark SQL, Scala, R, Amazon S3, Maria DB, Yarn,Cloudera
Hadoop and Spark Developer
Confidential
Responsibilities:
- Built data pipelines to Load and transform large sets of structured, semi structured and
Unstructured data.
- Imported data from HDFS into Hive using HiveQL.
- Involved in creating Hive tables, loading and analyzing data using hive queries
- Created Hive Partitioned and Bucketed tables to improve performance.
- Developed a SQOOP Import Job for importing data into HDFS
- To improve performance and optimization of the existing algorithms, explored different
components like Spark Context, Spark-SQL, Data Frame, Pair RDD's, accumulators
- Processed millions of records using Hadoop jobs
- Implemented Spark code using Python for RDD transformations & actions in Spark application
- Used Java UDF jar files to mask the PII data.
- Define and contribute to development of standards, guidelines, design patterns and common
development frameworks & components.
- Working with the leadership to understand scope, derive estimates, schedule, allocate work,
manage tasks/projects, present status updates to IT and business leaders as required.
Environment: Hadoop ecosystem, Hive, HDFS, Python, Eclipse, Sqoop, Apache Spark, Java.
Confidential
Data Analyst
Responsibilities:
- Analyze the Nielsen data for forecasting the impressions of television shows using time series analysis with R .
- Assess the data quality using python scripts and provide the insights using pandas 1.18.0.
- Provide the metrics to upfront team (adsales) using Tableau Dashboard which helps them for ad revenue.
- Performed statistical analysis using R.
- Monitoring Schedule jobs on databricks .
- Interface with other technology teams to extract, transform, and load data from a wide variety of data sources using SQL and other ETL solutions.
- Implement the logic in pandas for feature engineering on subset of data which will be used for forecasting
Environment: Python, Tableau, R, Machine Learning.